A language for hierarchical data parallel design-space exploration on GPUs

BO JOEL SVENSSON; RYAN R. NEWTON; MARY SHEERAN

doi:10.1017/S0956796816000046

A language for hierarchical data parallel design-space exploration on GPUs

Part of: JFP Research Articles

Published online by Cambridge University Press: 17 March 2016

BO JOEL SVENSSON ,

RYAN R. NEWTON and

MARY SHEERAN

Show author details

BO JOEL SVENSSON: Affiliation:
Indiana University, School of Informatics and Computer Science, Bloomington, IN, USA (e-mails: bo.joel.svensson@gmail.com, rrnewton@indiana.edu)
RYAN R. NEWTON: Affiliation:
Indiana University, School of Informatics and Computer Science, Bloomington, IN, USA (e-mails: bo.joel.svensson@gmail.com, rrnewton@indiana.edu)
MARY SHEERAN: Affiliation:
Chalmers University of Technology, Department of Computer Science and Engineering, Gothenburg, Sweden (e-mail: ms@chalmers.se)

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Graphics Processing Units (GPUs) offer potential for very high performance; they are also rapidly evolving. Obsidian is an embedded language (in Haskell) for implementing high performance kernels to be run on GPUs. We would like to have our cake and eat it too; we want to raise the level of abstraction beyond CUDA code and still give the programmer control over the details relevant to kernel performance. To that end, Obsidian provides array representations that guarantee elimination of intermediate arrays while also using the type system to model the hierarchy of the GPU. Operations are compiled very differently depending on what level of the GPU they target, and as a result, the user is gently constrained to write code that matches the capabilities of the GPU. Thus, we implement not Nested Data Parallelism, but a more limited form that we call Hierarchical Data Parallelism. We walk through case-studies that demonstrate how to use Obsidian for rapid design exploration or auto-tuning, resulting in performance that compares well to the hand-tuned kernels used in Accelerate and NVIDIA Thrust.

Information

Type: Articles
Information: Journal of Functional Programming , Volume 26 , 2016 , e6

DOI: https://doi.org/10.1017/S0956796816000046 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

References

Axelsson, E., Claessen, K., Sheeran, M., Svenningsson, J., Engdal, D. & Persson, A. (2011) The design and implementation of Feldspar: An embedded language for digital signal processing. In Proceedings of 22nd International Conference on Implementation and Application of Functional Languages, IFL'10. Berlin Heidelberg: Springer-Verlag, pp. 121–136.Google Scholar

Bergstrom, L. & Reppy, J. (2012) Nested data-parallelism on the GPU. In Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming. ICFP'12. New York, NY, USA: ACM, pp. 247–258.Google Scholar

Billeter, M., Olsson, O. & Assarsson, U. (2009) Efficient stream compaction on wide SIMD many-core architectures. In Proceedings of the Conference on High Performance Graphics. HPG '09. New York, NY, USA: ACM, pp. 159–166.CrossRef Google Scholar

Bjesse, P., Claessen, K., Sheeran, M. & Singh, S. (1998) Lava: Hardware design in Haskell. In Proceedings of the 3rd ACM SIGPLAN International Conference on Functional Programming, ICFP'98. New York, NY, USA: ACM, pp. 174–184.Google Scholar

Blelloch, G. (1996) Programming parallel algorithms. Commun. ACM 39 (3), 85–97.Google Scholar

Catanzaro, B., Garland, M. & Keutzer, K. (2011) Copperhead: Compiling an embedded data parallel language. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. PPoPP '11. New York, NY, USA: ACM, pp. 47–56.Google Scholar

Chafi, H., Sujeeth, A. K., Brown, K. J., Lee, H., Atreya, A. R. & Olukotun, K. (2011) A Domain-specific Approach to Heterogeneous Parallelism. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. PPoPP '11. New York, NY, USA: ACM, pp. 35–46.Google Scholar

Chakravarty, M. M. T., Keller, G., Lee, S., McDonell, T. L. & Grover, V. (2011) Accelerating Haskell array codes with multicore GPUs. In Proceedings of the 6th workshop on Declarative Aspects of Multicore Programming. DAMP'11. New York, NY, USA: ACM, pp. 3–14.Google Scholar

Claessen, K., Sheeran, M. & Svensson, B. J. (2012) Expressive array constructs in an embedded GPU Kernel programming language. In Proceedings of the 7th workshop on Declarative Aspects and Applications of Multicore Programming. DAMP '12. New York, NY, USA: ACM, pp. 21–30.Google Scholar

Elliott, C. (2003) Functional images. The Fun of Programming. “Cornerstones of Computing” series. Palgrave, pp. 131–150.CrossRef Google Scholar

Elliott, C., Finne, S. & de Moor, O. (2003) Compiling embedded languages. J. Funct. Program. 13 (3), 455–481.Google Scholar

Guibas, L. J. & Wyatt, D. K. (1978) Compilation and delayed evaluation in APL. In Proceedings of the 5th Acm SIGACT-SIGPLAN Symposium on Principles of Programming Languages. POPL'78. New York, NY, USA: ACM, pp. 1–8.Google Scholar

Harris, M. (2007) Optimizing parallel reduction in CUDA. "http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf".Google Scholar

Harris, M., Sengupta, S. & Owens, J. D. (2007) Parallel prefix sum (Scan) with CUDA. In GPU Gems 3, Nguyen, H. (ed), Boston Mass.: Addison Wesley, pp. 851–876.Google Scholar

Holk, E., Byrd, W. E., Mahajan, N., Willcock, J., Chauhan, A. & Lumsdaine, A. (2012) Declarative parallel programming for GPUs. In Proceedings of ParCo 2011 Applications, Tools and Techniques on the Road to Exascale Computing. Advances in Parallel Computing. Amsterdam: IOS Press, pp. 297–304.Google Scholar

Keller, G., Chakravarty, M. M. T., Leshchinskiy, R., Peyton Jones, S. & Lippmeier, B. (2010) Regular, shape-polymorphic, parallel arrays in Haskell. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming. ICFP'10. New York, NY, USA: ACM, pp. 261–272.Google Scholar

Kloeckner, A. (2015) Loo.py: From Fortran to performance via transformation and substitution rules. In Proceedings of the ACM SIGPLAN 2nd International Workshop on Libraries, Languages and Compilers for Array Programming. ARRAY'15. New York, NY, USA: ACM, pp. 1–6.Google Scholar

Mainland, G. & Morrisett, . (2010) Nikola: Embedding compiled GPU functions in Haskell. In Proceedings of the 3rd ACM Haskell Symposium. New York, NY, USA: ACM, pp. 67–78.Google Scholar

McDonell, T. L., Chakravarty, M. M. T., Keller, G. & Lippmeier, B. (2013) Optimising purely functional GPU programs. In Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming. ICFP'13. New York, NY, USA: ACM, pp. 49–60.CrossRef Google Scholar

NVIDIA. (2015a) CUDA C Programming Guide. “http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html”.Google Scholar

NVIDIA. (2015b) NVIDIA CUB Library. “http://nvlabs.github.io/cub/”.Google Scholar

NVIDIA. (2015c) NVIDIA Thrust Library. “https://developer.nvidia.com/thrust”.Google Scholar

Oancea, C. E., Andreetta, C., Berthold, J., Frisch, A. & Henglein, F. (2012) Financial software on GPUs: Between Haskell and Fortran. In Proceedings of the 1st ACM SIGPLAN workshop on Functional High-Performance Computing. FHPC'12. New York, NY, USA: ACM, pp. 61–72.CrossRef Google Scholar

Persson, A., Axelsson, E. & Svenningsson, J. (2012) Generic monadic constructs for embedded languages. In Implementation and Application of Functional Languages, Gill, Andy and Hage, Jurriaan (eds), IFL '11. Berlin Heidelberg: Springer-Verlag, pp. 85–99.CrossRef Google Scholar

Sculthorpe, N., Bracker, J., Giorgidze, G. & Gill, A. (2013) The constrained-monad problem. In Proceedings of 18th ACM SIGPLAN International Conference on Functional Programming, ICFP 2013. New York, NY, USA: ACM, pp. 287–298.Google Scholar

Sklansky, J. (1960) Conditional-sum addition logic. IRE Trans. Electron. Comput. EC–9 (2), 226–231.Google Scholar

Stevens, R. T. (1989) Fractal Programming in C. M&T Books.Google Scholar

Svenningsson, J. & Axelsson, E. (2013) Combining deep and shallow embedding for EDSL. In Trends in Functional Programming, TFP '12, Loidl, H.-W. & Pea, R. (eds), Lecture Notes in Computer Science, vol. 7829. Berlin Heidelberg: Springer-Verlag, pp. 21–36.Google Scholar

Svenningsson, J. & Svensson, B. J. (2013) Simple and compositional reification of monadic embedded languages. In Proceedings of 18th ACM SIGPLAN International Conference on Functional Programming, ICFP'13. New York, NY, USA: ACM.Google Scholar

Svenningsson, J., Svensson, B. J. & Sheeran, M. (2013) Efficient counting and occurrence sort for GPUs using an embedded GPU programming language. In Proceedings of the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing. FHPC'13. New York, NY, USA: ACM pp. 50–63.Google Scholar

Svensson, B. J., Sheeran, M. & Newton, R. R. (2014) Design exploration through code-generating DSLs. Commun. ACM 57 (6), 56–63.Google Scholar

Svensson, J., Sheeran, M. & Claessen, K. (2008) Obsidian: A domain specific embedded language for parallel programming of graphics processors. In Proceedings of the International Conference on Implementation and Application of Functional Languages. IFL '08. Berlin Heidelberg: Springer-Verlag, pp. 156–173.Google Scholar

Svensson, J., Claessen, K. & Sheeran, M. (2010) GPGPU kernel implementation and refinement using Obsidian. Procedia Comput. Sci. 1 (1), 2065–2074.Google Scholar

Ulvinge, N. (2014) Increasing Programmability of an Embedded Domain Specific Language for GPGPU Kernels using Static Analysis. MSc Thesis, Dept. Computer Science and Engineering, Chalmers University of Technology.Google Scholar

Submit a response

Discussions

No Discussions have been published for this article.

Article contents

A language for hierarchical data parallel design-space exploration on GPUs

Abstract

Information

References

Discussions

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests