Oracle-guided scheduling for controlling granularity in implicitly parallel languages*

UMUT A. ACAR; ARTHUR CHARGUÉRAUD; MIKE RAINEY

doi:10.1017/S0956796816000101

Oracle-guided scheduling for controlling granularity in implicitly parallel languages*

Part of: JFP Research Articles

Published online by Cambridge University Press: 10 November 2016

UMUT A. ACAR ,

ARTHUR CHARGUÉRAUD and

MIKE RAINEY

Show author details

UMUT A. ACAR: Affiliation:
Carnegie Mellon University, Pittsburgh, PA, USA Inria, Paris, France (e-mail: umut@cs.cmu.edu)
ARTHUR CHARGUÉRAUD: Affiliation:
Inria, Université Paris-Saclay, Palaiseau, France LRI, CNRS & Univ. Paris-Sud, Université Paris-Saclay, Orsay, France (e-mail: arthur.chargueraud@inria.fr)
MIKE RAINEY: Affiliation:
Inria, Paris, France (e-mail: mike.rainey@inria.fr)

Article contents

Abstract
Footnotes
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

A classic problem in parallel computing is determining whether to execute a thread in parallel or sequentially. If small threads are executed in parallel, the overheads due to thread creation can overwhelm the benefits of parallelism, resulting in suboptimal efficiency and performance. If large threads are executed sequentially, processors may spin idle, resulting again in sub-optimal efficiency and performance. This “granularity problem” is especially important in implicitly parallel languages, where the programmer expresses all potential for parallelism, leaving it to the system to exploit parallelism by creating threads as necessary. Although this problem has been identified as an important problem, it is not well understood—broadly applicable solutions remain elusive. In this paper, we propose techniques for automatically controlling granularity in implicitly parallel programming languages to achieve parallel efficiency and performance. To this end, we first extend a classic result, Brent's theorem (a.k.a. the work-time principle) to include thread-creation overheads. Using a cost semantics for a general-purpose language in the style of lambda calculus with parallel tuples, we then present a precise accounting of thread-creation overheads and bound their impact on efficiency and performance. To reduce such overheads, we propose an oracle-guided semantics by using estimates of the sizes of parallel threads. We show that, if the oracle provides accurate estimates in constant time, then the oracle-guided semantics reduces the thread-creation overheads for a reasonably large class of parallel computations. We describe how to approximate the oracle-guided semantics in practice by combining static and dynamic techniques. We require the programmer to provide the asymptotic complexity cost for each parallel thread and use runtime profiling to determine hardware-specific constant factors. We present an implementation of the proposed approach as an extension of the Manticore compiler for Parallel ML. Our empirical evaluation shows that our techniques can reduce thread-creation overheads, leading to good efficiency and performance.

Information

Type: Articles
Information: Journal of Functional Programming , Volume 26 , 2016 , e23

DOI: https://doi.org/10.1017/S0956796816000101 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

Footnotes

This research was partially supported by the National Science Foundation (grants CCF-1320563 and CCF-1408940), European Research Council (grant ERC-2012-StG-308246), and by Microsoft Research.

References

Acar, U. A., & Blelloch, G. (2015a). 15210: Algorithms: Parallel and sequential. Accessed August 2016. Available at: http://www.cs.cmu.edu/~15210/.Google Scholar

Acar, U. A., & Blelloch, G. (2015b). Algorithm design: Parallel and sequential. Accessed August 2016. Available at: http:www.parallel-algorithms-book.com.Google Scholar

Acar, U. A., Blelloch, G. E. & Blumofe, R. D. (2002). The data locality of work stealing. Theory Comput. Syst. 35 (3), 321–347.CrossRef Google Scholar

Acar, U. A., Charguéraud, A., & Rainey, M. (2011). Oracle scheduling: Controlling granularity in implicitly parallel languages. In Proceedings of ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (OOPSLA), pp. 499–518.CrossRef Google Scholar

Acar, U. A., Charguéraud, A. & Rainey, M. (2013). Scheduling parallel programs by work stealing with private deques. In PPoPP '13.CrossRef Google Scholar

Acar, U. A., Chargueraud, A., & Rainey, M. (2015a). An introduction to parallel computing in c++. Available at: http://www.cs.cmu.edu/15210/pasl.html Google Scholar

Acar, U. A., Chargueraud, A., & Rainey, M. (2015b). A work-efficient algorithm for parallel unordered depth-first search. In Proceedings of Acm/ieee conference on high performance computing (sc). New York, NY, USA: ACM.Google Scholar

Aharoni, G., Feitelson, D. G. & Barak, A. (1992). A run-time algorithm for managing the granularity of parallel functional programs. J. Funct. Program. 2, 387–405.CrossRef Google Scholar

Arora, N. S., Blumofe, R. D., & Plaxton, C. G. (1998). Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA '98. ACM Press, pp. 119–129.Google Scholar

Arora, N. S., Blumofe, R. D. & Plaxton, C. G. (2001). Thread scheduling for multiprogrammed multiprocessors. Theory Comput. Syst. 34 (2), 115–144.CrossRef Google Scholar

Barnes, J. & Hut, P. (December 1986). A hierarchical O(N log N) force calculation algorithm. Nature 324, 446–449.CrossRef Google Scholar

Bergstrom, L., Fluet, M., Rainey, M., Reppy, J., & Shaw, A. (2010). Lazy tree splitting. Icfp 2010. ACM Press, pp. 93–104.Google Scholar

Blelloch, G., & Greiner, J. (1995). Parallelism in sequential functional languages. In Proceedings of the 7th International Conference on Functional Programming Languages and Computer Architecture. FPCA '95. ACM, pp. 226–237.Google Scholar

Blelloch, G. E., Fineman, J. T., Gibbons, P. B. & Simhadri, H. V. (2011). Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures. SPAA, '11, pp. 355–366.CrossRef Google Scholar

Blelloch, G. E. & Gibbons, P. B. (2004). Effectively sharing a cache among threads. In SPAA.CrossRef Google Scholar

Blelloch, G. E., & Greiner, J. (1996). A provable time and space efficient implementation of NESL. In Proceedings of the 1st ACM Sigplan International Conference on Functional Programming. ACM, pp. 213–225.CrossRef Google Scholar

Blelloch, G. E., Hardwick, J. C., Sipelstein, J., Zagha, M. & Chatterjee, S. (1994). Implementation of a portable nested data-parallel language. J. Parallel Distrib. Comput. 21 (1), 4–14.CrossRef Google Scholar

Blelloch, G. E. & Sabot, G. W. (February 1990). Compiling collection-oriented languages onto massively parallel computers. J. Parallel Distrib. Comput. 8, 119–134.CrossRef Google Scholar

Blumofe, R. D. & Leiserson, C. E. (September 1999). Scheduling multithreaded computations by work stealing. J. ACM 46, 720–748.CrossRef Google Scholar

Brent, R. P. (1974) The parallel evaluation of general arithmetic expressions. J. ACM 21 (2), 201–206.CrossRef Google Scholar

Chakravarty, M. M. T., Leshchinskiy, R., Peyton Jones, S., Keller, G. & Marlow, S. (2007). Data parallel Haskell: a status report. In Workshop on declarative aspects of multicore programming. DAMP '07, pp. 10–18.Google Scholar

Chowdhury, R. A., Silvestri, F., Blakeley, B. & Ramachandran, V. 2010 (Apr.). Oblivious algorithms for multicores and network of processors. In Proceedings of International Symposium on Parallel Distributed Processing (ipdps), pp. 1–12.Google Scholar

Cole, R. & Ramachandran, V. (2010). Resource oblivious sorting on multicores. In Proceedings of the 37th International Colloquium Conference on Automata, Languages and Programming. ICALP'10. Springer-Verlag, pp. 226–237.Google Scholar

Crary, K. & Weirich, S. (2000). Resource bound certification. In Proceedings of the 27th ACM Sigplan-Sigact Symposium on Principles of Programming Languages. POPL '00, pp. 184–198.CrossRef Google Scholar

Feeley, M. (1992). A message passing implementation of lazy task creation. In Proceedings of Parallel symbolic computing, pp. 94–107.Google Scholar

Feeley, M. (1993). An Efficient and General Implementation of Futures on Large Scale Shared-Memory Multiprocessors. PhD Thesis, Brandeis University, Waltham, MA, USA, UMI Order No. GAX93-22348.Google Scholar

Fluet, M., Rainey, M. & Reppy, J. (2008). A scheduling framework for general purpose parallel languages. In Proceedings of ACM Sigplan International Conference on Functional Programming (icfp). ACM, pp. 241–252.CrossRef Google Scholar

Fluet, M., Rainey, M., Reppy, J. & Shaw, A. (2011). Implicitly threaded parallelism in Manticore. J. Funct. Program. 20 (5–6), 1–40.Google Scholar

Frens, J. D. & Wise, D. S. (1997). Auto-blocking matrix-multiplication or tracking blas3 performance from source code. In Proceedings of the Sixth ACM Sigplan Symposium on Principles and Practice of Parallel Programming. PPOPP '97. New York, NY, USA: ACM, pp. 206–216.Google Scholar

Frigo, M., Leiserson, C. E. & Randall, K. H. (1998). The implementation of the Cilk-5 multithreaded language. In Pldi, pp. 212–223.CrossRef Google Scholar

Goldsmith, S. F., Aiken, A. S. & Wilkerson, D. S. (2007). Measuring empirical computational complexity. In Proceedings of the 6th joint meeting of the european software engineering conference and the acm symposium on the foundations of software engineering, pp. 395–404.CrossRef Google Scholar

Gulwani, S., Mehra, K. K. & Chilimbi, T. (2009). Speed: Precise and efficient static estimation of program computational complexity. In Proceedings of the 36th Annual ACM Sigplan-Sigact Symposium on Principles of Programming Languages, pp. 127–139.CrossRef Google Scholar

Halstead, R. H. (1985). Multilisp: A language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7, 501–538.CrossRef Google Scholar

Hiraishi, T., Yasugi, M., Umatani, S. & Yuasa, T. (2009). Backtracking-based load balancing. In Ppopp '09. ACM, pp. 55–64.CrossRef Google Scholar

Huelsbergen, L., Larus, James R. & Aiken, A. (1994). Using the run-time sizes of data structures to guide parallel-thread creation. In Proceedings of the 1994 ACM Conference on Lisp and Functional Programming. LFP '94, pp. 79–90.CrossRef Google Scholar

Jost, S., Hammond, K., Loidl, H. & Hofmann, M. (2010). Static determination of quantitative resource usage for higher-order programs. In Principles of programming languages (popl), pp. 223–236.CrossRef Google Scholar

Leroy, X., Doligez, D., Garrigue, J., Rémy, D. & Vouillon, J. (2005). The Objective Caml System.Google Scholar

Lopez, P., Hermenegildo, M. & Debray, S. (June 1996). A methodology for granularity-based control of parallelism in logic programs. J. Symbol. Comput. 21, 715–734.CrossRef Google Scholar

Mohr, E., Kranz, D. A. & Halstead, R. H. Jr. (1990). Lazy task creation: A technique for increasing the granularity of parallel programs. In Conference Record of the 1990 ACM Conference on Lisp and Functional Programming. New York, New York, USA: ACM Press, pp. 185–197.CrossRef Google Scholar

Narlikar, G. J. (1999). Space-Efficient Scheduling for Parallel, Multithreaded Computations. PhD Thesis, Carnegie Mellon University, Pittsburgh, PA, USA.Google Scholar

Pehoushek, J. & Weening, J. (1990). Low-cost process creation and dynamic partitioning in Qlisp. of: Ito, Takayasu, & Halstead, Robert (eds), In Parallel lisp: Languages and Systems. Lecture Notes in Computer Science, vol. 441. Springer Berlin/Heidelberg, pp. 182–199.CrossRef Google Scholar

Peyton Jones, S. L. (2008). Harnessing the multicores: Nested data parallelism in Haskell. In Aplas, p. 138.CrossRef Google Scholar

Peyton Jones, S. L., Leshchinskiy, R., Keller, G. & Chakravarty, M. M. T. (2008). Harnessing the multicores: Nested data parallelism in Haskell. In Fsttcs, pp. 383–414.Google Scholar

Plummer, H. C. (March 1911). On the problem of distribution in globular star clusters. Mon. Not. R. Astron. Soc. 71, 460–470.CrossRef Google Scholar

Rainey, M. (August 2010). Effective Scheduling Techniques for High-Level Parallel Programming Languages. PhD thesis, University of Chicago.Google Scholar

Rosendahl, M. (1989). Automatic complexity analysis. In Fpca '89: Functional Programming Languages and Computer Architecture. ACM, pp. 144–156.CrossRef Google Scholar

Sanchez, D., Yoo, R. M. & Kozyrakis, C. (2010). Flexible architectural support for fine-grain scheduling. In Proceedings of the Fifteenth Edition of Asplos on Architectural Support for Programming Languages and Operating Systems. ASPLOS '10. New York, NY, USA: ACM, pp. 311–322.CrossRef Google Scholar

Sands, D. (September 1990). Calculi for Time Analysis of Functional Programs. PhD Thesis, University of London, Imperial College.Google Scholar

Sivaramakrishnan, K. C., Ziarek, L. & Jagannathan, S. (2014). Multimlton: A multicore-aware runtime for standard ml. J. Funct. Program. FirstView:1–62, 6.Google Scholar

Spoonhower, D. (2009). Scheduling Deterministic Parallel Programs. PhD Thesis, Pittsburgh, PA, USA: Carnegie Mellon University.Google Scholar

Spoonhower, D., Blelloch, G. E., Harper, R. & Gibbons, P. B. (2008). Space profiling for parallel functional programs. In International Conference on Functional Programming.CrossRef Google Scholar

Tzannes, A., Caragea, G. C., Vishkin, U. & Barua, R. (September 2014). Lazy scheduling: A runtime adaptive scheduler for declarative parallelism. TOPLAS 36 (3), 10:1–10:51.CrossRef Google Scholar

Valiant, L. G. (August 1990). A bridging model for parallel computation. CACM 33, 103–111.CrossRef Google Scholar

Weening, J. S. (1989). Parallel Execution of Lisp Programs. PhD Thesis, Stanford University. Computer Science Technical Report STAN-CS-89-1265.Google Scholar

Submit a response

Discussions

No Discussions have been published for this article.

Article contents

Oracle-guided scheduling for controlling granularity in implicitly parallel languages*

Abstract

Information

Footnotes

References

Discussions

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests