Skip to main content Accessibility help
×
Home

Oracle-guided scheduling for controlling granularity in implicitly parallel languages*

  • UMUT A. ACAR (a1), ARTHUR CHARGUÉRAUD (a2) and MIKE RAINEY (a3)

Abstract

A classic problem in parallel computing is determining whether to execute a thread in parallel or sequentially. If small threads are executed in parallel, the overheads due to thread creation can overwhelm the benefits of parallelism, resulting in suboptimal efficiency and performance. If large threads are executed sequentially, processors may spin idle, resulting again in sub-optimal efficiency and performance. This “granularity problem” is especially important in implicitly parallel languages, where the programmer expresses all potential for parallelism, leaving it to the system to exploit parallelism by creating threads as necessary. Although this problem has been identified as an important problem, it is not well understood—broadly applicable solutions remain elusive. In this paper, we propose techniques for automatically controlling granularity in implicitly parallel programming languages to achieve parallel efficiency and performance. To this end, we first extend a classic result, Brent's theorem (a.k.a. the work-time principle) to include thread-creation overheads. Using a cost semantics for a general-purpose language in the style of lambda calculus with parallel tuples, we then present a precise accounting of thread-creation overheads and bound their impact on efficiency and performance. To reduce such overheads, we propose an oracle-guided semantics by using estimates of the sizes of parallel threads. We show that, if the oracle provides accurate estimates in constant time, then the oracle-guided semantics reduces the thread-creation overheads for a reasonably large class of parallel computations. We describe how to approximate the oracle-guided semantics in practice by combining static and dynamic techniques. We require the programmer to provide the asymptotic complexity cost for each parallel thread and use runtime profiling to determine hardware-specific constant factors. We present an implementation of the proposed approach as an extension of the Manticore compiler for Parallel ML. Our empirical evaluation shows that our techniques can reduce thread-creation overheads, leading to good efficiency and performance.

Copyright

Footnotes

Hide All
*

This research was partially supported by the National Science Foundation (grants CCF-1320563 and CCF-1408940), European Research Council (grant ERC-2012-StG-308246), and by Microsoft Research.

Footnotes

References

Hide All
Acar, U. A., & Blelloch, G. (2015a). 15210: Algorithms: Parallel and sequential. Accessed August 2016. Available at: http://www.cs.cmu.edu/~15210/.
Acar, U. A., & Blelloch, G. (2015b). Algorithm design: Parallel and sequential. Accessed August 2016. Available at: http:www.parallel-algorithms-book.com.
Acar, U. A., Blelloch, G. E. & Blumofe, R. D. (2002). The data locality of work stealing. Theory Comput. Syst. 35 (3), 321347.
Acar, U. A., Charguéraud, A., & Rainey, M. (2011). Oracle scheduling: Controlling granularity in implicitly parallel languages. In Proceedings of ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (OOPSLA), pp. 499–518.
Acar, U. A., Charguéraud, A. & Rainey, M. (2013). Scheduling parallel programs by work stealing with private deques. In PPoPP '13.
Acar, U. A., Chargueraud, A., & Rainey, M. (2015a). An introduction to parallel computing in c++. Available at: http://www.cs.cmu.edu/15210/pasl.html
Acar, U. A., Chargueraud, A., & Rainey, M. (2015b). A work-efficient algorithm for parallel unordered depth-first search. In Proceedings of Acm/ieee conference on high performance computing (sc). New York, NY, USA: ACM.
Aharoni, G., Feitelson, D. G. & Barak, A. (1992). A run-time algorithm for managing the granularity of parallel functional programs. J. Funct. Program. 2, 387405.
Arora, N. S., Blumofe, R. D., & Plaxton, C. G. (1998). Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA '98. ACM Press, pp. 119129.
Arora, N. S., Blumofe, R. D. & Plaxton, C. G. (2001). Thread scheduling for multiprogrammed multiprocessors. Theory Comput. Syst. 34 (2), 115144.
Barnes, J. & Hut, P. (December 1986). A hierarchical O(N log N) force calculation algorithm. Nature 324, 446449.
Bergstrom, L., Fluet, M., Rainey, M., Reppy, J., & Shaw, A. (2010). Lazy tree splitting. Icfp 2010. ACM Press, pp. 93104.
Blelloch, G., & Greiner, J. (1995). Parallelism in sequential functional languages. In Proceedings of the 7th International Conference on Functional Programming Languages and Computer Architecture. FPCA '95. ACM, pp. 226237.
Blelloch, G. E., Fineman, J. T., Gibbons, P. B. & Simhadri, H. V. (2011). Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures. SPAA, '11, pp. 355–366.
Blelloch, G. E. & Gibbons, P. B. (2004). Effectively sharing a cache among threads. In SPAA.
Blelloch, G. E., & Greiner, J. (1996). A provable time and space efficient implementation of NESL. In Proceedings of the 1st ACM Sigplan International Conference on Functional Programming. ACM, pp. 213225.
Blelloch, G. E., Hardwick, J. C., Sipelstein, J., Zagha, M. & Chatterjee, S. (1994). Implementation of a portable nested data-parallel language. J. Parallel Distrib. Comput. 21 (1), 414.
Blelloch, G. E. & Sabot, G. W. (February 1990). Compiling collection-oriented languages onto massively parallel computers. J. Parallel Distrib. Comput. 8, 119134.
Blumofe, R. D. & Leiserson, C. E. (September 1999). Scheduling multithreaded computations by work stealing. J. ACM 46, 720748.
Brent, R. P. (1974) The parallel evaluation of general arithmetic expressions. J. ACM 21 (2), 201206.
Chakravarty, M. M. T., Leshchinskiy, R., Peyton Jones, S., Keller, G. & Marlow, S. (2007). Data parallel Haskell: a status report. In Workshop on declarative aspects of multicore programming. DAMP '07, pp. 1018.
Chowdhury, R. A., Silvestri, F., Blakeley, B. & Ramachandran, V. 2010 (Apr.). Oblivious algorithms for multicores and network of processors. In Proceedings of International Symposium on Parallel Distributed Processing (ipdps), pp. 1–12.
Cole, R. & Ramachandran, V. (2010). Resource oblivious sorting on multicores. In Proceedings of the 37th International Colloquium Conference on Automata, Languages and Programming. ICALP'10. Springer-Verlag, pp. 226237.
Crary, K. & Weirich, S. (2000). Resource bound certification. In Proceedings of the 27th ACM Sigplan-Sigact Symposium on Principles of Programming Languages. POPL '00, pp. 184–198.
Feeley, M. (1992). A message passing implementation of lazy task creation. In Proceedings of Parallel symbolic computing, pp. 94–107.
Feeley, M. (1993). An Efficient and General Implementation of Futures on Large Scale Shared-Memory Multiprocessors. PhD Thesis, Brandeis University, Waltham, MA, USA, UMI Order No. GAX93-22348.
Fluet, M., Rainey, M. & Reppy, J. (2008). A scheduling framework for general purpose parallel languages. In Proceedings of ACM Sigplan International Conference on Functional Programming (icfp). ACM, pp. 241252.
Fluet, M., Rainey, M., Reppy, J. & Shaw, A. (2011). Implicitly threaded parallelism in Manticore. J. Funct. Program. 20 (5–6), 140.
Frens, J. D. & Wise, D. S. (1997). Auto-blocking matrix-multiplication or tracking blas3 performance from source code. In Proceedings of the Sixth ACM Sigplan Symposium on Principles and Practice of Parallel Programming. PPOPP '97. New York, NY, USA: ACM, pp. 206216.
Frigo, M., Leiserson, C. E. & Randall, K. H. (1998). The implementation of the Cilk-5 multithreaded language. In Pldi, pp. 212–223.
Goldsmith, S. F., Aiken, A. S. & Wilkerson, D. S. (2007). Measuring empirical computational complexity. In Proceedings of the 6th joint meeting of the european software engineering conference and the acm symposium on the foundations of software engineering, pp. 395–404.
Gulwani, S., Mehra, K. K. & Chilimbi, T. (2009). Speed: Precise and efficient static estimation of program computational complexity. In Proceedings of the 36th Annual ACM Sigplan-Sigact Symposium on Principles of Programming Languages, pp. 127–139.
Halstead, R. H. (1985). Multilisp: A language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7, 501538.
Hiraishi, T., Yasugi, M., Umatani, S. & Yuasa, T. (2009). Backtracking-based load balancing. In Ppopp '09. ACM, pp. 5564.
Huelsbergen, L., Larus, James R. & Aiken, A. (1994). Using the run-time sizes of data structures to guide parallel-thread creation. In Proceedings of the 1994 ACM Conference on Lisp and Functional Programming. LFP '94, pp. 79–90.
Jost, S., Hammond, K., Loidl, H. & Hofmann, M. (2010). Static determination of quantitative resource usage for higher-order programs. In Principles of programming languages (popl), pp. 223–236.
Leroy, X., Doligez, D., Garrigue, J., Rémy, D. & Vouillon, J. (2005). The Objective Caml System.
Lopez, P., Hermenegildo, M. & Debray, S. (June 1996). A methodology for granularity-based control of parallelism in logic programs. J. Symbol. Comput. 21, 715734.
Mohr, E., Kranz, D. A. & Halstead, R. H. Jr. (1990). Lazy task creation: A technique for increasing the granularity of parallel programs. In Conference Record of the 1990 ACM Conference on Lisp and Functional Programming. New York, New York, USA: ACM Press, pp. 185197.
Narlikar, G. J. (1999). Space-Efficient Scheduling for Parallel, Multithreaded Computations. PhD Thesis, Carnegie Mellon University, Pittsburgh, PA, USA.
Pehoushek, J. & Weening, J. (1990). Low-cost process creation and dynamic partitioning in Qlisp. of: Ito, Takayasu, & Halstead, Robert (eds), In Parallel lisp: Languages and Systems. Lecture Notes in Computer Science, vol. 441. Springer Berlin/Heidelberg, pp. 182199.
Peyton Jones, S. L. (2008). Harnessing the multicores: Nested data parallelism in Haskell. In Aplas, p. 138.
Peyton Jones, S. L., Leshchinskiy, R., Keller, G. & Chakravarty, M. M. T. (2008). Harnessing the multicores: Nested data parallelism in Haskell. In Fsttcs, pp. 383–414.
Plummer, H. C. (March 1911). On the problem of distribution in globular star clusters. Mon. Not. R. Astron. Soc. 71, 460470.
Rainey, M. (August 2010). Effective Scheduling Techniques for High-Level Parallel Programming Languages. PhD thesis, University of Chicago.
Rosendahl, M. (1989). Automatic complexity analysis. In Fpca '89: Functional Programming Languages and Computer Architecture. ACM, pp. 144156.
Sanchez, D., Yoo, R. M. & Kozyrakis, C. (2010). Flexible architectural support for fine-grain scheduling. In Proceedings of the Fifteenth Edition of Asplos on Architectural Support for Programming Languages and Operating Systems. ASPLOS '10. New York, NY, USA: ACM, pp. 311322.
Sands, D. (September 1990). Calculi for Time Analysis of Functional Programs. PhD Thesis, University of London, Imperial College.
Sivaramakrishnan, K. C., Ziarek, L. & Jagannathan, S. (2014). Multimlton: A multicore-aware runtime for standard ml. J. Funct. Program. FirstView:1–62, 6.
Spoonhower, D. (2009). Scheduling Deterministic Parallel Programs. PhD Thesis, Pittsburgh, PA, USA: Carnegie Mellon University.
Spoonhower, D., Blelloch, G. E., Harper, R. & Gibbons, P. B. (2008). Space profiling for parallel functional programs. In International Conference on Functional Programming.
Tzannes, A., Caragea, G. C., Vishkin, U. & Barua, R. (September 2014). Lazy scheduling: A runtime adaptive scheduler for declarative parallelism. TOPLAS 36 (3), 10:110:51.
Valiant, L. G. (August 1990). A bridging model for parallel computation. CACM 33, 103111.
Weening, J. S. (1989). Parallel Execution of Lisp Programs. PhD Thesis, Stanford University. Computer Science Technical Report STAN-CS-89-1265.

Related content

Powered by UNSILO

Oracle-guided scheduling for controlling granularity in implicitly parallel languages*

  • UMUT A. ACAR (a1), ARTHUR CHARGUÉRAUD (a2) and MIKE RAINEY (a3)

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed.

Oracle-guided scheduling for controlling granularity in implicitly parallel languages*

  • UMUT A. ACAR (a1), ARTHUR CHARGUÉRAUD (a2) and MIKE RAINEY (a3)
Submit a response

Discussions

No Discussions have been published for this article.

×

Reply to: Submit a response


Your details


Conflicting interests

Do you have any conflicting interests? *