Communication lower bounds and optimal algorithms for numerical linear algebra*†

G. Ballard; E. Carson; J. Demmel; M. Hoemmen; N. Knight; O. Schwartz

doi:10.1017/S0962492914000038

Communication lower bounds and optimal algorithms for numerical linear algebra*†

Published online by Cambridge University Press: 12 May 2014

G. Ballard ,

E. Carson ,

J. Demmel ,

M. Hoemmen ,

N. Knight and

O. Schwartz

Show author details

G. Ballard: Affiliation:
Sandia National Laboratories, Livermore, CA 94551, USA, E-mail: gmballa@sandia.gov
E. Carson: Affiliation:
EECS Department, UC Berkeley, Berkeley, CA 94704, USA, E-mail: ecc2z@cs.berkeley.edu, knight@cs.berkeley.edu, odedsc@cs.berkeley.edu
J. Demmel: Affiliation:
EECS Department, UC Berkeley, Berkeley, CA 94704, USA, E-mail: ecc2z@cs.berkeley.edu, knight@cs.berkeley.edu, odedsc@cs.berkeley.edu Mathematics Department, UC Berkeley, Berkeley, CA 94704, USA, E-mail: demmel@cs.berkeley.edu
M. Hoemmen: Affiliation:
Sandia National Laboratories, Albuquerque, NM 87185, USA, E-mail: mhoemme@sandia.gov
N. Knight: Affiliation:
EECS Department, UC Berkeley, Berkeley, CA 94704, USA, E-mail: ecc2z@cs.berkeley.edu, knight@cs.berkeley.edu, odedsc@cs.berkeley.edu
O. Schwartz: Affiliation:
EECS Department, UC Berkeley, Berkeley, CA 94704, USA, E-mail: ecc2z@cs.berkeley.edu, knight@cs.berkeley.edu, odedsc@cs.berkeley.edu

Article contents

Extract
Footnotes
References

Get access

Rights & Permissions

Extract

The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n3)) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.

Information

Type: Research Article
Information: Acta Numerica , Volume 23 , May 2014 , pp. 1 - 155

DOI: https://doi.org/10.1017/S0962492914000038 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

We acknowledge funding from Microsoft (award 024263) and Intel (award 024894), and matching funding by UC Discovery (award DIG07-10227). Additional support comes from ParLab affiliates National Instruments, Nokia, NVIDIA, Oracle and Samsung, as well as MathWorks. Research is also supported by DOE grants DE-SC0004938, DE-SC0005136, DE-SC0003959, DE-SC0008700, DE-SC0010200, DE-FC02-06-ER25786, AC02-05CH11231, and DARPA grant HR0011-12-2-0016. This research is supported by grant 3-10891 from the Ministry of Science and Technology, Israel, and grant 2010231 from the US-Israel Bi-National Science Foundation. This research was supported in part by an appointment to the Sandia National Laboratories Truman Fellowship in National Security Science and Engineering, sponsored by Sandia Corporation (a wholly owned subsidiary of Lockheed Martin Corporation) as Operator of Sandia National Laboratories under its US Department of Energy Contract DE-AC04-94AL85000.

†

Colour online for monochrome figures available at journals.cambridge.org/anu.

References

REFERENCES

Aasen, J. O. (1971), ‘On the reduction of a symmetric matrix to tridiagonal form’, BIT Numer. Math. 11, 233–242.Google Scholar

Abdelmalek, N. N. (1971), ‘Round off error analysis for Gram–Schmidt method and solution of linear least squares problems’, BIT Numer. Math. 11, 345–367.Google Scholar

Agarwal, R., Balle, S., Gustavson, F., Joshi, M. and Palkar, P. (1995), ‘A three-dimensional approach to parallel matrix multiplication’, IBM J. Res. Dev. 39, 575–582.Google Scholar

Agarwal, R., Gustavson, F. and Zubair, M. (1994), ‘A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication’, IBM J. Res. Dev. 38, 673–681.Google Scholar

Aggarwal, A. and Vitter, J. (1988), ‘The input/output complexity of sorting and related problems’, Comm. Assoc. Comput. Mach. 31, 1116–1127.Google Scholar

Aggarwal, A., Chandra, A. K. and Snir, M. (1990), ‘Communication complexity of PRAMs’, Theoret. Comput. Sci. 71, 3–28.Google Scholar

Ahmed, N. and Pingali, K. (2000), Automatic generation of block-recursive codes. In Proc. 6th International Euro-Par Conference on Parallel Processing, Springer, pp. 368–378.Google Scholar

Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J. D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S. and Sorensen, D. (1992), LAPACK Users' Guide, SIAM. Also available from http://www.netlib.org/lapack/.Google Scholar

Anderson, M., Ballard, G., Demmel, J. and Keutzer, K. (2011), Communication-avoiding QR decomposition for GPUs. In Proc. 2011 IEEE International Parallel and Distributed Processing Symposium: IPDPS '11, pp. 48–58.Google Scholar

Arnoldi, W. E. (1951), ‘The principle of minimized iterations in the solution of the matrix eigenvalue problem’, Quart. Appl. Math. 9, 17–29.Google Scholar

Bai, Z. and Day, D. (2000), Block Arnoldi method. In Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide (Bai, Z., Demmel, J. W., Dongarra, J. J., Ruhe, A. and van der Vorst, H., eds), SIAM, pp. 196–204.Google Scholar

Bai, Z., Day, D., Demmel, J. and Dongarra, J. (1997 a), A test matrix collection for non-Hermitian eigenvalue problems. Technical report CS-97-355, Department of CS, University of Tennessee.Google Scholar

Bai, Z., Demmel, J. and Gu, M. (1997 b), ‘An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems’, Numer. Math. 76, 279–308.CrossRef Google Scholar

Bai, Z., Hu, D. and Reichel, L. (1991), An implementation of the GMRES method using QR factorization. In Proc. Fifth SIAM Conference on Parallel Processing for Scientific Computing, pp. 84–91.Google Scholar

Bai, Z., Hu, D. and Reichel, L. (1994), ‘A Newton basis GMRES implementation’, IMA J. Numer. Anal. 14, 563–581.Google Scholar

Ballard, G. (2013), Avoiding communication in dense linear algebra. PhD thesis, EECS Department, UC Berkeley.Google Scholar

Ballard, G., Becker, D., Demmel, J., Dongarra, J., Druinsky, A., Peled, I., Schwartz, O., Toledo, S. and Yamazaki, I. (2013 a), Communication-avoiding symmetric-indefinite factorization. Technical report UCB/EECS-2013-127, EECS Department, UC Berkeley.Google Scholar

Ballard, G., Becker, D., Demmel, J., Dongarra, J., Druinsky, A., Peled, I., Schwartz, O., Toledo, S. and Yamazaki, I. (2013 b), Implementing a blocked Aasen's algorithm with a dynamic scheduler on multicore architectures. In Proc. 27th IEEE International Parallel Distributed Processing Symposium: IPDPS '13, pp. 895–907.Google Scholar

Ballard, G., Buluç, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O. and Toledo, S. (2013 c), Communication optimal parallel multiplication of sparse random matrices. In Proc. 25th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '13, ACM, pp. 222–231.Google Scholar

Ballard, G., Demmel, J. and Dumitriu, I. (2011a), Communication-optimal parallel and sequential eigenvalue and singular value algorithms. Technical Report EECS-2011-14, UC Berkeley.Google Scholar

Ballard, G., Demmel, J. and Gearhart, A. (2011 b), Brief announcement: Communication bounds for heterogeneous architectures. In Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '11, ACM, pp. 257–258.Google Scholar

Ballard, G., Demmel, J. and Knight, N. (2012 a), Communication avoiding successive band reduction. In Proc. 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming: PPoPP '12, ACM, pp. 35–44.Google Scholar

Ballard, G., Demmel, J. and Knight, N. (2013 d), Avoiding communication in successive band reduction. Technical report UCB/EECS-2013-131, EECS Department, UC Berkeley.Google Scholar

Ballard, G., Demmel, J., Grigori, L., Jacquelin, M., Nguyen, H. D. and Solomonik, E. (2014), Reconstructing Householder vectors from Tall-Skinny QR. In Proc. 2014 IEEE International Parallel and Distributed Processing Symposium: IPDPS '14, to appear.Google Scholar

Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2010), ‘Communication-optimal parallel and sequential Cholesky decomposition’, SIAM J. Sci. Comput. 32, 3495–3523.Google Scholar

Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2011 c), Graph expansion and communication costs of fast matrix multiplication. In Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '11, ACM, pp. 1–12.Google Scholar

Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2011 d), ‘Minimizing communication in numerical linear algebra’, SIAM J. Matrix Anal. Appl. 32, 866–901.Google Scholar

Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2012 b), ‘Graph expansion and communication costs of fast matrix multiplication’, J. Assoc. Comput. Mach. 59, #32.CrossRef Google Scholar

Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2012 c), Sequential communication bounds for fast linear algebra. Technical report EECS-2012-36, UC Berkeley.Google Scholar

Ballard, G., Demmel, J., Holtz, O., Lipshitz, B. and Schwartz, O. (2012 d), Brief announcement: Strong scaling of matrix multiplication algorithms and memory independent communication lower bounds. In Proc. 24th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '12, ACM, pp. 77–79.Google Scholar

Ballard, G., Demmel, J., Holtz, O., Lipshitz, B. and Schwartz, O. (2012 e), Communication-optimal parallel algorithm for Strassen's matrix multiplication. In Proc. 24th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '12, ACM, pp. 193–204.Google Scholar

Ballard, G., Demmel, J., Holtz, O., Lipshitz, B. and Schwartz, O. (2012 f), Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In Design and Analysis of Algorithms (Even, G. and Rawitz, D., eds), Vol. 7659 of Lecture Notes in Computer Science, Springer, pp. 13–36.Google Scholar

Ballard, G., Demmel, J., Lipshitz, B., Schwartz, O. and Toledo, S. (2013 f), Communication efficient Gaussian elimination with partial pivoting using a shape morphing data layout. In Proc. 25th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '13, ACM, pp. 232–240.Google Scholar

Barrett, R., Berry, M., Chan, T. F., Demmel, J. W., Donato, J., Dongarra, J. J., Eijkhout, V., Pozo, R., Romine, C. and van der Vorst, H. (1994), Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, second edition, SIAM.Google Scholar

Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R. and Vicari, E. (2010), ‘Optimal sparse matrix dense vector multiplication in the I/O-model’, Theory Comput. Syst. 47, 934–962.Google Scholar

Bennett, J., Carbery, A., Christ, M. and Tao, T. (2010), ‘Finite bounds for Holder-Brascamp-Lieb multilinear inequalities’, Math. Res. Lett. 17, 647–666.CrossRef Google Scholar

Berntsen, J. (1989), ‘Communication efficient matrix multiplication on hypercubes’, Parallel Comput. 12, 335–342.Google Scholar

Bilardi, G. and Preparata, F. P. (1999), ‘Processor-time tradeoffs under bounded-speed message propagation II: Lower bounds’, Theory Comput. Syst. 32, 531–559.Google Scholar

Bilardi, G., Pietracaprina, A. and D'Alberto, P. (2000), On the space and access complexity of computation DAGs. In Graph-Theoretic Concepts in Computer Science: 26th International Workshop (Brandes, U. and Wagner, D., eds), Vol. 1928 of Lecture Notes in Computer Science, Springer, pp. 47–58.Google Scholar

Bischof, C. and Loan, C. Van (1987), ‘The WY representation for products of Householder matrices’, SIAM J. Sci. Statist. Comput. 8, 2–13.Google Scholar

Bischof, C. H., Lang, B. and Sun, X. (2000a), ‘Algorithm 807: The SBR Toolbox, Software for successive band reduction’, ACM Trans. Math. Softw. 26, 602–616.CrossRef Google Scholar

Bischof, C., Lang, B. and Sun, X. (2000b), ‘A framework for symmetric band reduction’, ACM Trans. Math. Softw. 26, 581–601.CrossRef Google Scholar

Björck, A. (1967), ‘Solving linear least squares problems by Gram–Schmidt orthogonalization’, BIT Numer. Math. 7, 1–21.Google Scholar

Blackford, L. S., Choi, J., Cleary, A., D'Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D. and Whaley, R. C. (1997), ScaLAPACK Users' Guide, SIAM. Also available from http://www.netlib.org/scalapack/.Google Scholar

Blackford, L. S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K. and Whaley, R. C. (2002), ‘An updated set of basic linear algebra subroutines (BLAS)’, J. ACM Trans. Math. Softw. 28, 135–151.Google Scholar

Börm, S. and Grasedyck, L. (2006), HLib package. http://www.hlib.org/hlib.html Google Scholar

Börm, S., Grasedyck, L. and Hackbusch, W. (2004), Hierarchical matrices. http://www.mis.mpg.de/scicomp/Fulltext/WS_HMatrices.pdf Google Scholar

Braman, K., Byers, R. and Mathias, R. (2002 a), ‘The multishift QR algorithm I: Maintaining well-focused shifts and level 3 performance’, SIAM J. Matrix Anal. Appl. 23, 929–947.Google Scholar

Braman, K., Byers, R. and Mathias, R. (2002b), ‘The multishift QR algorithm II: Aggressive early deflation’, SIAM J. Matrix Anal. Appl. 23, 948–973.Google Scholar

Bunch, J. and Kaufman, L. (1977), ‘Some stable methods for calculating inertia and solving symmetric linear systems’, Math. Comp. 31, 163–179.Google Scholar

Buttari, A., Langou, J., Kurzak, J. and Dongarra, J. J. (2007), A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 191.Google Scholar

Byun, J., Lin, R., Yelick, K. and Demmel, J. (2012), Autotuning sparse matrix-vector multiplication for multicore. Technical report UCB/EECS-2012-215, EECS Department, UC Berkeley.Google Scholar

Cannon, L. (1969), A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University.Google Scholar

Carson, E. and Demmel, J. (2014), ‘A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods’, SIAM J. Matrix Anal. Appl. 35, 22–43.Google Scholar

Carson, E., Knight, N. and Demmel, J. (2013), ‘Avoiding communication in non-symmetric Lanczos-based Krylov subspace methods’, SIAM J. Sci. Comput. 35, S42–S61.Google Scholar

Catalyurek, U. V. and Aykanat, C. (1999), ‘Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication’, IEEE Trans. Parallel Distributed Systems 10, 673–693.Google Scholar

Catalyiirek, Ü. V. and Aykanat, C. (2001), A fine-grain hypergraph model for 2D decomposition of sparse matrices. In Proc. 15th IEEE International Parallel and Distributed Processing Symposium: IPDPS '01.Google Scholar

Chan, E., Heimlich, M., Purkayastha, A. and van de Geijn, R. (2007), ‘Collective communication: Theory, practice, and experience’, Concurrency and Computation: Pract. Exp. 19, 1749–1783.Google Scholar

Chow, E. (2000), ‘Apriori sparsity patterns for parallel sparse approximate inverse preconditioners’, SIAM J. Sci. Comput. 21, 1804–1822.Google Scholar

Chow, E. (2001), ‘Parallel implementation and practical use of sparse approximate inverses with a priori sparsity patterns’, Internat. J. High Perf. Comput. Appl. 15, 56–74.Google Scholar

Christ, M., Demmel, J., Knight, N., Scanlon, T. and Yelick, K. (2013), Communication lower bounds and optimal algorithms for programs that reference arrays, part 1. Technical report UCB/EECS-2013-61, EECS Department, UC Berkeley.Google Scholar

Chronopoulos, A. and Gear, C. (1989 a), ‘On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy’, Parallel Comput. 11, 37–53.Google Scholar

Chronopoulos, A. and Gear, C. (1989 b), ‘s-step iterative methods for symmetric linear systems’, J. Comput. Appl. Math. 25, 153–168.Google Scholar

Chronopoulos, A. and Swanson, C. (1996), ‘Parallel iterative s-step methods for unsymmetric linear systems’, Parallel Comput. 22, 623–641.Google Scholar

Cohen, E. (1994), Estimating the size of the transitive closure in linear time, in Proc. 35th Ann. Symp. Found. Comp. Sci., IEEE, pp. 190–200.Google Scholar

Cohen, E. (1997), ‘Size-estimation framework with applications to transitive closure and reachability’, J. Comput. System Sci. 55, 441–453.Google Scholar

Committee on the Analysis of Massive Data; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Their Applications; Division on Engineering and Physical Sciences; National Research Council (2013), Frontiers in Massive Data Analysis, The National Academies Press.Google Scholar

da Cunha, R. D., Becker, D. and Patterson, J. C. (2002), New parallel (rank-revealing) QR factorization algorithms. In Euro-Par 2002 Parallel Processing, Springer, pp. 677–686.Google Scholar

Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J. and Yelick, K. (2008), Stencil computation optimization and autotuning on state-of-the-art multicore architectures. In Proc. 2008 ACM/IEEE Conference on Supercomputing, IEEE Press, p. 4.Google Scholar

Davis, T. and Hu, Y. (2011), ‘The University of Florida sparse matrix collection’, ACM Trans. Math. Softw. 38, 1–25.Google Scholar

Dekel, E., Nassimi, D. and Sahni, S. (1981), ‘Parallel matrix and graph algorithms’, SIAM J. Comput. 10, 657–675.Google Scholar

Demmel, J. (1997), Applied Numerical Linear Algebra, SIAM.Google Scholar

Demmel, J. (2013), An arithmetic complexity lower bound for computing rational functions, with applications to linear algebra. Technical report UCB/EECS-2013-126, EECS Department, UC Berkeley.Google Scholar

Demmel, J., Dumitriu, I. and Holtz, O. (2007 a), ‘Fast linear algebra is stable’, Numer. Math. 108, 59–91.Google Scholar

Demmel, J., Dumitriu, I., Holtz, O. and Kleinberg, R. (2007 b), ‘Fast matrix multiplication is stable’, Numer. Math. 106, 199–224.Google Scholar

Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O. and Spillinger, O. (2013 a), Communication-optimal parallel recursive rectangular matrix multiplication. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 261–272.Google Scholar

Demmel, J., Gearhart, A., Lipshitz, B. and Schwartz, O. (2013 b), Perfect strong scaling using no additional energy. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 649–660.Google Scholar

Demmel, J., Grigori, L., Gu, M. and Xiang, H. (2013 c), Communication avoiding rank revealing QR factorization with column pivoting. Technical report UCB/EECS-2013-46, EECS Department, UC Berkeley.Google Scholar

Demmel, J., Grigori, L., Hoemmen, M. and Langou, J. (2008 a), Communication-avoiding parallel and sequential QR and LU factorizations: Theory and practice. LAPACK Working Note.Google Scholar

Demmel, J., Grigori, L., Hoemmen, M. and Langou, J. (2012), ‘Communication-optimal parallel and sequential QR and LU factorizations’, SIAM J. Sci. Comput. 34, A206–A239.Google Scholar

Demmel, J., Hoemmen, M., Mohiyuddin, M. and Yelick, K. (2007 c), Avoiding communication in computing Krylov subspaces. Technical report UCB/EECS-2007-123, EECS Department, UC Berkeley.Google Scholar

Demmel, J., Hoemmen, M., Mohiyuddin, M. and Yelick, K. (2008 b), Avoiding communication in sparse matrix computations. In Proc. 2008 IEEE International Parallel and Distributed Processing Symposium: IPDPS 2008, pp. 1–12.Google Scholar

Demmel, J., Marques, O., Parlett, B. and Vomel, C. (2008 c), ‘Performance and accuracy of LAPACK's symmetric tridiagonal eigensolvers’, SIAM J. Sci. Comput. 30, 1508–1526.Google Scholar

Devine, K. D., Boman, E. G., Heaphy, R. T., Bisseling, R. H. and Catalyurek, U. V. (2006), Parallel hypergraph partitioning for scientific computing. In Proc. 20th IEEE International Parallel and Distributed Processing Symposium: IPDPS 2006.Google Scholar

Dongarra, J. J., Croz, J. D., Duff, I. S. and Hammarling, S. (1990 a), ‘Algorithm 679: A set of level 3 basic linear algebra subprograms’, ACM Trans. Math. Softw. 16, 18–28.Google Scholar

Dongarra, J. J., Croz, J. D., Duff, I. S. and Hammarling, S. (1990 b), ‘A set of level 3 basic linear algebra subprograms’, ACM Trans. Math. Softw. 16, 1–17.Google Scholar

Dongarra, J. J., Croz, J. D., Hammarling, S. and Hanson, R. J. (1988 a), ‘Algorithm 656: An extended set of Fortran basic linear algebra subprograms’, ACM Trans. Math. Softw. 14, 18–32.CrossRef Google Scholar

Dongarra, J. J., Croz, J. D., Hammarling, S. and Hanson, R. J. (1988 b), ‘An extended set of Fortran basic linear algebra subprograms’, ACM Trans. Math. Softw. 14, 1–17.CrossRef Google Scholar

Dongarra, J. J., Moler, C. B., Bunch, J. R. and Stewart, G. W. (1979), LINPACK Users' Guide, SIAM.Google Scholar

Douglas, C. C., Hu, J., Kowarschik, M., Rüde, U. and Weiβ, C. (2000), ‘Cache optimization for structured and unstructured grid multigrid’, Electron. Trans. Numer. Anal. 10, 21–40.Google Scholar

Driscoll, M., Georganas, E., Koanantakool, P., Solomonik, E. and Yelick, K. (2013), A communication-optimal W-body algorithm for direct interactions. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 1075–1084.Google Scholar

Dursun, H., Nomura, K.-I., Peng, L., Seymour, R., Wang, W., Kalia, R. K., Nakano, A. and Vashishta, P. (2009), A multilevel parallelization framework for highorder stencil computations. In Euro-Par 2009 Parallel Processing, Springer, pp. 642–653.CrossRef Google Scholar

Elmroth, E. and Gustavson, F. (1998), New serial and parallel recursive QR factorization algorithms for SMP systems. In Applied Parallel Computing: Large Scale Scientific and Industrial Problems (Kågström, B.et al., eds), Vol. 1541 of Lecture Notes in Computer Science, Springer, pp. 120–128.Google Scholar

Floyd, R. (1962), ‘Algorithm 97: Shortest path’, Commun. Assoc. Comput. Mach. 5, 345.Google Scholar

Frens, J. D. and Wise, D. S. (2003), ‘QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism’, ACM SIGPLAN Notices 38, 144–154.Google Scholar

Frigo, M. and Strumpen, V. (2005), Cache oblivious stencil computations. In Proc. 19th Annual International Conference on Supercomputing, ACM, pp. 361–366.Google Scholar

Frigo, M. and Strumpen, V. (2009), ‘The cache complexity of multithreaded cache oblivious algorithms’, Theory Comput. Syst. 45, 203–233.Google Scholar

Frigo, M., Leiserson, C. E., Prokop, H. and Ramachandran, S. (1999), Cache-oblivious algorithms. In Proc. 40th Annual Symposium on Foundations of Computer Science: FOCS '99, IEEE Computer Society, pp. 285–297.Google Scholar

Fuller, S. H. and Millett, L. I., eds (2011), The Future of Computing Performance: Game Over or Next Level? The National Academies Press. http://www.nap.edu Google Scholar

Gannon, D. and Rosendale, J. Van (1984), ‘On the impact of communication complexity on the design of parallel numerical algorithms’, Trans. Comput. 100, 1180–1194.Google Scholar

Georganas, E., Gonzalez-Dominguez, J., Solomonik, E., Zheng, Y., Tourino, J. and Yelick, K. (2012), Communication avoiding and overlapping for numerical linear algebra. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis: SC '12, pp. 1–11.Google Scholar

George, A. (1973), ‘Nested dissection of a regular finite element mesh’, SIAM J. Numer. Anal. 10, 345–363.Google Scholar

Gilbert, J. R. and Tarjan, R. E. (1987), ‘The analysis of a nested dissection algorithm’, Numer. Math. pp. 377–404.Google Scholar

Giraud, L. and Langou, J. (2003), ‘A robust criterion for the modified Gram–Schmidt algorithm with selective reorthogonalization’, SIAM J. Sci. Comput. 25, 417–441.Google Scholar

Giraud, L., Langou, J. and Rozloznik, M. (2005), ‘The loss of orthogonality in the Gram–Schmidt orthogonalization process’, Comput. Math. Appl. 50, 1069–1075.Google Scholar

Golub, G. and Loan, C. Van (1996), Matrix Computations, third edition, Johns Hopkins University Press.Google Scholar

Golub, G. H., Plemmons, R. J. and Sameh, A. (1988), Parallel block schemes for large-scale least-squares computations. In High-Speed Computing: Scientific Applications and Algorithm Design, University of Illinois Press, pp. 171–179.Google Scholar

Graham, S. L., Snir, M. and Patterson, C. A., eds (2004), Getting up to Speed: The Future of Supercomputing, Report of National Research Council of the National Academies Sciences, The National Academies Press.Google Scholar

Granat, R., Kågström, B., Kressner, D. and Shao, M. (2012), Parallel library software for the multishift QR algorithm with aggressive early deflation. Report UMINF 12.06, Department of Computing Science, Umea University, SE-901.Google Scholar

Greenbaum, A. (1997 a), ‘Estimating the attainable accuracy of recursively computed residual methods’, SIAM J. Matrix Anal. Appl. 18, 535–551.Google Scholar

Greenbaum, A. (1997 b), Iterative Methods for Solving Linear Systems, SIAM.Google Scholar

Greenbaum, A., Rozložník, M. and Strakoš, Z. (1997), ‘Numerical behavior of the modified Gram–Schmidt GMRES implementation’, BIT Numer. Math. 37, 706–719.Google Scholar

Greiner, G. and Jacob, R. (2010 a), Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model. In Mathematical Foundations of Computer Science 2010, Springer, pp. 393–404.Google Scholar

Greiner, G. and Jacob, R. (2010 b), The I/O complexity of sparse matrix dense matrix multiplication. In LATIN 2010: Theoretical Informatics, Springer, pp. 143–156.Google Scholar

Grigori, L. and Moufawad, S. (2013), Communication avoiding ILU(0) preconditioned Research report RR-8266, INRIA.Google Scholar

Grigori, L., David, P.-Y., Demmel, J. and Peyronnet, S. (2010), Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem. In Proc. 22nd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '10, ACM, pp. 79–81.Google Scholar

Grigori, L., Demmel, J. and Xiang, H. (2011), ‘CALU: A communication optimal LU factorization algorithm’, SIAM J. Matrix Anal. Appl. 32, 1317–1350.Google Scholar

Gu, M. and Eisenstat, S. (1996), ‘Efficient algorithms for computing a strong rank-revealing QR factorization’, SIAM J. Sci. Comput. 17, 848–869.Google Scholar

Gunter, B. C. and van de Geijn, R. A. (2005), ‘Parallel out-of-core computation and updating of the QR factorization’, ACM Trans. Math. Softw. 31, 60–78.Google Scholar

Gustavson, F. G. (1997), ‘Recursion leads to automatic variable blocking for dense linear-algebra algorithms’, IBM J. Res. Dev. 41, 737–756.CrossRef Google Scholar

Gutknecht, M. (1997), Lanczos-type solvers for nonsymmetric linear systems of equations. In Acta Numerica, Vol. 6, Cambridge University Press, pp. 271–398.Google Scholar

Gutknecht, M. and Ressel, K. (2000), ‘Look-ahead procedures for Lanczos-type product methods based on three-term Lanczos recurrences’, SIAM J. Matrix Anal. Appl. 21, 1051–1078.Google Scholar

Gutknecht, M. and Strakoš, Z. (2000), ‘Accuracy of two three-term and three two-term recurrences for Krylov space solvers’, SIAM J. Matrix Anal. Appl. 22, 213–229.Google Scholar

Hackbusch, W. (2006), Hierarchische Matrizen: Algorithmen und Analysis. http://www.mis.mpg.de/scicomp/Fulltext/hmvorlesung.ps Google Scholar

Haidar, A., Luszczek, P., Kurzak, J. and Dongarra, J. (2013), An improved parallel singular value algorithm and its implementation for multicore hardware. LAPACK Working Note 283.Google Scholar

Hestenes, M. R. and Stiefel, E. (1952), ‘Methods of conjugate gradients for solving linear systems’, J. Res. Nat. Bur. Standards 49, 409–436.Google Scholar

Hindmarsh, A. and Walker, H. (1986), Note on a Householder implementation of the GMRES method. Technical report UCID-20899, Lawrence Livermore National Laboratory.Google Scholar

Hoemmen, M. (2010), Communication-avoiding Krylov subspace methods. PhD thesis, EECS Department, UC Berkeley.Google Scholar

Hoffman, A. J., Martin, M. S. and Rose, D. J. (1973), ‘Complexity bounds for regular finite difference and finite element grids’, SIAM J. Numer. Anal. 10, 364–369.Google Scholar

Hong, J. W. and Kung, H. T. (1981), I/O complexity: The red-blue pebble game. In Proc. 13th Annual ACM Symposium on Theory of Computing: STOC '81, ACM, pp. 326–333.Google Scholar

Howell, G. W., Demmel, J., Fulton, C. T., Hammarling, S. and Marmol, K. (2008), ‘Cache efficient bidiagonalization using BLAS 2.5 operators’, ACM Trans. Math. Softw. 34, #14.Google Scholar

Hupp, P. and Jacob, R. (2013), Tight bounds for low dimensional star stencils in the external memory model. In Algorithms and Data Structures (Dehne, F., Solis-Oba, R. and Sack, J.-R., eds), Vol. 8037 of Lecture Notes in Computer Science, Springer, pp. 415–426.Google Scholar

Irigoin, F. and Triolet, R. (1988), Supernode partitioning. In Proc. 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ACM, pp. 319–329.Google Scholar

Irony, D., Toledo, S. and Tiskin, A. (2004), ‘Communication lower bounds for distributed-memory matrix multiplication’, J. Parallel Distrib. Comput. 64, 1017–1026.Google Scholar

Jalby, W. and Philippe, B. (1991), ‘Stability analysis and improvement of the block Gram–Schmidt algorithm’, SIAM J. Sci. Statist. Comput. 12, 1058–1073.Google Scholar

Johnsson, S. L. (1992), Minimizing the communication time for matrix multiplication on multiprocessors’, Parallel Comput.Google Scholar

Joubert, W. and Carey, G. (1992), ‘Parallelizable restarted iterative methods for nonsymmetric linear systems I: Theory’, Internat. J. Comput. Math. 44, 243–267.Google Scholar

Kågström, B., Kressner, D. and Shao, M. (2012), On aggressive early deflation in parallel variants of the QR algorithm. In Applied Parallel and Scientific Computing, Springer, pp. 1–10.Google Scholar

Karlsson, L. and Kågström, B. (2011), ‘Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures’, Parallel Comput. 37, 771–782.Google Scholar

Kepner, J. and Gilbert, J. (2011), Graph Algorithms in the Language of Linear Algebra, Vol. 22, SIAM.Google Scholar

Kielbasinski, A. (1974), ‘Numerical analysis of the Gram–Schmidt orthogonalization algorithm (analiza numeryczna algorytmu ortogonalizacji Grama–Schmidta)’, Roczniki Polskiego Towarzystwa Matematycznego, Seria III: Matematyka Stosowana II, pp. 15–35.Google Scholar

Kim, S. and Chronopoulos, A. (1992), ‘An efficient nonsymmetric Lanczos method on parallel vector computers’, J. Comput. Appl. Math. 42, 357–374.Google Scholar

Knight, N., Carson, E. and Demmel, J. (2014), Exploiting data sparsity in parallel matrix powers computations. In Proc. PPAM '13, Vol. 8384 of Lecture Notes in Computer Science, Springer, to appear.Google Scholar

Kolda, T. G. and Bader, B. W. (2009), ‘Tensor decompositions and applications’, SIAM Review 51, 455–500.Google Scholar

LaMielle, A. and Strout, M. (2010), Enabling code generation within the sparse polyhedral framework. Technical report CS-10-102, Colorado State University.Google Scholar

Lanczos, C. (1950), An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators, United States Government Press Office.Google Scholar

Lawson, C. L., Hanson, R. J., Kincaid, D. and Krogh, F. T. (1979), ‘Basic linear algebra subprograms for Fortran usage’, ACM Trans. Math. Softw. 5, 308–323.Google Scholar

Leiserson, C. E., Rao, S. and Toledo, S. (1997), ‘Efficient out-of-core algorithms for linear relaxation using blocking covers’, J. Comput. System Sci. 54, 332–344.Google Scholar

Lev, G. and Valiant, L. (1983), ‘Size bounds for superconcentrators’, Theor. Comput. Sci. 22, 233–251.Google Scholar

Lipshitz, B. (2013), Communication-avoiding parallel recursive algorithms for matrix multiplication. Master's thesis, EECS Department, UC Berkeley.Google Scholar

Lipshitz, B., Ballard, G., Demmel, J. and Schwartz, O. (2012), Communication-avoiding parallel Strassen: Implementation and performance. In Proc. International Conference on High Performance Computing, Networking, Storage and Analysis: SC '12, #101.Google Scholar

Loomis, L. H. and Whitney, H. (1949), ‘An inequality related to the isoperimetric inequality’, Bull. Amer. Math. Soc. 55, 961–962.Google Scholar

McColl, W. and Tiskin, A. (1999), ‘Memory-efficient matrix multiplication in the BSP model’, Algorithmica 24, 287–297.Google Scholar

Meurant, G. (2006), The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations, SIAM.Google Scholar

Meurant, G. and Strakos, Z. (2006), The Lanczos and conjugate gradient algorithms in finite precision arithmetic. In Acta Numerica, Vol. 15, Cambridge University Press, pp. 471–542.Google Scholar

Mohanty, S. and Gopalan, S. (2012), I/O efficient QR and QZ algorithms, in 2012 19th International Conference on High Performance Computing: HiPC, pp. 1–9.Google Scholar

Mohiyuddin, M. (2012), Tuning hardware and software for multiprocessors. PhD thesis, EECS Department, UC Berkeley.Google Scholar

Mohiyuddin, M., Hoemmen, M., Demmel, J. and Yelick, K. (2009), Minimizing communication in sparse matrix solvers. In Proc. International Conference on High Performance Computing Networking, Storage and Analysis: SC '09.Google Scholar

Nakatsukasa, Y. and Higham, N. (2012), Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. MIMS EPrint 2012.52, University of Manchester.Google Scholar

Nishtala, R., Vuduc, R. W., Demmel, J. W. and Yelick, K. A. (2007), ‘When cache blocking of sparse matrix vector multiply works and why’, Applicable Algebra in Engineering, Communication and Computing 18, 297–311.Google Scholar

Park, J.-S., Penner, M. and Prasanna, V. (2004), ‘Optimizing graph algorithms for improved cache performance’, IEEE Trans. Parallel Distributed Systems 15, 769–782.Google Scholar

Parlett, B. (1995), The new QD algorithms. In Acta Numerica, Vol. 4, Cambridge University Press, pp. 459–491.Google Scholar

Parlett, B. and Reid, J. (1970), ‘On the solution of a system of linear equations whose matrix is symmetric but not definite’, BIT Numer. Math. 10, 386–397.Google Scholar

Parlett, B., Taylor, D. and Liu, Z. (1985), ‘A look-ahead Lanczos algorithm for unsymmetric matrices’, Math. Comp. 44, 105–124.Google Scholar

Pfeifer, C. (1963), Data flow and storage allocation for the PDQ-5 program on the Philco-2000. Technical report, Westinghouse Electric Corp. Bettis Atomic Power Lab., Pittsburgh.Google Scholar

Philippe, B. and Reichel, L. (2012), ‘On the generation of Krylov subspace bases’, Appl. Numer. Math. 62, 1171–1186.Google Scholar

Puglisi, C. (1992), ‘Modification of the Householder method based on compact WY representation’, SIAM J. Sci. Statist. Comput. 13, 723–726.Google Scholar

Reichel, L. (1990), ‘Newton interpolation at Leja points’, BIT 30, 332–346.Google Scholar

Rozloznik, M., Shklarski, G. and Toledo, S. (2011), ‘Partitioned triangular tridiagonalization’, ACM Trans. Math. Softw. 37, #38.Google Scholar

Saad, Y. (1985), ‘Practical use of polynomial preconditionings for the conjugate gradient method’, SIAM J. Sci. Statist. Comput. 6, 865–881.Google Scholar

Saad, Y. (2003), Iterative Methods for Sparse Linear Systems, second edition, SIAM.Google Scholar

Saad, Y. and Schultz, M. H. (1986), ‘GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems’, SIAM J. Sci. Statist. Comput. 7, 856–869.Google Scholar

Saad, Y., Yeung, M., Erhel, J. and F. Guyomarc'h (2000), ‘A deflated version of the conjugate gradient algorithm’, SIAM J. Sci. Comput. 21, 1909–1926.Google Scholar

Savage, J. E. (1995), Extending the Hong-Kung model to memory hierarchies. In Computing and Combinatorics, Vol. 959, Springer, pp. 270–281.Google Scholar

Schatz, M., Poulson, J. and van de Geijn, R. (2013), Scalable universal matrix multiplication algorithms: 2D and 3D variations on a theme. Technical report, University of Texas.Google Scholar

Schreiber, R. and Loan, C. Van (1989), ‘A storage-efficient WY representation for products of Householder transformations’, SIAM J. Sci. Statist. Comput. 10, 53–57.Google Scholar

Schwarz, H. A. (1870), ‘über einen Grenzübergang durch alternierendes Verfahren’, Vierteljahrsschrift der Naturforschenden Gesellschaft in Zürich 15, 272–286.Google Scholar

Scquizzato, M. and Silvestri, F. (2014), Communication lower bounds for distributed-memory computations. In 31st International Symposium on Theoretical Aspects of Computer Science: STACS 2014 (Mayr, E. W. and Portier, N., eds), Vol. 25 of Leibniz International Proceedings in Informatics: LIPIcs, Schloss Dagstuhl-Leibniz-Zentrum für Informatik, pp. 627–638.Google Scholar

Sleijpen, G. and van der Vorst, H. (1996), ‘Reliable updated residuals in hybrid Bi-CG methods’, Computing 56, 141–163.Google Scholar

Smith, B. T., Boyle, J. M., Dongarra, J. J., Garbow, B. S., Ikebe, Y., Klema, V. C. and Moler, C. B. (1976), Matrix Eigensystem Routines: EISPACK Guide, second edition, Springer.Google Scholar

Smoktunowicz, A., Barlow, J. L. and Langou, J. (2006), ‘A note on the error analysis of classical Gram-Schmidt’, Numer. Math. 105, 299–313.Google Scholar

Solomonik, E. and Demmel, J. (2011), Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Euro-Par 2011 Parallel Processing (Jeannot, E., Namyst, R. and Roman, J., eds), Vol. 6853 of Lecture Notes in Computer Science, Springer, pp. 90–109.Google Scholar

Solomonik, E., Buluc, A. and Demmel, J. (2013), Minimizing communication in all-pairs shortest-paths. In Proc. 27th IEEE International Parallel Distributed Processing Symposium: IPDPS '13, pp. 548–559.Google Scholar

Solomonik, E., Carson, E., Knight, N. and Demmel, J. (2014), Tradeoffs between synchronization, communication, and work in parallel linear algebra computations. Technical report UCB/EECS-2014-8, EECS Department, UC Berkeley.Google Scholar

Solomonik, E., Matthews, D., Hammond, J. and Demmel, J. (2013 c), Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 813–824.Google Scholar

Sorensen, D. (1985), ‘Analysis of pairwise pivoting in Gaussian elimination’, IEEE Trans. Computers C-34, 274–278.Google Scholar

Sorensen, D. (1992), ‘Implicit application of polynomial filters in a k-step Arnoldi method’, SIAM J. Matrix Anal. Appl. 13, 357–385.CrossRef Google Scholar

Stathopoulos, A. and Wu, K. (2002), ‘A block orthogonalization procedure with constant synchronization requirements’, SIAM J. Sci. Comput. 23, 2165–2182.Google Scholar

Stewart, G. (2008), ‘Block Gram-Schmidt orthogonalization’, SIAM J. Sci. Comput. 31, 761–775.CrossRef Google Scholar

Strassen, V. (1969), ‘Gaussian elimination is not optimal’, Numer. Math. 13, 354–356.Google Scholar

Strout, M. M., Carter, L. and Ferrante, J. (2001), Rescheduling for locality in sparse matrix computations. In Computational Science: ICCS 2001, Springer, pp. 137–146.Google Scholar

Sturler, E. (1996), ‘A performance model for Krylov subspace methods on meshbased parallel computers’, Parallel Comput. 22, 57–74.Google Scholar

Tang, Y., Chowdhury, R. A., Kuszmaul, B. C., Luk, C.-K. and Leiserson, C. E. (2011), The Pochoir stencil compiler. In Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures, ACM, pp. 117–128.Google Scholar

Thakur, R., Rabenseifner, R. and Gropp, W. (2005), ‘Optimization of collective communication operations in MPICH’, Internat. J. High Performance Comput. Appl. 19, 49–66.Google Scholar

Tiskin, A. (2002), ‘Bulk-synchronous parallel Gaussian elimination’, J. Math. Sci. 108, 977–991.Google Scholar

Tiskin, A. (2007), ‘Communication-efficient parallel generic pairwise elimination’, Future Generation Computer Systems 23, 179–188.Google Scholar

Toledo, S. (1995), Quantitative performance modeling of scientific computations and creating locality in numerical algorithms. PhD thesis, MIT.Google Scholar

Toledo, S. (1997), ‘Locality of reference in LU decomposition with partial pivoting’, SIAM J. Matrix Anal. Appl. 18, 1065–1081.Google Scholar

Tong, C. and Ye, Q. (2000), ‘Analysis of the finite precision bi-conjugate gradient algorithm for nonsymmetric linear systems’, Math. Comp. 69, 1559–1576.Google Scholar

Trefethen, L. and Schreiber, R. (1990), ‘Average-case stability of Gaussian elimination’, SIAM J. Matrix Anal. Appl. 11, 335–360.Google Scholar

van de Geijn, R. A. and Watts, J. (1997), ‘SUMMA: Scalable universal matrix multiplication algorithm’, Concurrency: Pract. Exp. 9, 255–274.Google Scholar

der Vorst, H. Van and Ye, Q. (1999), ‘Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals’, SIAM J. Sci. Comput. 22, 835–852.Google Scholar

Rosendale, J. Van (1983), Minimizing inner product data dependencies in conjugate gradient iteration. Technical report 172178, ICASE-NASA.Google Scholar

Vanderstraeten, D. (1999), A stable and efficient parallel block Gram–Schmidt algorithm. In Euro-Par99 Parallel Processing, Springer, pp. 1128–1135.Google Scholar

Vuduc, R., Demmel, J. and Yelick, K. (2005), OSKI: A library of automatically tuned sparse matrix kernels. In Proc. of SciDAC 2005, J. of Physics Conference Series, Institute of Physics.Google Scholar

Vuduc, R. W. (2003), Automatic performance tuning of sparse matrix kernels. PhD thesis, EECS Department, UC Berkeley.Google Scholar

Walker, H. (1988), ‘Implementation of the GMRES method using Householder transformations’, SIAM J. Sci. Statist. Comput. 9, 152–163.Google Scholar

Warshall, S. (1962), ‘A theorem on Boolean matrices’, J. Assoc. Comput. Mach. 9, 11–12.Google Scholar

Williams, S., Lijewski, M., Almgren, A., Straalen, B. Van, Carson, E., Knight, N. and Demmel, J. (2014), s-step Krylov subspace methods as bottom solvers for geometric multigrid. In Proc. 2014 IEEE International Parallel and Distributed Processing Symposium: IPDPS '14, to appear.Google Scholar

Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K. and Demmel, J. (2009), ‘Optimization of sparse matrix-vector multiplication on emerging multicore platforms’, Parallel Comput. 35, 178–194.Google Scholar

Williams, V. (2012), Multiplying matrices faster than Coppersmith–Winograd. In Proc. 44th Annual Symposium on Theory of Computing: STOC 12, ACM, pp. 887–898.Google Scholar

Wise, D. (2000), Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In Euro-Par 2000 Parallel Processing (Bode, A., Ludwig, T., Karl, W. and Wismüoller, R., eds), Vol. 1900 of Lecture Notes in Computer Science, Springer, pp. 774–783.Google Scholar

Wolf, M. M., Boman, E. G. and Hendrickson, B. (2008), ‘Optimizing parallel sparse matrix-vector multiplication by corner partitioning’, PARA08, Trondheim, Norway.Google Scholar

Xia, J., Chandrasekaran, S., Gu, M. and Li, X. S. (2010), ‘Fast algorithms for hierarchically semiseparable matrices’, Numer. Linear Algebra Appl. 17, 953–976.CrossRef Google Scholar

Yotov, K., Roeder, T., Pingali, K., Gunnels, J. and Gustavson, F. (2007), An experimental comparison of cache-oblivious and cache-conscious programs. In Proc. 19th Annual ACM Symposium on Parallel Algorithms and Architectures: SPAA '07, ACM, pp. 93–104.Google Scholar

Yzelman, A. and Bisseling, R. H. (2011), ‘Two-dimensional cache-oblivious sparse matrix-vector multiplication’, Parallel Comput. 37, 806–819.Google Scholar

Article contents

Communication lower bounds and optimal algorithms for numerical linear algebra*†

Extract

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests