Skip to main content
    • Aa
    • Aa
  • Get access
    Check if you have access via personal or institutional login
  • Cited by 6
  • Cited by
    This article has been cited by the following publications. This list is generated based on data provided by CrossRef.

    Diener, Matthias Cruz, Eduardo H. M. Alves, Marco A. Z. and Navaux, Philippe O. A. 2016. 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). p. 151.

    Pan, Victor Y. 2016. How Bad Are Vandermonde Matrices?. SIAM Journal on Matrix Analysis and Applications, Vol. 37, Issue. 2, p. 676.

    Atanassov, E. Gurov, T. and Karaivanova, A. 2015. Energy aware performance study for a class of computationally intensive Monte Carlo algorithms. Computers & Mathematics with Applications, Vol. 70, Issue. 11, p. 2719.

    Carson, Erin and Demmel, James W. 2015. Accuracy of the $s$-Step Lanczos Method for the Symmetric Eigenproblem in Finite Precision. SIAM Journal on Matrix Analysis and Applications, Vol. 36, Issue. 2, p. 793.

    Pan, Victor Y. Qian, Guoliang and Yan, Xiaodong 2015. Random multipliers numerically stabilize Gaussian and block Gaussian elimination: Proofs and an extension to low-rank approximation. Linear Algebra and its Applications, Vol. 481, p. 202.

    Yamazaki, Ichitaro Rajamanickam, Sivasankaran Boman, Erik G. Hoemmen, Mark Heroux, Michael A. and Tomov, Stanimire 2014. SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. p. 933.


Communication lower bounds and optimal algorithms for numerical linear algebra*

  • G. Ballard (a1), E. Carson (a2), J. Demmel (a2) (a3), M. Hoemmen (a4), N. Knight (a2) and O. Schwartz (a2)
  • DOI:
  • Published online: 12 May 2014

The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n3)) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.

Hide All

We acknowledge funding from Microsoft (award 024263) and Intel (award 024894), and matching funding by UC Discovery (award DIG07-10227). Additional support comes from ParLab affiliates National Instruments, Nokia, NVIDIA, Oracle and Samsung, as well as MathWorks. Research is also supported by DOE grants DE-SC0004938, DE-SC0005136, DE-SC0003959, DE-SC0008700, DE-SC0010200, DE-FC02-06-ER25786, AC02-05CH11231, and DARPA grant HR0011-12-2-0016. This research is supported by grant 3-10891 from the Ministry of Science and Technology, Israel, and grant 2010231 from the US-Israel Bi-National Science Foundation. This research was supported in part by an appointment to the Sandia National Laboratories Truman Fellowship in National Security Science and Engineering, sponsored by Sandia Corporation (a wholly owned subsidiary of Lockheed Martin Corporation) as Operator of Sandia National Laboratories under its US Department of Energy Contract DE-AC04-94AL85000.

Colour online for monochrome figures available at

Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

J. O. Aasen (1971), ‘On the reduction of a symmetric matrix to tridiagonal form’, BIT Numer. Math. 11, 233242.

N. N. Abdelmalek (1971), ‘Round off error analysis for Gram–Schmidt method and solution of linear least squares problems’, BIT Numer. Math. 11, 345367.

R. Agarwal , S. Balle , F. Gustavson , M. Joshi and P. Palkar (1995), ‘A three-dimensional approach to parallel matrix multiplication’, IBM J. Res. Dev. 39, 575582.

R. Agarwal , F. Gustavson and M. Zubair (1994), ‘A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication’, IBM J. Res. Dev. 38, 673681.

A. Aggarwal , A. K. Chandra and M. Snir (1990), ‘Communication complexity of PRAMs’, Theoret. Comput. Sci. 71, 328.

E. Anderson , Z. Bai , C. Bischof , J. Demmel , J. Dongarra , J. D. Croz , A. Greenbaum , S. Hammarling , A. McKenney , S. Ostrouchov and D. Sorensen (1992), LAPACK Users' Guide, SIAM. Also available from

M. Anderson , G. Ballard , J. Demmel and K. Keutzer (2011), Communication-avoiding QR decomposition for GPUs. In Proc. 2011 IEEE International Parallel and Distributed Processing Symposium: IPDPS '11, pp. 4858.

Z. Bai , J. Demmel and M. Gu (1997 b), ‘An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems’, Numer. Math. 76, 279308.

Z. Bai , D. Hu and L. Reichel (1994), ‘A Newton basis GMRES implementation’, IMA J. Numer. Anal. 14, 563581.

G. Ballard , A. Buluç , J. Demmel , L. Grigori , B. Lipshitz , O. Schwartz and S. Toledo (2013 c), Communication optimal parallel multiplication of sparse random matrices. In Proc. 25th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '13, ACM, pp. 222231.

G. Ballard , J. Demmel , L. Grigori , M. Jacquelin , H. D. Nguyen and E. Solomonik (2014), Reconstructing Householder vectors from Tall-Skinny QR. In Proc. 2014 IEEE International Parallel and Distributed Processing Symposium: IPDPS '14, to appear.

G. Ballard , J. Demmel , O. Holtz and O. Schwartz (2010), ‘Communication-optimal parallel and sequential Cholesky decomposition’, SIAM J. Sci. Comput. 32, 34953523.

G. Ballard , J. Demmel , O. Holtz and O. Schwartz (2011 d), ‘Minimizing communication in numerical linear algebra’, SIAM J. Matrix Anal. Appl. 32, 866901.

G. Ballard , J. Demmel , O. Holtz and O. Schwartz (2012 b), ‘Graph expansion and communication costs of fast matrix multiplication’, J. Assoc. Comput. Mach. 59, #32.

G. Ballard , J. Demmel , O. Holtz , B. Lipshitz and O. Schwartz (2012 f), Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In Design and Analysis of Algorithms (G. Even and D. Rawitz , eds), Vol. 7659 of Lecture Notes in Computer Science, Springer, pp. 1336.

G. Ballard , J. Demmel , B. Lipshitz , O. Schwartz and S. Toledo (2013 f), Communication efficient Gaussian elimination with partial pivoting using a shape morphing data layout. In Proc. 25th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '13, ACM, pp. 232240.

M. A. Bender , G. S. Brodal , R. Fagerberg , R. Jacob and E. Vicari (2010), ‘Optimal sparse matrix dense vector multiplication in the I/O-model’, Theory Comput. Syst. 47, 934962.

J. Bennett , A. Carbery , M. Christ and T. Tao (2010), ‘Finite bounds for Holder-Brascamp-Lieb multilinear inequalities’, Math. Res. Lett. 17, 647666.

J. Berntsen (1989), ‘Communication efficient matrix multiplication on hypercubes’, Parallel Comput. 12, 335342.

G. Bilardi and F. P. Preparata (1999), ‘Processor-time tradeoffs under bounded-speed message propagation II: Lower bounds’, Theory Comput. Syst. 32, 531559.

C. Bischof and C. Van Loan (1987), ‘The WY representation for products of Householder matrices’, SIAM J. Sci. Statist. Comput. 8, 213.

C. H. Bischof , B. Lang and X. Sun (2000a), ‘Algorithm 807: The SBR Toolbox, Software for successive band reduction’, ACM Trans. Math. Softw. 26, 602616.

C. Bischof , B. Lang and X. Sun (2000b), ‘A framework for symmetric band reduction’, ACM Trans. Math. Softw. 26, 581601.

A. Björck (1967), ‘Solving linear least squares problems by Gram–Schmidt orthogonalization’, BIT Numer. Math. 7, 121.

L. S. Blackford , J. Choi , A. Cleary , E. D'Azevedo , J. Demmel , I. Dhillon , J. Dongarra , S. Hammarling , G. Henry , A. Petitet , K. Stanley , D. Walker and R. C. Whaley (1997), ScaLAPACK Users' Guide, SIAM. Also available from

L. S. Blackford , J. Demmel , J. Dongarra , I. Duff , S. Hammarling , G. Henry , M. Heroux , L. Kaufman , A. Lumsdaine , A. Petitet , R. Pozo , K. Remington and R. C. Whaley (2002), ‘An updated set of basic linear algebra subroutines (BLAS)’, J. ACM Trans. Math. Softw. 28, 135151.

K. Braman , R. Byers and R. Mathias (2002 a), ‘The multishift QR algorithm I: Maintaining well-focused shifts and level 3 performance’, SIAM J. Matrix Anal. Appl. 23, 929947.

K. Braman , R. Byers and R. Mathias (2002b), ‘The multishift QR algorithm II: Aggressive early deflation’, SIAM J. Matrix Anal. Appl. 23, 948973.

J. Bunch and L. Kaufman (1977), ‘Some stable methods for calculating inertia and solving symmetric linear systems’, Math. Comp. 31, 163179.

E. Carson and J. Demmel (2014), ‘A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods’, SIAM J. Matrix Anal. Appl. 35, 2243.

E. Carson , N. Knight and J. Demmel (2013), ‘Avoiding communication in non-symmetric Lanczos-based Krylov subspace methods’, SIAM J. Sci. Comput. 35, S42S61.

U. V. Catalyurek and C. Aykanat (1999), ‘Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication’, IEEE Trans. Parallel Distributed Systems 10, 673693.

Ü. V. Catalyiirek and C. Aykanat (2001), A fine-grain hypergraph model for 2D decomposition of sparse matrices. In Proc. 15th IEEE International Parallel and Distributed Processing Symposium: IPDPS '01.

E. Chan , M. Heimlich , A. Purkayastha and R. van de Geijn (2007), ‘Collective communication: Theory, practice, and experience’, Concurrency and Computation: Pract. Exp. 19, 17491783.

E. Chow (2000), ‘Apriori sparsity patterns for parallel sparse approximate inverse preconditioners’, SIAM J. Sci. Comput. 21, 18041822.

E. Chow (2001), ‘Parallel implementation and practical use of sparse approximate inverses with a priori sparsity patterns’, Internat. J. High Perf. Comput. Appl. 15, 5674.

A. Chronopoulos and C. Gear (1989 a), ‘On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy’, Parallel Comput. 11, 3753.

A. Chronopoulos and C. Gear (1989 b), ‘s-step iterative methods for symmetric linear systems’, J. Comput. Appl. Math. 25, 153168.

A. Chronopoulos and C. Swanson (1996), ‘Parallel iterative s-step methods for unsymmetric linear systems’, Parallel Comput. 22, 623641.

E. Cohen (1994), Estimating the size of the transitive closure in linear time, in Proc. 35th Ann. Symp. Found. Comp. Sci., IEEE, pp. 190200.

E. Cohen (1997), ‘Size-estimation framework with applications to transitive closure and reachability’, J. Comput. System Sci. 55, 441453.

R. D. da Cunha , D. Becker and J. C. Patterson (2002), New parallel (rank-revealing) QR factorization algorithms. In Euro-Par 2002 Parallel Processing, Springer, pp. 677686.

T. Davis and Y. Hu (2011), ‘The University of Florida sparse matrix collection’, ACM Trans. Math. Softw. 38, 125.

E. Dekel , D. Nassimi and S. Sahni (1981), ‘Parallel matrix and graph algorithms’, SIAM J. Comput. 10, 657675.

J. Demmel (1997), Applied Numerical Linear Algebra, SIAM.

J. Demmel , I. Dumitriu and O. Holtz (2007 a), ‘Fast linear algebra is stable’, Numer. Math. 108, 5991.

J. Demmel , I. Dumitriu , O. Holtz and R. Kleinberg (2007 b), ‘Fast matrix multiplication is stable’, Numer. Math. 106, 199224.

J. Demmel , L. Grigori , M. Hoemmen and J. Langou (2012), ‘Communication-optimal parallel and sequential QR and LU factorizations’, SIAM J. Sci. Comput. 34, A206A239.

J. Demmel , M. Hoemmen , M. Mohiyuddin and K. Yelick (2008 b), Avoiding communication in sparse matrix computations. In Proc. 2008 IEEE International Parallel and Distributed Processing Symposium: IPDPS 2008, pp. 112.

J. Demmel , O. Marques , B. Parlett and C. Vomel (2008 c), ‘Performance and accuracy of LAPACK's symmetric tridiagonal eigensolvers’, SIAM J. Sci. Comput. 30, 15081526.

K. D. Devine , E. G. Boman , R. T. Heaphy , R. H. Bisseling and U. V. Catalyurek (2006), Parallel hypergraph partitioning for scientific computing. In Proc. 20th IEEE International Parallel and Distributed Processing Symposium: IPDPS 2006.

J. J. Dongarra , J. D. Croz , I. S. Duff and S. Hammarling (1990 b), ‘A set of level 3 basic linear algebra subprograms’, ACM Trans. Math. Softw. 16, 117.

J. J. Dongarra , J. D. Croz , S. Hammarling and R. J. Hanson (1988 b), ‘An extended set of Fortran basic linear algebra subprograms’, ACM Trans. Math. Softw. 14, 117.

J. J. Dongarra , C. B. Moler , J. R. Bunch and G. W. Stewart (1979), LINPACK Users' Guide, SIAM.

H. Dursun , K.-I. Nomura , L. Peng , R. Seymour , W. Wang , R. K. Kalia , A. Nakano and P. Vashishta (2009), A multilevel parallelization framework for highorder stencil computations. In Euro-Par 2009 Parallel Processing, Springer, pp. 642653.

J. D. Frens and D. S. Wise (2003), ‘QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism’, ACM SIGPLAN Notices 38, 144154.

M. Frigo and V. Strumpen (2005), Cache oblivious stencil computations. In Proc. 19th Annual International Conference on Supercomputing, ACM, pp. 361366.

M. Frigo and V. Strumpen (2009), ‘The cache complexity of multithreaded cache oblivious algorithms’, Theory Comput. Syst. 45, 203233.

M. Frigo , C. E. Leiserson , H. Prokop and S. Ramachandran (1999), Cache-oblivious algorithms. In Proc. 40th Annual Symposium on Foundations of Computer Science: FOCS '99, IEEE Computer Society, pp. 285297.

A. George (1973), ‘Nested dissection of a regular finite element mesh’, SIAM J. Numer. Anal. 10, 345363.

L. Giraud and J. Langou (2003), ‘A robust criterion for the modified Gram–Schmidt algorithm with selective reorthogonalization’, SIAM J. Sci. Comput. 25, 417441.

L. Giraud , J. Langou and M. Rozloznik (2005), ‘The loss of orthogonality in the Gram–Schmidt orthogonalization process’, Comput. Math. Appl. 50, 10691075.

A. Greenbaum (1997 a), ‘Estimating the attainable accuracy of recursively computed residual methods’, SIAM J. Matrix Anal. Appl. 18, 535551.

A. Greenbaum (1997 b), Iterative Methods for Solving Linear Systems, SIAM.

A. Greenbaum , M. Rozložník and Z. Strakoš (1997), ‘Numerical behavior of the modified Gram–Schmidt GMRES implementation’, BIT Numer. Math. 37, 706719.

G. Greiner and R. Jacob (2010 a), Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model. In Mathematical Foundations of Computer Science 2010, Springer, pp. 393404.

L. Grigori , J. Demmel and H. Xiang (2011), ‘CALU: A communication optimal LU factorization algorithm’, SIAM J. Matrix Anal. Appl. 32, 13171350.

M. Gu and S. Eisenstat (1996), ‘Efficient algorithms for computing a strong rank-revealing QR factorization’, SIAM J. Sci. Comput. 17, 848869.

B. C. Gunter and R. A. van de Geijn (2005), ‘Parallel out-of-core computation and updating of the QR factorization’, ACM Trans. Math. Softw. 31, 6078.

F. G. Gustavson (1997), ‘Recursion leads to automatic variable blocking for dense linear-algebra algorithms’, IBM J. Res. Dev. 41, 737756.

M. Gutknecht (1997), Lanczos-type solvers for nonsymmetric linear systems of equations. In Acta Numerica, Vol. 6, Cambridge University Press, pp. 271398.

M. Gutknecht and K. Ressel (2000), ‘Look-ahead procedures for Lanczos-type product methods based on three-term Lanczos recurrences’, SIAM J. Matrix Anal. Appl. 21, 10511078.

M. Gutknecht and Z. Strakoš (2000), ‘Accuracy of two three-term and three two-term recurrences for Krylov space solvers’, SIAM J. Matrix Anal. Appl. 22, 213229.

M. R. Hestenes and E. Stiefel (1952), ‘Methods of conjugate gradients for solving linear systems’, J. Res. Nat. Bur. Standards 49, 409436.

A. J. Hoffman , M. S. Martin and D. J. Rose (1973), ‘Complexity bounds for regular finite difference and finite element grids’, SIAM J. Numer. Anal. 10, 364369.

G. W. Howell , J. Demmel , C. T. Fulton , S. Hammarling and K. Marmol (2008), ‘Cache efficient bidiagonalization using BLAS 2.5 operators’, ACM Trans. Math. Softw. 34, #14.

P. Hupp and R. Jacob (2013), Tight bounds for low dimensional star stencils in the external memory model. In Algorithms and Data Structures (F. Dehne , R. Solis-Oba and J.-R. Sack , eds), Vol. 8037 of Lecture Notes in Computer Science, Springer, pp. 415426.

F. Irigoin and R. Triolet (1988), Supernode partitioning. In Proc. 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ACM, pp. 319329.

D. Irony , S. Toledo and A. Tiskin (2004), ‘Communication lower bounds for distributed-memory matrix multiplication’, J. Parallel Distrib. Comput. 64, 10171026.

W. Jalby and B. Philippe (1991), ‘Stability analysis and improvement of the block Gram–Schmidt algorithm’, SIAM J. Sci. Statist. Comput. 12, 10581073.

W. Joubert and G. Carey (1992), ‘Parallelizable restarted iterative methods for nonsymmetric linear systems I: Theory’, Internat. J. Comput. Math. 44, 243267.

B. Kågström , D. Kressner and M. Shao (2012), On aggressive early deflation in parallel variants of the QR algorithm. In Applied Parallel and Scientific Computing, Springer, pp. 110.

L. Karlsson and B. Kågström (2011), ‘Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures’, Parallel Comput. 37, 771782.

J. Kepner and J. Gilbert (2011), Graph Algorithms in the Language of Linear Algebra, Vol. 22, SIAM.

S. Kim and A. Chronopoulos (1992), ‘An efficient nonsymmetric Lanczos method on parallel vector computers’, J. Comput. Appl. Math. 42, 357374.

T. G. Kolda and B. W. Bader (2009), ‘Tensor decompositions and applications’, SIAM Review 51, 455500.

C. L. Lawson , R. J. Hanson , D. Kincaid and F. T. Krogh (1979), ‘Basic linear algebra subprograms for Fortran usage’, ACM Trans. Math. Softw. 5, 308323.

C. E. Leiserson , S. Rao and S. Toledo (1997), ‘Efficient out-of-core algorithms for linear relaxation using blocking covers’, J. Comput. System Sci. 54, 332344.

G. Lev and L. Valiant (1983), ‘Size bounds for superconcentrators’, Theor. Comput. Sci. 22, 233251.

L. H. Loomis and H. Whitney (1949), ‘An inequality related to the isoperimetric inequality’, Bull. Amer. Math. Soc. 55, 961962.

W. McColl and A. Tiskin (1999), ‘Memory-efficient matrix multiplication in the BSP model’, Algorithmica 24, 287297.

G. Meurant (2006), The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations, SIAM.

G. Meurant and Z. Strakos (2006), The Lanczos and conjugate gradient algorithms in finite precision arithmetic. In Acta Numerica, Vol. 15, Cambridge University Press, pp. 471542.

S. Mohanty and S. Gopalan (2012), I/O efficient QR and QZ algorithms, in 2012 19th International Conference on High Performance Computing: HiPC, pp. 19.

M. Mohiyuddin , M. Hoemmen , J. Demmel and K. Yelick (2009), Minimizing communication in sparse matrix solvers. In Proc. International Conference on High Performance Computing Networking, Storage and Analysis: SC '09.

R. Nishtala , R. W. Vuduc , J. W. Demmel and K. A. Yelick (2007), ‘When cache blocking of sparse matrix vector multiply works and why’, Applicable Algebra in Engineering, Communication and Computing 18, 297311.

J.-S. Park , M. Penner and V. Prasanna (2004), ‘Optimizing graph algorithms for improved cache performance’, IEEE Trans. Parallel Distributed Systems 15, 769782.

B. Parlett (1995), The new QD algorithms. In Acta Numerica, Vol. 4, Cambridge University Press, pp. 459491.

B. Parlett , D. Taylor and Z. Liu (1985), ‘A look-ahead Lanczos algorithm for unsymmetric matrices’, Math. Comp. 44, 105124.

B. Philippe and L. Reichel (2012), ‘On the generation of Krylov subspace bases’, Appl. Numer. Math. 62, 11711186.

C. Puglisi (1992), ‘Modification of the Householder method based on compact WY representation’, SIAM J. Sci. Statist. Comput. 13, 723726.

L. Reichel (1990), ‘Newton interpolation at Leja points’, BIT 30, 332346.

M. Rozloznik , G. Shklarski and S. Toledo (2011), ‘Partitioned triangular tridiagonalization’, ACM Trans. Math. Softw. 37, #38.

Y. Saad (1985), ‘Practical use of polynomial preconditionings for the conjugate gradient method’, SIAM J. Sci. Statist. Comput. 6, 865881.

Y. Saad (2003), Iterative Methods for Sparse Linear Systems, second edition, SIAM.

Y. Saad and M. H. Schultz (1986), ‘GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems’, SIAM J. Sci. Statist. Comput. 7, 856869.

Y. Saad , M. Yeung , J. Erhel and F. Guyomarc'h (2000), ‘A deflated version of the conjugate gradient algorithm’, SIAM J. Sci. Comput. 21, 19091926.

J. E. Savage (1995), Extending the Hong-Kung model to memory hierarchies. In Computing and Combinatorics, Vol. 959, Springer, pp. 270281.

R. Schreiber and C. Van Loan (1989), ‘A storage-efficient WY representation for products of Householder transformations’, SIAM J. Sci. Statist. Comput. 10, 5357.

G. Sleijpen and H. van der Vorst (1996), ‘Reliable updated residuals in hybrid Bi-CG methods’, Computing 56, 141163.

B. T. Smith , J. M. Boyle , J. J. Dongarra , B. S. Garbow , Y. Ikebe , V. C. Klema and C. B. Moler (1976), Matrix Eigensystem Routines: EISPACK Guide, second edition, Springer.

A. Smoktunowicz , J. L. Barlow and J. Langou (2006), ‘A note on the error analysis of classical Gram-Schmidt’, Numer. Math. 105, 299313.

E. Solomonik and J. Demmel (2011), Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Euro-Par 2011 Parallel Processing (E. Jeannot , R. Namyst and J. Roman , eds), Vol. 6853 of Lecture Notes in Computer Science, Springer, pp. 90109.

D. Sorensen (1992), ‘Implicit application of polynomial filters in a k-step Arnoldi method’, SIAM J. Matrix Anal. Appl. 13, 357385.

A. Stathopoulos and K. Wu (2002), ‘A block orthogonalization procedure with constant synchronization requirements’, SIAM J. Sci. Comput. 23, 21652182.

G. Stewart (2008), ‘Block Gram-Schmidt orthogonalization’, SIAM J. Sci. Comput. 31, 761775.

V. Strassen (1969), ‘Gaussian elimination is not optimal’, Numer. Math. 13, 354356.

E. Sturler (1996), ‘A performance model for Krylov subspace methods on meshbased parallel computers’, Parallel Comput. 22, 5774.

R. Thakur , R. Rabenseifner and W. Gropp (2005), ‘Optimization of collective communication operations in MPICH’, Internat. J. High Performance Comput. Appl. 19, 4966.

A. Tiskin (2002), ‘Bulk-synchronous parallel Gaussian elimination’, J. Math. Sci. 108, 977991.

A. Tiskin (2007), ‘Communication-efficient parallel generic pairwise elimination’, Future Generation Computer Systems 23, 179188.

S. Toledo (1997), ‘Locality of reference in LU decomposition with partial pivoting’, SIAM J. Matrix Anal. Appl. 18, 10651081.

C. Tong and Q. Ye (2000), ‘Analysis of the finite precision bi-conjugate gradient algorithm for nonsymmetric linear systems’, Math. Comp. 69, 15591576.

L. Trefethen and R. Schreiber (1990), ‘Average-case stability of Gaussian elimination’, SIAM J. Matrix Anal. Appl. 11, 335360.

R. A. van de Geijn and J. Watts (1997), ‘SUMMA: Scalable universal matrix multiplication algorithm’, Concurrency: Pract. Exp. 9, 255274.

H. Van der Vorst and Q. Ye (1999), ‘Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals’, SIAM J. Sci. Comput. 22, 835852.

D. Vanderstraeten (1999), A stable and efficient parallel block Gram–Schmidt algorithm. In Euro-Par99 Parallel Processing, Springer, pp. 11281135.

H. Walker (1988), ‘Implementation of the GMRES method using Householder transformations’, SIAM J. Sci. Statist. Comput. 9, 152163.

S. Warshall (1962), ‘A theorem on Boolean matrices’, J. Assoc. Comput. Mach. 9, 1112.

S. Williams , M. Lijewski , A. Almgren , B. Van Straalen , E. Carson , N. Knight and J. Demmel (2014), s-step Krylov subspace methods as bottom solvers for geometric multigrid. In Proc. 2014 IEEE International Parallel and Distributed Processing Symposium: IPDPS '14, to appear.

S. Williams , L. Oliker , R. Vuduc , J. Shalf , K. Yelick and J. Demmel (2009), ‘Optimization of sparse matrix-vector multiplication on emerging multicore platforms’, Parallel Comput. 35, 178194.

D. Wise (2000), Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In Euro-Par 2000 Parallel Processing (A. Bode , T. Ludwig , W. Karl and R. Wismüoller , eds), Vol. 1900 of Lecture Notes in Computer Science, Springer, pp. 774783.

J. Xia , S. Chandrasekaran , M. Gu and X. S. Li (2010), ‘Fast algorithms for hierarchically semiseparable matrices’, Numer. Linear Algebra Appl. 17, 953976.

A. Yzelman and R. H. Bisseling (2011), ‘Two-dimensional cache-oblivious sparse matrix-vector multiplication’, Parallel Comput. 37, 806819.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Acta Numerica
  • ISSN: 0962-4929
  • EISSN: 1474-0508
  • URL: /core/journals/acta-numerica
Please enter your name
Please enter a valid email address
Who would you like to send this to? *