Aasen, J. O. (1971), ‘On the reduction of a symmetric matrix to tridiagonal form’, BIT Numer. Math. 11, 233–242.
Abdelmalek, N. N. (1971), ‘Round off error analysis for Gram–Schmidt method and solution of linear least squares problems’, BIT Numer. Math. 11, 345–367.
Agarwal, R., Balle, S., Gustavson, F., Joshi, M. and Palkar, P. (1995), ‘A three-dimensional approach to parallel matrix multiplication’, IBM J. Res. Dev. 39, 575–582.
Agarwal, R., Gustavson, F. and Zubair, M. (1994), ‘A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication’, IBM J. Res. Dev. 38, 673–681.
Aggarwal, A. and Vitter, J. (1988), ‘The input/output complexity of sorting and related problems’, Comm. Assoc. Comput. Mach. 31, 1116–1127.
Aggarwal, A., Chandra, A. K. and Snir, M. (1990), ‘Communication complexity of PRAMs’, Theoret. Comput. Sci. 71, 3–28.
Ahmed, N. and Pingali, K. (2000), Automatic generation of block-recursive codes. In Proc. 6th International Euro-Par Conference on Parallel Processing, Springer, pp. 368–378.
Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J. D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S. and Sorensen, D. (1992), LAPACK Users' Guide, SIAM. Also available from http://www.netlib.org/lapack/. Anderson, M., Ballard, G., Demmel, J. and Keutzer, K. (2011), Communication-avoiding QR decomposition for GPUs. In Proc. 2011 IEEE International Parallel and Distributed Processing Symposium: IPDPS '11, pp. 48–58.
Arnoldi, W. E. (1951), ‘The principle of minimized iterations in the solution of the matrix eigenvalue problem’, Quart. Appl. Math. 9, 17–29.
Bai, Z. and Day, D. (2000), Block Arnoldi method. In Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide (Bai, Z., Demmel, J. W., Dongarra, J. J., Ruhe, A. and van der Vorst, H., eds), SIAM, pp. 196–204.
Bai, Z., Day, D., Demmel, J. and Dongarra, J. (1997 a), A test matrix collection for non-Hermitian eigenvalue problems. Technical report CS-97-355, Department of CS, University of Tennessee.
Bai, Z., Demmel, J. and Gu, M. (1997 b), ‘An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems’, Numer. Math. 76, 279–308.
Bai, Z., Hu, D. and Reichel, L. (1991), An implementation of the GMRES method using QR factorization. In Proc. Fifth SIAM Conference on Parallel Processing for Scientific Computing, pp. 84–91.
Bai, Z., Hu, D. and Reichel, L. (1994), ‘A Newton basis GMRES implementation’, IMA J. Numer. Anal. 14, 563–581.
Ballard, G. (2013), Avoiding communication in dense linear algebra. PhD thesis, EECS Department, UC Berkeley.
Ballard, G., Becker, D., Demmel, J., Dongarra, J., Druinsky, A., Peled, I., Schwartz, O., Toledo, S. and Yamazaki, I. (2013 a), Communication-avoiding symmetric-indefinite factorization. Technical report UCB/EECS-2013-127, EECS Department, UC Berkeley.
Ballard, G., Becker, D., Demmel, J., Dongarra, J., Druinsky, A., Peled, I., Schwartz, O., Toledo, S. and Yamazaki, I. (2013 b), Implementing a blocked Aasen's algorithm with a dynamic scheduler on multicore architectures. In Proc. 27th IEEE International Parallel Distributed Processing Symposium: IPDPS '13, pp. 895–907.
Ballard, G., Buluç, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O. and Toledo, S. (2013 c), Communication optimal parallel multiplication of sparse random matrices. In Proc. 25th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '13, ACM, pp. 222–231.
Ballard, G., Demmel, J. and Dumitriu, I. (2011a), Communication-optimal parallel and sequential eigenvalue and singular value algorithms. Technical Report EECS-2011-14, UC Berkeley.
Ballard, G., Demmel, J. and Gearhart, A. (2011 b), Brief announcement: Communication bounds for heterogeneous architectures. In Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '11, ACM, pp. 257–258.
Ballard, G., Demmel, J. and Knight, N. (2012 a), Communication avoiding successive band reduction. In Proc. 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming: PPoPP '12, ACM, pp. 35–44.
Ballard, G., Demmel, J. and Knight, N. (2013 d), Avoiding communication in successive band reduction. Technical report UCB/EECS-2013-131, EECS Department, UC Berkeley.
Ballard, G., Demmel, J., Grigori, L., Jacquelin, M., Nguyen, H. D. and Solomonik, E. (2014), Reconstructing Householder vectors from Tall-Skinny QR. In Proc. 2014 IEEE International Parallel and Distributed Processing Symposium: IPDPS '14, to appear.
Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2010), ‘Communication-optimal parallel and sequential Cholesky decomposition’, SIAM J. Sci. Comput. 32, 3495–3523.
Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2011 c), Graph expansion and communication costs of fast matrix multiplication. In Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '11, ACM, pp. 1–12.
Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2011 d), ‘Minimizing communication in numerical linear algebra’, SIAM J. Matrix Anal. Appl. 32, 866–901.
Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2012 b), ‘Graph expansion and communication costs of fast matrix multiplication’, J. Assoc. Comput. Mach. 59, #32.
Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2012 c), Sequential communication bounds for fast linear algebra. Technical report EECS-2012-36, UC Berkeley.
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B. and Schwartz, O. (2012 d), Brief announcement: Strong scaling of matrix multiplication algorithms and memory independent communication lower bounds. In Proc. 24th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '12, ACM, pp. 77–79.
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B. and Schwartz, O. (2012 e), Communication-optimal parallel algorithm for Strassen's matrix multiplication. In Proc. 24th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '12, ACM, pp. 193–204.
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B. and Schwartz, O. (2012 f), Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In Design and Analysis of Algorithms (Even, G. and Rawitz, D., eds), Vol. 7659 of Lecture Notes in Computer Science, Springer, pp. 13–36.
Ballard, G., Demmel, J., Lipshitz, B., Schwartz, O. and Toledo, S. (2013 f), Communication efficient Gaussian elimination with partial pivoting using a shape morphing data layout. In Proc. 25th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '13, ACM, pp. 232–240.
Barrett, R., Berry, M., Chan, T. F., Demmel, J. W., Donato, J., Dongarra, J. J., Eijkhout, V., Pozo, R., Romine, C. and van der Vorst, H. (1994), Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, second edition, SIAM.
Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R. and Vicari, E. (2010), ‘Optimal sparse matrix dense vector multiplication in the I/O-model’, Theory Comput. Syst. 47, 934–962.
Bennett, J., Carbery, A., Christ, M. and Tao, T. (2010), ‘Finite bounds for Holder-Brascamp-Lieb multilinear inequalities’, Math. Res. Lett. 17, 647–666.
Berntsen, J. (1989), ‘Communication efficient matrix multiplication on hypercubes’, Parallel Comput. 12, 335–342.
Bilardi, G. and Preparata, F. P. (1999), ‘Processor-time tradeoffs under bounded-speed message propagation II: Lower bounds’, Theory Comput. Syst. 32, 531–559.
Bilardi, G., Pietracaprina, A. and D'Alberto, P. (2000), On the space and access complexity of computation DAGs. In Graph-Theoretic Concepts in Computer Science: 26th International Workshop (Brandes, U. and Wagner, D., eds), Vol. 1928 of Lecture Notes in Computer Science, Springer, pp. 47–58.
Bischof, C. and Loan, C. Van (1987), ‘The WY representation for products of Householder matrices’, SIAM J. Sci. Statist. Comput. 8, 2–13.
Bischof, C. H., Lang, B. and Sun, X. (2000a), ‘Algorithm 807: The SBR Toolbox, Software for successive band reduction’, ACM Trans. Math. Softw. 26, 602–616.
Bischof, C., Lang, B. and Sun, X. (2000b), ‘A framework for symmetric band reduction’, ACM Trans. Math. Softw. 26, 581–601.
Björck, A. (1967), ‘Solving linear least squares problems by Gram–Schmidt orthogonalization’, BIT Numer. Math. 7, 1–21.
Blackford, L. S., Choi, J., Cleary, A., D'Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D. and Whaley, R. C. (1997), ScaLAPACK Users' Guide, SIAM. Also available from http://www.netlib.org/scalapack/. Blackford, L. S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K. and Whaley, R. C. (2002), ‘An updated set of basic linear algebra subroutines (BLAS)’, J. ACM Trans. Math. Softw. 28, 135–151.
Braman, K., Byers, R. and Mathias, R. (2002 a), ‘The multishift QR algorithm I: Maintaining well-focused shifts and level 3 performance’, SIAM J. Matrix Anal. Appl. 23, 929–947.
Braman, K., Byers, R. and Mathias, R. (2002b), ‘The multishift QR algorithm II: Aggressive early deflation’, SIAM J. Matrix Anal. Appl. 23, 948–973.
Bunch, J. and Kaufman, L. (1977), ‘Some stable methods for calculating inertia and solving symmetric linear systems’, Math. Comp. 31, 163–179.
Buttari, A., Langou, J., Kurzak, J. and Dongarra, J. J. (2007), A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 191.
Byun, J., Lin, R., Yelick, K. and Demmel, J. (2012), Autotuning sparse matrix-vector multiplication for multicore. Technical report UCB/EECS-2012-215, EECS Department, UC Berkeley.
Cannon, L. (1969), A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University.
Carson, E. and Demmel, J. (2014), ‘A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods’, SIAM J. Matrix Anal. Appl. 35, 22–43.
Carson, E., Knight, N. and Demmel, J. (2013), ‘Avoiding communication in non-symmetric Lanczos-based Krylov subspace methods’, SIAM J. Sci. Comput. 35, S42–S61.
Catalyurek, U. V. and Aykanat, C. (1999), ‘Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication’, IEEE Trans. Parallel Distributed Systems 10, 673–693.
Catalyiirek, Ü. V. and Aykanat, C. (2001), A fine-grain hypergraph model for 2D decomposition of sparse matrices. In Proc. 15th IEEE International Parallel and Distributed Processing Symposium: IPDPS '01.
Chan, E., Heimlich, M., Purkayastha, A. and van de Geijn, R. (2007), ‘Collective communication: Theory, practice, and experience’, Concurrency and Computation: Pract. Exp. 19, 1749–1783.
Chow, E. (2000), ‘Apriori sparsity patterns for parallel sparse approximate inverse preconditioners’, SIAM J. Sci. Comput. 21, 1804–1822.
Chow, E. (2001), ‘Parallel implementation and practical use of sparse approximate inverses with a priori sparsity patterns’, Internat. J. High Perf. Comput. Appl. 15, 56–74.
Christ, M., Demmel, J., Knight, N., Scanlon, T. and Yelick, K. (2013), Communication lower bounds and optimal algorithms for programs that reference arrays, part 1. Technical report UCB/EECS-2013-61, EECS Department, UC Berkeley.
Chronopoulos, A. and Gear, C. (1989 a), ‘On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy’, Parallel Comput. 11, 37–53.
Chronopoulos, A. and Gear, C. (1989 b), ‘s-step iterative methods for symmetric linear systems’, J. Comput. Appl. Math. 25, 153–168.
Chronopoulos, A. and Swanson, C. (1996), ‘Parallel iterative s-step methods for unsymmetric linear systems’, Parallel Comput. 22, 623–641.
Cohen, E. (1994), Estimating the size of the transitive closure in linear time, in Proc. 35th Ann. Symp. Found. Comp. Sci., IEEE, pp. 190–200.
Cohen, E. (1997), ‘Size-estimation framework with applications to transitive closure and reachability’, J. Comput. System Sci. 55, 441–453.
Committee on the Analysis of Massive Data; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Their Applications; Division on Engineering and Physical Sciences; National Research Council (2013), Frontiers in Massive Data Analysis, The National Academies Press.
da Cunha, R. D., Becker, D. and Patterson, J. C. (2002), New parallel (rank-revealing) QR factorization algorithms. In Euro-Par 2002 Parallel Processing, Springer, pp. 677–686.
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J. and Yelick, K. (2008), Stencil computation optimization and autotuning on state-of-the-art multicore architectures. In Proc. 2008 ACM/IEEE Conference on Supercomputing, IEEE Press, p. 4.
Davis, T. and Hu, Y. (2011), ‘The University of Florida sparse matrix collection’, ACM Trans. Math. Softw. 38, 1–25.
Dekel, E., Nassimi, D. and Sahni, S. (1981), ‘Parallel matrix and graph algorithms’, SIAM J. Comput. 10, 657–675.
Demmel, J. (1997), Applied Numerical Linear Algebra, SIAM.
Demmel, J. (2013), An arithmetic complexity lower bound for computing rational functions, with applications to linear algebra. Technical report UCB/EECS-2013-126, EECS Department, UC Berkeley.
Demmel, J., Dumitriu, I. and Holtz, O. (2007 a), ‘Fast linear algebra is stable’, Numer. Math. 108, 59–91.
Demmel, J., Dumitriu, I., Holtz, O. and Kleinberg, R. (2007 b), ‘Fast matrix multiplication is stable’, Numer. Math. 106, 199–224.
Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O. and Spillinger, O. (2013 a), Communication-optimal parallel recursive rectangular matrix multiplication. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 261–272.
Demmel, J., Gearhart, A., Lipshitz, B. and Schwartz, O. (2013 b), Perfect strong scaling using no additional energy. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 649–660.
Demmel, J., Grigori, L., Gu, M. and Xiang, H. (2013 c), Communication avoiding rank revealing QR factorization with column pivoting. Technical report UCB/EECS-2013-46, EECS Department, UC Berkeley.
Demmel, J., Grigori, L., Hoemmen, M. and Langou, J. (2008 a), Communication-avoiding parallel and sequential QR and LU factorizations: Theory and practice. LAPACK Working Note.
Demmel, J., Grigori, L., Hoemmen, M. and Langou, J. (2012), ‘Communication-optimal parallel and sequential QR and LU factorizations’, SIAM J. Sci. Comput. 34, A206–A239.
Demmel, J., Hoemmen, M., Mohiyuddin, M. and Yelick, K. (2007 c), Avoiding communication in computing Krylov subspaces. Technical report UCB/EECS-2007-123, EECS Department, UC Berkeley.
Demmel, J., Hoemmen, M., Mohiyuddin, M. and Yelick, K. (2008 b), Avoiding communication in sparse matrix computations. In Proc. 2008 IEEE International Parallel and Distributed Processing Symposium: IPDPS 2008, pp. 1–12.
Demmel, J., Marques, O., Parlett, B. and Vomel, C. (2008 c), ‘Performance and accuracy of LAPACK's symmetric tridiagonal eigensolvers’, SIAM J. Sci. Comput. 30, 1508–1526.
Devine, K. D., Boman, E. G., Heaphy, R. T., Bisseling, R. H. and Catalyurek, U. V. (2006), Parallel hypergraph partitioning for scientific computing. In Proc. 20th IEEE International Parallel and Distributed Processing Symposium: IPDPS 2006.
Dongarra, J. J., Croz, J. D., Duff, I. S. and Hammarling, S. (1990 a), ‘Algorithm 679: A set of level 3 basic linear algebra subprograms’, ACM Trans. Math. Softw. 16, 18–28.
Dongarra, J. J., Croz, J. D., Duff, I. S. and Hammarling, S. (1990 b), ‘A set of level 3 basic linear algebra subprograms’, ACM Trans. Math. Softw. 16, 1–17.
Dongarra, J. J., Croz, J. D., Hammarling, S. and Hanson, R. J. (1988 a), ‘Algorithm 656: An extended set of Fortran basic linear algebra subprograms’, ACM Trans. Math. Softw. 14, 18–32.
Dongarra, J. J., Croz, J. D., Hammarling, S. and Hanson, R. J. (1988 b), ‘An extended set of Fortran basic linear algebra subprograms’, ACM Trans. Math. Softw. 14, 1–17.
Dongarra, J. J., Moler, C. B., Bunch, J. R. and Stewart, G. W. (1979), LINPACK Users' Guide, SIAM.
Douglas, C. C., Hu, J., Kowarschik, M., Rüde, U. and Weiβ, C. (2000), ‘Cache optimization for structured and unstructured grid multigrid’, Electron. Trans. Numer. Anal. 10, 21–40.
Driscoll, M., Georganas, E., Koanantakool, P., Solomonik, E. and Yelick, K. (2013), A communication-optimal W-body algorithm for direct interactions. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 1075–1084.
Dursun, H., Nomura, K.-I., Peng, L., Seymour, R., Wang, W., Kalia, R. K., Nakano, A. and Vashishta, P. (2009), A multilevel parallelization framework for highorder stencil computations. In Euro-Par 2009 Parallel Processing, Springer, pp. 642–653.
Elmroth, E. and Gustavson, F. (1998), New serial and parallel recursive QR factorization algorithms for SMP systems. In Applied Parallel Computing: Large Scale Scientific and Industrial Problems (Kågström, B.et al., eds), Vol. 1541 of Lecture Notes in Computer Science, Springer, pp. 120–128.
Floyd, R. (1962), ‘Algorithm 97: Shortest path’, Commun. Assoc. Comput. Mach. 5, 345.
Frens, J. D. and Wise, D. S. (2003), ‘QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism’, ACM SIGPLAN Notices 38, 144–154.
Frigo, M. and Strumpen, V. (2005), Cache oblivious stencil computations. In Proc. 19th Annual International Conference on Supercomputing, ACM, pp. 361–366.
Frigo, M. and Strumpen, V. (2009), ‘The cache complexity of multithreaded cache oblivious algorithms’, Theory Comput. Syst. 45, 203–233.
Frigo, M., Leiserson, C. E., Prokop, H. and Ramachandran, S. (1999), Cache-oblivious algorithms. In Proc. 40th Annual Symposium on Foundations of Computer Science: FOCS '99, IEEE Computer Society, pp. 285–297.
Fuller, S. H. and Millett, L. I., eds (2011), The Future of Computing Performance: Game Over or Next Level? The National Academies Press. http://www.nap.edu Gannon, D. and Rosendale, J. Van (1984), ‘On the impact of communication complexity on the design of parallel numerical algorithms’, Trans. Comput. 100, 1180–1194.
Georganas, E., Gonzalez-Dominguez, J., Solomonik, E., Zheng, Y., Tourino, J. and Yelick, K. (2012), Communication avoiding and overlapping for numerical linear algebra. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis: SC '12, pp. 1–11.
George, A. (1973), ‘Nested dissection of a regular finite element mesh’, SIAM J. Numer. Anal. 10, 345–363.
Gilbert, J. R. and Tarjan, R. E. (1987), ‘The analysis of a nested dissection algorithm’, Numer. Math. pp. 377–404.
Giraud, L. and Langou, J. (2003), ‘A robust criterion for the modified Gram–Schmidt algorithm with selective reorthogonalization’, SIAM J. Sci. Comput. 25, 417–441.
Giraud, L., Langou, J. and Rozloznik, M. (2005), ‘The loss of orthogonality in the Gram–Schmidt orthogonalization process’, Comput. Math. Appl. 50, 1069–1075.
Golub, G. and Loan, C. Van (1996), Matrix Computations, third edition, Johns Hopkins University Press.
Golub, G. H., Plemmons, R. J. and Sameh, A. (1988), Parallel block schemes for large-scale least-squares computations. In High-Speed Computing: Scientific Applications and Algorithm Design, University of Illinois Press, pp. 171–179.
Graham, S. L., Snir, M. and Patterson, C. A., eds (2004), Getting up to Speed: The Future of Supercomputing, Report of National Research Council of the National Academies Sciences, The National Academies Press.
Granat, R., Kågström, B., Kressner, D. and Shao, M. (2012), Parallel library software for the multishift QR algorithm with aggressive early deflation. Report UMINF 12.06, Department of Computing Science, Umea University, SE-901.
Greenbaum, A. (1997 a), ‘Estimating the attainable accuracy of recursively computed residual methods’, SIAM J. Matrix Anal. Appl. 18, 535–551.
Greenbaum, A. (1997 b), Iterative Methods for Solving Linear Systems, SIAM.
Greenbaum, A., Rozložník, M. and Strakoš, Z. (1997), ‘Numerical behavior of the modified Gram–Schmidt GMRES implementation’, BIT Numer. Math. 37, 706–719.
Greiner, G. and Jacob, R. (2010 a), Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model. In Mathematical Foundations of Computer Science 2010, Springer, pp. 393–404.
Greiner, G. and Jacob, R. (2010 b), The I/O complexity of sparse matrix dense matrix multiplication. In LATIN 2010: Theoretical Informatics, Springer, pp. 143–156.
Grigori, L. and Moufawad, S. (2013), Communication avoiding ILU(0) preconditioned Research report RR-8266, INRIA.
Grigori, L., David, P.-Y., Demmel, J. and Peyronnet, S. (2010), Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem. In Proc. 22nd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '10, ACM, pp. 79–81.
Grigori, L., Demmel, J. and Xiang, H. (2011), ‘CALU: A communication optimal LU factorization algorithm’, SIAM J. Matrix Anal. Appl. 32, 1317–1350.
Gu, M. and Eisenstat, S. (1996), ‘Efficient algorithms for computing a strong rank-revealing QR factorization’, SIAM J. Sci. Comput. 17, 848–869.
Gunter, B. C. and van de Geijn, R. A. (2005), ‘Parallel out-of-core computation and updating of the QR factorization’, ACM Trans. Math. Softw. 31, 60–78.
Gustavson, F. G. (1997), ‘Recursion leads to automatic variable blocking for dense linear-algebra algorithms’, IBM J. Res. Dev. 41, 737–756.
Gutknecht, M. (1997), Lanczos-type solvers for nonsymmetric linear systems of equations. In Acta Numerica, Vol. 6, Cambridge University Press, pp. 271–398.
Gutknecht, M. and Ressel, K. (2000), ‘Look-ahead procedures for Lanczos-type product methods based on three-term Lanczos recurrences’, SIAM J. Matrix Anal. Appl. 21, 1051–1078.
Gutknecht, M. and Strakoš, Z. (2000), ‘Accuracy of two three-term and three two-term recurrences for Krylov space solvers’, SIAM J. Matrix Anal. Appl. 22, 213–229.
Haidar, A., Luszczek, P., Kurzak, J. and Dongarra, J. (2013), An improved parallel singular value algorithm and its implementation for multicore hardware. LAPACK Working Note 283.
Hestenes, M. R. and Stiefel, E. (1952), ‘Methods of conjugate gradients for solving linear systems’, J. Res. Nat. Bur. Standards 49, 409–436.
Hindmarsh, A. and Walker, H. (1986), Note on a Householder implementation of the GMRES method. Technical report UCID-20899, Lawrence Livermore National Laboratory.
Hoemmen, M. (2010), Communication-avoiding Krylov subspace methods. PhD thesis, EECS Department, UC Berkeley.
Hoffman, A. J., Martin, M. S. and Rose, D. J. (1973), ‘Complexity bounds for regular finite difference and finite element grids’, SIAM J. Numer. Anal. 10, 364–369.
Hong, J. W. and Kung, H. T. (1981), I/O complexity: The red-blue pebble game. In Proc. 13th Annual ACM Symposium on Theory of Computing: STOC '81, ACM, pp. 326–333.
Howell, G. W., Demmel, J., Fulton, C. T., Hammarling, S. and Marmol, K. (2008), ‘Cache efficient bidiagonalization using BLAS 2.5 operators’, ACM Trans. Math. Softw. 34, #14.
Hupp, P. and Jacob, R. (2013), Tight bounds for low dimensional star stencils in the external memory model. In Algorithms and Data Structures (Dehne, F., Solis-Oba, R. and Sack, J.-R., eds), Vol. 8037 of Lecture Notes in Computer Science, Springer, pp. 415–426.
Irigoin, F. and Triolet, R. (1988), Supernode partitioning. In Proc. 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ACM, pp. 319–329.
Irony, D., Toledo, S. and Tiskin, A. (2004), ‘Communication lower bounds for distributed-memory matrix multiplication’, J. Parallel Distrib. Comput. 64, 1017–1026.
Jalby, W. and Philippe, B. (1991), ‘Stability analysis and improvement of the block Gram–Schmidt algorithm’, SIAM J. Sci. Statist. Comput. 12, 1058–1073.
Johnsson, S. L. (1992), Minimizing the communication time for matrix multiplication on multiprocessors’, Parallel Comput.
Joubert, W. and Carey, G. (1992), ‘Parallelizable restarted iterative methods for nonsymmetric linear systems I: Theory’, Internat. J. Comput. Math. 44, 243–267.
Kågström, B., Kressner, D. and Shao, M. (2012), On aggressive early deflation in parallel variants of the QR algorithm. In Applied Parallel and Scientific Computing, Springer, pp. 1–10.
Karlsson, L. and Kågström, B. (2011), ‘Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures’, Parallel Comput. 37, 771–782.
Kepner, J. and Gilbert, J. (2011), Graph Algorithms in the Language of Linear Algebra, Vol. 22, SIAM.
Kielbasinski, A. (1974), ‘Numerical analysis of the Gram–Schmidt orthogonalization algorithm (analiza numeryczna algorytmu ortogonalizacji Grama–Schmidta)’, Roczniki Polskiego Towarzystwa Matematycznego, Seria III: Matematyka Stosowana II, pp. 15–35.
Kim, S. and Chronopoulos, A. (1992), ‘An efficient nonsymmetric Lanczos method on parallel vector computers’, J. Comput. Appl. Math. 42, 357–374.
Knight, N., Carson, E. and Demmel, J. (2014), Exploiting data sparsity in parallel matrix powers computations. In Proc. PPAM '13, Vol. 8384 of Lecture Notes in Computer Science, Springer, to appear.
Kolda, T. G. and Bader, B. W. (2009), ‘Tensor decompositions and applications’, SIAM Review 51, 455–500.
LaMielle, A. and Strout, M. (2010), Enabling code generation within the sparse polyhedral framework. Technical report CS-10-102, Colorado State University.
Lanczos, C. (1950), An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators, United States Government Press Office.
Lawson, C. L., Hanson, R. J., Kincaid, D. and Krogh, F. T. (1979), ‘Basic linear algebra subprograms for Fortran usage’, ACM Trans. Math. Softw. 5, 308–323.
Leiserson, C. E., Rao, S. and Toledo, S. (1997), ‘Efficient out-of-core algorithms for linear relaxation using blocking covers’, J. Comput. System Sci. 54, 332–344.
Lev, G. and Valiant, L. (1983), ‘Size bounds for superconcentrators’, Theor. Comput. Sci. 22, 233–251.
Lipshitz, B. (2013), Communication-avoiding parallel recursive algorithms for matrix multiplication. Master's thesis, EECS Department, UC Berkeley.
Lipshitz, B., Ballard, G., Demmel, J. and Schwartz, O. (2012), Communication-avoiding parallel Strassen: Implementation and performance. In Proc. International Conference on High Performance Computing, Networking, Storage and Analysis: SC '12, #101.
Loomis, L. H. and Whitney, H. (1949), ‘An inequality related to the isoperimetric inequality’, Bull. Amer. Math. Soc. 55, 961–962.
McColl, W. and Tiskin, A. (1999), ‘Memory-efficient matrix multiplication in the BSP model’, Algorithmica 24, 287–297.
Meurant, G. (2006), The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations, SIAM.
Meurant, G. and Strakos, Z. (2006), The Lanczos and conjugate gradient algorithms in finite precision arithmetic. In Acta Numerica, Vol. 15, Cambridge University Press, pp. 471–542.
Mohanty, S. and Gopalan, S. (2012), I/O efficient QR and QZ algorithms, in 2012 19th International Conference on High Performance Computing: HiPC, pp. 1–9.
Mohiyuddin, M. (2012), Tuning hardware and software for multiprocessors. PhD thesis, EECS Department, UC Berkeley.
Mohiyuddin, M., Hoemmen, M., Demmel, J. and Yelick, K. (2009), Minimizing communication in sparse matrix solvers. In Proc. International Conference on High Performance Computing Networking, Storage and Analysis: SC '09.
Nakatsukasa, Y. and Higham, N. (2012), Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. MIMS EPrint 2012.52, University of Manchester.
Nishtala, R., Vuduc, R. W., Demmel, J. W. and Yelick, K. A. (2007), ‘When cache blocking of sparse matrix vector multiply works and why’, Applicable Algebra in Engineering, Communication and Computing 18, 297–311.
Park, J.-S., Penner, M. and Prasanna, V. (2004), ‘Optimizing graph algorithms for improved cache performance’, IEEE Trans. Parallel Distributed Systems 15, 769–782.
Parlett, B. (1995), The new QD algorithms. In Acta Numerica, Vol. 4, Cambridge University Press, pp. 459–491.
Parlett, B. and Reid, J. (1970), ‘On the solution of a system of linear equations whose matrix is symmetric but not definite’, BIT Numer. Math. 10, 386–397.
Parlett, B., Taylor, D. and Liu, Z. (1985), ‘A look-ahead Lanczos algorithm for unsymmetric matrices’, Math. Comp. 44, 105–124.
Pfeifer, C. (1963), Data flow and storage allocation for the PDQ-5 program on the Philco-2000. Technical report, Westinghouse Electric Corp. Bettis Atomic Power Lab., Pittsburgh.
Philippe, B. and Reichel, L. (2012), ‘On the generation of Krylov subspace bases’, Appl. Numer. Math. 62, 1171–1186.
Puglisi, C. (1992), ‘Modification of the Householder method based on compact WY representation’, SIAM J. Sci. Statist. Comput. 13, 723–726.
Reichel, L. (1990), ‘Newton interpolation at Leja points’, BIT 30, 332–346.
Rozloznik, M., Shklarski, G. and Toledo, S. (2011), ‘Partitioned triangular tridiagonalization’, ACM Trans. Math. Softw. 37, #38.
Saad, Y. (1985), ‘Practical use of polynomial preconditionings for the conjugate gradient method’, SIAM J. Sci. Statist. Comput. 6, 865–881.
Saad, Y. (2003), Iterative Methods for Sparse Linear Systems, second edition, SIAM.
Saad, Y. and Schultz, M. H. (1986), ‘GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems’, SIAM J. Sci. Statist. Comput. 7, 856–869.
Saad, Y., Yeung, M., Erhel, J. and F. Guyomarc'h (2000), ‘A deflated version of the conjugate gradient algorithm’, SIAM J. Sci. Comput. 21, 1909–1926.
Savage, J. E. (1995), Extending the Hong-Kung model to memory hierarchies. In Computing and Combinatorics, Vol. 959, Springer, pp. 270–281.
Schatz, M., Poulson, J. and van de Geijn, R. (2013), Scalable universal matrix multiplication algorithms: 2D and 3D variations on a theme. Technical report, University of Texas.
Schreiber, R. and Loan, C. Van (1989), ‘A storage-efficient WY representation for products of Householder transformations’, SIAM J. Sci. Statist. Comput. 10, 53–57.
Schwarz, H. A. (1870), ‘über einen Grenzübergang durch alternierendes Verfahren’, Vierteljahrsschrift der Naturforschenden Gesellschaft in Zürich 15, 272–286.
Scquizzato, M. and Silvestri, F. (2014), Communication lower bounds for distributed-memory computations. In 31st International Symposium on Theoretical Aspects of Computer Science: STACS 2014 (Mayr, E. W. and Portier, N., eds), Vol. 25 of Leibniz International Proceedings in Informatics: LIPIcs, Schloss Dagstuhl-Leibniz-Zentrum für Informatik, pp. 627–638.
Sleijpen, G. and van der Vorst, H. (1996), ‘Reliable updated residuals in hybrid Bi-CG methods’, Computing 56, 141–163.
Smith, B. T., Boyle, J. M., Dongarra, J. J., Garbow, B. S., Ikebe, Y., Klema, V. C. and Moler, C. B. (1976), Matrix Eigensystem Routines: EISPACK Guide, second edition, Springer.
Smoktunowicz, A., Barlow, J. L. and Langou, J. (2006), ‘A note on the error analysis of classical Gram-Schmidt’, Numer. Math. 105, 299–313.
Solomonik, E. and Demmel, J. (2011), Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Euro-Par 2011 Parallel Processing (Jeannot, E., Namyst, R. and Roman, J., eds), Vol. 6853 of Lecture Notes in Computer Science, Springer, pp. 90–109.
Solomonik, E., Buluc, A. and Demmel, J. (2013), Minimizing communication in all-pairs shortest-paths. In Proc. 27th IEEE International Parallel Distributed Processing Symposium: IPDPS '13, pp. 548–559.
Solomonik, E., Carson, E., Knight, N. and Demmel, J. (2014), Tradeoffs between synchronization, communication, and work in parallel linear algebra computations. Technical report UCB/EECS-2014-8, EECS Department, UC Berkeley.
Solomonik, E., Matthews, D., Hammond, J. and Demmel, J. (2013 c), Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 813–824.
Sorensen, D. (1985), ‘Analysis of pairwise pivoting in Gaussian elimination’, IEEE Trans. Computers C-34, 274–278.
Sorensen, D. (1992), ‘Implicit application of polynomial filters in a k-step Arnoldi method’, SIAM J. Matrix Anal. Appl. 13, 357–385.
Stathopoulos, A. and Wu, K. (2002), ‘A block orthogonalization procedure with constant synchronization requirements’, SIAM J. Sci. Comput. 23, 2165–2182.
Stewart, G. (2008), ‘Block Gram-Schmidt orthogonalization’, SIAM J. Sci. Comput. 31, 761–775.
Strassen, V. (1969), ‘Gaussian elimination is not optimal’, Numer. Math. 13, 354–356.
Strout, M. M., Carter, L. and Ferrante, J. (2001), Rescheduling for locality in sparse matrix computations. In Computational Science: ICCS 2001, Springer, pp. 137–146.
Sturler, E. (1996), ‘A performance model for Krylov subspace methods on meshbased parallel computers’, Parallel Comput. 22, 57–74.
Tang, Y., Chowdhury, R. A., Kuszmaul, B. C., Luk, C.-K. and Leiserson, C. E. (2011), The Pochoir stencil compiler. In Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures, ACM, pp. 117–128.
Thakur, R., Rabenseifner, R. and Gropp, W. (2005), ‘Optimization of collective communication operations in MPICH’, Internat. J. High Performance Comput. Appl. 19, 49–66.
Tiskin, A. (2002), ‘Bulk-synchronous parallel Gaussian elimination’, J. Math. Sci. 108, 977–991.
Tiskin, A. (2007), ‘Communication-efficient parallel generic pairwise elimination’, Future Generation Computer Systems 23, 179–188.
Toledo, S. (1995), Quantitative performance modeling of scientific computations and creating locality in numerical algorithms. PhD thesis, MIT.
Toledo, S. (1997), ‘Locality of reference in LU decomposition with partial pivoting’, SIAM J. Matrix Anal. Appl. 18, 1065–1081.
Tong, C. and Ye, Q. (2000), ‘Analysis of the finite precision bi-conjugate gradient algorithm for nonsymmetric linear systems’, Math. Comp. 69, 1559–1576.
Trefethen, L. and Schreiber, R. (1990), ‘Average-case stability of Gaussian elimination’, SIAM J. Matrix Anal. Appl. 11, 335–360.
van de Geijn, R. A. and Watts, J. (1997), ‘SUMMA: Scalable universal matrix multiplication algorithm’, Concurrency: Pract. Exp. 9, 255–274.
der Vorst, H. Van and Ye, Q. (1999), ‘Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals’, SIAM J. Sci. Comput. 22, 835–852.
Rosendale, J. Van (1983), Minimizing inner product data dependencies in conjugate gradient iteration. Technical report 172178, ICASE-NASA.
Vanderstraeten, D. (1999), A stable and efficient parallel block Gram–Schmidt algorithm. In Euro-Par99 Parallel Processing, Springer, pp. 1128–1135.
Vuduc, R., Demmel, J. and Yelick, K. (2005), OSKI: A library of automatically tuned sparse matrix kernels. In Proc. of SciDAC 2005, J. of Physics Conference Series, Institute of Physics.
Vuduc, R. W. (2003), Automatic performance tuning of sparse matrix kernels. PhD thesis, EECS Department, UC Berkeley.
Walker, H. (1988), ‘Implementation of the GMRES method using Householder transformations’, SIAM J. Sci. Statist. Comput. 9, 152–163.
Warshall, S. (1962), ‘A theorem on Boolean matrices’, J. Assoc. Comput. Mach. 9, 11–12.
Williams, S., Lijewski, M., Almgren, A., Straalen, B. Van, Carson, E., Knight, N. and Demmel, J. (2014), s-step Krylov subspace methods as bottom solvers for geometric multigrid. In Proc. 2014 IEEE International Parallel and Distributed Processing Symposium: IPDPS '14, to appear.
Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K. and Demmel, J. (2009), ‘Optimization of sparse matrix-vector multiplication on emerging multicore platforms’, Parallel Comput. 35, 178–194.
Williams, V. (2012), Multiplying matrices faster than Coppersmith–Winograd. In Proc. 44th Annual Symposium on Theory of Computing: STOC 12, ACM, pp. 887–898.
Wise, D. (2000), Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In Euro-Par 2000 Parallel Processing (Bode, A., Ludwig, T., Karl, W. and Wismüoller, R., eds), Vol. 1900 of Lecture Notes in Computer Science, Springer, pp. 774–783.
Wolf, M. M., Boman, E. G. and Hendrickson, B. (2008), ‘Optimizing parallel sparse matrix-vector multiplication by corner partitioning’, PARA08, Trondheim, Norway.
Xia, J., Chandrasekaran, S., Gu, M. and Li, X. S. (2010), ‘Fast algorithms for hierarchically semiseparable matrices’, Numer. Linear Algebra Appl. 17, 953–976.
Yotov, K., Roeder, T., Pingali, K., Gunnels, J. and Gustavson, F. (2007), An experimental comparison of cache-oblivious and cache-conscious programs. In Proc. 19th Annual ACM Symposium on Parallel Algorithms and Architectures: SPAA '07, ACM, pp. 93–104.
Yzelman, A. and Bisseling, R. H. (2011), ‘Two-dimensional cache-oblivious sparse matrix-vector multiplication’, Parallel Comput. 37, 806–819.