Skip to main content
    • Aa
    • Aa

A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures

  • Cristóbal A. Navarro (a1) (a2), Nancy Hitschfeld-Kahler (a1) and Luis Mateu (a1)

Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions. The evolution of computer architectures (multi-core and many-core) towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm. In the last decade, the graphics processing unit, or GPU, has gained an important place in the field of high performance computing (HPC) because of its low cost and massive parallel processing power. Super-computing has become, for the first time, available to anyone at the price of a desktop computer. In this paper, we survey the concept of parallel computing and especially GPU computing. Achieving efficient parallel algorithms for the GPU is not a trivial task, there are several technical restrictions that must be satisfied in order to achieve the expected performance. Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it. Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and its massive parallelism model. In particular, we show how this new technology can help the field of computational physics, especially when the problem is data-parallel. We present four examples of computational physics problems; n-body, collision detection, Potts model and cellular automata simulations. These examples well represent the kind of problems that are suitable for GPU computing. By understanding the GPU architecture and its massive parallelism programming model, one can overcome many of the technical limitations found along the way, design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle.

      Note you can select to send to either the or variations. ‘’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures
      Available formats
      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your Dropbox account. Find out more about sending content to Dropbox.

      A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures
      Available formats
      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your Google Drive account. Find out more about sending content to Google Drive.

      A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures
      Available formats
Corresponding author
Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

[1] S. V. Adve and K. Gharachorloo Shared memory consistency models: A tutorial. Computer, 29(12):66–76, December 1996.

[3] B. Alpern , L. Carter , E. Feig , and T. Selker The uniform memory hierarchy model of computation. Algorithmica, 12:72–109, 1994. 10.1007/BF01185206.

[6] J. Barnes and P. Hut A hierarchical O(N log N) force-calculation algorithm. Nature, 324(6096):446–449, December 1986.

[7] L. A. Barroso The price of performance. Queue, 3(7):48–53, September 2005.

[10] J. Bédorf , E. Gaburov , and S. P. Zwart A sparse octree gravitational n-body code that runs entirely on the GPU processor. J. Comput. Phys., 231(7):2825–2839, April 2012.

[15] I. Buck , T. Foley , D. Horn , J. Sugerman , K. Fatahalian , M. Houston , and P. Hanrahan Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph., 23(3):777–786, August 2004.

[19] D.-K. Chen , H.-M. Su , and P.-C. Yew The impact of synchronization and granularity on parallel systems. SIGARCH Comput. Archit. News, 18(3a):239–248, May 1990.

[31] D. Culler , R. Karp , D. Patterson , A. Sahay , K. E. Schauser , E. Santos , R. Subramonian , and T. von Eicken . Logp: towards a realistic model of parallel computation. SIGPLAN Not., 28(7):1–12, July 1993.

[32] J. Dean and S. Ghemawat Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113,January 2008.

[33] E. W. Dijkstra Solution of a problem in concurrent programming control. Commun. ACM, 8(9):569–, September 1965.

[34] N. Dunstan Semaphores for fair scheduling monitor conditions. SIGOPS Oper. Syst. Rev., 25(3):27–31, May 1991.

[35] V. Faber , O. M. Lubeck , and A. B. White , Jr. Superlinear speedup of an efficient sequential algorithm is not possible. Parallel Comput., 3(3):259–260, July 1986.

[37] E. E. Ferrero , J. P. De Francesco , N. Wolovick , and S. A.Cannas. q-state potts model metasta-bility study using optimized GPU-based monte carlo algorithms. Computer Physics Communications, 183(8):1578–1587, 2012.

[44] S. Gobron , A. Çöltekin , H. Bonafos , and D. Thalmann GPGPU computation and visualization of three-dimensional cellular automata. The Visual Computer, 27(1):67–81, 2011.

[45] S. Gobron , F. Devillard , and B. Heit Retina simulation using cellular automata and GPU programming. Mach. Vision Appl., 18(6):331–342, November 2007.

[49] J. L. Gustafson Reevaluating Amdahl’s law. Communications of the ACM, 31:532–533, 1988.

[54] C. A. R. Hoare Monitors: an operating system structuring concept. Commun. ACM, 17(10):549–557, October 1974.

[66] D. B. Kidner , P. J. Rallings , and J. A. Ware Parallel processing for terrain analysis in GIS: Visibility as a case study. Geoinformatica, 1(2):183–207, August 1997.

[70] D. E. Knuth Computer programming as an art. Commun. ACM, 17(12):667–673,December 1974.

[71] Y. Komura and Y. Okabe GPU-based single-cluster algorithm for the simulation of the ising model. J. Comput. Phys., 231(4):1209–1215,February 2012.

[72] Y. Komura and Y. Okabe Multi-GPU-based swendsenVwang multi-cluster algorithm for the simulation of two-dimensional -state potts model. Computer Physics Communications, 184(1):40–44, 2013.

[74] S. Krishnamoorthy , M. Baskaran , U. Bondhugula , J. Ramanujam , A. Rountev , and P. Sa-dayappan Effective automatic parallelization of stencil computations. SIGPLAN Not., 42(6):235–244, June 2007.

[75] V. W. Lee , C. Kim , J. Chhugani , M. Deisher , D. Kim , A. D. Nguyen , N. Satish , M. Smelyanskiy , S. Chennupaty , P. Hammarlund , R. Singhal , and P. Dubey Debunking the 100x GPU vs. cpu myth: an evaluation of throughput computing on cpu and GPU. SIGARCH Comput. Archit. News, 38(3):451–460, June 2010.

[77] D. Loveman High performance Fortran. IEEE Parallel & Distributed Technology: Systems & Applications, 1(1):25–42, 1993.

[78] P. Lu , H. Oki , C. Frey , G. Chamitoff , L. Chiao , E. Fincke , C. Foale , S. Magnus , W. Mc Arthur, D. Tani , P. Whitson , J. Williams , W. Meyer , R. Sicker , B. Au , M. Christiansen , A. Schofield , and D. Weitz Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the international space station. Journal of Real-Time Image Processing, 5:179–193, 2010. 10.1007/s11554-009-0133-1.

[80] M. Macedonia The GPU enters computing’s mainstream. Computer, 36(10):106–108,2003.

[82] W. R. Mark , R. S. Glanville , K. Akeley , and M. J. Kilgard Cg: a system for programming graphics hardware in a c-like language. ACM Trans. Graph., 22(3):896–907, July 2003.

[86] N. Metropolis , A. Rosenbluth , M. Rosenbluth , A. Teller , and E. Teller Equation of state calculations by fast computing machines. J. Chem. Phys., 21:1087, 1953.

[96] S. Pabst , A. Koch , and W. Straßer Fast and scalable CPU/GPU collision detection for rigid and deformable surfaces. Computer Graphics Forum, 29(5):1605–1612, 2010.

[99] D. Parkinson Parallel efficiency can be greater than unity. Parallel Computing, 3(3):261 – 262, 1986.

[100] H. A. Peelle To teach Newton’s square root algorithm. SIGAPL APL Quote Quad, 5(4):48–50, December 1974.

[101] V. P. Plagianakos , N. K. Nousis , and M. N. Vrahatis Locating and computing in parallel all the simple roots of special functions using pvm. J. Comput. Appl. Math., 133(1-2):545–554, August 2001.

[102] T. Preis , P. Virnau , W. Paul , and J. J. Schneider GPU accelerated monte carlo simulation of the 2d and 3d ising model. J. Comput. Phys., 228(12):4468–4477, July 2009.

[104] P. E. Ross Why cpu frequency stalled. IEEE Spectr., 45(4):72–72, April 2008.

[112] J. Sugerman , K. Fatahalian , S. Boulos , K. Akeley , and P. Hanrahan Gramps: A programming model for graphics pipelines. ACM Trans. Graph., 28(1):4:1–4:11, February 2009.

[113] R. H. Swendsen and J. S. Wang Nonuniversal, critical dynamics in Monte Carlo simulations. Phys. Rev. Lett., 58:86, 1987.

[117] J. J. Tapia and R. D’Souza Parallelizing the cellular potts model on graphics processing units. Computer Physics Communications, 182(4):857–865, 2011.

[119] L. G. Valiant A bridging model for parallel computation. Commun. ACM, 33(8):103–111, August 1990.

[124] U. Wolff Collective Monte Carlo updating for spin systems. Physical Review Letters, 62:361–364, 1989.

[125] F. Y. Wu The Potts model. Reviews of Modern Physics, 54(1):235–268, January 1982.

[129] R. Yokota and L. A. Barba Hierarchical n-body simulations with autotuning for heterogeneous systems. Computing in Science and Engineering, 14(3):30–39, 2012.

[131] K. Zhou , Q. Hou , R. Wang , and B. Guo Real-time kd-tree construction on graphics hardware. ACM Trans. Graph., 27(5):126:1–126:11, December 2008.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Communications in Computational Physics
  • ISSN: 1815-2406
  • EISSN: 1991-7120
  • URL: /core/journals/communications-in-computational-physics
Please enter your name
Please enter a valid email address
Who would you like to send this to? *



Full text views

Total number of HTML views: 0
Total number of PDF views: 916 *
Loading metrics...

Abstract views

Total abstract views: 582 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 22nd September 2017. This data will be updated every 24 hours.