Large-Scale Data Management Techniques in Cloud Computing Platforms

Sherif Sakr; Anna Liu

doi:10.1017/CBO9780511844409.005

5 - Large-Scale Data Management Techniques in Cloud Computing Platforms

Published online by Cambridge University Press: 05 December 2012

Sherif Sakr and

Anna Liu

Edited by

Ian Gorton and

Deborah K. Gracio

Show author details

Sherif Sakr: Affiliation:
National ICT Australia (NICTA), University of New SouthWales
Anna Liu: Affiliation:
National ICT Australia (NICTA), University of New South Wales
Ian Gorton: Affiliation:
Pacific Northwest National Laboratory, Washington
Deborah K. Gracio: Affiliation:
Pacific Northwest National Laboratory, Washington

Book contents

Get access

Summary

Introduction

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data, which called for a paradigm shift in the computing architecture and large scale data processing mechanisms. In a speech given just a few weeks before he was lost at sea off the California coast in January 2007, Jim Gray, a database software pioneer and a Microsoft researcher, called the shift a “fourth paradigm” [32]. The first three paradigms were experimental, theoretical and, more recently, computational science. Gray argued that the only way to cope with this paradigm is to develop a new generation of computing tools to manage, visualize, and analyze the data flood. In general, the current computer architectures are increasingly imbalanced where the latency gap between multicore CPUs and mechanical hard disks is growing every year, which makes the challenges of data-intensive computing harder to overcome [6]. Therefore, there is a crucial need for a systematic and generic approach to tackle these problems with an architecture that can also scale into the foreseeable future. In response, Gray argued that the new trend should instead focus on supporting cheaper clusters of computers to manage and process all this data instead of focusing on having the biggest and fastest single computer. Figure 5.1 illustrates an example of the explosion in scientific data, which creates major challenges for cutting-edge scientific projects. For example, modern high-energy physics experiments, such as DZero, typically generate more than one terabyte of data per day.

Type: Chapter
Information: Data-Intensive Computing
Architectures, Algorithms, and Applications
, pp. 85 - 123

DOI: https://doi.org/10.1017/CBO9780511844409.005 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1. Abadi, D. “Data Management in the Cloud: Limitations and Opportunities.” IEEE Data Eng. Bull. 32, no. 1 (2009): 3–12.Google Scholar

2. Abouzeid, A, Bajda-Pawlikowski, K., Abadi, D, Rasin, A, and Silberschatz, A. “Hadoopdb: An Architectural Hybrid of Mapreduce and Dbms Technologies for Analytical Workloads.” PVLDB 2, no. 1 (2009): 922–33.Google Scholar

3. Abouzeid, A, K., Bajda-Pawlikowski, Huang, J, Abadi, D, and Silberschatz, A. “HadoopDB in Action: Building Real World Applications.” In SIGMOD, 2010.Google Scholar

4. Armbrust, M., Fox, A, Rean, G, Joseph, A, Katz, R, Konwinski, A, Gunho, L, David, P., Rabkin, A, Stoica, I, and Zaharia, M.Above the Clouds: A Berkeley View of Cloud Computing. Feb. 2009.Google Scholar

5. Tam, E, Ramakrishnan, R, Cooper, B, Silberstein, A, and Sears, R. “Benchmarking Cloud Serving Systems with YCSB.” In ACM SoCC, 2010.Google Scholar

6. Bell, G,Gray, J, and Szalay, A. “PetascaleComputational Systems.” IEEE Computer 39, no. 1 (2006): 110–12.CrossRef Google Scholar

7. Bernstein, P, Cseri, I, Dani, N, N., Ellis, Kalhan, A, Kakivaya, G, Lomet, D, Manne, R., Novik, L, and Talius, T. “Adapting Microsoft SQL Server for Cloud Computing.” In ICDE, pages 1255–1263, 2011.Google Scholar

8. Binnig, C, Kossmann, D, Kraska, T, and Loesing, S. “How is the Weather Tomorrow?: Towards a Benchmark for the Cloud.” In DBTest, 2009.Google Scholar

9. Brantner, M, Florescu, D, Graf, D, Kossmann, D, and Kraska, T. “Building a Database on S3.” In SIGMOD, pages 251–264, 2008.Google Scholar

10. Brewer, ETowards Robust Distributed Systems (abstract). In PODC, page 7, 2000.Google Scholar

11. Bu, Y, Howe, B, Balazinska, M, and Ernst, MHaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB 3, no. 1 (2010): 285–96.Google Scholar

12. Burrows, MThe Chubby Lock Service for Loosely-Coupled Distributed Systems. In OSD, pages 335–350, 2006.Google Scholar

13. Cary, A, Sun, Z, Hristidis, V, and Rishe, N. “Experiences on Processing Spatial Data with MapReduce.” In SSDBM, pages 302–319, 2009.Google Scholar

14. Deepak, T Chandra, Griesemer, R, and Redstone, JPaxos made live: an engineering perspective. In PODC, pages 398–407, 2007.Google Scholar

15. Chang, F, Dean, J, Ghemawat, S, Hsieh, W, Wallach, D, Burrows, M, Chandra, T, Fikes, A, and Gruber, R. “Bigtable: A Distributed Storage System for Structured Data.” ACM Trans. Comput. Syst. 26, no. 2 (2008).CrossRef Google Scholar

16. Chen, R, Weng, X, He, B, and Yang, M. “Large Graph Processing in the Cloud.” In SIGMOD, pages 1123–1126, 2010.Google Scholar

17. Cooper, B, Baldeschwieler, E, Fonseca, R, Kistler, J, Narayan, P, Neerdaels, C, Negrin, T, Ramakrishnan, R, Silberstein, A, Srivastava, U, and Stata, RBuilding a Cloud for Yahoo!IEEE Data Eng. Bull. 32, no. 1 (2009): 36–43.Google Scholar

18. Cooper, B, Ramakrishnan, R, Srivastava, U, Silberstein, A, Bohannon, P, H., Jacobsen, Puz, N, Weaver, D, and Yerneni, R. “Pnuts: Yahoo!'s Hosted Data Serving Platform.” PVLDB 1, no. 2 (2008): 1277–88.Google Scholar

19. Das, S, Sismanis, Y, Beyer, K, Gemulla, R, Haas, P, and McPherson, J. “Ricardo: Integrating R and Hadoop.” In SIGMOD, pages 987–998, 2010.Google Scholar

20. Dean, J, and Ghemawat, S. “Mapreduce: Simplified Data Processing on Large Clusters.” In OSDI, pages 137–150, 2004.Google Scholar

21. Dean, J, and Ghemawat, SMapreduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, no. 1 (2008): 107–13.CrossRef Google Scholar

22. DeCandia, G, Hastorun, D, Jampani, M, Kakulapati, G, Lakshman, A, Pilchin, A., Sivasubramanian, S, Vosshall, P, and Vogels, W. “Dynamo: Amazon's Highly Available Key-Value Store.” In SOSP, pages 205–220, 2007.Google Scholar

23. Deelman, E, Singh, G, Livny, M, Berriman, G, and Good, J. “The Cost of Doing Science on the Cloud: The Montage Example.” In SC, page 50, 2008.Google Scholar

24. Dittrich, J, Quiané-Ruiz, J, Jindal, A, Kargin, Y, Setty, V, and Schad, JHadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3, no. 1 (2010): 518–29.Google Scholar

25. Foster, IandKesselman, CThe Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1999.Google Scholar

26. Friedman, E, Pawlowski, P, and Cieslewicz, JSql/mapreduce: A Practical Approach to Self-Describing, Polymorphic, and Parallelizable User-defined Functions. PVLDB 2, no. 2 (2009): 1402–13.Google Scholar

27. ,Gartner. Gartner top ten disruptive technologies for 2008 to 2012. Emerging trends and technologies roadshow, 2008.

28. Gates, A, Natkovich, O, Chopra, S, Kamath, P, Narayanam, S, Olston, C, Reed, B, Srinivasan, S, and Srivastava, U. “Building a Highlevel Dataflow System on Top of Mapreduce: The Pig Experience.” PVLDB 2, no. 2 (2009): 1414–25.Google Scholar

29. Ghemawat, S, Gobioff, H, and Leung, SThe Google File System. In SOSP, pages 29–43, 2003.Google Scholar

30. Gilbert, S and Lynch, NBrewer's Conjecture and the Feasibility of Consistent, available, partition-tolerant web services. SIGACT News, 33(2): 51–59, 2002.CrossRef Google Scholar

31. Gonzalez, L, Merino, L, Caceres, J, and Lindner, M. “A Break in the Clouds: Towards a Cloud Definition.” Computer Communication Review 39, no. 1 (2009): 50–5.Google Scholar

32. Hey, T, Tansly, S, and Tolle, K, eds. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, October 2009.

33. Karger, D, Lehman, E, Leighton, F, Panigrahy, R, Levine, M, and Lewin, D. “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web.” In STOC, pages 654–663, 1997.Google Scholar

34. Kossmann, D, Kraska, T, and Loesing, S. “An Evaluation of Alternative Architectures for Transaction Processing in the Cloud.” In SIGMOD, 2010.Google Scholar

35. Lakshman, A, and Malik, P. “Cassandra: Structured Storage System on a p2p Network.” In PODC, page 5, 2009.Google Scholar

36. Lu, W, Jackson, J, and Barga, R. “AzureBlast: a case study of developing science Applications on the Cloud.” In HPDC, pages 413–420, 2010.Google Scholar

37. Malewicz, G, Austern, M, Bik, A, Dehnert, J, Horn, I, Leiser, N, and Czajkowski, GPregel: A System for Large-Scale Graph Processing. In SIGMOD, pages 135–146, 2010.Google Scholar

38. Nykiel, T, Potamias, M, Mishra, C, Kollios, G, and Koudas, N. “MRShare: Sharing Across Multiple Queries in MapReduce.” PVLDB 3, no. 1 (2010): 494–505.Google Scholar

39. Olston, C, Reed, B, Srivastava, U, Kumar, R, and Tomkins, A. “Pig Latin: A Not-So-Foreign Language for Data Processing.” In SIGMOD, pages 1099–1110, 2008.Google Scholar

40. Pavlo, A, Paulson, E, Rasin, A, Abadi, D, DeWitt, D, Madden, S, and, MStonebraker. “A Comparison of Approaches to Large-Scale Data Analysis.” In SIGMOD, pages 165–178, 2009.Google Scholar

41. Stonebraker, M. “The Case for Shared Nothing.” IEEE Database Eng. Bull. 9, no. 1 (1986): 4–9.Google Scholar

42. Stonebraker, M, Abadi, D, DeWitt, D, Madden, S, Paulson, E, Pavlo, A, and Rasin, A. “MapReduce and Parallel DBMSs: Friends or Foes?” Commun. ACM 53, no. 1 (2010): 64–71.CrossRef Google Scholar

43. Alvaro, P, Hellerstein, J, Elmeleegy, K, Condie, T, Conway, N, and Sears, R. “Mapre-duce Online.” In NSDI, 2010.Google Scholar

44. Tanenbaum, A, and Steen, M., eds. Distributed Systems: Principles and Paradigms. Prentice Hall, 2002.

45. Thusoo, A, Sarma, J, Jain, N, Shao, Z, Chakka, P, Anthony, S, Liu, H, Wyckoff, P, and Murthy, R. “Hive – A Warehousing Solution Over a Map-reduce Framework.” PVLDB 2, no. 2 (2009): 1626–29.Google Scholar

46. Thusoo, A, Sarma, J, Jain, N, Shao, Z, Chakka, P, Zhang, N, Anthony, S, Liu, H, andMurthy, R. “Hive – A Petabyte Scale DataWarehouse Using Hadoop.” In ICDE, pages 996–1005, 2010.Google Scholar

47. Vogels, WEventually consistent. Commun. ACM 52, no. 1 (2009): 40–44.CrossRef Google Scholar

48. Wang, C, Wang, J, Lin, X, Wang, W, Wang, H, Li, H, Tian, W, Xu, J, and R., Li. “MapDupReducer: Detecting Near Duplicates Over Massive Datasets.” In SIGMOD, pages 1119–1122, 2010.Google Scholar

49. Xu, Y, Kostamaa, P, and Gao, L. “Integrating Hadoop and Parallel Dbms.” In SIGMOD, pages 969–974, 2010.Google Scholar

50. Yang, H, Dasdan, A, Hsiao, R, and Parker, D. “Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters.” In SIGMOD, pages 1029–1040, 2007.Google Scholar

Book contents

5 - Large-Scale Data Management Techniques in Cloud Computing Platforms

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive