Hostname: page-component-8448b6f56d-c4f8m Total loading time: 0 Render date: 2024-04-18T07:20:38.684Z Has data issue: false hasContentIssue false

Property-Based Testing for Spark Streaming

Published online by Cambridge University Press:  19 February 2019

A. RIESCO*
Affiliation:
Universidad Complutense de Madrid, Madrid 28040, Spain (e-mail: ariesco@fdi.ucm.es)
J. RODRÍGUEZ-HORTALÁ
Affiliation:
(e-mail: juan.rodriguez.hortala@gmail.com)

Abstract

Stream processing has reached the mainstream in the last years, as a new generation of open-source distributed stream processing systems, designed for scaling horizontally on commodity hardware, has brought the capability for processing high-volume and high-velocity data streams to companies of all sizes. In this work, we propose a combination of temporal logic and property-based testing (PBT) for dealing with the challenges of testing programs that employ this programming model. We formalize our approach in a discrete time temporal logic for finite words, with some additions to improve the expressiveness of properties, which includes timeouts for temporal operators and a binding operator for letters. In particular, we focus on testing Spark Streaming programs written with the Spark API for the functional language Scala, using the PBT library ScalaCheck. For that we add temporal logic operators to a set of new ScalaCheck generators and properties, as part of our testing library sscheck.

Type
Original Article
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Akidau, T., Balikov, A., Bekiroğlu, K., Chernyak, S., Haberman, J., Lax, R., McVeety, S., Mills, D., Nordstrom, P. and Whittle, S. 2013. MillWheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment 6, 11, 10331044.10.14778/2536222.2536229CrossRefGoogle Scholar
Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R. J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., et al. 2015. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment 8, 12, 17921803.10.14778/2824032.2824076CrossRefGoogle Scholar
Alur, R. and Henzinger, T. A. 1994. A really temporal logic. Journal of the ACM 41, 1, 181204.10.1145/174644.174651CrossRefGoogle Scholar
Barringer, H. and Havelund, K. 2011. Tracecontract: A scala DSL for trace analysis. In Proceedings of the 17th International Symposium on Formal Methods, FM 2011, vol. 6664, “emLecture Notes in Computer Science, Butler, M. J. and Schulte, W., Eds. Springer, Berlin Heidelberg, 5772.10.1007/978-3-642-21437-0_7CrossRefGoogle Scholar
Bauer, A., Leucker, M. and Schallhart, C. 2006. Monitoring of real-time properties. In FSTTCS 2006: Foundations of Software Technology and Theoretical Computer Science. Springer, Berlin Heidelberg, 260272.10.1007/11944836_25CrossRefGoogle Scholar
Beck, K. 2003. Test-Driven Development: By Example. Addison-Wesley Professional, Boston, USA.Google Scholar
Blackburn, P., van Benthem, J. and Wolter, F., Eds. 2006. Handbook of Modal Logic. Elsevier, Amsterdam, the Netherlands.Google Scholar
Carbone, P., Ewen, S., Haridi, S., Katsifodimos, A., Markl, V. and Tzoumas, K. 2015a. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 38, 4, 11.Google Scholar
Carbone, P., Fóra, G., Ewen, S., Haridi, S. and Tzoumas, K. 2015b. Lightweight asynchronous snapshots for distributed dataflows. arXiv preprint, arXiv:1506.08603.Google Scholar
Claessen, K. and Hughes, J. 2011. QuickCheck: A lightweight tool for random testing of Haskell programs. Acm Sigplan Notices 46, 4, 5364.10.1145/1988042.1988046CrossRefGoogle Scholar
D’Angelo, B., Sankaranarayanan, S., Sánchez, C., Robinson, W., Finkbeiner, B., Sipma, H. B., Mehrotra, S. and Manna, Z. 2005. LOLA: Runtime monitoring of synchronous systems. In Proceedings of the 12th International Symposium on Temporal Representation and Reasoning, TIME 2005. IEEE Computer Society, 166174.Google Scholar
Fitting, M. and Mendelsohn, R. L. 1998. First-Order Modal Logic, vol. 277, Science & Business Media. Springer, Berlin Heidelberg.10.1007/978-94-011-5292-1CrossRefGoogle Scholar
Fowler, M. and Foemmel, M. 2006. Continuous integration. Thought-Works, Addison-Wesley, Boston, USA, 122.Google Scholar
Gorawski, M., Gorawska, A. and Pasterak, K. 2014. A survey of data stream processing tools. In Information Sciences and Systems 2014. Springer, Berlin Heidelberg, 295303.Google Scholar
Halbwachs, N. 1992. Synchronous programming of reactive systems, vol. 215, Springer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Berlin Heidelberg.Google Scholar
Karau, H. 2015c. Effective testing of spark programs and jobs. In Strata + Hadoop World 2015 NYC. O’Reilly. https://conferences.oreilly.com/strata/big-data-conference-ny-2015/public/schedule/detail/42993.Google Scholar
Karau, H. and Warren, R. 2017. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. O’Reilly Media, Incorporated, Missouri, USA.Google Scholar
Kuhn, R. and Allen, J. 2014. Reactive Design Patterns. Manning Publications Co. 2017, New York, USA.Google Scholar
Leucker, M. and Schallhart, C. 2009. A brief account of runtime verification. The Journal of Logic and Algebraic Programming 78, 5, 293303.10.1016/j.jlap.2008.08.004CrossRefGoogle Scholar
Marz, N. and Warren, J. 2015. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co, New York, USA.Google Scholar
Neumeyer, L., Robbins, B., Nair, A. and Kesari, A. 2010. S4: Distributed stream computing platform. In 2010 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 170177.Google Scholar
Nilsson, R. 2014. ScalaCheck: The Definitive Guide. IT Pro. Artima Incorporated, California, USA.Google Scholar
Papadakis, M. and Sagonas, K. 2011. A PropEr integration of types and function specifications with property-based testing. In Proceedings of the 10th ACM SIGPLAN workshop on Erlang. ACM, 3950.10.1145/2034654.2034663CrossRefGoogle Scholar
Pnueli, A. 1986. Applications of temporal logic to the specification and verification of reactive systems: a survey of current trends. Springer, Berlin Heidelberg.10.1007/BFb0027047CrossRefGoogle Scholar
Ramasamy, K. 2015. Flying faster with twitter heron. The Official Twitter Blog. https://blog.twitter.com/2015/flying-faster-with-twitter-heron.Google Scholar
Raymond, P., Roux, Y. and Jahier, E. 2008. Lutin: A language for specifying and executing reactive scenarios. EURASIP Journal on Embedded Systems 2008, 753821.Google Scholar
Riesco, A. and Rodríguez-Hortalá, J. 2015–2017a. Examples using sscheck. https://github.com/juanrh/sscheck-examples.Google Scholar
Riesco, A. and Rodríguez-Hortalá, J. 2015–2017b. sscheck: Scalacheck for spark v0.3.2. https://github.com/juanrh/sscheck/releases/tag/0.3.2. See ScalaDoc documentation at https://juanrh.github.io/doc/sscheck/scala-2.10/api, and basic setup instructions at https://github.com/juanrh/sscheck/wiki/Quickstart.Google Scholar
Riesco, A. and Rodríguez-Hortalá, J. 2016a. Property-based testing for Spark Streaming. In Apache Big Data Europe 2016. The Linux Foundation. http://events.linuxfoundation.org/events/apache-big-data-europe/program/schedule.Google Scholar
Riesco, A. and Rodríguez-Hortalá, J. 2016b. Temporal random testing for spark streaming. In Proceedings of the 12th International Conference on integrated Formal Methods, iFM 2016, vol. 9681, Lecture Notes in Computer Science, Abraham, E. and Huisman, M., Eds. Springer.Google Scholar
Riesco, A. and Rodríguez-Hortalá, J. 2018. Property-based testing for spark streaming (extended version). Technical Report 02/2018, Departamento de Sistemas Informáticos y Computación de la Universidad Complutense de Madrid, Berlin Heidelberg. http://maude.sip.ucm.es/~adrian/pubs.html.Google Scholar
Smullyan, R. M. 1995. First-Order Logic. Courier Corporation.Google Scholar
Venners, B. 2015. Re: Prop.exists and scalatest matchers. https://groups.google.com/forum/#!msg/scalacheck/Ped7joQLhnY/gNH0SSWkKUgJ.Google Scholar
White, T. 2012. Hadoop: The Definitive Guide. O’Reilly Media, Missouri, USA.Google Scholar
Wolper, P. 1983. Temporal logic can be more expressive. Information and Control 56, 1/2, 7299.Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. and Stoica, I. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2.Google Scholar
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S. and Stoica, I. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 423438.Google Scholar
Supplementary material: PDF

Riesco and Rodríguez-Hortalá supplementary material

Riesco and Rodríguez-Hortalá supplementary material 1

Download Riesco and Rodríguez-Hortalá supplementary material(PDF)
PDF 289.2 KB