Skip to main content

A scalable architecture for data-intensive natural language processing


Computational power needs have greatly increased during the last years, and this is also the case in the Natural Language Processing (NLP) area, where thousands of documents must be processed, i.e., linguistically analyzed, in a reasonable time frame. These computing needs have implied a radical change in the computing architectures and big-scale text processing techniques used in NLP. In this paper, we present a scalable architecture for distributed language processing. The architecture uses Storm to combine diverse NLP modules into a processing chain, which carries out the linguistic analysis of documents. Scalability requires designing solutions that are able to run distributed programs in parallel and across large machine clusters. Using the architecture presented here, it is possible to integrate a set of third-party NLP modules into a unique processing chain which can be deployed onto a distributed environment, i.e., a cluster of machines, so allowing the language-processing modules run in parallel. No restrictions are placed a priori on the NLP modules apart of being able to consume and produce linguistic annotations following a given format. We show the feasibility of our approach by integrating two linguistic processing chains for English and Spanish. Moreover, we provide several scripts that allow building from scratch a whole distributed architecture that can be then easily installed and deployed onto a cluster of machines. The scripts and the NLP modules used in the paper are publicly available and distributed under free licenses. In the paper, we also describe a series of experiments carried out in the context of the NewsReader project with the goal of testing how the system behaves in different scenarios.

Hide All

This work has been partially funded by the NewsReader (FP7-ICT-2011-8-316404) project. Zuhaitz Beloki’s work is funded by a PhD grant from the University of the Basque Country.

Hide All
Agerri R., Aldabe I., Beloki Z., Laparra E., Rigau G., Soroa A., van Erp M., Fokkens A., Ilievski F., Izquierdo R., Morante R., van Son C., Vossen P., and Minard A.-L. 2016. Event detection, version 3. NewsReader Deliverable 4.2.3.
Agerri R., Artola X., Beloki Z., Rigau G., and Soroa A., 2015. Big data for natural language processing: a streaming approach. Knowledge-Based Systems 79: 3642.
Agerri R., Bermudez J., and Rigau G. 2014. IXA Pipeline: efficient and ready to use multilingual NLP tools. In Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), Reykjavik, Iceland.
Agerri R., and Rigau G. (2016). Robust multilingual named entity recognition with shallow semi-supervised features. Artificial Intelligence 238: 6382.
Cherniack M., Balakrishnan H., Balazinska M., Carney D., Cetintemel U., Xing Y., and Zdonik S. 2003. Scalable distributed stream processing. In CIDR 2003 – First Biennial Conference on Innovative Data Systems Research, Asilomar, CA.
Cunningham H., 2002. Gate, a general architecture for text engineering. Computers and the Humanities 36 (2): 223–54.
Dean J., and Ghemawat S., 2008. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51 (1): 107–13.
Derivière J., Hamon T., and Nazarenko A. 2006. A scalable and distributed nlp architecture for web document annotation. In Advances in Natural Language Processing, pp. 5667. Springer.
Epstein E. A., Schor M. I., Iyer B. S., Lally A., Brown E. W., and Cwiklik J., 2012. Making watson fast. IBM Journal of Research and Development 56 (3): 15.
Evans N., Asahara M., and Matsumoto Y., 2008. Cocytus: parallel NLP over disparate data. TAL 49 (2): 271–93.
Exner P., and Nugues P. 2014. KOSHIK: a large-scale distributed computing framework for NLP. In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods, pp. 463–70.
Fokkens A., Soroa A., Beloki Z., Ockeloen N., Rigau G., van Hage W. R., and Vossen P. 2014. NAF and GAF: linking linguistic annotations. In Proceedings of 10th Joint ACL/ISO Workshop on Interoperable Semantic Annotation (ISA-10).
Ide N., Romary L., and de La Clergerie É. V. 2003. International standard for a linguistic annotation framework. In Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS). Association for Computational Linguistics.
Nesi P., Pantaleo G., and Sanesi G. 2015. A distributed framework for NLP-based keyword and keyphrase extraction from web pages and documents. In Proceedings of the 21st International Conference on Distributed Multimedia Systems DMS '15, Hyatt Regency.
Otero G., Pichel J., García M., Abuín J. M., and Fernández T., 2014. Análisis morfosintáctico y clasificación de entidades nombradas en un entorno Big Data. Procesamiento del Lenguaje Natural 53: 1724.
Padró L., and Stanilovsky E. 2012. Freeling 3.0: towards wider multilinguality. In Proceedings of the Language Resources and Evaluation Conference (LREC '12), Istanbul, Turkey, ELRA.
Padró L., and Turmo J. 2015. Textserver: cloud-based multilingual natural language processing. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDMW), IEEE, pp. 1636–39.
Padró L., and Turmo J., 2015. Textserver: cloud-based multilingual natural language processing. In Proceedings of the 15th IEEE International Conference on Data Mining Workshop (ICDMW '15), Atlantic City, USA, IEEE, pp. 1636–39.
Tablan V., Roberts I., Cunningham H., and Bontcheva K. 2012. a platform for large-scale, open-source text processing on the cloud. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical, and Engineering Sciences 371 (1983).
Wu H., Fei Z., Dai A., Sammons M., Roth D., and Mayhew S. D. 2014. Illinoiscloudnlp: text analytics services in the cloud. In Proceedings of International Conference on Language Resources and Evaluation (LREC), pp. 14–21.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 11
Total number of PDF views: 70 *
Loading metrics...

Abstract views

Total abstract views: 306 *
Loading metrics...

* Views captured on Cambridge Core between 9th May 2017 - 18th December 2017. This data will be updated every 24 hours.