Skip to main content
    • Aa
    • Aa

A scalable architecture for data-intensive natural language processing


Computational power needs have greatly increased during the last years, and this is also the case in the Natural Language Processing (NLP) area, where thousands of documents must be processed, i.e., linguistically analyzed, in a reasonable time frame. These computing needs have implied a radical change in the computing architectures and big-scale text processing techniques used in NLP. In this paper, we present a scalable architecture for distributed language processing. The architecture uses Storm to combine diverse NLP modules into a processing chain, which carries out the linguistic analysis of documents. Scalability requires designing solutions that are able to run distributed programs in parallel and across large machine clusters. Using the architecture presented here, it is possible to integrate a set of third-party NLP modules into a unique processing chain which can be deployed onto a distributed environment, i.e., a cluster of machines, so allowing the language-processing modules run in parallel. No restrictions are placed a priori on the NLP modules apart of being able to consume and produce linguistic annotations following a given format. We show the feasibility of our approach by integrating two linguistic processing chains for English and Spanish. Moreover, we provide several scripts that allow building from scratch a whole distributed architecture that can be then easily installed and deployed onto a cluster of machines. The scripts and the NLP modules used in the paper are publicly available and distributed under free licenses. In the paper, we also describe a series of experiments carried out in the context of the NewsReader project with the goal of testing how the system behaves in different scenarios.

Hide All

This work has been partially funded by the NewsReader (FP7-ICT-2011-8-316404) project. Zuhaitz Beloki’s work is funded by a PhD grant from the University of the Basque Country.

Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

R. Agerri , X. Artola , Z. Beloki , G. Rigau , and A. Soroa , 2015. Big data for natural language processing: a streaming approach. Knowledge-Based Systems 79: 3642.

H. Cunningham , 2002. Gate, a general architecture for text engineering. Computers and the Humanities 36 (2): 223–54.

J. Dean , and S. Ghemawat , 2008. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51 (1): 107–13.

P. Nesi , G. Pantaleo , and G. Sanesi 2015. A distributed framework for NLP-based keyword and keyphrase extraction from web pages and documents. In Proceedings of the 21st International Conference on Distributed Multimedia Systems DMS '15, Hyatt Regency.

V. Tablan , I. Roberts , H. Cunningham , and K. Bontcheva 2012. a platform for large-scale, open-source text processing on the cloud. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical, and Engineering Sciences 371 (1983).

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 9
Total number of PDF views: 40 *
Loading metrics...

Abstract views

Total abstract views: 226 *
Loading metrics...

* Views captured on Cambridge Core between 9th May 2017 - 24th September 2017. This data will be updated every 24 hours.