Skip to main content
×
Home
    • Aa
    • Aa
  • Get access
    Check if you have access via personal or institutional login
  • Cited by 1
  • Cited by
    This article has been cited by the following publications. This list is generated based on data provided by CrossRef.

    2015. Associative lexical cohesion as a factor in text complexity. ITL - International Journal of Applied Linguistics, Vol. 165, Issue. 2, p. 223.


    ×

A fast and flexible architecture for very large word n-gram datasets

  • MICHAEL FLOR (a1)
  • DOI: http://dx.doi.org/10.1017/S1351324911000349
  • Published online: 10 January 2012
Abstract
Abstract

This paper presents TrendStream, a versatile architecture for very large word n-gram datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel trie-based architecture, features lossless compression, and provides optimization for both speed and memory use. In addition to literal queries, it also supports fast pattern matching searches (with wildcards or regular expressions), on the same data structure, without any additional indexing. Language models are updateable directly in the compiled binary format, allowing rapid encoding of existing tabulated collections, incremental generation of n-gram models from streaming text, and merging of encoded compiled files. This architecture offers flexible choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model, or on-demand partial data loading with very modest memory requirements. The implemented system runs successfully on several different platforms, under different operating systems, even when the n-gram model file is much larger than available memory. Experimental evaluation results are presented with the Google Web1T collection and the Gigaword corpus.

Copyright
Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

B. H. Bloom 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13 (7): 422–6.

E. Fredkin 1960. Trie memory. Communications of the ACM 3 (9): 490–9.

Y. Futagi 2010. The effects of learner errors on the development of a collocation detection tool. In Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data (AND '10), Toronto, Canada, pp. 2734. New York, NY: Association for Computing Machinery.

U. Germann , E. Joanis and S. Larkin 2009. Tightly packed tries: how to fit large models into memory, and make them load fast, too. In Proceedings of the NAACL HLT Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing (SETQA-NLP 2009), Boulder, CO, USA, pp. 31–9. Stroudsburg, PA: Association for Computational Linguistics.

A. Islam and D. Inkpen 2009a. Managing the Google Web 1T 5-gram data set. In Proceedings of International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE 2009), Dalian, China, pp. 15. Piscataway, NJ: Institute of Electrical and Electronics Engineers.

A. Islam and D. Inkpen 2009b. Real-word spelling correction using Google Web 1T n-gram data set. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, pp. 1689–92. New York, NY: Association for Computing Machinery.

M. Lapata and F. Keller 2005. Web-based models for natural language processing. ACM Transactions on Speech and Language Processing 2 (1): 131.

A. Levenberg and M. Osborne 2009. Stream-based randomised language models for SMT. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP-2009), Singapore, pp. 756–64. Stroudsburg, PA: Association for Computational Linguistics.

B. Raj and E. W. D. Whittaker 2003. Lossless compression of language model structure and word identifiers. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'03), Hong Kong. USA, vol. 1, pp. 388–99. Piscataway, NJ: Institute of Electrical and Electronics Engineers.

T. Watanabe , H. Tsukada and H. Isozaki 2009. A succinct n-gram language model. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), Short Papers, Suntec, Singapore, pp. 341–4. Stroudsburg, PA: Association for Computational Linguistics.

D. Yuret 2008. Smoothing a tera-word language model. In Proceedings of 46th Annual Meeting of the Association for Computational Linguistics and Human Language Technology Conference (ACL-08: HLT), Columbus, OH, USA, pp. 141–4. Stroudsburg, PA: Association for Computational Linguistics. Software available at http://denizyuret.blogspot.com/2008/06/smoothing-tera-word-language-model.html

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×