Skip to main content

DEXTER: A workbench for automatic term extraction with specialized corpora


Automatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.

Hide All

Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.

Hide All
Ahmad K., Gillam L., and Tostevin L. 2000. Weirdness indexing for logical document extrapolation and retrieval (WILDER). In E. M. Voorhees, and D. K Harman (eds.), Proceedings of the 8th Text Retrieval Conference, pp. 717–724. Washington: National Institute of Standards and Technology.
Ahrenberg L. 2009. Term extraction: A review. Retrieved from
Alajmi A., Saad E. M., and Darwish R. R., 2012. Toward an ARABIC stop-words list generation. Int. J. Comput. Appl. 46 (8): 813.
Asubiaro T. V., 2013. Entropy-based generic stopwords list for Yoruba texts. Int. J. Comput. Inform. Technol. 2 (5): 10651068.
Barcala M., Domínguez-Noya E., Gamallo P., López M., Moscoso E., Rojo G., Santalla P., and Sotelo S. 2007. A corpus and lexical resources for multi-word terminology extraction in the field of economy. In Proceedings of the 3rd Language and Technology Conference, Poznan, pp. 355–359.
Biemann C., Heyer G., Quasthoff U., and Richter M. 2007. The Leipzig Corpora Collection: monolingual corpora of standard size. In Proceedings of Corpus Linguistic 2007, Birmingham.
Brants T. 2004. Natural language processing in information retrieval. In Proceedings of the 14th Meeting of Computational Linguistics, Antwerp, pp. 1–13.
Church K. W., Gale W., Hanks P., and Hindle D. 1991. Using statistics in lexical analysis. In Zernik U., (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115164. Hillsdale: Lawrence Erlbaum Associates.
Church K. W. and Hanks P., 1990. Word association norms, mutual information and lexicography. Computational Linguistics 6 (1): 2229.
Conde A., Larrañaga M., Arruarte A., Elorriaga J. A., and Roth D., 2016. LiteWi: a combined term extraction method for eliciting educational ontologies from textbooks. Journal of the Association for Information Science and Technology 67 (2): 380399.
Conrado M. S., Felippo A., Pardo T. A. S., and Rezende S. O., 2014. A survey of automatic term extraction for Brazilian Portuguese. Journal of the Brazilian Computer Society 20 (12): 128.
Deane P. 2005. A nonparametric method for extraction of candidate phrasal terms. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics. Michigan: Association for Computer Linguistics, pp. 605–613.
Drouin P., 2003. Term extraction using non-technical corpora as a point of leverage. Terminology 9 (1): 99117.
Dunning T., 1994. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 6174.
Everitt B., 1992. The Analysis of Contingency Tables. London: Chapman and Hall/CRC.
Fedorenko D., Astrakhantsev N., and Turdakov D. 2013. Automatic recognition of domain-specific terms: an experimental evaluation. In Proceedings of the 9th Spring Researcher’s Colloquium on Database and Information Systems, pp. 15–23.
Fox C., 1990. A stop list for general text. ACM-SIGIR Forum 24 : 1935.
Francis W. N., and Kučera H., 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin.
Frantzi K., and Ananiadou S. 1996. Extracting nested collocations. In Proceedings of the 16th International Conference on Computational Linguistics. Morristown: Association for Computational Linguistics, pp. 41–46.
Frantzi K., Ananiadou S., and Mima H., 2000. Automatic recognition of multi-word terms. International Journal of Digital Libraries 3 (2): 117132.
Gale W., and Church K. W. 1991. Concordances for parallel texts. In Proceedings of the 7th Annual Conference of the UW Center for the New OED and Text Research, Oxford, pp. 40–62.
Haan P. 1992. The optimum corpus sample size? In Leitner G. (ed.), New Dimensions in English Language Corpora, pp. 319. Berlin-NewYork: Mouton de Gruyter.
Harman D. 1986. An experimental study of factors important in document ranking. In Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, pp. 186–193.
Hatcher E., Gospodnetic O., and McCandless M., 2010. Lucene in Action. Greenwich: Manning.
Hunston S. 2008. Collection strategies and design decisions. In Lüdeling A., and Kytö M. (eds.), Corpus Linguistics: An International Handbook, vol. 1, pp. 154168. Berlin-New York: Mouton de Gruyter.
ISO 704.,2009. Terminology Work – Principles and Methods. Geneva: International Organization for Standardization.
Ittoo A., Maruster L., Wortmann H., and Bouma G. 2010. Textractor: a framework for extracting relevant domain concepts from irregular corporate textual datasets. In Abramowicz W., and Tolksdorf R. (eds.), Business Information Systems. Lecture Notes in Business Information Processing, vol. 47, pp. 7182. Heidelberg: Springer.
Jacquey E., Tutin A., Kister L., Jacques M., Hatier S., and Ollinger S. 2013. Filtrage terminologique par le lexique transdisciplinaire scientifique: une expérimentation en sciences humaines. In Proceedings of the 10th International Conference on Terminology and Artificial Intelligence (TIA 2013). Villetaneuse, pp. 121–128.
Justeson J. S., and Katz S. M., 1995. Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1 (1): 927.
Kageura K., and Umino B., 1996. Methods of automatic term recognition: A review. Terminology 3 (2): 259289.
Karystianis G., Buchan I., and Nenadic G. 2014. Mining characteristics of epidemiological studies from Medline: a case study in obesity. Journal of Biomedical Semantics 5, 22: 111.
Khosrow-Pour M., 2009. Encyclopedia of Information Science and Technology. Hershey: Information Science Reference.
Knoth P., Schmidt M., Smrz P., and Zdráhal Z. 2009. Towards a framework for comparing automatic term recognition methods. In Proceedings of the 8th Annual Conference Znalosti. Bratislava: Informatics and Information Technology STU, pp. 83–94.
Koester A. 2010. Building small specialized corpora. In O’Keeffe A., and McCarthy M. (eds.), The Routledge Handbook of Corpus Linguistics, pp. 6679. London: Routledge.
Korkontzelos I., Klapaftis I., and Manandhar S. 2008. Reviewing and evaluating automatic term recognition techniques. In Proceedings of the 6th International Conference on Advances in Natural Language Processing. Berlin-Heidelberg: Springer, pp. 248–259.
Lochbaum K. E., and Streeter L. A., 1989. Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval. Information Processing and Management 25 (6): 665676.
Lossio-Ventura J. A., Jonquet C., Roche M., and Teisseire M. 2014a. BioTex: a system for biomedical terminology extraction, ranking and validation. In Proceedings of the 13th International Semantic Web Conference, pp. 157–160.
Lossio-Ventura J. A., Jonquet C., Roche M., and Teisseire M., 2014a. Towards a mixed approach to extract biomedical terms from text corpus. International Journal of Knowledge Discovery in Bioinformatics 4 (1): 115.
Lossio-Ventura J. A., Jonquet C., Roche M., and Teisseire M. 2014c. Yet another ranking function to automatic multi-word term extraction. In Proceedings of the 9th International Conference on Natural Language Processing, Warsaw.
Luhn H. P., 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2 (2): 159165.
Marín M. J., 2015. Measuring precision in legal term mining: a corpus-based validation of single and multi-word term recognition methods. ESP World 46 : 123.
Merkel M., Foo J., and Ahrenberg L. 2013. IPhraxtor – a linguistically informed system for extraction of term candidates. In Proceedings of the 19th Nordic Conference on Computational Linguistics, pp. 121–132. Oslo: Linkoping University Electronic Press.
Meyers A., He Y., Glass Z., and Babko-Malaya O. 2015. The Termolator: terminology recognition based on chunking, statistical and search-based scores. In Proceedings of the First Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics, Istanbul, pp. 34–43.
Nagao M., Mizutani M., and Ikeda H., 1976. An automated method of the extraction of important words from Japanese scientific documents. Transactions of the Information Processing Society of Japan 17 (2): 110117.
Oakes M., 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.
Park Y., Byrd R. J., and Boguraev B. 2002. Automatic glossary extraction: beyond terminology identification. In Proceedings of the 19th International Conference on Computational Linguistics. Taipei: Howard International House and Academia Sinica, pp. 1–7.
Paulo J. L., and Mamede N. J. 2004. Terms spotting with linguistics and statistics. In G. De Ita Luna, O. Fuentes Chávez, and M. Osorio Galindo (eds.), Proceedings of the International Workshop Taller de Herramientas y Recursos Linguísticos para el Español y el Portugués, IX Iberoamerican Conference on Artificial Intelligence, pp. 298–304.
Pazienza M. T., Pennacchiotti M., and Zanzotto F. M. 2005. Terminology extraction: an analysis of linguistic and statistical approaches. In Sirmakessis S. (ed.), Knowledge Mining. Studies in Fuzziness and Soft Computing, vol. 185, pp. 255279. Heidelberg: Springer.
Periñán-Pascual C., 2015. The underpinnings of a composite measure for automatic term extraction: the case of SRC. Terminology 21 (2): 151179.
Quasthoff U., Richter M., and Biemann C. 2006. Corpus portal for search in monolingual corpora. In Proceedings of LREC-06, Genova, pp. 1799–1802.
Robertson S. E., Walker S., and Beaulieu M. 1998. Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. In Proceedings of the 7th Text Retrieval Conference, Gaithersburg: National Institute of Standards and Technology, pp. 253–264.
Sajjacholapunt P., and Joy M. 2015. Analysing features of lecture slides and past exam paper materials. Towards automatic associating E-materials for self-revision. In Proceedings of the 7th International Conference on Computer Supported Education, Lisbon: SciTePress, pp. 169–176.
Salton G. (ed.), 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs: Prentice-Hall.
Salton G., and Buckley C., 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 (5): 513523.
Salton G., and McGill M., 1983. Introduction to Modern Information Retrieval. New York: McGraw Hill.
Salton G., Wong A., and Yang C. S., 1975. A vector space model for automatic indexing. Communications of the ACM 18 (11): 613620.
Salton G., Yang C. S., and Yu C. T., 1975. A theory of term importance in automatic text analysis. Journal of the American Society for Information Science 26 (1): 3344.
Silva J. F., Dias G., Guilloré S., and Lopes G. P. 1999. Using LocalMaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In Barahona P. (ed.), Progress in Artificial Intelligence: 9th Portuguese Conference on AI, pp. 113132. Heidelberg: Springer.
Silva J. F., and Lopes G. P. 1999. A local maxima method and a fair dispersion normalization for extracting multiword units. In Proceedings of the 6th Meeting on the Mathematics of Language, Orlando, pp. 369–381.
Sinclair I., 2011. Electronics Simplified. Oxford: Newnes-Elsewier.
Singhal A., Buckley C., and Mitra M. 1996. Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM press, pp. 21–29.
Sinka M. P., and Corne D. W. 2003. Towards modernised and web-specific stoplists for web document analysis. In Proceedings of IEEE Web Intelligence 2003. Los Alamitos (California): IEEE Computer Society, pp. 396–404.
Smadja F., McKeown K. R., and Hatzivassiloglou V., 1996. Translating collocations for bilingual lexicons: a statistical approach. Journal of Computational Linguistics 22 (1): 138.
Sun Q., Shaw D., and Davis C. H., 1999. A model for estimating the occurrence of same-frequency words and the boundary between high- and low-frequency words in texts. Journal of the American Society for Information Science 50 (3): 280286.
Thurmair G. 2003. Making term extraction tools usable. In Proceedings of The Joint Conference of the 8th International Workshop of the European Association of Machine Translation and the 4th Controlled Language Applications Workshop. Dublin: European Association for Machine Translation, pp. 1–10.
Vivaldi J., Màrquez L., and Rodríguez H. 2001. Improving term extraction by system combination using boosting. In Proceedings of the 12th European Conference on Machine Learning, pp. 515–526. Heidelberg: Springer.
Vivaldi J., and Rodríguez H., 2007. Evaluation of terms and term extraction systems: a practical approach. Terminology 13 (2): 225248.
Wermter J., and Hahn U. 2005. Finding new terminology in very large corpora. In P. Clark, and G. Schreiber (eds.), Proceedings of the 3rd International Conference on Knowledge Capture, pp. 137–144. Alberta: Association for Computing Machinery.
Wiechmann D., and Fuhs S., 2006. Corpus linguistics resources. Concordancing software. Corpus Linguistics and Linguistic Theory 2 (1): 109–30.
Wong W., Liu W., and Bennamoun M. 2008. Determination of unithood and termhood for term recognition. In Song M., and Wu Y. (eds.), Handbook of Research on Text and Web Mining Technologies, pp. 500529. Hershey-New York: IGI Global.
Zadeh B. Q., and Handschuh S. 2014a. Evaluation of technology term recognition with random indexing. In Proceedings of the 9th International Conference on Language Resources and Evaluation. Reykjavik: European Language Resources Association, pp. 4027–4032.
Zadeh B. Q., and Handschuh S. 2014b. The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics. In Proceedings of the 4th International Workshop on Computational Terminology, Dublin: Association for Computational Linguistics, pp. 52–63.
Zhang Z., Iria J., Brewster C., and Ciravegna F. 2008. A comparative evaluation of term recognition algorithms. In Proceedings of the 6th International Conference on Language Resources and Evaluation. Luxemburg: European Language Resources Association, pp. 2108–2113.
Zorrilla-Agut P. 2014. When IATE met LISE: LISE clean-up and consolidation tools take on the IATE challenge. In Budin G., and Lušicky V. (eds.), Languages for Special Purposes in a Multilingual, Transcultural World. Proceedings of the 19th European Symposium on Languages for Special Purposes, pp. 536545. Vienna: University of Vienna.
Zou F., Wang F. L., Deng X., Han S., and Wang L. S. 2006. Automatic construction of Chinese stop word list. In Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, pp. 1010–1015.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 4
Total number of PDF views: 23 *
Loading metrics...

Abstract views

Total abstract views: 114 *
Loading metrics...

* Views captured on Cambridge Core between 5th October 2017 - 22nd November 2017. This data will be updated every 24 hours.