Extraction of templates from phrases using Sequence Binary Decision Diagrams

D. HIRANO; K. TANAKA-ISHII; A. FINCH

doi:10.1017/S1351324918000268

Extraction of templates from phrases using Sequence Binary Decision Diagrams

Published online by Cambridge University Press: 19 July 2018

D. HIRANO ,

K. TANAKA-ISHII and

A. FINCH

Show author details

D. HIRANO: Affiliation:
Graduate School and Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan e-mail: hirano@limu.ait.kyushu-u.ac.jp
K. TANAKA-ISHII: Affiliation:
Research Center for Advanced Science and Technology, University of Tokyo, Tokyo, Japan e-mail: kumiko@cl.rcast.u-tokyo.ac.jp
A. FINCH: Affiliation:
National Institute of Information and Communications Technology Kyoto, Kyoto, Japan e-mail: andrew.finch@gmail.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The extraction of templates such as ‘regard X as Y’ from a set of related phrases requires the identification of their internal structures. This paper presents an unsupervised approach for extracting templates on-the-fly from only tagged text by using a novel relaxed variant of the Sequence Binary Decision Diagram (SeqBDD). A SeqBDD can compress a set of sequences into a graphical structure equivalent to a minimal deterministic finite state automata, but more compact and better suited to the task of template extraction. The main contribution of this paper is a relaxed form of the SeqBDD construction algorithm that enables it to form general representations from a small amount of data. The process of compression of shared structures in the text during Relaxed SeqBDD construction, naturally induces the templates we wish to extract. Experiments show that the method is capable of high-quality extraction on tasks based on verb+preposition templates from corpora and phrasal templates from short messages from social media.

Type: Article
Information: Natural Language Engineering , Volume 24 , Issue 5 , September 2018 , pp. 763 - 795

DOI: https://doi.org/10.1017/S1351324918000268 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abney, S., and Abney, S. P. 1991. Parsing by chunks. In Principle-Based Parsing, pp. 257–278. Netherlands: Kluwer Academic Publishers.Google Scholar

Baldwin, T., Cook, P., Lui, M., Mackinlay, A., and Wang, L. 2013. How noisy social media text, how diffrnt social media sources. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP-2013). Mark: John Blitzer.Google Scholar

Baldwin, T. and Kim, S. N. 2010. Multiword expressions. In Indurkhya, N., and Damerau, F. J. (eds.), Handbook of Natural Language Processing, 3rd ed. Boca Raton, FL: CRC Press, Taylor and Francis Group.Google Scholar

Bryant, R. E., 1986. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers 35 (8): 677–691.Google Scholar

Cormen, T. H., Leiserson, C. E., and Rivest, R. L., 1990. Introduction to Algorithms. USA: MIT Press.Google Scholar

Cui, H., Kan, M.-Y., and Chua, T.-S. 2004. Unsupervised learning of soft patterns for generating definitions from online news. In Proceedings of the 13th International Conference on World Wide Web (WWW-2004), 90–99.Google Scholar

Daciuk, J., Mihov, S., Watson, B. W., and Watson, R. E., 2000. Incremental construction of minimal acyclic finite state automata. Computational Linguistics: Special Issue on Finite-State Methods in NLP 26 : 3–16.Google Scholar

de Medeiros Caseli, H., Ramisch, C., das Graças Volpe Nunes, M., and Villavicencio, A., 2010. Alignment-based extraction of multiword expressions. Language Resources and Evaluation: Special Issue on Multiword Expression: Hard Going or Plain Sailing 44 (1–2): 59–77.Google Scholar

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6): 391–407.Google Scholar

Denzumi, S., Yoshinaka, R., Arimura, H., and Minato, S.-I. 2016. Sequence binary decision diagram: Minimization, relationship to acyclic automata, and complexities of boolean set operations. Discrete Applied Mathematics 212 (C): 61–80. Elsevier Sicence Publishers B. V. Amsterdam, The Netherlands, The Netherlands.Google Scholar

Denzumi, S., Yoshinaka, R., Minato, S.-I., and Arimura, H. 2011. Efficient algorithms on sequence binary decision diagrams for manipulating sets of strings. Technical Report, DCS, Hokkaido U., TCS-TR-A-11-53.Google Scholar

Duan, J., Lu, R., Wu, W., Hu, Y., and Tian, Y. 2006. A bio-inspired approach for multi-word expression extraction. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (COLING-ACL-2006). Association for Computational Linguistics, Stroudsburg, PA, USA, 176–82.Google Scholar

Fazly, A. 2006. Automatically constructing a lexicon of verb phrase idiomatic combinations. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2006), 337–344.Google Scholar

Fernau, H., 2009. Algorithms for learning regular expressions from positive data. Journal of Information and Computation 207 : 521–541.Google Scholar

Francis, G., Hunston, S., and Manning, E., 1996. Collins COBUILD Grammar Patterns 1: Verbs. London: Harper Collins.Google Scholar

Francis, G., Hunston, S., and Manning, E., 1998. Collins COBUILD Grammar Patterns 2: Nouns and Adjectives. London: HarperCollins.Google Scholar

Gimpel, K. and Smith, N. A. 2012. Concavity and initialization for unsupervised dependency parsing. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 577–581.Google Scholar

Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD-2000). ACM, 1–12.Google Scholar

Headden, W. P. III, Johnson, M., and McClosky, D. 2009. Improving unsupervised dependency parsing with richer contexts and smoothing. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-2009). Association for Computational Linguistics, 101–109.Google Scholar

Huang, Z., Xu, W., and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991.Google Scholar

Hunston, S., and Francis, G. 2000. Pattern Grammar: A Corpus-driven Approach to the Lexical Grammar of English. Studies in Corpus Linguistics 4. John Benjamins Publishing Company.Google Scholar

Kim, C. 2010. Text: Automatic template extraction from heterogeneous web pages. IEEE Computer Society, 23 (4): 612–626.Google Scholar

Kita, K., Kato, Y., Omoto, T., and Yano, Y. 1994. A comparative study of automatic extraction of collocations from corpora: mutual information vs. cost criteria. Journal of Natural Language Processing 1, 1: 21–33.Google Scholar

Klein, D. and Manning, C. D. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.Google Scholar

Knuth, D. 2009. The Art of Computer Programming, vol. 4, Fascicle 1. Addison-Wesley.Google Scholar

Loekito, E., Bailey, J., and Pei, J., 2010. A binary decision diagram based approach for mining frequent subsequences. Knowledge and Information Systems 24 (2): 235–268.Google Scholar

Macqueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 281–297.Google Scholar

Manning, C. D., Raghavan, P., and Schütze, H., 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.Google Scholar

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the Association for Computational Linguistics System Demonstrations, 55–60.Google Scholar

Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B., 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics 19 (2): 313–330.Google Scholar

Martens, S. 2010. Varro: an algorithm and toolkit for regular structure discovery in treebanks. In Proceedings of the Coling 2010: Posters. Coling 2010 Organizing Committee, Beijing, China, 810–8.Google Scholar

István Nagy, T., and Vincze, V. 2014. Vpctagger: Detecting verb-particle constructions with syntax-based methods. In Proceedings of the 10th Workshop on Multiword Expressions (MWE). Association for Computational Linguistics, Gothenburg, Sweden, 17–25.Google Scholar

Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., and Hsu, M. 2001. Prefixspan: mining sequential patterns by prefix-projected growth. In Proceedings of the 17th International Conference on Data Engineering. IEEE Computer Society, 215–224.Google Scholar

Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. 2002. Multiword expressions: a pain in the neck for NLP. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 1–15.Google Scholar

Sangati, F., and van Cranenburgh, A. 2015. Multiword expression identification with recurring tree fragments and association measures. In Proceedings of the MWE@ NAACL-HLT, 10–18.Google Scholar

Smadja, F., 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19 (1): 143–177.Google Scholar

Tu, Y., and Roth, D. 2012. Sorting out the most confusing english phrasal verbs. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval-2012). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 65–9.Google Scholar

Zarrieß, S. and Kuhn, J. 2009. Exploiting translational correspondences for pattern-independent MWE identification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE-2009). Association for Computational Linguistics, Stroudsburg, PA, USA, 23–30.Google Scholar

Article contents

Extraction of templates from phrases using Sequence Binary Decision Diagrams

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests