Skip to main content

Extraction of templates from phrases using Sequence Binary Decision Diagrams

  • D. HIRANO (a1), K. TANAKA-ISHII (a2) and A. FINCH (a3)

The extraction of templates such as ‘regard X as Y’ from a set of related phrases requires the identification of their internal structures. This paper presents an unsupervised approach for extracting templates on-the-fly from only tagged text by using a novel relaxed variant of the Sequence Binary Decision Diagram (SeqBDD). A SeqBDD can compress a set of sequences into a graphical structure equivalent to a minimal deterministic finite state automata, but more compact and better suited to the task of template extraction. The main contribution of this paper is a relaxed form of the SeqBDD construction algorithm that enables it to form general representations from a small amount of data. The process of compression of shared structures in the text during Relaxed SeqBDD construction, naturally induces the templates we wish to extract. Experiments show that the method is capable of high-quality extraction on tasks based on verb+preposition templates from corpora and phrasal templates from short messages from social media.

Hide All
Abney, S., and Abney, S. P. 1991. Parsing by chunks. In Principle-Based Parsing, pp. 257278. Netherlands: Kluwer Academic Publishers.
Baldwin, T., Cook, P., Lui, M., Mackinlay, A., and Wang, L. 2013. How noisy social media text, how diffrnt social media sources. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP-2013). Mark: John Blitzer.
Baldwin, T. and Kim, S. N. 2010. Multiword expressions. In Indurkhya, N., and Damerau, F. J. (eds.), Handbook of Natural Language Processing, 3rd ed. Boca Raton, FL: CRC Press, Taylor and Francis Group.
Bryant, R. E., 1986. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers 35 (8): 677691.
Cormen, T. H., Leiserson, C. E., and Rivest, R. L., 1990. Introduction to Algorithms. USA: MIT Press.
Cui, H., Kan, M.-Y., and Chua, T.-S. 2004. Unsupervised learning of soft patterns for generating definitions from online news. In Proceedings of the 13th International Conference on World Wide Web (WWW-2004), 90–99.
Daciuk, J., Mihov, S., Watson, B. W., and Watson, R. E., 2000. Incremental construction of minimal acyclic finite state automata. Computational Linguistics: Special Issue on Finite-State Methods in NLP 26 : 316.
de Medeiros Caseli, H., Ramisch, C., das Graças Volpe Nunes, M., and Villavicencio, A., 2010. Alignment-based extraction of multiword expressions. Language Resources and Evaluation: Special Issue on Multiword Expression: Hard Going or Plain Sailing 44 (1–2): 5977.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6): 391407.
Denzumi, S., Yoshinaka, R., Arimura, H., and Minato, S.-I. 2016. Sequence binary decision diagram: Minimization, relationship to acyclic automata, and complexities of boolean set operations. Discrete Applied Mathematics 212 (C): 6180. Elsevier Sicence Publishers B. V. Amsterdam, The Netherlands, The Netherlands.
Denzumi, S., Yoshinaka, R., Minato, S.-I., and Arimura, H. 2011. Efficient algorithms on sequence binary decision diagrams for manipulating sets of strings. Technical Report, DCS, Hokkaido U., TCS-TR-A-11-53.
Duan, J., Lu, R., Wu, W., Hu, Y., and Tian, Y. 2006. A bio-inspired approach for multi-word expression extraction. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (COLING-ACL-2006). Association for Computational Linguistics, Stroudsburg, PA, USA, 176–82.
Fazly, A. 2006. Automatically constructing a lexicon of verb phrase idiomatic combinations. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2006), 337–344.
Fernau, H., 2009. Algorithms for learning regular expressions from positive data. Journal of Information and Computation 207 : 521541.
Francis, G., Hunston, S., and Manning, E., 1996. Collins COBUILD Grammar Patterns 1: Verbs. London: Harper Collins.
Francis, G., Hunston, S., and Manning, E., 1998. Collins COBUILD Grammar Patterns 2: Nouns and Adjectives. London: HarperCollins.
Gimpel, K. and Smith, N. A. 2012. Concavity and initialization for unsupervised dependency parsing. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 577–581.
Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD-2000). ACM, 1–12.
Headden, W. P. III, Johnson, M., and McClosky, D. 2009. Improving unsupervised dependency parsing with richer contexts and smoothing. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-2009). Association for Computational Linguistics, 101–109.
Huang, Z., Xu, W., and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991.
Hunston, S., and Francis, G. 2000. Pattern Grammar: A Corpus-driven Approach to the Lexical Grammar of English. Studies in Corpus Linguistics 4. John Benjamins Publishing Company.
Kim, C. 2010. Text: Automatic template extraction from heterogeneous web pages. IEEE Computer Society, 23 (4): 612626.
Kita, K., Kato, Y., Omoto, T., and Yano, Y. 1994. A comparative study of automatic extraction of collocations from corpora: mutual information vs. cost criteria. Journal of Natural Language Processing 1, 1: 2133.
Klein, D. and Manning, C. D. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.
Knuth, D. 2009. The Art of Computer Programming, vol. 4, Fascicle 1. Addison-Wesley.
Loekito, E., Bailey, J., and Pei, J., 2010. A binary decision diagram based approach for mining frequent subsequences. Knowledge and Information Systems 24 (2): 235268.
Macqueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 281–297.
Manning, C. D., Raghavan, P., and Schütze, H., 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the Association for Computational Linguistics System Demonstrations, 55–60.
Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B., 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics 19 (2): 313330.
Martens, S. 2010. Varro: an algorithm and toolkit for regular structure discovery in treebanks. In Proceedings of the Coling 2010: Posters. Coling 2010 Organizing Committee, Beijing, China, 810–8.
István Nagy, T., and Vincze, V. 2014. Vpctagger: Detecting verb-particle constructions with syntax-based methods. In Proceedings of the 10th Workshop on Multiword Expressions (MWE). Association for Computational Linguistics, Gothenburg, Sweden, 17–25.
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., and Hsu, M. 2001. Prefixspan: mining sequential patterns by prefix-projected growth. In Proceedings of the 17th International Conference on Data Engineering. IEEE Computer Society, 215–224.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. 2002. Multiword expressions: a pain in the neck for NLP. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 1–15.
Sangati, F., and van Cranenburgh, A. 2015. Multiword expression identification with recurring tree fragments and association measures. In Proceedings of the MWE@ NAACL-HLT, 10–18.
Smadja, F., 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19 (1): 143177.
Tu, Y., and Roth, D. 2012. Sorting out the most confusing english phrasal verbs. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval-2012). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 65–9.
Zarrieß, S. and Kuhn, J. 2009. Exploiting translational correspondences for pattern-independent MWE identification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE-2009). Association for Computational Linguistics, Stroudsburg, PA, USA, 23–30.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed