Skip to main content
×
Home

Relational paraphrase acquisition from Wikipedia: The WRPA method and corpus

  • M. VILA (a1), H. RODRÍGUEZ (a2) and M. A. MARTÍ (a1)
Abstract
Abstract

Paraphrase corpora are an essential but scarce resource in Natural Language Processing. In this paper, we present the Wikipedia-based Relational Paraphrase Acquisition (WRPA) method, which extracts relational paraphrases from Wikipedia, and the derived WRPA paraphrase corpus. The WRPA corpus currently covers person-related and authorship relations in English and Spanish, respectively, suggesting that, given adequate Wikipedia coverage, our method is independent of the language and the relation addressed. WRPA extracts entity pairs from structured information in Wikipedia applying distant learning and, based on the distributional hypothesis, uses them as anchor points for candidate paraphrase extraction from the free text in the body of Wikipedia articles. Focussing on relational paraphrasing and taking advantage of Wikipedia-structured information allows for an automatic and consistent evaluation of the results. The WRPA corpus characteristics distinguish it from other types of corpora that rely on string similarity or transformation operations. WRPA relies on distributional similarity and is the result of the free use of language outside any reformulation framework. Validation results show a high precision for the corpus.

Copyright
Footnotes
Hide All

This work was supported by the MINECO projects DIANA (TIN2012-38603-C02-02) and SKATER (TIN2012-38584-C06-01), as well as a MECD FPU grant (AP2008-02185). Also, we are grateful to Esther Arias, Santiago González, Rita Zaragoza and Oriol Borrega, the linguists that worked on the annotation processes.

Footnotes
References
Hide All
Androutsopoulos I., and Malakasiotis P., 2010. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research 38: 135–87.
Arévalo M., Civit M., and Martí M. A. 2004. MICE: a module for named entity recognition and classification. International Journal of Corpus Linguistics 9 (1): 5368.
Bannard C., and Callison-Burch C. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 597604. Ann Arbor, MI: ACL.
Barrón-Cedeño A., Vila M., Martí M. A., and Rosso P. 2013. Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Computational Linguistics – vol. 39, no. 4, doi:10.1162/COLI_a_00153.
Barzilay R., and Lee L. 2003. Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In Proceedings of the 4th Annual Meeting of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT/NAACL 2003), pp. 1623. Edmonton, Canada: ACL.
Barzilay R., and McKeown K. 2001. Extracting paraphrases from a parallel corpus. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL 2001), pp. 5057. Toulouse, France: ACL.
Bhagat R., and Ravichandran D. 2008. Large scale acquisition of paraphrases for learning surface patterns. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT/ACL 2008), pp. 674–82. Columbus, OH: ACL.
Brin S. 1999. Extracting patterns and relations from the World Wide Web. In Atzeni P., Mendelzon A., and Mecca G. (eds.), Proceedings of the 1st International Workshop on the World Wide Web and Databases (WebDB 1998), Lecture Notes in Computer Science, Vol. 1590. pp. 172–83. Berlin, Heidelberg: Springer-Verlag.
Burrows S., Potthast M., and Stein B. 2013. Paraphrase acquisition via crowdsourcing and machine learning. ACM Transactions on Intelligent Systems and Technology 4 (3), article no. 43.
Buzek O., Resnik P., and Bederson B. B 2010. Error driven paraphrase annotation using Mechanical Turk. In Proceedings of the HLT/NAACL 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (CSLDAMT 2010), pp. 217–21. Los Angeles, CA: ACL.
Carrasco R. C, and Oncina J. 1994. Learning stochastic regular grammars by means of a state merging method. In Carrasco R. C. and Oncina J. (eds.), Grammatical Inference and Applications. Proceedings of the 2nd International Colloquium (ICGI 1994), Lecture Notes in Computer Science, Vol. 862. pp. 139–52. Berlin, Heidelberg: Springer-Verlag.
Chen D. L, and Dolan W. B 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT/ACL 2011), Vol. 1, pp. 190200. Portland, OR: ACL.
Clough P., and Stevenson M. 2011. Developing a corpus of plagiarised short answers. Language Resources and Evaluation 45 (1): 524.
Cohn T., Burch Callison-C., and Lapata M. 2008. Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics 34 (4): 597614.
Cohn T., and Lapata M. 2008. Sentence compression beyond word deletion. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), pp. 137–44. Manchester: International Committee on Computational Linguistics.
Coster W., and Kauchak D. 2011. Simple English Wikipedia: a new text simplification task. In Proceeding of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT/ACL 2011), pp. 665–9. Portland, OR: ACL.
Dolan W. B, and Brockett C. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP 2005), Jeju Island, pp. 916.
Dolan B., Quirk C., and Brockett C. 2004. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), pp. 350–6. Geneva: International Committee on Computational Linguistics.
España-Bonet C., Vila M., Rodríguez H., and Martí M. A., 2009. CoCo, a web interface for corpora compilation. Procesamiento del Lenguaje Natural 43: 367–8.
Fillmore C. J 1992. ‘Corpus linguistics’ or ‘computer-aided armchair linguistics’. In Svartvik J. (ed.), Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82, pp. 3560. Berlin: Mouton de Gruyter.
Fujita A., and Inui K. 2005. A class-oriented approach to building a paraphrase corpus. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP 2005), Jeju Island, pp. 2532.
Gonzàlez E., Rodríguez H., Turmo J., Comas P. R, Naderi A., Ageno A., Sapena E., Vila M., and Martí M. A. 2012. The TALP participation at TAC-KBP 2012. In Proceedings of the Fifth Text Analysis Conference (TAC 2012), Gaithersburg, MD.
Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., and Witten I. H 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 1018.
Harris Z. 1954. Distributional structure. Word 10 (2–3): 146–62.
Herrera J., Peñas A., and Verdejo F., 2007. Paraphrase extraction from validated question answering corpora in Spanish. Procesamiento del Lenguaje Natural 39: 3744.
Knight K., and Marcu D., 2002. Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artificial Intelligence 139: 91107.
Kouylekov M., and Negri M. 2010. An open-source package for recognizing textual entailment. In Proceedings of the ACL 2010 System Demonstrations (ACLDemos 2010), pp. 42–7. Uppsala: ACL.
Lin D., and Pantel P. 2001. DIRT-Discovery of Inference Rules from Text. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001), pp. 323–8. San Francisco, CA: ACM.
Madnani N., and Dorr B. J. 2010. Generating phrasal and sentential paraphrases: a survey of data-driven methods. Computational Linguistics 36 (3): 341–87.
Martzoukos S., and Monz C. 2012. Power-law distributions for paraphrases extracted from bilingual corpora. In Proceedings of the 13th Conference of the European Chapter on the Association for Computational Linguistics (EACL 2012), pp. 211. Avignon, France: ACL.
Max A., and Wisniewski G. 2010. Mining naturally-occurring corrections and paraphrases from Wikipedia's revision history. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 3143–8. Valletta, Malta: European Language Resources Association.
Medelyan O., Milne D., Legg C., and Witten I. H 2009. Mining meaning from Wikipedia. International Journal of Human–Computer Studies 67 (9): 716–54.
Mintz M., Bills S., Snow R., and Jurafsky D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/IJCNLP 2009), pp. 1003–11. Singapore: ACL.
Nilsson N. J., 1982. Principles of Artificial Intelligence. Berlin/Heidelberg/New York: Springer-Verlag.
Padró L., Collado M., Reese S., Lloberes M., and Castellón I. 2010. Freeling 2.1: five years of open-source language processing tools. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 931–6. Valletta, Malta: European Language Resources Association.
Pang B., Knight K., and Marcu D. 2003. Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. In Proceedings of the 4th Annual Meeting of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT/NAACL 2003), pp. 102–9. Edmonton, Canada: ACL.
Potthast M., Stein B., Barrón-Cedeño A., and Rosso P. 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 9971005. Beijing: International Committee on Computational Linguistics.
Ravichandran D., and Hovy E. 2002. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 41–7. Philadelphia, PA: ACL.
Szpektor I., Tanev H., Dagan I., and Coppola B. 2004. Scaling web-based acquisition of entailment relations. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 41–8. Barcelona, Spain: ACL.
Vila M., Bertran M., Martí M. A., and Rodríguez H. 2013. Corpus annotation with paraphrase types: new annotation scheme and inter-annotator agreement measures (submitted).
Vila M., Rodríguez H., and Martí M. A. 2010. WRPA: a system for relational paraphrase acquisition from Wikipedia. Procesamiento del Lenguaje Natural 45, 11–9.
Wubben S., van den Bosch A., and Krahmer E. 2010. Paraphrase generation as monolingual translation: data and evaluation. In Proceedings of the 6th International Language Generation Conference (INLG 2010), pp. 203–7. Dublin: ACL.
Yatskar M., Pang B., Danescu-Niculescu-Mizil C., and Lee L. 2010. For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia. In Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT/NAACL 2010), pp. 365–8. Los Angeles, CA: ACL.
Zesch T., Müller C., and Gurevych I. 2008. Extracting lexical semantic knowledge from Wikipedia and Wiktionary. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 1646–52. Marrakech, Morocco: European Language Resources Association.
Zhu Z., Bernhard D., and Gurevych I. 2010. A monolingual tree-based translation method for sentence simplification. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 1353–61. Beijing: International Committee on Computational Linguistics.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 2
Total number of PDF views: 26 *
Loading metrics...

Abstract views

Total abstract views: 405 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 24th November 2017. This data will be updated every 24 hours.