Skip to main content
    • Aa
    • Aa

Relational paraphrase acquisition from Wikipedia: The WRPA method and corpus

  • M. VILA (a1), H. RODRÍGUEZ (a2) and M. A. MARTÍ (a1)

Paraphrase corpora are an essential but scarce resource in Natural Language Processing. In this paper, we present the Wikipedia-based Relational Paraphrase Acquisition (WRPA) method, which extracts relational paraphrases from Wikipedia, and the derived WRPA paraphrase corpus. The WRPA corpus currently covers person-related and authorship relations in English and Spanish, respectively, suggesting that, given adequate Wikipedia coverage, our method is independent of the language and the relation addressed. WRPA extracts entity pairs from structured information in Wikipedia applying distant learning and, based on the distributional hypothesis, uses them as anchor points for candidate paraphrase extraction from the free text in the body of Wikipedia articles. Focussing on relational paraphrasing and taking advantage of Wikipedia-structured information allows for an automatic and consistent evaluation of the results. The WRPA corpus characteristics distinguish it from other types of corpora that rely on string similarity or transformation operations. WRPA relies on distributional similarity and is the result of the free use of language outside any reformulation framework. Validation results show a high precision for the corpus.

Hide All

This work was supported by the MINECO projects DIANA (TIN2012-38603-C02-02) and SKATER (TIN2012-38584-C06-01), as well as a MECD FPU grant (AP2008-02185). Also, we are grateful to Esther Arias, Santiago González, Rita Zaragoza and Oriol Borrega, the linguists that worked on the annotation processes.

Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

M. Arévalo , M. Civit , and M. A. Martí 2004. MICE: a module for named entity recognition and classification. International Journal of Corpus Linguistics 9 (1): 5368.

A. Barrón-Cedeño , M. Vila , M. A. Martí , and P. Rosso 2013. Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Computational Linguistics – vol. 39, no. 4, doi:10.1162/COLI_a_00153.

S. Brin 1999. Extracting patterns and relations from the World Wide Web. In P. Atzeni , A. Mendelzon , and G. Mecca (eds.), Proceedings of the 1st International Workshop on the World Wide Web and Databases (WebDB 1998), Lecture Notes in Computer Science, Vol. 1590. pp. 172–83. Berlin, Heidelberg: Springer-Verlag.

S. Burrows , M. Potthast , and B. Stein 2013. Paraphrase acquisition via crowdsourcing and machine learning. ACM Transactions on Intelligent Systems and Technology 4 (3), article no. 43.

R. C Carrasco , and J. Oncina 1994. Learning stochastic regular grammars by means of a state merging method. In R. C. Carrasco and J. Oncina (eds.), Grammatical Inference and Applications. Proceedings of the 2nd International Colloquium (ICGI 1994), Lecture Notes in Computer Science, Vol. 862. pp. 139–52. Berlin, Heidelberg: Springer-Verlag.

P. Clough , and M. Stevenson 2011. Developing a corpus of plagiarised short answers. Language Resources and Evaluation 45 (1): 524.

T. Cohn , Callison-C. Burch , and M. Lapata 2008. Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics 34 (4): 597614.

T. Cohn , and M. Lapata 2008. Sentence compression beyond word deletion. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), pp. 137–44. Manchester: International Committee on Computational Linguistics.

B. Dolan , C. Quirk , and C. Brockett 2004. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), pp. 350–6. Geneva: International Committee on Computational Linguistics.

M. Hall , E. Frank , G. Holmes , B. Pfahringer , P. Reutemann , and I. H Witten 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 1018.

Z. Harris 1954. Distributional structure. Word 10 (2–3): 146–62.

K. Knight , and D. Marcu , 2002. Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artificial Intelligence 139: 91107.

N. Madnani , and B. J. Dorr 2010. Generating phrasal and sentential paraphrases: a survey of data-driven methods. Computational Linguistics 36 (3): 341–87.

O. Medelyan , D. Milne , C. Legg , and I. H Witten 2009. Mining meaning from Wikipedia. International Journal of Human–Computer Studies 67 (9): 716–54.

N. J. Nilsson , 1982. Principles of Artificial Intelligence. Berlin/Heidelberg/New York: Springer-Verlag.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 2
Total number of PDF views: 16 *
Loading metrics...

Abstract views

Total abstract views: 350 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 27th June 2017. This data will be updated every 24 hours.