Skip to main content
    • Aa
    • Aa

Densification: Semantic document analysis using Wikipedia


This paper proposes a new method for semantic document analysis: densification, which identifies and ranks Wikipedia pages relevant to a given document. Although there are similarities with established tasks such as wikification and entity linking, the method does not aim for strict disambiguation of named entity mentions. Instead, densification uses existing links to rank additional articles that are relevant to the document, a form of explicit semantic indexing that enables higher-level semantic retrieval procedures that can be beneficial for a wide range of NLP applications. Because a gold standard for densification evaluation does not exist, a study is carried out to investigate the level of agreement achievable by humans, which questions the feasibility of creating an annotated data set. As a result, a semi-supervised approach is employed to develop a two-stage densification system: filtering unlikely candidate links and then ranking the remaining links. In a first evaluation experiment, Wikipedia articles are used to automatically estimate the performance in terms of recall. Results show that the proposed densification approach outperforms several wikification systems. A second experiment measures the impact of integrating the links predicted by the densification system into a semantic question answering (QA) system that relies on Wikipedia links to answer complex questions. Densification enables the QA system to find twice as many additional answers than when using a state-of-the-art wikification system.

Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

V. Bryl , C. Giuliano , L. Serafini , and K. Tymoshenko 2010. Supporting natural language processing with background knowledge: coreference resolution case. In P. F. Patel-Schneider , Y. Pan , P. Hitzler , P. Mika , L. Zhang , J. Z. Pan , I. Horrocks , and B. Glimm (eds.), The Semantic Web — ISWC 2010 (9th International Semantic Web Conference, Shanghai, China, Revised Selected Papers, Part I, volume 6496 of Lecture Notes in Computer Science), pp. 8095. Berlin: Springer.

J. Chu-Carroll , K. Czuba , J. Prager , and A. Ittycheriah 2003. In question answering, two heads are better than one. In NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 2431. Morristown, NJ: Association for Computational Linguistics.

R. Cilibrasi , and P. M. B. Vitányi , 2007. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19 (3): 370–83.

J. Cohen , 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1): 3746.

S. Deerwester , S. T. Dumais , G. W. Furnas , T. K. Landauer , and R. Harshman , 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6): 391407.

I. Dornescu 2010. Semantic QA for encyclopaedic questions: EQUAL in GikiCLEF. In C. Peters , G. M. Di Nunzio , M. Kurimo , T. Mandl , and D. Mostefa (eds.), Multilingual Information Access Evaluation I. Text Retrieval Experiments (vol. 6241, Lecture Notes in Computer Science), pp. 326–33. Berlin: Springer.

J. L. Fleiss , 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76 (5): 378–82.

L. Hirschman , and R. Gaizauskas , 2001. Natural language question answering: the view from here. Natural Language Engineering 7 (4): 275300.

E. Hovy , L. Gerber , U. Hermjakob , C.-Y. Lin , and D. Ravichandran 2001. Toward semantics-based answer pinpointing. In HLT '01: Proceedings of the First International Conference on Human Language Technology Research, pp. 17. Morristown, NJ: Association for Computational Linguistics.

C. Li , A. Sun , and A. Datta , 2013. TSDW: two-stage word sense disambiguation using Wikipedia. Journal of the American Society for Information Science and Technology 64 (6): 1203–23.

X. Li , and D. Roth 2002. Learning question classifiers. In Proceedings of the 19th International Conference on Computational Linguistics-Volume 1, pp. 17. Stroudsburg, PA: Association for Computational Linguistics.

D. Moldovan , C. Clark , S. Harabagiu , and D. Hodges , 2007. Cogex: a semantically and contextually enriched logic prover for question answering. Journal of Applied Logic 5 (1): 4969.

D. Santos , N. Cardoso , P. Carvalho , I. Dornescu , S. Hartrumpf , J. Leveling , and Y. Skalban 2009. GikiP at GeoCLEF 2008: joining GIR and QA forces for querying Wikipedia. In C. Peters , T. Deselaers , N. Ferro , J. Gonzalo , A. Peñas , G. J. F. Jones , M. Kurimo , T. Mandl , and V. Petras (eds.), Proceedings of the 9th Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access (vol. 5706, Lecture Notes in Computer Science), pp. 894905. Berlin: Springer.

W. A. Scott , 1955. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly 19 (3): 321–5.

C. Spearman , 1904. The proof and measurement of association between two things. American Journal of Psychology 15 (1): 72101.

E. M. Voorhees , 2001. The TREC question answering track. Natural Language Engineering 7 (4): 361–78.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 2
Total number of PDF views: 15 *
Loading metrics...

Abstract views

Total abstract views: 318 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 23rd June 2017. This data will be updated every 24 hours.