Skip to main content
×
Home
    • Aa
    • Aa

TwitterNEED: A hybrid approach for named entity extraction and disambiguation for tweet*

  • MENA B. HABIB (a1) and MAURICE VAN KEULEN (a1)
Abstract
Abstract

Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.

Copyright
Footnotes
Hide All

The authors would like to thank Zhemin Zhu for sharing his CRF model (Zhu et al.2013) and assisting us in applying it. This work is supported by the Dutch national research program COMMIT.

Footnotes
Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

C.-C. Chang and C.-J. Lin 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (3–27): 127. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

S. Dann 2010. Twitter content classification. First Monday 15 (12), http://firstmonday.org/ojs/index.php/fm/article/viewArticle/2745/2681.

L. R. Dice 1945. Measures of the amount of ecologic association between species. Ecology 26 (3): 297302.

P. Gupta , A. Goel , J. Lin , A. Sharma , D. Wang , and R. Zadeh 2013. Wtf: the who to follow service at twitter. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, Rio de Janeiro, Brazil, pp. 505–14.

P. Howard and M. Hussain 2013. Democracy’s Fourth Wave?: Digital Media and the Arab Spring, Oxford Studies in Digital Politics. USA: OUP.

J. J. Jung 2012. Online named entity recognition method for microtexts in social networking services: a case study of twitter. Expert Systems with Applications 39 (9): 8066–70.

S. Kulkarni , A. Singh , G. Ramakrishnan , and S. Chakrabarti 2009. Collective annotation of wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, Paris, France, pp. 457–66.

P. N. Mendes , M. Jakob , A. García-Silva , and C. Bizer 2011. Dbpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, I-Semantics ’11, New York, NY, USA. ACM, pp. 18.

D. Spina , E. Amigó , and J. Gonzalo 2011. Filter keywords and majority class strategies for company name disambiguation in twitter. In Proceedings of the 2nd International Conference on Multilingual and Multimodal Information Access Evaluation, CLEF’11, Amsterdam, The Netherlands, pp. 5061.

F. M. Suchanek , G. Kasneci , and G. Weikum 2007. Yago: a core of semantic knowledge. In Proc. of the 16th International Conference on World Wide Web, WWW ’07, Banff, Alberta, Canada, pp. 697706.

S. J. Sullivan , A. G. Schneiders , C.-W. Cheang , E. Kitto , H. Lee , J. Redhead , S. Ward , O. H. Ahmed , and P. R. McCrory 2012. what’s happening? A content analysis of concussion-related traffic on twitter. British Journal of Sports Medicine 46 (4): 258–63.

M. Verma , Divya , and S. Sofat 2014. Article: Techniques to detect spammers in twitter- a survey. International Journal of Computer Applications 85 (10): 2732.

C. Wang , K. Chakrabarti , T. Cheng , and S. Chaudhuri 2012. Targeted disambiguation of ad-hoc, homogeneous sets of named entities. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, Lyon, France, pp. 719–28.

C. Zhai and J. Lafferty 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, New Orleans, Louisiana, USA, pp. 334–42.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 8
Total number of PDF views: 66 *
Loading metrics...

Abstract views

Total abstract views: 508 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 22nd June 2017. This data will be updated every 24 hours.