Skip to main content
×
Home
    • Aa
    • Aa

TwitterNEED: A hybrid approach for named entity extraction and disambiguation for tweet*

  • MENA B. HABIB (a1) and MAURICE VAN KEULEN (a1)
Abstract
Abstract

Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.

Copyright
Footnotes
Hide All
*

The authors would like to thank Zhemin Zhu for sharing his CRF model (Zhu et al.2013) and assisting us in applying it. This work is supported by the Dutch national research program COMMIT.

Footnotes
References
Hide All
Abeel T., Van de Peer Y., and Saeys Y. 2009. Java-ml: a machine learning library. Journal of Machine Learning Research 10 : 931–4.
Basave A. E. C., Varga A., Rowe M., Stankovic M., and Dadzie A.-S. 2013. Making sense of microposts (#msm2013) concept extraction challenge. In Making Sense of Microposts (#MSM2013) Concept Extraction Challenge, Rio de Janeiro, Brazil, pp. 115.
Bontcheva K., Derczynski L., Funk A., Greenwood M., Maynard D., and Aswani N. 2013. Twitie: An open-source information extraction pipeline for microblog text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, Hissar, Bulgaria, pp. 8390.
Bunescu R. C., and Pasca M. 2006. Using encyclopedic knowledge for named entity disambiguation. In EACL, Trento, Italy, pp. 916.
Cano Basave A. E., Rizzo G., Varga A., Rowe M., Stankovic M., and Dadzie A.-S. 2014. Making sense of microposts (#microposts2014) named entity extraction & linking challenge. In Proceedings of the 4th Workshop on Making Sense of Microposts (#Microposts2014), Seoul, South Korea, pp. 5460.
Castillo C., Mendoza M., and Poblete B. 2011. Information credibility on twitter. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India. ACM, pp. 675–84.
Chang C.-C. and Lin C.-J. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (3–27): 127. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Christoforaki M., Erunse I., and Yu C. 2011. Searching social updates for topic-centric entities. In Proceedings of the 1st International Workshop on Searching and Integrating New Web Data Sources – Very Large Data Search (VLDS), Seattle, WA, USA, pp. 34–9.
Cucerzan S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 708–16.
Cunningham H., Maynard D., Bontcheva K., and Tablan V. 2002. GATE: a framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), Philadelphia, Pennsylvania, USA, pp. 168–75.
Dann S. 2010. Twitter content classification. First Monday 15 (12), http://firstmonday.org/ojs/index.php/fm/article/viewArticle/2745/2681.
Davis A., Veloso A., da Silva A. S., Meira W. Jr, and Laender A. H. F. 2012. Named entity disambiguation in streaming data. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers – Volume 1, ACL ’12, Jeju Island, Korea, pp. 815–24.
Delgado A. D., Mart’ınez R., Pérez Garc’ıa-Plaza A., and Fresno V. 2012. Unsupervised Real-Time company name disambiguation in twitter. In Workshop on Real-Time Analysis and Mining of Social Streams (RAMSS), Palo Alto, California, USA, pp. 25–8.
Derczynski L. and Bontcheva K. 2013. Mining social media with linked open data, entity recognition, and event extraction. In Proceedings of the 3rd Workshop on Data Extraction and Object Search (DEOS 2013), Oxford, UK.
Dice L. R. 1945. Measures of the amount of ecologic association between species. Ecology 26 (3): 297302.
Finkel J. R., Grenager T., and Manning C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, University of Michigan, USA, pp. 363–70.
Gimpel K., Schneider N., O’Connor B., Das D., Mills D., Eisenstein J., Heilman M., Yogatama D., Flanigan J., and Smith N. A. 2011. Part-of-speech tagging for twitter: annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short papers – Volume 2, HLT ’11, Portland, Oregon, USA, pp. 42–7.
Gupta P., Goel A., Lin J., Sharma A., Wang D., and Zadeh R. 2013. Wtf: the who to follow service at twitter. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, Rio de Janeiro, Brazil, pp. 505–14.
Habib M. B. and van Keulen M. 2012a. Improving toponym disambiguation by iteratively enhancing certainty of extraction. In Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval, KDIR 2012, Barcelona, Spain. SciTePress, pp. 399410.
Habib M. B. and van Keulen M. 2012b. Unsupervised improvement of named entity extraction in short informal context using disambiguation clues. In Proc. of the Workshop on Semantic Web and Information Extraction (SWAIE 2012), Galway, Ireland, pp. 110.
Habib M. B. and van Keulen M. 2013. A hybrid approach for robust multilingual toponym extraction and disambiguation. In IIS, Warsaw, Poland, pp. 115.
Hoffart J., Yosef M. A., Bordino I., Frstenau H., Pinkal M., Spaniol M., Taneva B., Thater S., and Weikum G. 2011. Robust disambiguation of named entities in text. In Proceedings of EMNLP 2011, Edinburgh, Scotland, UK, pp. 782–92.
Howard P. and Hussain M. 2013. Democracy’s Fourth Wave?: Digital Media and the Arab Spring, Oxford Studies in Digital Politics. USA: OUP.
Jung J. J. 2012. Online named entity recognition method for microtexts in social networking services: a case study of twitter. Expert Systems with Applications 39 (9): 8066–70.
Kulkarni S., Singh A., Ramakrishnan G., and Chakrabarti S. 2009. Collective annotation of wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, Paris, France, pp. 457–66.
Li C., Weng J., He Q., Yao Y., Datta A., Sun A., and Lee B.-S. 2012. Twiner: named entity recognition in targeted twitter stream. In SIGIR, Portland, Oregon, USA, pp. 721–30.
Li L., Yu Z., Zou J., Su L., Xian Y., and Mao C. 2009. Research on the method of entity homepage recognition. Journal of Computational Information Systems (JCIS) 5 (4): 1617–24.
Lin T., Mausam, and Etzioni O.,2012. Entity linking at web scale. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), Montreal, Canada, pp. 84–8.
Locke B., and Martin J. 2009. Named entity recognition: adapting to microblogging. Senior Thesis, University of Colorado.
MacKay D. J., and Peto L. C. B. 1994. A hierarchical dirichlet language model. Natural Language Engineering 1 : 119.
Marsh E., and Perzanowski D. 1998. Muc-7 evaluation of ie technology: overview of results. In Proceedings of the 7th Message Understanding Conference (MUC-7).
McCallum A., and Li W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of CoNLL 2003, Edmonton, Canada, pp. 188–91.
Mendes P. N., Jakob M., García-Silva A., and Bizer C. 2011. Dbpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, I-Semantics ’11, New York, NY, USA. ACM, pp. 18.
Ritter A., Clark S., Mausam, and Etzioni O. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of EMNLP 2011, Edinburgh, Scotland, UK, pp. 1524–34.
Rizzo G. and Troncy R. 2011. Nerd: Evaluating named entity recognition tools in the web of data. In ISWC’11, Workshop on Web Scale Knowledge Extraction (WEKEX’11), Bonn, Germany.
Spina D., Amigó E., and Gonzalo J. 2011. Filter keywords and majority class strategies for company name disambiguation in twitter. In Proceedings of the 2nd International Conference on Multilingual and Multimodal Information Access Evaluation, CLEF’11, Amsterdam, The Netherlands, pp. 5061.
Srinivasan H., Chen J., and Srihari R. 2009. Cross document person name disambiguation using entity profiles. In Proceedings of the Text Analysis Conference (TAC) Workshop, Gaithersburg, Maryland, USA.
Steiner T., Verborgh R., Gabarró Vallés J., and Van de Walle R. 2013. Adding meaning to social network microposts via multiple named entity disambiguation apis and tracking their data provenance. International Journal of Computer Information Systems and Industrial Management 5 : 6978.
Suchanek F. M., Kasneci G., and Weikum G. 2007. Yago: a core of semantic knowledge. In Proc. of the 16th International Conference on World Wide Web, WWW ’07, Banff, Alberta, Canada, pp. 697706.
Sullivan S. J., Schneiders A. G., Cheang C.-W., Kitto E., Lee H., Redhead J., Ward S., Ahmed O. H., and McCrory P. R. 2012. what’s happening? A content analysis of concussion-related traffic on twitter. British Journal of Sports Medicine 46 (4): 258–63.
Sutton C. and McCallum A. 2005. Piecewise training of undirected models. In Proceedings of UAI, Edinburgh, Scotland, UK, pp. 568–75.
Verma M., Divya, and Sofat S. 2014. Article: Techniques to detect spammers in twitter- a survey. International Journal of Computer Applications 85 (10): 2732.
Wang C., Chakrabarti K., Cheng T., and Chaudhuri S. 2012. Targeted disambiguation of ad-hoc, homogeneous sets of named entities. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, Lyon, France, pp. 719–28.
Wang K., Thrasher C., Viegas E., Li X., and Hsu B.-J. P. 2010. An overview of microsoft web n-gram corpus and applications. In Proceedings of the NAACL HLT 2010, Los Angeles, California, USA, pp. 45–8.
Westerveld T., Kraaij W., and Hiemstra D. 2002. Retrieving web pp. using content, links, urls and anchors. In Proceedings of the 10th Text REtrieval Conference, TREC 2001, vol. SP 500, Gaithersburg, Maryland, USA, pp. 663–72.
Winkels M. 2013. The global social network landscape a country-by-country guide to social network usage. http://www.optimediaintelligence.es/noticias_archivos/719_20130715123913.pdf.
Wu T.-F., Lin C.-J., and Weng R. C. 2004. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5 : 9751005.
Yerva S. R., Miklós Z., and Aberer K. 2012. Entity-based classification of twitter messages. IJCSA, 9 (1): 88115.
Yosef M., Hoffart J., Bordino I., Spaniol M., and Weikum G. 2011. Aida: An online tool for accurate disambiguation of named entities in text and tables. Proc. of the VLDB Endowment 4 (12): 1450–53.
Zhai C. and Lafferty J. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, New Orleans, Louisiana, USA, pp. 334–42.
Zhu Z., Hiemstra D., Apers P. M. G., and Wombacher A. 2012. Separate training for conditional random fields using co-occurrence rate factorization. Technical Report TR-CTIT-12-29, Centre for Telematics and Information Technology, University of Twente, Enschede.
Zhu Z., Hiemstra D., Apers P. M. G., and Wombacher A. 2013. Closed form maximum likelihood estimator of conditional random fields. Technical Report TR-CTIT-13-03, Centre for Telematics and Information Technology, University of Twente, Enschede.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 8
Total number of PDF views: 81 *
Loading metrics...

Abstract views

Total abstract views: 578 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 20th October 2017. This data will be updated every 24 hours.