Hostname: page-component-77f85d65b8-t6st2 Total loading time: 0 Render date: 2026-03-28T13:24:32.548Z Has data issue: false hasContentIssue false

Multilingual SMS-based author profiling: Data and methods

Published online by Cambridge University Press:  26 June 2018

MEHWISH FATIMA
Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
SABA ANWAR
Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
AMNA NAVEED
Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
WAQAS ARSHAD
Affiliation:
Department of Computer Science & IT, Superior University, Lahore, Pakistan e-mail: waqas.arshad@superior.edu.com.pk
RAO MUHAMMAD ADEEL NAWAB
Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
MUNTAHA IQBAL
Affiliation:
Al-Khwarizmi Institute of Computer Science, University of Engineering & Technology, Lahore, Pakistan e-mail: muntaha.iqbal@kics.edu.pk
ALIA MASOOD
Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com

Abstract

In the recent years, many benchmark author profiling corpora have been developed for various genres including Twitter, social media, blogs, hotel reviews and e-mail, etc. However, no such standard evaluation resource has been developed for Short Messaging Service (SMS), a popular medium of communication, which is very useful for author profiling. The primary aim of this study is to develop a large multilingual (English and Roman Urdu) benchmark SMS-based author profiling corpus. The proposed corpus contains 810 author profiles, wherein each profile consists of an aggregation of SMS messages as a single document of an author, along with seven demographic traits associated with each author profile: gender, age, native language, native city, qualification, occupation and personality type (introvert/extrovert). The secondary aims of this study include the following: (1) annotating the proposed corpus for code-switching annotations at the lexical level (approximately 0.69 million tokens are manually annotated for code-switching) and (2) applying the stylometry-based method (groups of sixty-four features) and the content-based method (twelve features) for gender identification in order to demonstrate how our proposed corpus can be used for the development and evaluation of various author profiling methods. The results show that the content-based character 5-gram feature outperformed all the other features by obtaining the accuracy score of 0.975 and F1 score of 0.947 for gender identification while using the entire corpus. Furthermore, our proposed corpora (SMS–AP–18 and code-switched SMS–AP–18) are freely and publicly available for research purpose.

Information

Type
Article
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable