Hostname: page-component-89b8bd64d-sd5qd Total loading time: 0 Render date: 2026-05-11T11:49:52.127Z Has data issue: false hasContentIssue false

A clustering framework for lexical normalization of Roman Urdu

Published online by Cambridge University Press:  10 June 2020

Abdul Rafae Khan*
Affiliation:
Stevens Institute of Technology, Hoboken, NJ 07030, USA The Graduate Center, Computer Science Department, City University of New York, 365 5th Ave, New York, NY 10016, USA
Asim Karim
Affiliation:
Lahore University of Management Sciences, D.H.A, Lahore Cantt., 54792 Lahore, Pakistan
Hassan Sajjad
Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Faisal Kamiran
Affiliation:
Information Technology University, Arfa Software Technology Park, Ferozepur Road, Lahore, Pakistan
Jia Xu
Affiliation:
Stevens Institute of Technology, Hoboken, NJ 07030, USA The Graduate Center, Computer Science Department, City University of New York, 365 5th Ave, New York, NY 10016, USA
*
*Corresponding author. E-mail: rafae015@gmail.com

Abstract

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.

Information

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable