Hostname: page-component-77c78cf97d-sp94z Total loading time: 0 Render date: 2026-04-24T21:00:24.764Z Has data issue: false hasContentIssue false

Unsupervised Arabic dialect segmentation for machine translation

Published online by Cambridge University Press:  23 September 2020

Wael Salloum*
Affiliation:
AI Research Department, Mendel.ai, San Jose, CA, USA
Nizar Habash
Affiliation:
AI Research Department, Mendel.ai, San Jose, CA, USA
*
*Corresponding author. E-mail: wael@ccls.columbia.edu

Abstract

Resource-limited and morphologically rich languages pose many challenges to natural language processing tasks. Their highly inflected surface forms inflate the vocabulary size and increase sparsity in an already scarce data situation. In this article, we present an unsupervised learning approach to vocabulary reduction through morphological segmentation. We demonstrate its value in the context of machine translation for dialectal Arabic (DA), the primarily spoken, orthographically unstandardized, morphologically rich and yet resource poor variants of Standard Arabic. Our approach exploits the existence of monolingual and parallel data. We show comparable performance to state-of-the-art supervised methods for DA segmentation.

Information

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable