Hostname: page-component-6766d58669-88psn Total loading time: 0 Render date: 2026-05-23T18:24:18.768Z Has data issue: false hasContentIssue false

Learning from noisy out-of-domain corpus using dataless classification

Published online by Cambridge University Press:  17 June 2020

Yiping Jin
Affiliation:
Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok 10300, Thailand
Dittaya Wanvarie*
Affiliation:
Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok 10300, Thailand
Phu T. V. Le
Affiliation:
Knorex Pte. Ltd., 8 Cross St, Singapore 048424, Singapore
*
*Corresponding author. E-mail: Dittaya.W@chula.ac.th

Abstract

In real-world applications, text classification models often suffer from a lack of accurately labelled documents. The available labelled documents may also be out of domain, making the trained model not able to perform well in the target domain. In this work, we mitigate the data problem of text classification using a two-stage approach. First, we mine representative keywords from a noisy out-of-domain data set using statistical methods. We then apply a dataless classification method to learn from the automatically selected keywords and unlabelled in-domain data. The proposed approach outperformed various supervised learning and dataless classification baselines by a large margin. We evaluated different keyword selection methods intrinsically and extrinsically by measuring their impact on the dataless classification accuracy. Last but not least, we conducted an in-depth analysis of the behaviour of the classifier and explained why the proposed dataless classification method outperformed supervised learning counterparts.

Information

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable