Hostname: page-component-89b8bd64d-j4x9h Total loading time: 0 Render date: 2026-05-12T12:31:38.388Z Has data issue: false hasContentIssue false

Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting

Published online by Cambridge University Press:  11 February 2013

ROSA DEL GAUDIO
Affiliation:
Faculdade de Ciências, Departamento de Informática, University of Lisbon, Campo Grande, 1749-016 Lisboa, Portugal e-mails: rosa@di.fc.ul.pt, antonio.branco@di.fc.ul.pt
GUSTAVO BATISTA
Affiliation:
Department of Computer Science, University of São Paulo, PO Box 668, 13560-970 São Carlos, SP, Brazil e-mail: gbatista@icmc.usp.br
ANTÓNIO BRANCO
Affiliation:
Faculdade de Ciências, Departamento de Informática, University of Lisbon, Campo Grande, 1749-016 Lisboa, Portugal e-mails: rosa@di.fc.ul.pt, antonio.branco@di.fc.ul.pt

Abstract

This paper addresses the task of automatic extraction of definitions by thoroughly exploring an approach that solely relies on machine learning techniques, and by focusing on the issue of the imbalance of relevant datasets. We obtained a breakthrough in terms of the automatic extraction of definitions, by extensively and systematically experimenting with different sampling techniques and their combination, as well as a range of different types of classifiers. Performance consistently scored in the range of 0.95–0.99 of area under the receiver operating characteristics, with a notorious improvement between 17 and 22 percentage points regarding the baseline of 0.73–0.77, for datasets with different rates of imbalance. Thus, the present paper also represents a contribution to the seminal work in natural language processing that points toward the importance of exploring the research path of applying sampling techniques to mitigate the bias induced by highly imbalanced datasets, and thus greatly improving the performance of a large range of tools that rely on them.

Information

Type
Articles
Copyright
Copyright © Cambridge University Press 2013 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable