Hostname: page-component-77f85d65b8-8wtlm Total loading time: 0 Render date: 2026-03-28T07:40:51.141Z Has data issue: false hasContentIssue false

Leveraging machine translation for cross-lingual fine-grained cyberbullying classification amongst pre-adolescents

Published online by Cambridge University Press:  07 September 2022

Kanishk Verma*
Affiliation:
ADAPT SFI Research Centre, Dublin City University, Dublin, Ireland DCU Anti-Bullying Centre, Dublin City University, Dublin, Ireland
Maja Popović
Affiliation:
ADAPT SFI Research Centre, Dublin City University, Dublin, Ireland
Alexandros Poulis
Affiliation:
TransPerfect DataForce, Luxembourg, Luxembourg
Yelena Cherkasova
Affiliation:
G3 Translate, Strategic Partner of Transperfect, New York, USA
Cathal Ó hÓbáin
Affiliation:
ADAPT SFI Research Centre, Dublin City University, Dublin, Ireland
Angela Mazzone
Affiliation:
DCU Anti-Bullying Centre, Dublin City University, Dublin, Ireland
Tijana Milosevic
Affiliation:
DCU Anti-Bullying Centre, Dublin City University, Dublin, Ireland
Brian Davis
Affiliation:
ADAPT SFI Research Centre, Dublin City University, Dublin, Ireland
*
*Corresponding author. E-mail: kanishk.verma@adaptcentre.ie
Rights & Permissions [Opens in a new window]

Abstract

Cyberbullying is the wilful and repeated infliction of harm on an individual using the Internet and digital technologies. Similar to face-to-face bullying, cyberbullying can be captured formally using the Routine Activities Model (RAM) whereby the potential victim and bully are brought into proximity of one another via the interaction on online social networking (OSN) platforms. Although the impact of the COVID-19 (SARS-CoV-2) restrictions on the online presence of minors has yet to be fully grasped, studies have reported that 44% of pre-adolescents have encountered more cyberbullying incidents during the COVID-19 lockdown. Transparency reports shared by OSN companies indicate an increased take-downs of cyberbullying-related comments, posts or content by artificially intelligen moderation tools. However, in order to efficiently and effectively detect or identify whether a social media post or comment qualifies as cyberbullying, there are a number factors based on the RAM, which must be taken into account, which includes the identification of cyberbullying roles and forms. This demands the acquisition of large amounts of fine-grained annotated data which is costly and ethically challenging to produce. In addition where fine-grained datasets do exist they may be unavailable in the target language. Manual translation is costly and expensive, however, state-of-the-art neural machine translation offers a workaround. This study presents a first of its kind experiment in leveraging machine translation to automatically translate a unique pre-adolescent cyberbullying gold standard dataset in Italian with fine-grained annotations into English for training and testing a native binary classifier for pre-adolescent cyberbullying. In addition to contributing high-quality English reference translation of the source gold standard, our experiments indicate that the performance of our target binary classifier when trained on machine-translated English output is on par with the source (Italian) classifier.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Table 1. Scenario-wise sentence breakdown.

Figure 1

Table 2. Count-wise fine-grained entity-based annotations.

Figure 2

Table 3. Count-wise fine-grained role-based annotations.

Figure 3

Table 4. Number of segments (sentences) in training and test data for MT systems.

Figure 4

Table 5. Dataset split size.

Figure 5

Table 6. Translation ambiguity, original labelling and domain expert error analysis.

Figure 6

Table 7. Cohen’s Kappa score for individual annotators with original annotations.

Figure 7

Table 8. Comparison of Italian-English systems by automatic evaluation scores BLEU and chrF.

Figure 8

Table 9. Original and replicated classification results on the Italian corpus.

Figure 9

Table 10. Hold-out test set binary classification best results with GRU.

Figure 10

Table 11. Binary classification best results on Scenario-C additional annotations.

Supplementary material: PDF

Verma et al. supplementary material

Appendix

Download Verma et al. supplementary material(PDF)
PDF 81.1 KB