Hostname: page-component-77f85d65b8-6c7dr Total loading time: 0 Render date: 2026-03-29T12:26:02.179Z Has data issue: false hasContentIssue false

Real-world sentence boundary detection using multitask learning: A case study on French

Published online by Cambridge University Press:  06 April 2022

KyungTae Lim
Affiliation:
Hanbat National University, Daejeon 34158, South Korea
Jungyeul Park*
Affiliation:
The University of British Columbia, Vancouver, BC V6T 1Z4, BC, Canada University of Washington, Seattle, WA 98195, USA
*
*Corresponding author. E-mail: jungyeul@mail.ubc.ca
Rights & Permissions [Opens in a new window]

Abstract

We propose a novel approach for sentence boundary detection in text datasets in which boundaries are not evident (e.g., sentence fragments). Although detecting sentence boundaries without punctuation marks has rarely been explored in written text, current real-world textual data suffer from widespread lack of proper start/stop signaling. Herein, we annotate a dataset with linguistic information, such as parts of speech and named entity labels, to boost the sentence boundary detection task. Via experiments, we obtained F1 scores up to 98.07% using the proposed multitask neural model, including a score of 89.41% for sentences completely lacking punctuation marks. We also present an ablation study and provide a detailed analysis to demonstrate the effectiveness of the proposed multitask learning method.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. Paragraph example from the Europarl corpus (Koehn 2005).

Figure 1

Table 1. Summary of previous works on sentence boundary disambiguation.

Figure 2

Figure 2. Raw SBD data for French: (translation) “Les Sables-d’Olonne La Chaume Philippe and Véronique, his nephew and niece; Anne, Albert and Chantal, his brother-in-law and sister-in-law, are sad to report the death of Mr. Serge, who passed away on November 26, 2014, at the age of 82….”

Figure 3

Figure 3. Preprocessed SBD data for training (sent marked): Les Sables-d’Olonne La Chaume and Philippe et Véronique, son … are annotated as two sentences where the latter represents the sentence middle that no punctuation marks precede.

Figure 4

Table 2. Detailed statistics of the corpus.

Figure 5

Table 3. Summary of SBD as a sequence labeling problem.

Figure 6

Figure 4. Overall structure of our roberta-sbd model.

Figure 7

Table 4. Hyperparameters in neural models.

Figure 8

Table 5. SBD results using different linguistic information.

Figure 9

Figure 5. Overall structure of our multitask model.

Figure 10

Table 6. Comparison between the single-task and multitask learning in terms of required computing resources and training time.

Figure 11

Table 7. SBD (middle) results based on different multitask models.

Figure 12

Figure 6. Evaluation results are based on the number of training epochs. The Y-axis represents F1 scores.

Figure 13

Table 8. Multitask SBD (middle) results based on different BERT models.

Figure 14

Figure 7. Evaluation results based on the number of training data. The Y-axis represents F1 scores for SBD and NER and accuracy for POS tagging.

Figure 15

Table 9. SBD results of heterogeneous domains using the Europarl corpus.

Figure 16

Figure 8. Example of the Europarl corpus for French: (translation) “Opening of the session I declare resumed the 2000-2001 session of the European Parliament. Agenda Mr President, the second item on this morning’s agenda is the recommendation for second reading on cocoa and chocolate products, for which I am the rapporteur. Quite by accident I learnt yesterday, at 8.30 p.m., that the vote was to take place at noon today.’’ We note that Je déclare ouverte … (‘I declare resumed ...’) and Monsieur le Président, le deuxième ... (‘Mr President, the second ...’) are considered as middle sentences because punctuation marks are not preceded.

Figure 17

Table 10. SBD results of domain adaptation using the Europarl corpus based on multitask-sbd with +p+n (p).

Figure 18

Figure 9. Sentence example for obituaries and possible genealogy tree diagram: (translation) “GLANGES (Cramarigeas, Le Ch^ataignier) Gaston and Marie-Claude, his children; Laurent and Christelle, his grandchildren; Evelyne, Guillaume, her great-grandchildren, As well as all the family and her friends, are sad to inform you of the death of Madame Lucienne at the age of 84 years. Her funeral will take place on Monday, October 19, 2009, at 2:30 p.m., in the church of Glanges. Condolences on register at the church. The family thanks in advance all the people who will take part in their grief. PF Graffeuil-Feisthammel, St-Germain-les-Belles.’

Figure 19

Table 11. End-to-end system result.