Hostname: page-component-6766d58669-wvcvf Total loading time: 0 Render date: 2026-05-22T15:26:50.185Z Has data issue: false hasContentIssue false

Toward a shallow discourse parser for Turkish

Published online by Cambridge University Press:  11 August 2023

Ferhat Kutlu*
Affiliation:
Graduate School of Informatics, Cognitive Science Department, Middle East Technical University, Çankaya/Ankara, Turkey
Deniz Zeyrek
Affiliation:
Graduate School of Informatics, Cognitive Science Department, Middle East Technical University, Çankaya/Ankara, Turkey
Murathan Kurfalı
Affiliation:
Linguistics Department, Stockholm University, Stockholm, Sweden
*
Corresponding author: Ferhat Kutlu; Email: ferhat.kutlu@metu.edu.tr
Rights & Permissions [Opens in a new window]

Abstract

One of the most interesting aspects of natural language is how texts cohere, which involves the pragmatic or semantic relations that hold between clauses (addition, cause-effect, conditional, similarity), referred to as discourse relations. A focus on the identification and classification of discourse relations appears as an imperative challenge to be resolved to support tasks such as text summarization, dialogue systems, and machine translation that need information above the clause level. Despite the recent interest in discourse relations in well-known languages such as English, data and experiments are still needed for typologically different and less-resourced languages. We report the most comprehensive investigation of shallow discourse parsing in Turkish, focusing on two main sub-tasks: identification of discourse relation realization types and the sense classification of explicit and implicit relations. The work is based on the approach of fine-tuning a pre-trained language model (BERT) as an encoder and classifying the encoded data with neural network-based classifiers. We firstly identify the discourse relation realization type that holds in a given text, if there is any. Then, we move on to the sense classification of the identified explicit and implicit relations. In addition to in-domain experiments on a held-out test set from the Turkish Discourse Bank (TDB 1.2), we also report the out-domain performance of our models in order to evaluate its generalization abilities, using the Turkish part of the TED Multilingual Discourse Bank. Finally, we explore the effect of multilingual data aggregation on the classification of relation realization type through a cross-lingual experiment. The results suggest that our models perform relatively well despite the limited size of the TDB 1.2 and that there are language-specific aspects of detecting the types of discourse relation realization. We believe that the findings are important both in providing insights regarding the performance of the modern language models in a typologically different language and in the low-resource scenario, given that the TDB 1.2 is 1/20th of the Penn Discourse TreeBank in terms of the number of total relations.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Symbolic representation of our approach

Figure 1

Table 1. Distribution of DR types and Level-1 senses in the TDB 1.2 and the PDTB 3.0

Figure 2

Table 2. Summary of the train, development, and test splits of the TDB 1.2 used in the experiments. The upper part presents the distribution of labels in the DR realization type classification experiments and the lower part presents the sense experiments over explicit and implicit DRs.

Figure 3

Table 3. An overview of the conducted experiments

Figure 4

Table 4. DR realization type classification results over the TDB 1.2

Figure 5

Figure 2. The bar chart and the box plot showing the semantic similarity analysis of DR realization types in the TDB 1.2 (encoded with the Turkish BERT)

Figure 6

Table 5. Level-1 Sense classification results of explicit and implicit DRs in the TDB 1.2

Figure 7

Table 6. Cross-domain DR realization type classification results over the T-TED-MDB

Figure 8

Table 7. Level-1 Sense classification results of explicit and implicit DRs in the T-TED-MDB

Figure 9

Table 8. F1-Scores of DR realization type classification experiments with Turkish BERT on the TDB 1.2 and with mBERT on both the PDTB 3.0 and the joint dataset

Figure 10

Figure A1. The leftmost column contains the Level-1 senses and the middle column contains the Level-2 senses. For asymmetric relations, Level-3 senses are located in the rightmost column (Webber et al. 2019). While the TDB 1.2 and the PDTB 3.0 both assign senses from all three levels, the present work exploits Level-1 senses only.

Figure 11

Figure B1. The symbolic representation of the BERT MultiClass TensorFlow Model

Figure 12

Figure C1. The confusion matrix of DR realization type classification with the TDB 1.2 test set, where the model is trained over TDB 1.2, and encoded with the fine-tuned monolingual BERT

Figure 13

Figure C2. The confusion matrix of DR realization type classification with the TDB 1.2 test set, where the model is trained on the custom multilingual dataset (TDB 1.2 + PDTB 3.0) and encoded with the fine-tuned mBERT

Figure 14

Table C1. $\kappa$ Coefficient of the confusion matrices in Figs. C1 and C2 calculated by Formula 4