Hostname: page-component-77f85d65b8-jkvpf Total loading time: 0 Render date: 2026-04-18T18:48:03.931Z Has data issue: false hasContentIssue false

Building a Turkish UCCA dataset

Published online by Cambridge University Press:  27 August 2024

Necva Bölücü*
Affiliation:
Department of Computer Engineering, Hacettepe University, Ankara, Turkey Data61, CSIRO, Sydney, NSW, Australia
Burcu Can
Affiliation:
Computing Science, University of Stirling, Stirling, UK
*
Corresponding author: Necva Bölücü; Email: necvaa@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Semantic representation is the task of conveying the meaning of a natural language utterance by converting it to a logical form that can be processed and understood by machines. It is utilised by many applications in natural language processing (NLP), particularly in tasks relevant to natural language understanding (NLU). Due to the widespread use of semantic parsing in NLP, many semantic representation schemes with different forms have been proposed; Universal Conceptual Cognitive Annotation (UCCA) is one of them. UCCA is a cross-lingual semantic annotation framework that allows easy annotation without requiring substantial linguistic knowledge. UCCA-annotated datasets have been released so far for English, French, German, Russian, and Hebrew. In this paper, we present a UCCA-annotated Turkish dataset of 400 sentences that are obtained from the METU-Sabanci Turkish Treebank. We provide the UCCA annotation specifications defined for the Turkish language so that it can be extended further. We followed a semi-automatic annotation approach, where an external semantic parser is utilised for the initial annotation of the dataset, which is manually revised by two annotators. We used the same semantic parser model to evaluate the dataset with zero-shot and few-shot learning, demonstrating that even a small sample set from the target language in the training data has a notable impact on the performance of the parser (15.6% and 2.5% gain over zero-shot for labelled and unlabelled results, respectively).

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Turkish UCCA dataset annotation process comprises two steps: (1) obtaining partially annotated dataset using an external semantic parser and (2) refining the partially annotated dataset by human annotators.

Figure 1

Figure 2. Examples of UCCA annotation graphs. Category abbreviations: A: Participant, P: Process, D: Adverbial, C: Center, N: Connector, E: Elaborator, F: Function

Figure 2

Table 1. A sentence “Ama hiçbir şey söylemedim ki ben sizlere” (in English, “But I didn’t say anything to you”) in the METU-Sabanci Turkish Treebank (Atalay et al.,2003; Oflazer et al.,2003). The columns correspond to the positions of the words within the sentence, surface forms, lemmas, parts-of-speech (PoS) tags, morphological features separated by $|$, head-word indices (index of a syntactic parent, 0 for ROOT), and syntactic relationships between HEAD and the word, respectively.

Figure 3

Figure 3. UCCA Annotation of “Ama hiçbir şey söylemedim ki ben sizlere” (in English, “But I didn’t say anything to you”)

Figure 4

Figure 4. An overview of the external semantic parser

Figure 5

Figure 5. Confusion matrix for the outputs in partial annotation (predicted) and refined annotation (gold). Category abbreviations: A: Participant, C: Center, D: Adverbial, E: Elaborator, F: Function, G: Ground, H: Parallel Scene, L: Linker, N: Connector, P: Process, R: Relator, S: State, U: Punctuation

Figure 6

Figure 6. The semantic parse tree obtained from the semantic parsing model and the gold annotation obtained from the manual annotation of the sentence “(O) Yerinden kalkmıştı.” (in English, “S/he had stood up.”). Category abbreviations: H: Parallel Scene, A: Participant, P: Process, U: Punctuation

Figure 7

Figure 7. The semantic parse tree obtained from the semantic parsing model and the gold annotation obtained from the manual annotation of the sentence “(Sen) Kurtulmak istiyor musun oğlum? diye sordu Şakir.” (in English, “Do you want to be saved son? asked Şakir.”). Category abbreviations: H: Parallel Scene, D: Adverbial, C: Center, U: Punctuation, R: Relator, P: Process, A: Participant, F: Function, G: Ground

Figure 8

Table 2. Proportions of the edges and labels as well as the number of sentences and tokens in the UCCA datasets in Turkish, English, French, and German. The statistical details of English, French, and German datasets are taken from Hershcovich et al. (2019b).

Figure 9

Table 3. The number of sentences in each UCCA-annotated dataset provided by SemEval 2019 (Hershcovich et al.2019b)

Figure 10

Table 4. F-1 results obtained from zero-shot and few-shot learning on the Turkish UCCA dataset. Avg is the macro average of F1 metric. $\uparrow$ means a statistically significant improvement over the zero-shot learning.

Figure 11

Figure 8. Results obtained from few-shot learning according to their sentence length

Figure 12

Figure 9. The semantic parse tree obtained from the semantic parsing model and the gold annotation obtained from the manual annotation of the sentence “(O) Evet, dedi çaresizlikle.” (in English, “S/he said yes with desperation.”). Category abbreviations: H: Parallel Scene, D: Adverbial, G: Ground, U: Punctuation, P: Process, A: Participant

Figure 13

Figure 10. The semantic parse tree obtained from the semantic parsing model and the gold annotation obtained from the manual annotation of the sentence, “(Ben) Kurtulup buraya gelmeyi başardım.” (in English, “I managed to escape and come here.”). Category abbreviations: H: Parallel Scene, D: Adverbial, U: Punctuation, P: Process, A: Participant

Figure 14

Figure 11. The semantic parse tree obtained from the semantic parsing model and the gold annotation obtained from the manual annotation of the sentence “(Sen) Kaçıp kurtulmak istedin.” (in English, “You wanted to escape and get away.”). Category abbreviations: H: Parallel Scene, U: Punctuation, P: Process, A: Participant

Figure 15

Figure 12. The semantic parse tree obtained from the semantic parsing model and the gold annotation obtained from the manual annotation of the sentence, “(O) Onu elinden kaçırmış, bir başka erkeğe kaptırmıştı.” (in English, “S/he missed her/him, s/he had lost her/him to another man.”). Category abbreviations: H: Parallel Scene, D: Adverbial, F: Function, U: Punctuation, P: Process, E: Elaborator, C: Center, A: Participant

Figure 16

Figure 13. The semantic parse tree obtained from the semantic parsing model and the gold annotation obtained from the manual annotation of the sentence “Geldik! diye bağırdı Kerem.” (in English, “Kerem shouted that we had arrived.”). Category abbreviations: H: Parallel Scene, R: Relator, U: Punctuation, P: Process, A: Participant