Hostname: page-component-77f85d65b8-6bnxx Total loading time: 0 Render date: 2026-03-27T21:56:59.644Z Has data issue: false hasContentIssue false

Korean named entity recognition based on language-specific features

Published online by Cambridge University Press:  29 June 2023

Yige Chen
Affiliation:
The Chinese University of Hong Kong, Hong Kong
KyungTae Lim
Affiliation:
Seoul National University of Science and Technology, Seoul, South Korea
Jungyeul Park*
Affiliation:
The University of British Columbia, Vancouver, BC, Canada
*
Corresponding author: Jungyeul Park; Email: jungyeul@mail.ubc.ca
Rights & Permissions [Opens in a new window]

Abstract

In this paper, we propose a novel way of improving named entity recognition (NER) in the Korean language using its language-specific features. While the field of NER has been studied extensively in recent years, the mechanism of efficiently recognizing named entities (NEs) in Korean has hardly been explored. This is because the Korean language has distinct linguistic properties that present challenges for modeling. Therefore, an annotation scheme for Korean corpora by adopting the CoNLL-U format, which decomposes Korean words into morphemes and reduces the ambiguity of NEs in the original segmentation that may contain functional morphemes such as postpositions and particles, is proposed herein. We investigate how the NE tags are best represented in this morpheme-based scheme and implement an algorithm to convert word-based and syllable-based Korean corpora with NEs into the proposed morpheme-based format. Analyses of the results of traditional and neural models reveal that the proposed morpheme-based format is feasible, and the varied performances of the models under the influence of various additional language-specific features are demonstrated. Extrinsic conditions were also considered to observe the variance of the performances of the proposed models, given different types of data, including the original segmentation and different types of tagging formats.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. XPOS (Sejong tag set) introduced in the morpheme-based CoNLL-U data converted from NAVER’s NER corpus and its type of named entities.

Figure 1

Figure 1. Distribution of each type of postposition/particle after NEs (NER data from NAVER). The terms and notations in the figure are described in Table 1.

Figure 2

Table 2. Percentage (%) of occurrences of JKB/JKO/JKS/JX among all types of parts-of-speech after LOC/ORG/PER (NER data from NAVER).

Figure 3

Figure 2. Various approaches of annotation for named entities (NEs): the eojeol-based approach annotates the entire word, the morpheme-based annotates only the morpheme and excludes the functional morphemes, and the syllable-based annotates syllable by syllable to exclude the functional morphemes.

Figure 4

Table 3. Summary of publicly available Korean NER datasets.

Figure 5

Figure 3. CoNLL-U style annotation with multiword tokens for morphological analysis and POS tagging. It can include BIO-based NER annotation where B-LOC is for a beginning word of location and I-PER for an inside word of person.

Figure 6

Figure 4. Overall structure of our RNN-based model.

Figure 7

Algorithm 1. Pseudo-code for converting data from NAVER’s eojeol-based format into the morpheme-based CoNLL-U format

Figure 8

Table 4. CRF/Neural results using different models using NAVER’s data converted into the proposed format.

Figure 9

Figure 5. CRF feature template example for word and pos.

Figure 10

Table 5. CRF/Neural results using the various sets of features using NAVER’s data converted into the proposed format.

Figure 11

Table 6. CRF/Neural result comparison between the proposed CoNLL-U format versus NAVER’s eojeol-based format using NAVER’s data where POS features are not applied.

Figure 12

Table 7. CRF/Neural result comparison between BIO versus BIOES annotations using NAVER’s data converted into the proposed format where POS features are not applied.

Figure 13

Table 8. Result comparison between the proposed CoNLL-U format and the syllable-based format using MODU (19 21), KLUE, and ETRI datasets where POS features are not applied.

Figure 14

Figure 6. Comparison of the confusion matrix between the BERT (6a) and XLM-RoBERTa (6b). Model on the test set. XLM-RoBERTa tends to show better results in finding the “I-$*$” typed entities, with +0.7% of average score.

Figure 15

Figure 7. Various approaches of annotation for NEs in Korean.

Figure 16

Table 9. Hyperparameters.