Hostname: page-component-77f85d65b8-lfk5g Total loading time: 0 Render date: 2026-04-18T11:06:35.978Z Has data issue: false hasContentIssue false

Chinese word segmentation with heterogeneous graph convolutional network

Published online by Cambridge University Press:  18 October 2024

Xuemei Tang
Affiliation:
Department of Information Management, Peking University, Beijing, China Digital Humanities Center of Peking University, Beijing, China
Qi Su*
Affiliation:
Digital Humanities Center of Peking University, Beijing, China School of Foreign Languages, Peking University, Beijing, China
Jun Wang
Affiliation:
Department of Information Management, Peking University, Beijing, China Digital Humanities Center of Peking University, Beijing, China
*
Corresponding author: Qi Su; Email: sukia@pku.edu.cn
Rights & Permissions [Opens in a new window]

Abstract

Recently, deep learning methods have achieved remarkable success in the Chinese word segmentation (CWS) task. Some of them enhance the CWS model by utilizing contextual features and external resources (e.g., sub-words, lexicon, and syntax). However, existing approaches fail to fully use the heterogeneous features and their structural information. Therefore, in this paper, we propose a heterogeneous information learning framework for CWS, named heterogeneous graph neural segmenter (HGNSeg), which exploits heterogeneous features with the graph convolutional networks and the pretrained language model. Experimental results on six benchmark datasets (e.g., SIGHAN 2005 and SIGHAN 2008) confirm that HGNSeg can effectively improve the performance of CWS. Importantly, HGNSeg also demonstrates an excellent ability to alleviate the out-of-vocabulary (OOV) issue in cross-domain scenarios.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. The architecture of HGNSeg. “HGN” at the bottom of the figure is the construction process of the heterogeneous graph. “L” means the lexicon, “N” denotes the n-gram lexicon.

Figure 1

Table 1. Statistics of 6 benchmark

Figure 2

Table 2. Statistics of 6 cross-domain datasets

Figure 3

Table 3. Experiment hyperparameters setting

Figure 4

Table 4. Performance comparison between HGNSeg and previous SOTA models on the test sets of six datasets. And experimental results of HGNSeg on six datasets with different encoders. “+HGN” indicates this model uses the heterogeneous graph network. Here, F and $R_{oov}$ represent the F1 value and OOV recall rate, respectively. The maximum F1 and $R_{oov}$ for each dataset are highlighted. “*” denotes the results of our reproduced experiments

Figure 5

Table 5. The F1 scores and $R_{oov}$ on test data of six cross-domain datasets. The maximum F1 and $R_{oov}$ scores for each domain dataset are highlighted

Figure 6

Figure 2. $R_{oov}$ with different sizes of n-gram lexicon for four cross-domain datasets. We randomly pick different percentages of n-grams from each target corpora to build new n-gram lexicons.

Figure 7

Table 6. The segmentation results with different proportions of n-grams from PT corpus. “0%” means this model without n-grams from the PT training set, and “100%” means this model uses all the n-grams from the PT training set

Figure 8

Table 7. Ablation experiments, “w/o $\mathscr{L}$/$\mathscr{N}$” means without the character-word/n-gram sub-graph, “w/o Dep.” means without the dependency tree sub-graph

Figure 9

Table 8. Comparison of different types of the lexicon in the character-word/n-gram sub-graph. “w/o $\mathscr{N}$ ” means the model without the n-gram lexicon, “w/o $\mathscr{L}$” means the model without the lexicon from the training set. The last line represents the model using $\mathscr{N}$ and $\mathscr{L}$ together

Figure 10

Figure 3. The F1 values and $R_{oov}$ of HGNSeg using two different syntax parsing toolkits, namely SCT and LTP 4.0, on six datasets.

Figure 11

Table 9. The segmentation results of five examples from the AS test set with different syntax parsing toolkits

Figure 12

Figure 4. The dependency tree parsing results of “” from two syntax parsing toolkits.

Figure 13

Table 10. Effect of different syntax parsing toolkits

Figure 14

Table 11. Statistical significance test of F-score for our system and Liu et al. (2021) system