Hostname: page-component-77f85d65b8-jkvpf Total loading time: 0 Render date: 2026-03-30T02:31:57.222Z Has data issue: false hasContentIssue false

Prompt tuning discriminative language models for hierarchical text classification

Published online by Cambridge University Press:  10 October 2024

Jaco du Toit*
Affiliation:
Department of Mathematical Science, Computer Science Division, Stellenbosch University, Stellenbosch, South Africa School for Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
Marcel Dunaiski
Affiliation:
Department of Mathematical Science, Computer Science Division, Stellenbosch University, Stellenbosch, South Africa School for Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
*
Corresponding author: Jaco du Toit; Email: jacowdutoit11@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Hierarchical text classification (HTC) is a natural language processing task which aims to categorise a text document into a set of classes from a hierarchical class structure. Recent approaches to solve HTC tasks focus on leveraging pre-trained language models (PLMs) and the hierarchical class structure by allowing these components to interact in various ways. Specifically, the Hierarchy-aware Prompt Tuning (HPT) method has proven to be effective in applying the prompt tuning paradigm to Bidirectional Encoder Representations from Transformers (BERT) models for HTC tasks. Prompt tuning aims to reduce the gap between the pre-training and fine-tuning phases by transforming the downstream task into the pre-training task of the PLM. Discriminative PLMs, which use a replaced token detection (RTD) pre-training task, have also shown to perform better on flat text classification tasks when using prompt tuning instead of vanilla fine-tuning. In this paper, we propose the Hierarchy-aware Prompt Tuning for Discriminative PLMs (HPTD) approach which injects the HTC task into the RTD task used to pre-train discriminative PLMs. Furthermore, we make several improvements to the prompt tuning approach of discriminative PLMs that enable HTC tasks to scale to much larger hierarchical class structures. Through comprehensive experiments, we show that our method is robust and outperforms current state-of-the-art approaches on two out of three HTC benchmark datasets.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. The HPTD architecture during training. HPTD modifies the text input sequence of length $T$ (orange) by appending a template that comprises $V$ prompts for each of the $K$ levels in the class hierarchy (green), followed by a representation of each class in the associated level (yellow). The tokens are passed through the discriminative PLM to obtain the output for the class tokens which are used to drive the training process.

Figure 1

Figure 2. High-level procedure used to initialise the hierarchical prompt embeddings. We construct the hierarchical class graph and attach virtual nodes which form the prompts associated with each level. The graph is passed through a GAT which obtains the embeddings for the prompts associated with each level.

Figure 2

Table 1. Comparison of token space available for input text tokens using the DPT and HPTD approaches for different illustrative hierarchical class structures. The column ‘Additional tokens’ shows the improvement of the HPTD approach in terms of usable tokens for input text

Figure 3

Table 2. Characteristics of the benchmark HTC datasets. The columns ‘Levels’ and ‘Classes’ give the number of levels and classes in the class structure. ‘Avg. Classes’ is the average number of classes per document, while ‘Train’, ‘Dev’, and ‘Test’ are the number of instances in each of the dataset splits

Figure 4

Table 3. The average per-level branching factor of the hierarchy in each benchmark dataset, which is calculated as the average number of child nodes for the nodes at a particular level. The number of nodes per level is given in parentheses

Figure 5

Table 4. Performance comparisons of the HPTD approach using the three commonly used benchmark datasets. Standard deviations for the proposed methods are given in parentheses

Figure 6

Table 5. Results on the benchmark datasets when removing components of the HPTD-ELECTRA model

Figure 7

Figure 3. Level-wise performance results of our approaches on the three benchmark datasets. The bar plots present the F1 scores (left y-axis) for the different approaches at each level of the hierarchy, while the line plot shows the average training instances for the classes at a particular level (right y-axis).

Figure 8

Table 6. Performance comparisons of the different threshold selection approaches. ‘0.5’ uses a fixed 0.5 threshold for each class, while ‘Single’ uses a single tuned threshold for each class. ‘Level’ and ‘Class’ use a tuned threshold for each level and class, respectively

Figure 9

Table 7. Performance results of the HPTD models under a low-resource scenario where only 10% of the training data is used. For comparison, the achieved results when all training data is used are shown in parentheses

Figure 10

Figure 4. Level-wise performance results of our approaches on the three benchmark datasets in the low-resource scenario. The bar plots present the F1 scores (left y-axis) for the different approaches at each level of the hierarchy, while the line plot shows the average training instances for the classes at a particular level (right y-axis).