Hostname: page-component-77f85d65b8-pkds5 Total loading time: 0 Render date: 2026-03-27T05:16:40.114Z Has data issue: false hasContentIssue false

Learning to rank for multi-label text classification: Combining different sources of information

Published online by Cambridge University Press:  18 February 2020

Hosein Azarbonyad*
Affiliation:
Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands KLM Digital Studio, KLM, Amsterdam, The Netherlands
Mostafa Dehghani
Affiliation:
Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, The Netherlands
Maarten Marx
Affiliation:
Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
Jaap Kamps
Affiliation:
Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, The Netherlands
*
*Corresponding author. E-mail: hosein.azarbonyad@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Efficiently exploiting all sources of information such as labeled instances, classes’ representation, and relations of them has a high impact on the performance of Multi-Label Text Classification (MLTC) systems. Most of the current approaches use labeled documents as the primary source of information for MLTC. We investigate the effectiveness of different sources of information— such as the labeled training data, textual labels of classes, and taxonomy relations of classes— for MLTC. More specifically, first, for each document–class pair, different features are extracted using different sources of information. The features reflect the similarity of classes and documents. Then, MLTC is considered to be a ranking problem, and a learning to rank (LTR) approach is used for ranking classes regarding documents and selecting labels of documents. An important characteristic of many MLTC instances is that documents can belong to multiple classes and there are implicit relations between classes. We apply score propagation on top of LTR to incorporate co-occurrence patterns of classes in labeled documents. Our main findings are the following. First, using an LTR approach integrating all features, we observe significantly better performance than previous systems for MLTC. Specifically, we show that simple classification approaches fail when there is a high number of classes. Second, the analysis of feature weights reveals the relative importance of various sources of evidence, also giving insight into the underlying classification problem. Interestingly, the results indicate that the titles of documents are more informative than all other sources of information. Third, a lean-and-mean system using only four features is able to perform at 96% of the large LTR model that we propose in this paper. Fourth, using the co-occurrence information of classes helps in classifying documents more accurately. Our results show that the co-occurrence information is more helpful when the underlying classifier has a poor performance.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© Cambridge University Press 2020
Figure 0

Figure 1. The general pipeline of the proposed LTR method for MLTC. First, a subset of the dataset is used for constructing class representations. Additional class representations can be learned using other sources of information such as hierarchy structure of classes. Second, document representations are constructed for the documents in the remaining subset of the dataset. Then, based on document and class representations, similarities of documents and classes are estimated and used as features. After building feature vectors, cross-validation is done to train and evaluate LTR models.

Figure 1

Table 1. Features extracted from each document–class pair to train LTR models. Note that for features 1–8, the similarity is calculated using three different methods. Therefore, each of these features represent three features. The size of the final feature vector is 28 (24 features based on the similarity of documents and classes (features 1–8), and 4 features based on the statistics of classes (features 9–12))

Figure 2

Figure 2. The graph of classes for matrix P introduced in Example 1.

Figure 3

Table 2. Performance of SVM, JEX, best single feature, and LTR methods for MLTC on the JRC1 dataset. We report incremental improvement and significance over JEX

Figure 4

Table 3. Performance of SVM, JEX, best single feature, and LTR methods for MLTC on the JRC2. We report incremental improvement and significance over JEX

Figure 5

Figure 3. Feature importance: (1) P@5 of individual features, (2) weights in SVM-Rank model. $title\_D$ and $text\_D$ are title and text representations of document, respectively. $title\_C$, $text\_C$, $label\_C$, and $anc\_label\_C$ are title, text, label, and ancestors’ label representations of classes, respectively. This analysis is done on the model trained on the JRC1 dataset.

Figure 6

Table 4. Performance of LTR on all features compared to four selected features (LTR-ttgp) on the JRC1 dataset

Figure 7

Table 5. Performance of the score propagation approach on the JRC1 dataset. We report incremental improvement and significance of each score propagation approach over its non-propagated version

Figure 8

Figure 4. The effect of $\alpha$ parameter on Precision achieved by propagating the scores of different text classification approaches. The number of iterations is set to 2.

Figure 9

Figure 5. Precision achieved in different iterations of score propagation method for different text classification approaches. The value of $\alpha$ is set to 0.7.

Figure 10

Table 6. Performance of SVM, JEX, BM25-Titles, and LTR, and Propagated LTR methods for MLTC using a dynamic threshold for selecting the number of classes. The significance tests are done on the improvements of each method using a dynamic threshold over its corresponding method which uses a static threshold, for example, setting number of classes to 5

Figure 11

Figure 6. The distribution of number of classes in documents. X-axis corresponds to the number of classes assigned to documents in the ground-truth and Y-axis corresponds to the number of documents in log-scale.

Figure 12

Table 7. The root mean squared error (RMSE) and mean absolute error (MAE) between the assigned number of classes and the actual number of classes for documents and the accuracy of the thresholding method in choosing the correct number of classes for documents for SVM, JEX, BM25-Titles, and LTR, and Propagated LTR methods for MLTC. The accuracy is calculated by dividing the number of documents for which the thresholding method picked a correct number of classes by total number of documents. We also report RMSE, MAE, and accuracy for the fixed threshold method (choosing top five classes)

Figure 13

Figure 7. Precision achieved using different sample sizes for classes. X-axis corresponds to the number of samples used per class for training the classifiers.

Figure 14

Figure 8. Precision achieved for different bins of documents in JRC1 dataset based on their actual number of classes in the ground-truth data. X-axis corresponds to the number of classes per document.

Figure 15

Table A1. A summary of MLTC methods: “Label information” indicates whether label-specific information such as textual gloassary is used or not, “Ranking loss” indicates whether a ranking loss is used for training the classifier or not, and “Thresholding” indicates whether a thresholding method is used for selecting number of labels or not.