Hostname: page-component-77c78cf97d-kmjgn Total loading time: 0.001 Render date: 2026-04-24T13:18:30.972Z Has data issue: false hasContentIssue false

Semi-supervised machine learning with word embedding for classification in price statistics

Published online by Cambridge University Press:  07 September 2020

Hazel Martindale*
Affiliation:
Methodology Division, Office for National Statistics, Newport, United Kingdom
Edward Rowland
Affiliation:
Methodology Division, Office for National Statistics, Newport, United Kingdom
Tanya Flower
Affiliation:
Prices Division, Office for National Statistics, Newport, United Kingdom
Gareth Clews
Affiliation:
Methodology Division, Office for National Statistics, Newport, United Kingdom
*
*Corresponding author. Email: Hazel.Martindale@ons.gov.uk

Abstract

The Office for National Statistics (ONS) is currently undertaking a substantial research program into using price information scraped from online retailers in the Consumer Prices Index including occupiers’ housing costs (CPIH). In order to make full use of these data, we must classify it into the product types that make up the basket of goods and services used in the current collection. It is a common problem that the amount of labeled training data is limited and it is either impossible or impractical to manually increase the size of the training data, as is the case with web-scraped price data. We make use of a semi-supervised machine learning (ML) method, Label Propagation, to develop a pipeline to increase the number of labels available for classification. In this work, we use several techniques in succession and in parallel to enable higher confidence in the final increased labeled dataset to be used in training a traditional ML classifier. We find promising results using this method on a test sample of data achieving good precision and recall values for both the propagated labels and the classifiers trained from these labels. We have shown that through combining several techniques together and averaging the results, we are able to increase the usability of a dataset with limited labeled training data, a common problem in using ML in real world situations. In future work, we will investigate how this method can be scaled up for use in future CPIH calculations and the challenges this brings.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Office for National Statistics, 2020. Published by Cambridge University Press in association with Data for Policy
Figure 0

Table 1. Categories used in the clothing data with the number of each category present in the sample

Figure 1

Table 2. Demonstration of simple word vectorization method, count vectorization

Figure 2

Figure 1. Two-dimensional projection of the word vector for a sample of clothing items generated using the term frequency—inverse document frequency algorithm.

Figure 3

Figure 2. Two-dimensional projection of the word vector for a sample of clothing items generated using the word2vec algorithm.

Figure 4

Figure 3. Two-dimensional projection of the word vector for a sample of clothing items generated using the fasttext algorithm.

Figure 5

Figure 4. Illustration of the principles of the label propagation algorithm. The left-hand panel shows before the propagation algorithm, including the seed labels and the graph. The right-hand panel shows the result of running the label propagation algorithm correctly labeling the two clusters.

Figure 6

Figure 5. Diagram showing the full pipeline for labeling and classifying the data for one item category. This diagram shows all the steps the data moves through including those where the data divides for multiple methods or metrics. This entire process will be run for each category from fuzzy matching to the end of label propagation and then the final labels combined and used in a classifier.

Figure 7

Table 3. Results of the fuzzy matching part of the label propagation pipeline across all nine target categories

Figure 8

Table 4. Classification metrics for the label propagation algorithm used on the clothing data sample

Figure 9

Figure 6. Confusion matrix for the label propagation algorithm. The categories are those shown in Table 4 where W stands for women’s, M for men’s, G for girls, and B for boys.

Figure 10

Table 5. Weighted macro average assessment metrics for the three machine learning classifiers using word vectors built using the term frequency—inverse document frequency method

Figure 11

Table 6. Weighted macro average assessment metrics for the three machine learning classifiers using word vectors trained with the fastText model

Figure 12

Table 7. Weighted macro average assessment metrics for the three machine learning classifiers using word vectors trained with the word2vec model

Figure 13

Figure A1. Confusion matrices for the three classification methods with term frequency—inverse document frequency vectors. Top left panel is the decision tree, top right random forest and bottom is the nonlinear support vector machine. These are the confusion matrices for the test dataset.

Figure 14

Table A1. Category by category breakdown of the metrics for the classifiers using TF-IDF vectors

Figure 15

Figure B1. Confusion matrices for the three classification methods with fastText vectors. Top left panel is the decision tree, top right random forest and bottom is the nonlinear support vector machine. These are the confusion matrices for the test dataset.

Figure 16

Table B1. Category by category breakdown of the metrics for the classifiers using fastText vectors

Figure 17

Figure C1. Confusion matrices for the three classification methods with word2vec vectors. Top left panel is the decision tree, top right random forest, and bottom is the nonlinear support vector machine. These are the confusion matrices for the test dataset.

Figure 18

Table C1. Category by category breakdown of the metrics for the classifiers using word2vec vectors

Submit a response

Comments

No Comments have been published for this article.