Hostname: page-component-77c78cf97d-xcx4r Total loading time: 0 Render date: 2026-04-23T21:54:29.905Z Has data issue: false hasContentIssue false

Using machine learning for communication classification

Published online by Cambridge University Press:  14 March 2025

Stefan P. Penczynski*
Affiliation:
School of Economics, University of East Anglia, Norwich, UK
Rights & Permissions [Opens in a new window]

Abstract

The present study explores the value of machine learning techniques in the classification of communication content in experiments. Previously human-coded datasets are used to both train and test algorithm-generated models that relate word counts to categories. For various games, the computer models of the classification are able to match out-of-sample the human classification to a considerable extent. The analysis raises hope that the substantial effort going into such studies can be reduced by using computer algorithms for classification. This would enable a quick and replicable analysis of large-scale datasets at reasonable costs and widen the applicability of such approaches. The paper gives an easily accessible technical introduction into the computational method.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s) 2019
Figure 0

Fig. 1 Exemplary decision tree

Figure 1

Fig. 2 Message tokens in the BCHS dataset. M=176, T=98, ∑txt=1605, xthink=127

Figure 2

Fig. 3 Message tokens in the BCHS dataset by level

Figure 3

Table 1 Bivariate correlations and linear regression between token count and level of reasoning in BCHS

Figure 4

Table 2 Human classification versus computer prediction from cross-validation in BCHS

Figure 5

Fig. 4 Variable importance in the BCHS dataset

Figure 6

Fig. 5 Message tokens in the SL dataset by level

Figure 7

Table 3 Bivariate correlations and linear regression between token counts and level of reasoning in SL

Figure 8

Fig. 6 Variable importance in the SL dataset

Figure 9

Table 4 Human classification versus computer prediction from the cross-validation in SL. ρ gives the correlation coefficient

Figure 10

Table 5 Payoff structure of coordination games

Figure 11

Fig. 7 Message tokens in the APC dataset by level

Figure 12

Table 6 Bivariate correlations and linear regressions between word counts and level of reasoning

Figure 13

Fig. 8 Variable importance in the APC dataset

Figure 14

Table 7 Human classification versus computer prediction from cross-validation

Figure 15

Table 8 Level averages of human and computer classifications by APC game

Figure 16

Fig. 9 Message tokens in the APC dataset by payoff salience

Figure 17

Table 9 Human payoff salience classification versus computer prediction from cross-validation

Figure 18

Fig. 10 Variable importance in the APC dataset. Classification model with Gini criterion

Figure 19

Fig. 11 Message tokens in the APC dataset by label salience

Figure 20

Table 10 Human classification versus computer prediction from cross-validation

Figure 21

Table 11 Coding performance of regression and classification models in APC (N=851) depending on the size of the training set

Supplementary material: File

Penczynski supplementary material

Penczynski supplementary material 1
Download Penczynski supplementary material(File)
File 8 KB
Supplementary material: File

Penczynski supplementary material

Penczynski supplementary material 2
Download Penczynski supplementary material(File)
File 55 KB