Hostname: page-component-89b8bd64d-sd5qd Total loading time: 0 Render date: 2026-05-13T07:27:18.165Z Has data issue: false hasContentIssue false

How reliable is visual lameness scoring? Assessing human label variability for use in automated detection systems

Published online by Cambridge University Press:  09 March 2026

Konstantina Linardopoulou*
Affiliation:
School of Biodiversity, One Health and Veterinary Medicine, University of Glasgow, Glasgow, UK
Lorenzo Viora
Affiliation:
School of Biodiversity, One Health and Veterinary Medicine, University of Glasgow, Glasgow, UK
George King
Affiliation:
School of Biodiversity, One Health and Veterinary Medicine, University of Glasgow, Glasgow, UK
Julien Le Kernec
Affiliation:
James Watt School of Engineering, University of Glasgow, Glasgow, UK
Nicholas N Jonsson
Affiliation:
School of Biodiversity, One Health and Veterinary Medicine, University of Glasgow, Glasgow, UK
*
Corresponding author: Konstantina Linardopoulou, Email: konstantina.linardopoulou@glasgow.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Visual mobility scoring to detect lame dairy cattle can be subjective and inconsistent. This study assessed the reliability of visual mobility scores from multiple assessors, using different scoring methods (live vs. video) and experience levels to evaluate their influence on label quality for machine learning applications. We gathered data from two farms using the AHDB 4-point mobility scale and a simplified post-hoc dichotomised version, with both live and video assessments. Substantial within- and between-assessor variation was seen in scores, particularly for scores 0 and 1 (consistent with normal and slightly abnormal gaits, respectively). Assessors showed only fair (weighted kappa ≈ 0.33) score consistency when they scored the same animal in different ways (live vs. video). Post-hoc simplification of the four-level scores to a dichotomous score improved agreement but reduced granularity. Assessor experience had limited influence on agreement levels (P > 0.05), and increased video viewing frequency during the assessment process was associated with lower inter-assessor agreement (probability estimate = −0.49, P = 0.005), suggesting higher uncertainty in ambiguous cases. Qualitative feedback from assessor comments revealed that the speed of the animal affected their scoring decisions (β = –1.92, P = 0.007). These results highlight the difficulties in using subjective human scores as labels for machine learning training. To improve automatic lameness detection in dairy cattle, we need strategies to reduce this variation and use more definitive labels.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of Hannah Dairy Research Foundation.
Figure 0

Table 1. Study design for two farms (A and B), including the number of visits, cows assessed, scoring methods (on-site and video), number of assessors, scoring systems and modes used, and types of metadata collected during video-based assessments

Figure 1

Table 2. Thirteen assessors with differences in experience and occupation scored the animals in the studies. Only two assessors (R1 and R2) participated in all four evaluations, and another five assessed the videos from both Farm A and Farm B

Figure 2

Figure 1. Distribution of inter-assessor agreement (weighted kappa values) during live mobility scoring sessions on the two Farm A visits using the original four-level AHDB scale and a dichotomised version (sound: 0–1, lame: 2–3). Each violin plot shows the distribution of pairwise kappa values between assessors for Visit 1 (n = 15 pairs) and Visit 2 (n = 10 pairs). Boxplots within violins indicate the median and interquartile range, while individual data points are overlaid. Dashed lines represent kappa interpretation thresholds according to Landis & Koch (Fair: 0.21, Moderate: 0.41, Substantial: 0.61).

Figure 3

Figure 2. Distribution of weighted kappa agreement scores for assessor pairwise comparisons in Farm A and Farm B video scoring. Violin plots illustrate the density of agreement values, with overlaid boxplots showing median and interquartile ranges, and jittered points representing individual pairwise scores. Colours differentiate the scoring scales (AHDB and dichotomised). Horizontal dashed lines indicate agreement strength thresholds according to Landis & Koch interpretation (Fair = 0.21, Moderate = 0.41, Substantial = 0.61).

Figure 4

Figure 3. Heat-map of pairwise co-occurrence of assessor comment themes (N = 77 comments). Each cell shows the number of comments in which the two themes were mentioned together; darker colours indicate more frequent co-occurrence. Diagonal cells (theme with itself) give the total frequency of that theme in the dataset.