Statistical tools to improve assessing agreement between several observers

I. Ruddat; B. Scholz; S. Bergmann; A.-L. Buehring; S. Fischer; A. Manton; D. Prengel; E. Rauch; S. Steiner; S. Wiedmann; L. Kreienbrock; A. Campe

doi:10.1017/S1751731113002450

Statistical tools to improve assessing agreement between several observers

Published online by Cambridge University Press: 24 January 2014

I. Ruddat ,

B. Scholz ,

S. Bergmann ,

A.-L. Buehring ,

S. Fischer ,

A. Manton ,

D. Prengel ,

E. Rauch ,

S. Steiner and

S. Wiedmann

...Show all authors

Show author details

I. Ruddat*: Affiliation:
Department of Biometry, Epidemiology and Information Processing, WHO Collaborating Centre for Research and Training in Veterinary Public Health, University of Veterinary Medicine, Hannover, Germany
B. Scholz: Affiliation:
Friedrich-Loeffler-Institut, Institute of Animal Welfare and Animal Husbandry, Celle, Germany
S. Bergmann: Affiliation:
Department of Veterinary Science, Faculty of Veterinary Medicine, Chair of Animal Welfare, Ethology, Animal Hygiene and Animal Housing, Ludwig-Maximilians-University, Munich, Germany
A.-L. Buehring: Affiliation:
Friedrich-Loeffler-Institut, Institute of Animal Welfare and Animal Husbandry, Celle, Germany
S. Fischer: Affiliation:
Institute for Animal Breeding and Genetics, University of Veterinary Medicine, Hannover, Germany
A. Manton: Affiliation:
Department of Farm Animal Ethology and Poultry Science, University of Hohenheim, Stuttgart, Germany
D. Prengel: Affiliation:
Department of Veterinary Science, Faculty of Veterinary Medicine, Chair of Animal Welfare, Ethology, Animal Hygiene and Animal Housing, Ludwig-Maximilians-University, Munich, Germany
E. Rauch: Affiliation:
Department of Veterinary Science, Faculty of Veterinary Medicine, Chair of Animal Welfare, Ethology, Animal Hygiene and Animal Housing, Ludwig-Maximilians-University, Munich, Germany
S. Steiner: Affiliation:
Department of Veterinary Science, Faculty of Veterinary Medicine, Chair of Animal Welfare, Ethology, Animal Hygiene and Animal Housing, Ludwig-Maximilians-University, Munich, Germany
S. Wiedmann: Affiliation:
Bavarian State Research Center for Agriculture, Kitzingen, Germany
L. Kreienbrock: Affiliation:
Department of Biometry, Epidemiology and Information Processing, WHO Collaborating Centre for Research and Training in Veterinary Public Health, University of Veterinary Medicine, Hannover, Germany
A. Campe: Affiliation:
Department of Biometry, Epidemiology and Information Processing, WHO Collaborating Centre for Research and Training in Veterinary Public Health, University of Veterinary Medicine, Hannover, Germany
*: E-mail: Inga.Ruddat@tiho-hannover.de

Article contents

Abstract
References

Get access

Abstract

In the context of assessing the impact of management and environmental factors on animal health, behaviour or performance it has become increasingly important to conduct (epidemiological) studies in the field. Hence, the number of investigated farms per study is considerably high so that numerous observers are needed for investigation. In order to maintain the quality and validity of study results calibration meetings where observers are trained and the current level of agreement is assessed have to be conducted to minimise the observer effect. When study animals were rated independently by the same observers by a categorical variable the exclusion test can be performed to identify disagreeing observers. This statistical test compares for each variable and each observer the observer-specific agreement with the overall agreement among all observers based on kappa coefficients. It accounts for two major challenges, namely the absence of a gold-standard observer and different data type comprising ordinal, nominal and binary data. The presented methods are applied on a reliability study to assess the agreement among eight observers rating welfare parameters of laying hens. The degree to which the observers agreed depended on the investigated item (global weighted kappa coefficients: 0.37 to 0.94). The proposed method and graphical description served to assess the direction and degree to which an observer deviates from the others. It is suggested to further improve studies with numerous observers by conducting calibration meetings and accounting for observer bias.

Keywords

inter-rater reliability observer bias scoring system welfare parameters plumage condition

Type: Full Paper
Information: animal , Volume 8 , Issue 4 , April 2014 , pp. 643 - 649

DOI: https://doi.org/10.1017/S1751731113002450 [Opens in a new window]
Copyright: © The Animal Consortium 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agresti, A 2002. Categorical data analysis. Wiley-Interscience, Hoboken, NJ, USA.CrossRef Google Scholar

Blokhuis, HJ, Van Niekerk, TF, Bessei, W, Elson, A, Guemene, D, Kjaer, JB, Levrino, GAM, Nicol, CJ, Tauson, R, Weeks, CA and De Weerd, HAV 2007. The LayWel project: welfare implications of changes in production systems for laying hens. Worlds Poultry Science Journal 63, 101–114.Google Scholar

Brenninkmeyer, C, Dippel, S, March, S, Brinkmann, J, Winckler, C and Knierim, U 2007. Reliability of a subjective lameness scoring system for dairy cows. Animal Welfare 16, 127–129.CrossRef Google Scholar

Byrt, T, Bishop, J and Carlin, JB 1993. Bias, prevalence and kappa. Journal of Clinical Epidemiology 46, 423–429.CrossRef Google Scholar PubMed

EFSA Panel on Animal Health and Welfare 2012. Statement on the use of animal-based measures to assess the welfare of animals. EFSA Journal 10, 1–29.Google Scholar

Elbers, ARW, Vos, JH, Bouma, A and Stegeman, JA 2004. Ability of veterinary pathologists to diagnose classical swine fever from clinical signs and gross pathological findings. Preventive Veterinary Medicine 66, 239–246.CrossRef Google Scholar PubMed

Fleiss, JL and Cohen, J 1973. Equivalence of weighted kappa and intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33, 613–619.CrossRef Google Scholar

Gardner, IA, Stryhn, H, Lind, P and Collins, MT 2000. Conditional dependence between tests affects the diagnosis and surveillance of animal diseases. Preventive Veterinary Medicine 45, 107–122.CrossRef Google Scholar PubMed

Kaler, J, Wassink, GJ and Green, LE 2009. The inter- and intra-observer reliability of a locomotion scoring scale for sheep. Veterinary Journal 180, 189–194.Google Scholar

Krummenauer, F 2005. Methoden zur Evaluation bildgebender Verfahren von begrenzter Reproduzierbarkeit. Shaker Verlag, Aachen, Germany.Google Scholar

Krummenauer, F 2006. The comparison of clinical imaging devices with respect to parallel readings in both devices. European Journal of Medical Research 11, 119–122.Google Scholar

Landis, JR and Koch, GG 1977. Measurement of observer agreement for categorical data. Biometrics 33, 159–174.Google Scholar

March, S, Brinkmann, J and Winkler, C 2007. Effect of training on the inter-observer reliability of lameness scoring in dairy cattle. Animal Welfare 16, 131–133.Google Scholar

Meagher, RK 2009. Observer ratings: validity and value as a tool for animal welfare research. Applied Animal Behaviour Science 119, 1–14.Google Scholar

Ott, S, Schalke, E, Campe, A and Hackbarth, H 2011. Urteilsübereinstimmung bei zwei Beobachterpaaren in einem Verhaltenstest für Hunde. In Proceedings of 16. Internationale DVG-Fachtagung zum Thema Tierschutz, Nürtingen, Deutschland, pp. 307–319.Google Scholar

Pedersen, KS, Holyoake, P, Stege, H and Nielsen, JP 2011. Observations of variable inter-observer agreement for clinical evaluation of faecal consistency in grow-finishing pigs. Preventive Veterinary Medicine 98, 284–287.CrossRef Google Scholar PubMed

Petersen, HH, Enoe, C and Nielsen, EO 2004. Observer agreement on pen level prevalence of clinical signs in finishing pigs. Preventive Veterinary Medicine 64, 147–156.CrossRef Google Scholar PubMed

SAS Institute Inc. 2012. SAS/SAT user’s guide. SAS Institute Inc., Cary, NC, USA.Google Scholar

Sim, J and Wright, CC 2005. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical Therapy 85, 257–268.Google Scholar

Svartberg, K 2005. A comparison of behaviour in test and in everyday life: evidence of three consistent boldness-related personality traits in dogs. Applied Animal Behaviour Science 91, 103–128.CrossRef Google Scholar

Thomsen, PT and Baadsgaard, NP 2006. Intra- and inter-observer agreement of a protocol for clinical examination of dairy cows. Preventive Veterinary Medicine 75, 133–139.CrossRef Google Scholar PubMed

Thomsen, PT, Munksgaard, L and Togersen, FA 2008. Evaluation of a lameness scoring system for dairy cows. Journal of Dairy Science 91, 119–126.Google Scholar

Winckler, C and Willen, S 2001. The reliability and repeatability of a lameness scoring system for use as an indicator of welfare in dairy cattle. Acta Agriculturae Scandinavica Section a-Animal Science 51, 103–107.Google Scholar

Article contents

Statistical tools to improve assessing agreement between several observers

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests