Hostname: page-component-89b8bd64d-x2lbr Total loading time: 0 Render date: 2026-05-06T08:23:49.007Z Has data issue: false hasContentIssue false

A normative model for Bayesian combination of subjective probability estimates

Published online by Cambridge University Press:  24 November 2023

Susanne Trick*
Affiliation:
Centre for Cognitive Science, Technical University of Darmstadt, Darmstadt, Germany Institute of Psychology, Technical University of Darmstadt, Darmstadt, Germany
Constantin A. Rothkopf
Affiliation:
Centre for Cognitive Science, Technical University of Darmstadt, Darmstadt, Germany Institute of Psychology, Technical University of Darmstadt, Darmstadt, Germany Frankfurt Institute for Advanced Studies, Goethe University, Frankfurt, Germany
Frank Jäkel
Affiliation:
Centre for Cognitive Science, Technical University of Darmstadt, Darmstadt, Germany Institute of Psychology, Technical University of Darmstadt, Darmstadt, Germany
*
Corresponding author: Susanne Trick; Email: susanne.trick@tu-darmstadt.de
Rights & Permissions [Opens in a new window]

Abstract

Combining experts’ subjective probability estimates is a fundamental task with broad applicability in domains ranging from finance to public health. However, it is still an open question how to combine such estimates optimally. Since the beta distribution is a common choice for modeling uncertainty about probabilities, here we propose a family of normative Bayesian models for aggregating probability estimates based on beta distributions. We systematically derive and compare different variants, including hierarchical and non-hierarchical as well as asymmetric and symmetric beta fusion models. Using these models, we show how the beta calibration function naturally arises in this normative framework and how it is related to the widely used Linear-in-Log-Odds calibration function. For evaluation, we provide the new Knowledge Test Confidence data set consisting of subjective probability estimates of 85 forecasters on 180 queries. On this and another data set, we show that the hierarchical symmetric beta fusion model performs best of all beta fusion models and outperforms related Bayesian fusion models in terms of mean absolute error.

Information

Type
Theory Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of Society for Judgment and Decision Making and European Association of Decision Making
Figure 0

Figure 1 Graphical models of one of the models proposed by Turner et al. (2014), Calibrate then Average, (a) and an exemplary normative fusion model (b). The models are simplified to the forecasts $x^k$ of K forecasters to only one query with truth value t.

Figure 1

Figure 2 Graphical models of the hierarchical (a) and non-hierarchical (b) beta fusion models.

Figure 2

Figure 3 Graphical models of the hierarchical (a) and non-hierarchical (b) symmetric beta fusion models.

Figure 3

Figure 4 Model performances on the Turner data set according to Brier score, 0–1 loss, and mean absolute error. We compare the scores’ means and standard errors of the mean of our beta fusion models, the Hierarchical Beta Fusion Model (HB), the non-hierarchical Beta Fusion Model (B), the Hierarchical Symmetric Beta Fusion Model (HSB), and the non-hierarchical Symmetric Beta Fusion Model (SB), the models by Turner et al. (2014), Average then Calibrate (ATC), Calibrate then Average (CTA), Calibrate then Average using log-odds (CTALO), Hierarchical Calibrate then Average (HCTA), and Hierarchical Calibrate then Average on log-odds (HCTALO), and the two baseline methods Unweighted Linear Opinion Pool (ULINOP) and Probit Average (PAVG).

Figure 4

Figure 5 Model performances on the Knowledge Test Confidence data set according to Brier score, 0–1 loss, and mean absolute error. We compare the scores’ means and standard errors of the mean of our beta fusion models, the Hierarchical Beta Fusion Model (HB), the non-hierarchical Beta Fusion Model (B), the Hierarchical Symmetric Beta Fusion Model (HSB), and the non-hierarchical Symmetric Beta Fusion Model (SB), the models by Turner et al. (2014), Average then Calibrate (ATC), Calibrate then Average (CTA), Calibrate then Average using log-odds (CTALO), Hierarchical Calibrate then Average (HCTA), and Hierarchical Calibrate then Average on log-odds (HCTALO), and the two baseline methods Unweighted Linear Opinion Pool (ULINOP) and Probit Average (PAVG).

Figure 5

Figure 6 Model performances on the reduced Turner data set consisting of a subset of the 20 forecasters of the Turner data set that provided the most forecasts. We compare the means and standard errors of the mean of Brier scores, 0–1 losses, and mean absolute errors of our beta fusion models, the Hierarchical Beta Fusion Model (HB), the non-hierarchical Beta Fusion Model (B), the Hierarchical Symmetric Beta Fusion Model (HSB), and the non-hierarchical Symmetric Beta Fusion Model (SB), the models by Turner et al. (2014), Average then Calibrate (ATC), Calibrate then Average (CTA), Calibrate then Average using log-odds (CTALO), Hierarchical Calibrate then Average (HCTA), and Hierarchical Calibrate then Average on log-odds (HCTALO), and the two baseline methods Unweighted Linear Opinion Pool (ULINOP) and Probit Average (PAVG).

Figure 6

Figure 7 Model performances on the reduced Knowledge Test Confidence (KTeC) data set consisting of a subset of the first 10 forecasters of KTeC data set. We compare the means and standard errors of the mean of Brier scores, 0–1 losses, and mean absolute errors of our beta fusion models, the Hierarchical Beta Fusion Model (HB), the non-hierarchical Beta Fusion Model (B), the Hierarchical Symmetric Beta Fusion Model (HSB), and the non-hierarchical Symmetric Beta Fusion Model (SB), the models by Turner et al. (2014), Average then Calibrate (ATC), Calibrate then Average (CTA), Calibrate then Average using log-odds (CTALO), Hierarchical Calibrate then Average (HCTA), and Hierarchical Calibrate then Average on log-odds (HCTALO), and the two baseline methods Unweighted Linear Opinion Pool (ULINOP) and Probit Average (PAVG).

Figure 7

Figure 8 The empirical, beta calibration (BC), and LLO curves of two exemplary forecasters from the Knowledge Test Confidence data set (top row) together with the respective asymmetric and symmetric beta distributions (bottom row). (a) shows the calibration curves and respective beta distributions of forecaster 57, where LLO and beta calibration curves are similar. (b) shows the calibration curves and beta distributions of forecaster 46, for which the beta calibration function tends to 0 for $x\rightarrow 1$ causing miscalibrations.

Supplementary material: File

Trick et al. supplementary material
Download undefined(File)
File 14.7 KB