Hostname: page-component-89b8bd64d-z2ts4 Total loading time: 0 Render date: 2026-05-07T02:01:39.882Z Has data issue: false hasContentIssue false

Forecasting forecaster accuracy: Contributions of past performance and individual differences

Published online by Cambridge University Press:  01 January 2023

Mark Himmelstein*
Affiliation:
Department of Psychology, Fordham University
Pavel Atanasov
Affiliation:
Pytho LLC
David V. Budescu
Affiliation:
Department of Psychology, Fordham University
*
Correspondence Email: mhimmelstein@fordham.edu
Rights & Permissions [Opens in a new window]

Abstract

A growing body of research indicates that forecasting skill is a unique and stable trait: forecasters with a track record of high accuracy tend to maintain this record. But how does one identify skilled forecasters effectively? We address this question using data collected during two seasons of a longitudinal geopolitical forecasting tournament. Our first analysis, which compares psychometric traits assessed prior to forecasting, indicates intelligence consistently predicts accuracy. Next, using methods adapted from classical test theory and item response theory, we model latent forecasting skill based on the forecasters’ past accuracy, while accounting for the timing of their forecasts relative to question resolution. Our results suggest these methods perform better at assessing forecasting skill than simpler methods employed by many previous studies. By parsing the data at different time points during the competitions, we assess the relative importance of each information source over time. When past performance information is limited, psychometric traits are useful predictors of future performance, but, as more information becomes available, past performance becomes the stronger predictor of future accuracy. Finally, we demonstrate the predictive validity of these results on out-of-sample data, and their utility in producing performance weights for wisdom-of-crowds aggregations.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
The authors license this article under the terms of the Creative Commons Attribution 3.0 License.
Copyright
Copyright © The Authors [2021] This is an Open Access article, distributed under the terms of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Figure 0

Figure 1: Average accuracy of forecasts as a function of time for all questions of at least 12 weeks in duration. Average accuracy refers to normalized accuracy scores (see methods). For a binary forecasting question, normalized accuracy of 0 represents a probability of exactly .5 assigned to the correct option. Higher values represent more accurate forecasts.

Figure 1

Table 1: Forecasting Question Examples

Figure 2

Figure 2: Four hypothetical item response functions. Top row represents questions where difficulty is relatively constant over time, bottom row where difficulty is very sensitive to timing. Left column represents questions which poorly discriminate forecasters of differing ability levels, right column represents questions which discriminate forecasters of differing abilities well (Note that b2 is held constant at 6.63, the empirical estimate for Season 1).

Figure 3

Figure 3: Scatterplot matrix of three ability assessments (simple, hierarchical, IRT) across all forecasters from Season 1 (n = 326).

Figure 4

Table 2: Correlations between accuracy measures between and within two sets of 94 forecasting questions from Season 1 (N = 326 forecasters). Between time correlations in italics, with comparisons of the same metric across time in bold.

Figure 5

Table 3: Correlations between measures of individual differences and accuracy (Season 1, n = 326)

Figure 6

Table 4: Global Dominance measures of hierarchical regression of normalized accuracy on individual differences (Season 1 core volunteer sample, n = 326).

Figure 7

Figure 4: Comparison of model fit (left: R2, right: BIC) for various accuracy measures, based on past performance over time (Season 1). Mean forecasts per forecaster at each time point are also displayed for reference on the X-axis (n = 216).

Figure 8

Figure 5: Global dominance measures of different sources of information on accuracy of future forecast as a function of time, with IRT used for ability assessment (Season 1, n = 216). Mean forecasts per forecaster at each time point are also displayed for reference on the X-axis.

Figure 9

Figure 6: Global dominance measures of different sources of past performance information on accuracy of future forecasts over time, using IRT for past ability assessment (Season 1, n = 216). Mean forecasts per forecaster at each time point are also displayed for reference on the X-axis.

Figure 10

Table 5: Correlations between model-predicted results from training sample and observed results from testing sample by model input type (first column); cross-correlations across models in training sample (below diagonal), and absolute z-scores for tests of dependent correlations with testing sample estimates (above diagonal with significant differences at p < .05 highlighted; Lee & Preacher, 2013) from Season 1 (n = 202 forecasters, 64 questions in both testing and training samples).

Figure 11

Table 6: Accuracy of wisdom-of-crowds weighted aggregations, by weighting input type (rows) and accuracy measure (columns) from Season 1 (n = 202 forecasters, 64 questions).

Figure 12

Table 7: Pairwise contrasts between weighting schemes on mean daily normalized accuracy from Season 1 (n = 202 forecasters, 64 questions). Positive Cohen’s d values favor alternative model.

Figure 13

Figure 7: Scatterplot matrix of three ability assessments (simple, hierarchical, IRT) across all from Season 2 (n = 547).

Figure 14

Table 8: Correlations between accuracy measures between and within two sets of 94 forecasting questions from Season 2 (n = 547 forecasters). Between time correlations in italics, with comparisons of the same metric across time in bold.

Figure 15

Table 9: Correlations between individual differences and accuracy (Season 2, n = 547).

Figure 16

Table 10: Global dominance measures of hierarchical regression of normalized accuracy on individual differences (Season 2, n = 547)

Figure 17

Figure 8: Comparison of model fit (left: R2, right: BIC) for three accuracy measures based on past performance as a function of time (Season 2, n = 409). Mean forecasts per forecaster at each time point are also displayed for reference on the X-axis.

Figure 18

Figure 9: Global dominance measures of different sources of information on accuracy of future forecast as a function of time, with IRT used for ability assessment (Season 2, n = 409). Mean forecasts per forecaster at each time point are also displayed for reference on the X-axis.

Figure 19

Figure 10: Global dominance measures of different sources of past performance information on accuracy of future forecasts over time, using IRT for past ability assessment (Season 2, n = 409). Mean forecasts per forecaster at each time point are also displayed for reference on the X-axis.

Figure 20

Table 11: Correlations between model-predicted results from training sample and observed results from testing sample, by model input type; and cross-correlations across models in training sample (first column); cross-correlations across models in training sample (below diagonal), and absolute z-scores for tests of dependent correlations with testing sample estimates (above diagonal with significant differences at p < .05 highlighted; Lee & Preacher, 2013)from Season 2 (n = 391, 150 training questions, 98 testing questions).

Figure 21

Table 12: Accuracy of wisdom-of-crowds weighted aggregations, by weighting input type (rows) and accuracy measure (columns) from Season 2 (n = 391 forecasters, 98 questions).

Figure 22

Table 13: Pairwise contrasts between weighting schemes on mean daily normalized accuracy from Season 2 (n = 391 forecasters, 98 questions). Positive Cohen’s d values favor alternative model.

Supplementary material: File

Himmelstein et al. supplementary material

Himmelstein et al. supplementary material
Download Himmelstein et al. supplementary material(File)
File 3.5 MB