Skip to main content

Precision of actuarial risk assessment instruments: Evaluating the ‘margins of error’ of group v. individual predictions of violence

  • Stephen D. Hart (a1), Christine Michie (a2) and David J. Cooke (a3)

Actuarial risk assessment instruments (ARAIs) estimate the probability that individuals will engage in future violence.


To evaluate the ‘margins of error’ at the group and individual level for risk estimates made using ARAIs.


An established statistical method was used to construct 95% CI for group and individual risk estimates made using two popular ARAIs.


The 95% CI were large for risk estimates at the group level; at the individual level, they were so high as to render risk estimates virtually meaningless.


The ARAIs cannot be used to estimate an individual's risk for future violence with any reasonable degree of certainty and should be used with great caution or not at all. In theory, reasonably precise group estimates could be made using ARAIs if developers used very large construction samples and if the tests included few score categories with extreme risk estimates.

    • Send article to Kindle

      To send this article to your Kindle, first ensure is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the or variations. ‘’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Precision of actuarial risk assessment instruments
      Available formats
      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Precision of actuarial risk assessment instruments
      Available formats
      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Precision of actuarial risk assessment instruments
      Available formats
Corresponding author
Professor Stephen D. Hart, Department of Psychology, Simon Fraser University, 8888 University Drive, Burnaby, British Columbia, Canada V5A 1S6. Email:
Hide All

Declaration of interest

None. Funding detailed in Acknowledgements.

Hide All
Agresti, A. & Coull, B. A. (1998) Approximate is better than ‘exact’ for interval estimation of binomial proportions. American Statistician, 52, 119126.
Brown, L. D., Cai, T. T. & DasGupta, A. (2001) Reply to comments on ‘Interval estimation for a binomial proportion.’ Statistical Science, 16, 128133.
Faigman, D. L. (1995) The evidentiary status of social science under Daubert: Is it ‘;scientific’, ‘technical’, or ‘other’ knowledge? Psychology, Public Policy, and Law, 1, 960971.
Grisso, T. & Appelbaum, P. S. (1992) Is it unethical to offer predictions of future violence? Law and Human Behavior, 16, 621633.
Hajek, A. (2003) Interpretations of probability. In The Stanford Encyclopedia of Philosophy (ed. Zalta, E. N.).
Hájek, A. & Hall, N. (2002) Induction and probability. In The Blackwell Guide to the Philosophy of Science (eds Machamer, P. & Silberstein, M.), pp. 149172. Blackwell.
Hanson, R. K. & Thornton, D. (1999) Static 99: Improving Actuarial Risk Assessments for Sex Offenders. Ministry of the Solicitor General of Canada.
Hart, S. D. (1998) The role of psychopathy in assessing risk for violence: Conceptual and methodological issues. Legal and Criminological Psychology, 3, 123140.
Hart, S. D. (2001) Assessing and managing violence risk. In HCR-20 Violence Risk Management Companion Guide (eds Douglas, K. S., Webster, C. D., Hart, S. D., et al), pp. 1325. Burnaby, British Columbia: Mental Health, Law, and Policy Institute, Simon Fraser University, and Department of Mental Health Law and Policy, Florida Mental Health Institute, University of South Florida.
Hart, S. D. (2003) Actuarial risk assessment. Sexual Abuse: A Journal of Research and Treatment, 15, 383388.
Heilbrun, K. (1992) The role of psychological testing in forensic assessment. Law and Human Behavior, 16, 257272.
Henderson, R. & Keiding, N. (2005) Individual survival time prediction using statistical models. Journal of Medical Ethics, 31, 703706.
Janus, E. S. (2000) Sexual predator commitment laws: Lessons for law and the behavioral sciences. Behavioral Sciences and the Law, 18, 521.
Litwack, T. R. (2001) Actuarial versus clinical assessments of dangerousness. Psychology, Public Policy, and Law, 7, 409433.
Mackay, R. D., Colman, A. M. & Thornton, P. (1998) The admissibility of expert psychological and psychiatric testimony. In Analysing Witness Testimony: Psychological, Investigative, and Evidential Perspectives (eds Shepard, E., Heaton-Armstrong, A. & Wolchover, D.), pp. 321334. Blackstone.
Maden, T. & Tyrer, P. (2003) Dangerous and severe personality disorders: a new personality concept from the United Kingdom. Journal of Personality Disorders, 17, 489496.
Meehl, P. E. (1998) The Power of Quantitative Thinking. Paper presented at the annual meeting of the American Psychological Society Washington, DC.
Melton, G. B., Petrila, J., Poythress, N., et al (1997) Psychological Evaluations for the Courts: A Handbook for Attorneys and Mental Health Professionals, 2nd ed. Guilford.
Monahan, J. A., Steadman, H. J., Appelbaum, P. S., et al (2005) Classification of Violence Risk (COVR). Psychological Assessment Resources.
Mossman, D. (2006) Another look at interpreting risk categories. Sexual Abuse: A Journal of Research and Treatment, 18, 4163.
Quinsey, V. L., Harris, G. T., Rice, M. E., et al (1998) Violent Offenders: Appraising and Managing Risk. American Psychological Association.
Szmukler, G. (2001) Violence risk prediction in practice. British Journal of Psychiatry, 178, 8485.
Tyrer, P. (2004) Getting to grips with severe personality disorder. Criminal Behaviour and Mental Health, 14, 14.
Wilson, E. B. (1927) Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22, 209212.
Zeedyk, M. S. & Raitt, F. E. (1998) Psychological evidence in the courtroom: critical reflections on the general acceptance standard. Journal of Community and Applied Social Psychology, 8, 2339.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

The British Journal of Psychiatry
  • ISSN: 0007-1250
  • EISSN: 1472-1465
  • URL: /core/journals/the-british-journal-of-psychiatry
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 53 *
Loading metrics...

Abstract views

Total abstract views: 160 *
Loading metrics...

* Views captured on Cambridge Core between 2nd January 2018 - 23rd March 2018. This data will be updated every 24 hours.

Precision of actuarial risk assessment instruments: Evaluating the ‘margins of error’ of group v. individual predictions of violence

  • Stephen D. Hart (a1), Christine Michie (a2) and David J. Cooke (a3)
Submit a response


Evaluating the Precision (and Accuracy!) of Criticisms

Stephen D. Hart, Professor
31 October 2007

Harris, Rice, and Quinsey (HRQ) claim that:

1. We “misapplied confidence intervals” to actuarial test scores. Butwe used CIs to evaluate the estimated probability of violence associated with test scores, not the raw scores themselves. The (many) problems with raw scores on actuarial tests are a separate issue.

2. We used “precision” and “accuracy” synonymously. But we did not, we simply recognized the important association between these concepts: Theaccuracy with which actuarial tests can predict future violence in an individual case depends on the precision of group data. As every research trainee learns, reliability places an upper bound on validity.

3. Their sanguine views about basing individual decisions on group data are supported by Grove and Meehl (1996). But they ought to read Groveand Meehl more carefully: “There is a real problem, not a fallacious objection, about uniqueness versus aggregates in defining what statisticians call the reference class for computing a particular probability in coming to a decision about an individual case” (1996; p. 306). Grove and Meehl’s (lengthy) discussion, which includes the issue of the precision of group estimates, is echoed in our paper.

4. Their belief in the “undeniable superiority of actuarials” is supported by Grove and Meehl (1996). But HRQ continue to confuse group andindividual data. Grove and Meehl concluded that actuarial decision making was superior to clinical judgment in about 45% of the studies they reviewed; in the others, clinical judgment was equally accurate or even more accurate. Put differently, the “on average” superiority of actuarialstranslated into superiority in slightly less than half of the individual comparisons. This is an important trend, obviously, but hardly a sound basis for high-stakes gambling on one outcome. As good scientists, we recommend against betting big on the toss of a single coin.

We strongly support evidence-based practice, but HRQ have confused “evidence based” with “statistically based.” They should recognize that inforensic mental health, as in many areas of life, good practice does not equate to mindless reliance on simplistic statistical algorithms.


Grove, W. M., & Meehl, P. E. (1996). Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical-statistical controversy. Psychology, Public Policy, and Law, 2, 293-323.
... More

Conflict of interest: None Declared

Write a reply

Margins of Error for Individual Risk Estimates: Large, Unknown, or Incalculable

Stephen D. Hart, Professor
26 September 2007

Actuarial risk assessment instruments (ARAIs), constructed using datafrom known groups, are used to make life-and-death decisions about individuals. How precisely do they estimate risk in individual cases? The 95%CI for proportions, which evaluates the precision of risk estimates forARAI groups, cannot be used for individual risk estimates unless one makesa very strong assumption of heterogeneity – that ARAIs carve nature at itsjoints, separating people with perfect accuracy into non-overlapping categories. No-one, not even those who construct ARAIs, makes this assumption. So, we ask again, what is the precision of individual risk estimates made using ARAIs?

Professors Mossman and Sellke criticized us for inadequately defining“individual risk.” They also criticized us for using an ad hoc procedure to estimate the margin of error for individual risk estimates, which they opined served only to “pile nonsense on top of meaninglessness.”

We must plead guilty to some of the charges leveled by Mossman and Sellke – indeed, we pled guilty in our paper, acknowledging the conceptualand statistical problems with the approach we used. In our defence, we claimed duress: Because developers used inappropriate statistical methods to construct ARAIs, we could not use appropriate methods to evaluate them.Violent recidivism was measured in the ARAI development samples as a dichotomous, time-dependent outcome, and so the developers ought to have used logistic regression or survival analysis to build models; if they had, one could directly calculate logistic regression or survival scores for individuals and their associated 95%CIs.

But we also plead that these charges are irrelevant to our conclusion. As we discussed, to reject our findings that the margins of error for individual risk estimates are is to acknowledge that they are either unknown or incalculable. Regardless, the current state of affairs is unacceptable for those who seek to use these tests in a professionally responsible manner or argue in favor of their legal admissibility. We urgeARAI developers to recalibrate their statistical models in a way that permits direct calculation of individual risk estimates and their precision or to make their data publicly available so others may do so.
... More

Conflict of interest: None Declared

Write a reply

Abandoning Evidence-Based Risk Appraisal in Forensic Practice: Comments on Hart et al.

Grant T. Harris, Director of Research
15 August 2007

Hart, Michie, and Cooke reminded readers of a supplement to this journal that, “Predicting the future is very difficult” (p. s63). All physicians are acutely aware of the difficulty of prognosis. But does thismean it should not be attempted? Competent practice, especially for serious conditions and therapies carrying risks, is impossible without some evaluation, one patient at a time, as to the likelihood of various outcomes as a function of various contemplated interventions (including nointervention), or as a function of various diagnostic tests. The advice from Hart and colleagues seems to call for clinicians to eschew empirical data about outcomes among groups of similar patients, but they failed to advise readers about what to do instead.

Statistical and Technical Matters

Hart and colleagues made a statistical argument that the results of widely replicated actuarial systems for forensic risk assessment (the Violence Risk Appraisal Guide, VRAG, and the Static-99) must be “virtuallymeaningless” (p. s60). Unfortunately, they were led into statistical errorby conflating test reliability and validity -- precision of measurement must be treated separately from a test’s association with an outcome. The first error resulting from this conflation was using confidence intervals to assess the “precision” or “margin of error” for an individual test result; in fact, confidence intervals were not designed for this purpose. The appropriate statistic is the standard error of measurement – the margin of error associated with a single person’s true score (an aspect ofreliability). The VRAG’s standard error of measurement has been reported both for the development sample and independent replications (Quinsey, Harris, Rice, & Cormier, 2006) consistently indicating that any singlescore has a only a .05 probability of yielding misclassification by more than one VRAG category. The analysis by Hart and colleagues was correct inone sense – the amount of misclassification to be expected does vary as a function of the score – those at the extremes exhibit greater risk of misclassification. Again, however, confidence intervals are not the way tocompute this error; conditional standard error of measurement is the statistic for this purpose.

Hart and colleagues’ analysis of confidence intervals did legitimately “prove” statistically that one usually cannot learn much froma single case -- one observation usually conveys only a little scientific information. But is it true, as they imply, that a single observation conveys absolutely no information? Readers will recognize that most research findings are simply the aggregation of many single observations. The fact that some research findings yield consistent replication inevitably means that single observations do convey valid scientific information. It’s just that we must often aggregate the single observations in order to evaluate and learn from them.

Hart and colleagues’ second mistake related to aggregated findings about the accuracy of actuarial tools. They slipped from “precision” to “accuracy” as though these are formally synonymous. They are not. As most medical professionals know, test accuracy (i.e., validity) is assessed in terms of sensitivity, specificity, and the tradeoff between these two. We are aware of more than 40 independent tests of the accuracy of the VRAG (and its allied tool the Sex Offender Risk Appaisal Guide, SORAG) in predicting violent recidivism in a total of approximately eight thousand released correctional inmates, sex offenders, forensic patients, civil psychiatric patients, and other clinical samples. These tests have been conducted in at least seven countries and have employed mean follow-up periods ranging from a few months to ten years. By conventional standards,average predictive effects (in terms of the sensitivity-specificity tradeoff) are large and are distributed as expected by psychometric principles and the laws of probability. Contrary to the assertions of Hartand colleagues, VRAG/SORAG scores have been shown to predict the speed andseverity of violent recidivism. If recalculated using all available cases,confidence intervals for category outcomes would be considerably smaller than those calculated for the development sample alone. Similarly, we are aware of approximately 40 replications, involving more than 13,000 cases, of the Static-99. The statistical argument by Hart and colleagues does notand cannot refute these empirical results supporting the accuracy of actuarial risk assessments.

It is instructive to consider the argument by Hart and colleagues in a broader medical context. Predicting violent recidivism with actuarial instruments is, in principle, no different than using diagnostic tests to predict development or outcome for such disorders as cancer. The accepted measure of predictive and diagnostic accuracy is the area under the Relative Operating Characteristic (ROC, Swets, Dawes, & Monahan, 2000)which indexes the tradeoff between sensitivity and specificity as a function of test score. Under conditions of good measurement reliability, equal follow-up duration, and few missing items, the VRAG produces ROC values that compare favorably with widely used diagnostic tests (Quinsey et al., 2006). This is true even though the accuracy of actuarial instruments is artifactually lowered by error in measuring the outcome (violent reoffending recorded in official records) whereas the accuracy ofdiagnostic tests for cancer prediction is generally less affected by such measurement error (for example, using death or autopsy results as the predicted outcome). Because ROC analyses are the standard for accuracy, the advice of Hart and colleagues would seem to require that many diagnostic tools also be abandoned.

Finally, classification accuracy is the standard in assessing the kind of “precision” attempted by Hart and colleagues. In most tests of theVRAG, there have been no statistically significant differences between theobserved rates and those expected on the basis of the proportions providedas norms (Harris & Rice, in press), especially given known variation predicted by Bayes’ Rule. Thus, classification accuracy has also been successfully replicated. In essence, Hart et al. have attempted, but failed, to gainsay an empirical result with a statistical argument.

The notion that it is somehow wrong to base individual decisions on “group data” has been thoroughly refuted (Grove & Meehl, 1996; Quinseyet al., 2006). Consider the example offered by Hart and colleagues themselves – betting on whether a card other than a diamond (probability =.75, 3 to 1 odds) will be drawn from an ordinary deck of playing cards. Hart and colleagues assert that one can have little confidence in winning in a single trial. What do they then advise -- bet on a diamond?! A careful reading of their paper yields only one piece of advice – refuse tobet. Yet consistently betting against a diamond is the winning strategy and all rational gamblers would make that bet. In the context of violence risk assessment over long durations, offenders in the highest two VRAG categories have generally exhibited probabilities of officially detected violent recidivism greater than 75 percent. And the lowest four categorieshave consistently exhibited rates below 25 percent. Surely forensic clinicians should not refuse to provide this information to those making decisions about violent offenders.

Clinical Decisions about One Case

What should a forensic clinician do when deciding to release or detain one previously violent forensic patient? Hart et al. imply that theclinician should make no release decision, presumably leaving it up to theunaided judgment of others. We disagree. An actuarial tool (such as the VRAG or Static-99) is simply an efficient, available distillation of relevant empirical evidence. An actuarial tool does not afford certainty, of course, but, as Hart and colleagues fully acknowledged, it affords moreaccuracy than any other known method for making such decisions.

In conclusion, the undeniable superiority in accuracy of actuarial systems over all known alternatives means they must be used where available. Except for refusing to make risk-related decisions, Hart and colleagues offer no alternative for actual forensic practice. Taken seriously, their advice is likely to worsen the practice of clinicians whomust make decisions about the risk of violent recidivism. Reluctance to make risk-related decisions based on actuarial methods may well have a motivation in addition to (misguided) concerns about accuracy, however. These concerns relate to clinical and philosophical objections to civil commitment for sex offenders in the U.S. and the dangerous severe personality disorders legislation in the UK (Monahan, 2006). Although we have some sympathy here, it is important to understand that the reliability and validity of actuarial instruments are independent of theiruse in particular schemes for sentencing and managing offenders. Further, if forensic clinicians refuse to make risk-related decisions, decisions will be made by others using less accurate means: less accurate decisions inexorably accumulate in more avoidable harm to victims, more unnecessary restriction of offenders, or both.


Grove, W. M. & Meehl, P. E. (1996) Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical–statistical controversy. Psychology, Public Policy, and Law, 2, 293–323.

Harris, G.T. & Rice, M.E. (in press) Characterizing the value of actuarial violence risk assessment. Criminal Justice and Behavior.

Monahan, J. (2006) A jurisprudence of risk assessment: Forecasting harm among prisoners, predators, and patients. Virginia Law Review, 92, 391-435.

Quinsey, V.L., Harris, G.T., Rice, M.E., & Cormier, C.A. (2006) Violent offenders: Appraising and managing risk (Second Edition). Washington, DC: American Psychological Association.

Swets, J., Dawes, R., & Monahan, J. (2000) Psychological science can improve diagnostic decisions. Psychological Science in the Public Interest, 1, 1-26.

Declaration of Interest: None

Grant T. Harris, Ph.D.

Director of Research, Mental Health Centre, Penetanguishene;Associate Professor of Psychology (adjunct), Queen's University;Associate Professor of Psychiatry (adjunct), University of Toronto;705-549-3181, fax: 705-549-3652

Marnie E. Rice, Ph.D., FRSC

Professor of Psychiatry and Behavioural Neuroscience, McMaster University;Professor of Psychiatry (adjunct), University of Toronto;Associate Professor of Psychology (adjunct), Queen's University

Vernon L. Quinsey, Ph.D.

Professor of Psychology, Biology, and Psychiatry, Queen's University
... More

Conflict of interest: None Declared

Write a reply

Avoiding Errors about "Margins of Error"

Douglas Mossman, Professor of Psychiatry
05 July 2007

In discussing actuarial risk assessment instruments (ARAIs), Hart andcolleagues (2007) acknowledge that "prediction" may refer to probabilisticstatements (e.g., a "prediction" that an individual "falls in a category for which the estimated risk of violence was 52%" [p. s60]). For unclear reasons, however, the authors seem to value only predictions with right-or-wrong outcomes. They therefore regard statements about future behavior of large groups (where one can be almost certain that the fraction of persons who act a certain way will fall within a narrow range of proportions) as potentially "credible," but predictions for individuals asmeaningless.

If the purpose of risk assessment is to make choices, however, then well-grounded probabilistic predictions about single events help us. Suppose we conclude it is legally and ethically acceptable to impose preventive confinement upon individuals in ARAI categories with estimated recidivism rates above a specified threshold T. This policy entails making "false negative" and "false positive" decision errors. We recognize, however, that unless we are omniscient, perfection is not an option, and ARAIs simply help us make better decisions than we otherwise could.

How do "margins of error" in estimated recidivism rates affect our decision process? Hart and colleagues believe their "group risk" and "individual risk" 95% confidence intervals (CIs) speak to this problem. Their group intervals are standard CIs for estimated population proportions based on random samples. If T lies outside the group risk CI for a category, then we can be reasonably certain that a decision we make concerning someone in that category is the same decision we would make if we knew the true recidivism rate for that category. If T falls within a category’s group risk CI, then our estimate quite possibly might lead to the "wrong" decision. Statistical decision theory (Berger, 1985) shows, however, that it is still a sensible strategy to choose whether to confinea member of a category based on which side of T our estimated risk falls.

Hart and colleagues talk about "individual risk" as though it is something different from category (or "group") risk. Yet if all one knowsabout an individual is his membership in a risk group, what can "individual risk" mean? The authors do not say. If "individual risk" refers to believed-to-exist-but-unspecified differences between individuals within a category, however, such differences should not affectchoices by a rational decision-maker.The 95% CIs for "individual risk" pile nonsense on top of meaninglessness.Hart and colleagues describe the replacement of "n" by "1" in the Wilson (1927) formulae as "ad hoc," but this substitution makes no sense when thebasis for the estimated proportion is an n-member sample. With "1" in place of "n," the formulae just don’t mean anything.

Using ARAIs raises serious moral problems as well as the valid scientific questions that Hart and colleagues mention. But in faulting ARAIs’ capacity to address an unspecified quantity called "individual risk," and in dressing up this notion with misapplied formulae for CIs, Hart and colleagues ultimately create a muddle.


Hart, S.D., Michie, C., & Cooke D.J. (2007) Precision of actuarial risk assessment instruments: Evaluating the ‘margins of error’ of group v. individual predictions of violence. British Journal of Psychiatry, 190, s60-s65.

Wilson, E.B. (1927) Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22, 209-212.

Berger, J.O. (1985) Statistical Decision Theory and Bayesian Analysis, 2nd Edition. Springer-Verlag, New York.


Douglas Mossman, M.D., Department of Psychiatry, Wright State University Boonshoft School of Medicine, 627 S. Edwin C. Moses Blvd., Dayton, OH, 45408-1461 USA

Thomas Sellke, Ph.D., Department of Statistics, Purdue University, 150 N. University Street, West Lafayette, IN 47907-2067 USA

Declaration of Interest

Neither of the authors has received fees or grants from, is employed by, has consulted for, has shared ownership in, or has any close relationship with an organization whose interests, financial or otherwise,would be affected by the publication of this letter.
... More

Conflict of interest: None Declared

Write a reply


Reply to: Submit a response

Your details

Conflicting interests

Do you have any conflicting interests? *