## 1. Introduction

There is a received view on measurement scales. It includes both a classification of scales and a set of prescriptions regarding which measurement inferences are justified. According to this view, the measurement scales used by researchers may be classified as nominal, ordinal, interval, or ratio, depending on the information they provide. Nominal scales only represent equality among elements of the same category (as in the classification 1 = alive, 2 = dead). Ordinal scales represent rank order among elements (e.g., in an attitudinal question, options 5 = strongly agree, 4 = agree, 3 = neutral, 2 = disagree, 1 = strongly disagree). Quantitative measurement, however, begins only with interval scales. Here the intervals (i.e., the differences between subsequent levels of the scale) are equal in magnitude. For instance, the difference in temperature between 2°C and 3°C is the same as the difference between 4°C and 5°C. This equality marks the difference between interval and ordinal scales—unlike the case of temperature in the Celsius scale, we do not know whether the distance between ‘strongly agree’ and ‘agree’ is the same as that between ‘agree’ and ‘neutral’. Finally, ratio scales, such as length measured in centimeters, are distinguishable from interval scales in that they have a nonarbitrary zero.

This classification of scales is tied to a set of methodological prescriptions
concerning the kinds of mathematical operations that may be applied to measurement
results. These prescriptions are meant to ensure the correctness of inferences from
measurements. For example, one prescription says researchers should not take
averages of their results if they are measuring temperature with an ordinal scale (a
“thermoscope”). This prescription entails that researchers should not
compare the average temperature of a group of places with that of another group in
order to infer which one is on average hotter. Similarly, if measuring temperature
with an interval scale, researchers should not compute ratios in order to infer
that, say, place *a* is twice as hot as place *b*.
These scale types and the associated prescriptions were first articulated by Stevens
(Reference Stevens1946) in his famous “permissible
statistics.” Later, the Representational Theory of Measurement (RTM; e.g.,
Suppes and Zinnes Reference Suppes and Zinnes1963) provided formal
foundations for both the standard classification of scales and the (un)justified
mathematical operations. But the endorsement of the classification and prescriptions
in research methodology goes well beyond the adherence to RTM or, for that matter,
to any specific theory of measurement. It is just the received view on measurement
scales, usually presented as standard methodology in textbooks across the
sciences.

A cursory look at many areas of the social and biomedical sciences, however, reveals that the prohibition on taking averages from ordinal scales has proved especially difficult to adhere to. Averages from ordinal scales are routinely used in psychology, sociology, economics, and medicine, despite methodologists frequently denouncing the practice as “impermissible.” While we have seen a revival of the philosophy of measurement in recent years (Tal Reference Tal and Zalta2017), little has been said regarding the mismatch between practice and methodology that surrounds measurement scales. I address this lacuna here, casting doubt on the adequacy of the received view. Focusing on the ordinal/interval distinction, I argue that the received view is too blunt a tool to be a satisfactory guide to measurement.

After describing the scale classification and associated prescriptions of the received view (sec. 2), I raise the worry that the prescriptions may be overly restrictive if the classification is not exhaustive enough (sec. 3). In order to assess the relevance of this worry, I offer an epistemic (Bayesian) characterization of the ordinal/interval distinction, that is, in terms of researchers’ beliefs about intervals (sec. 4). This novel epistemic characterization reveals that the ordinal/interval distinction is too coarse grained to appropriately represent all real-world measurement scales. Indeed, (forced) application of the received view might lead to overly cautious methodological prescriptions. We need a subtler epistemic framework of measurement scales (sec. 5).

## 2. The Received View on Measurement Scales

The received view defines scales by the uniqueness of their numerical assignments, which is in turn defined by a set of transformations that preserve the information the scales give. These transformations are called “permissible” or “admissible” (Suppes and Zinnes Reference Suppes and Zinnes1963). Ordinal scales are defined as those that are unique up to order, which means that any order-preserving (“increasing monotonic”) transformation is admissible. This expresses formally the intuitive idea—famously articulated in Stevens (Reference Stevens1946)—that these scales only provide information about the relative order of elements but nothing more. Thus, any order-preserving transformation gives us the same information we already had.

Beyond providing information about order, the specific characteristic of interval
scales is that their intervals are of equal magnitude. Here, any positive linear
transformation (i.e., a transformation from *x* to *y*
that satisfies
$y=a+bx$
,
$b>0$
) is admissible. Any such transformation may change the magnitude
that the scale assigns to 0 (if
$a\ne 0$
) and the absolute magnitude of the intervals but not the equality
of the intervals. As well as having equal intervals, ratio scales have a natural 0.
Thus, only positive similarity transformations are admissible (i.e.,
$y=bx$
,
$b>0$
); otherwise, ratios would not be preserved.

The methodological prescriptions are based on these admissible transformations. The
general form of the prescriptions is as follows: when inferring claims from
measurement results, only the claims that remain true under all admissible
transformations are validly inferred. The justification for this general
prescription lies in the fact that the information each scale gives is determined by
what is common across its admissible transformations—all admissible
transformations of a scale represent the phenomenon equally well. For example,
somebody might (incorrectly) infer that place *a* is twice as hot as
place *b* because the former is at 20°C while the latter at
10°C. However, in a Fahrenheit scale, *a*’s temperature
is 68°F, and *b* is 50°F (not
${68}^{\xb0}F/2={34}^{\xb0}F$
). Inferences such as ‘here is twice as hot as there’
are not validly made with these (interval) scales since the measurement comparison
is sensitive to the admissible transformation used. This is why standard methodology
rules them out.Footnote ^{1}

Let us see how this general prescription applies to ordinal scales. The paradigm
example of an ordinal scale in the physical sciences is Mohs’s scale of
hardness for minerals. This scale uses the following rule to order minerals: if
mineral *a* is able to scratch mineral *b*, then
*a* is harder than *b*. It also assigns numbers
from 1 to 10 in increasing levels of hardness, and each level is associated with a
specific mineral. Because in ordinal scales the differences in value between levels
are not invariant across admissible transformations, mathematical operations like
addition give results that are not invariant to admissible transformations. So, we
cannot infer that groups of objects A and B have the same average hardness from the
fact that their hardness levels in Mohs’s scale are
$A=\left\{2,3,4\right\}$
and
$B=\left\{3,3,3\right\}$
. For if we apply the transformation
$y={2}^{x}$
, the averages now differ.

Note that the prescription allows inferences when they are invariant. Under which
condition are inferences with averages from ordinal scales invariant? A mathematical
concept helps state this condition. Consider G^{A} and G^{B} to be
the cumulative distributions of each group: G^{A}(*x*) is the
fraction of minerals in group A that are as hard as or less hard than
*x* (and similarly for G^{B}). A well-known result in
statistics and economics says: A’s computed average is higher than B’s
computed average under any order-preserving transformation if and only if
${G}^{A}\left(x\right)\le {G}^{B}\left(x\right)$
for all *x* and with a strict inequality over some
values of *x*. The biconditional’s right-hand side is called
first-order stochastic dominance (FOSD).Footnote ^{2} FOSD assures that no matter which order-preserving
transformation is used, A’s computed average would always be bigger than that
of B. Of course, FOSD is a very strong condition. But nothing weaker can ensure that
the average comparison remains invariant under all order-preserving
transformations.

The received view, then, offers a classification of scales in terms of admissible transformations and a set of prescriptions about measurement inferences based on whether the measurement results are invariant across admissible transformations.

## 3. A Potential Problem for the Received View

Why does the received view single out only the admissible transformations (and, thus, the kinds of scales) that it does? If scales are defined by their admissible transformations, what stops us from having many other kinds of scales? Of course, many sets of admissible transformations can be considered between that of all order-preserving transformations and that of all linear positive transformations (Suppes and Zinnes Reference Suppes and Zinnes1963, 14; see an example below).

The sets of admissible transformations considered are nested (fig. 1*a*). Order-preserving transformations
include all positive linear transformations, which in turn include all positive
similarity transformations. Importantly, there is a relation between the admissible
transformations and what is necessary for conclusions to be invariant: the larger
the set of admissible transformations, the stronger the condition for conclusions to
be invariant (and thus validly inferred). For this reason, the conditions that
ensure that conclusions are invariant are stronger for ordinal scales than for
interval scales (e.g., FOSD is not needed to make average comparisons when working
with interval scales). Given that there is a positive relationship between the size
of the set of admissible transformations and the strength of the conditions
necessary for results to be invariant, having scales that are defined by smaller
sets of admissible transformations allows us to make valid inferences with weaker
conditions. The possibility of having such scales makes the issue of what scales are
(and are not) part of the standard classification more pressing.

To illustrate, imagine there exists another kind of scale, the
‘ordinal*’, defined by the following set of admissible
transformations: any positive concave transformation (i.e.,
$y=f\left(x\right)$
,
${f}^{\prime}>0$
,
${f}^{\u2033}\le 0$
). This set of transformations is a subset of the set of
order-preserving transformations, and it contains the set of positive linear
transformations (fig. 1*b*).
Thus, the conditions necessary for results to be invariant are weaker with ordinal*
scales than with ordinal scales.

If a researcher wants to compare the average hardness of two groups of minerals, it makes a difference whether she has an ordinal versus an ordinal* scale. In the case of the former, she needs FOSD to hold. In the case of the latter, only a substantially weaker condition is necessary, called second-order stochastic dominance (SOSD; see proof in Hadar and Russell Reference Hadar and Russell1969). Group comparisons that do not satisfy FOSD may satisfy SOSD. Hence, if a researcher is working with a scale that is in fact ordinal* but that is deemed to be ordinal just because ordinal* is not part of the conceived possibilities, that researcher might be wrongly forbidden to make some inferences. If this were the case, the scales framework endorsed by the received view would be a defective guide for research—its (forced) application would classify valid inferences as ‘invalid’. Policing too strict a (methodological) morality is, surely, an unwelcome result.

Whether the received view, which excludes any kind of scale between ordinal and interval, is a good guide for research depends among other things on whether the scales applied to actual measurement situations fall (for the most part, at least) neatly into those categories. If they do, researchers would not be working with a scale excluded in the received view, and thus nobody would be wrongly abiding to too strict a prescription. We should consider, then, whether the kind of scales singled out by the received view just are the ones researchers are likely to find themselves with in actual practice. If this is true, it would save the received view from the charge of being a defective guide. I will argue that this is unlikely.

The RTM-inspired way to assess whether this is the case would be to prove
representational and uniqueness theorems about scales that lie between the ordinal
and the interval and verify whether the axioms required for those representational
theorems are satisfied by the observed empirical systems (i.e., the phenomena) that
the scales aim to measure. It has been persuasively argued, however, that it is not
strictly speaking possible to verify whether the axioms are satisfied (because of
cases not yet observed, some of which are not observable in practice; Michell Reference Michell1990, 31; Sherry Reference Sherry2011, 517ff.). In this article I offer a different route.
Taking inspiration from the “epistemic turn” in the philosophy of
measurement, I propose that we characterize the ordinal/interval distinction
explicitly in terms of researchers’ beliefs. This provides a more flexible
way of thinking about scales, one that is less focused on the complete numerical
representability of attributes abstractly considered (as in RTM) and more on the
inferences researchers can validly make with measurement results.Footnote ^{3}

From an epistemic perspective, the ordinal/interval distinction reduces to beliefs
about differences between intervals. An interval scale is a scale where the
intervals are known to be of equal magnitude (and, thus, inferences from averages
are always valid). In the case of ordinal scales, things are not as straightforward.
An ordinal scale *only* informs about *order*. The
‘order’ part is easy to understand: all the intervals are known to be
positive (so that, e.g., a 5 is strictly harder than a 4). But what does the
‘only’ part entail for a researcher’s belief about
intervals’ differences? It is not obvious. Clearly, it is not correct to say:
“If the intervals are known to be of different magnitude, a scale is
ordinal.” For example, if a researcher knows that all intervals are different
but also knows that no interval is three or more times larger than any other, then
it is not the case that all order-preserving transformations are epistemically on a
par. (Any order-preserving transformation that makes some intervals three or more
times larger than other intervals should be ruled out.) Thus, the mere knowledge
that intervals are not equal—which implies that the scale is not
interval—is insufficient for saying that the scale is ordinal. It seems
plausible to say that the less equal the intervals of a scale are, the further the
scale is from being interval. But when are the intervals different enough for the
scale to be ordinal? More generally, what beliefs about differences between
intervals are constitutive of an ordinal scale? The tools of Bayesian epistemology
can help model the ordinal/interval distinction.

## 4. A Bayesian Take on the Ordinal/Interval Distinction

Under the received view, data from an interval scale reliably inform us about average
differences, while data from an ordinal scale do not. In this line, one approach to
model the ordinal/interval distinction is to take it as a case of “unreliable
evidence” (see Howson and Franklin Reference Howson and Franklin1994). The idea here is that the computed average difference may be,
depending on the kind of scale, more or less indicative of how the two groups
actually compare to each other. Imagine a researcher interested in (dis)confirming
hypothesis *H* (‘group A is harder on average than group
B’). Observing some positive evidence (*E*: A’s average
> B’s average) may confirm hypothesis *H* more or less,
depending on how reliable the scale is (*K*) for inferring hypotheses
like *H*.Footnote ^{4} We know
that if the intervals are all equal (
$K=1$
), the scale is fully reliable: *E* entails
*H* and is entailed by *H* (the likelihood
is
$\mathrm{Pr}\left(E|H\&K\right)=1$
). The less equal the intervals are, the less indicative
*E* is of *H*. This is because the numbers that
the scale assigns to the different minerals (and that determine whether
*E* is the case) are less indicative of the actual relative
degrees of hardness. One way of putting this is using the noise versus information
analogy: the more heterogeneous the intervals are, the larger the proportion of
noise to information about degrees of hardness there is in the numbers that the
scale uses. Arguably, there is a point at which intervals are believed to be wildly
heterogeneous enough (
$K\approx 0$
) so that *H* and *E* are taken to be
(for all purposes) probabilistically independent. In this case, there is no
confirmation (
$\mathrm{Pr}\left(H|E\&K\right)=\mathrm{Pr}\left(H|K\right)$
).Footnote ^{5}

If we model the interval scale by a researcher that assumes (or assigns credence 1
to)
$K=1$
, how is an ordinal scale modeled? One option: the researcher
assumes (or assigns credence 1 to)
$K\approx 0$
. Although at first sight plausible, there is something
counterintuitive about this representation of an ordinal scale. It implies that the
researcher takes for granted something quite specific about the intervals’
differences (i.e., that they are wildly heterogeneous). This plainly contradicts the
idea that an ordinal scale gives only information about order so that nothing is
known about intervals’ differences. Credence 1 in any other value of
*K* faces the same problem.

Another possibility: the researcher assigns a uniform distribution to
$K:K\sim U\left(0,1\right)$
. Motivated by the principle of insufficient reason, the idea could
be that the researcher has no reason to take as more likely any specific degree of
heterogeneity between intervals than other degrees. Treating them on a par, the idea
would go, requires believing
$K\sim U\left(0,1\right)$
. But, as is well known, the uniform distribution does not amount
to an informationless assumption. For example, how is the fact that the researcher
assigns equal probability to, say, *K* being between 0.5 and 0.6 and
between 0.6 and 0.7 compatible with her knowing nothing about intervals’
differences?

So we have already a significant result. In this Bayesian framework, it is not clear
how to represent an ordinal scale. No credence about *K* matches the
informal description of an ordinal scale (i.e., that which gives only information
about order, so that nothing is known about intervals’ differences). For
Bayesians, at least, this result may raise some concerns about the suitability of
the notion of an ordinal scale.

There is another possibility for (somehow still) representing ordinal scales in a
standard Bayesian model. It involves giving up the goal of faithfully representing
the researcher’s beliefs (or ignorance, rather) about intervals’
differences. We can black box the beliefs (for a moment) and focus on representing
faithfully the assumed corollary of having an ordinal scale. Ordinal scales,
according to the received view, are such that positive evidence cannot be used to
(dis)confirm *H*. So, taking that as a fixed point and reverse
engineering, we can now ask, what should a researcher’s beliefs be like for
this prescription to be brought about by our representation? The only belief
compatible with such a prescription is to assign credence 1 to
$K\approx 0$
so that there is no confirmation.

As argued above, these beliefs about the intervals do not match the common
understanding of an ordinal scale. But unless we impose them on the part of the
researcher, we just do not get the prescription that is supposed to hold for ordinal
scales (i.e., that we cannot confirm *H*). Indeed, the apparently
more reasonable (but still unsatisfactory) alternative of assigning a uniform
distribution to *K* would have meant that positive evidence
*does* provide some confirmation to *H*. Thus, it
is only *certainty* about an extreme heterogeneity of
intervals—to the point of having a totally unreliable measuring
instrument—that is compatible with the prescription. In that sense, only this
certainty about intervals’ heterogeneity is a plausible representation of
what it is for a researcher to have an ordinal scale.

Summing up, when modeling the received view on measurement scales from a standard
Bayesian perspective, the most plausible interpretation of the ordinal versus
interval distinction maps to the following distinction: researchers have certainty
about the extreme heterogeneity of intervals versus researchers have certainty about
the equality of intervals. How does this result bear on our assessment of the
received view? Quite negatively—the ordinal/interval distinction is not
(epistemically) fine grained at all. The ordinal/interval distinction does not
amount to two reasonably spaced categories, so that both might jointly capture the
situation of most researchers working with scales in the ordinal/interval area.
Rather, the distinction picks out two *extremes* of a continuum of
possibilities regarding beliefs about intervals’ differences. That actual
researchers will never (or almost never) find themselves between being certain of
intervals’ equality versus being certain of intervals’ extreme
heterogeneity is, on the face of it, extremely unlikely.

Bear in mind that any knowledge about plausible differences between intervals (that does not entail equality of intervals) is ruled out by the position being challenged. Think, for example, of any bounds that physical laws might suggest for plausible relative hardness of minerals and, thus, for physically possible or likely differences between intervals of hardness scales. That kind of knowledge may rule out some levels of (substantial) heterogeneity without, of course, necessarily establishing equality of intervals. Thus, such knowledge needs to be assumed as nonexistent if we are to say, as the received view assumes, that researchers have either ordinal or interval scales but nothing in between. That the absence of any such knowledge is required from the world (of researchers) for the received view to be an adequate framework puts pressure on it.

Moreover, focusing on the ordinal scale side of the continuum, it is doubtful that
actual researchers will find themselves in such a doxastic state (let alone that
scales will actually have extremely heterogeneous intervals). In order to get the
prescription about averages within our epistemic representation, radically strong
views about intervals’ heterogeneity need to be held by researchers. What
kind of evidence they could have for rationally settling on such strong beliefs is
unclear to me. That actual researchers would rationally hold such beliefs in any
given actual case is, then, unlikely.Footnote ^{6}

Granted, toy examples of ordinal scales can be produced by stipulation. But whether this resembles the situation of real-life researchers, working with scales developed out of background knowledge and experimental work, is a different matter. Indeed, given the kind of beliefs researchers must have about intervals’ heterogeneity for a scale to be ordinal, finding a real ordinal scale might prove challenging. At least, a strong case can be made that the alleged “paradigm example” of an ordinal scale in the physical sciences—Mohs’s scale—is neither ordinal nor interval. Friedrich Mohs himself believed he had a sense of how different the intervals were. With the exception of the last interval (9–10), he believed that the intervals of his scale were not that different so as to render the scale not fit for quantitative analysis. Later on he was, to considerable extent, proved right on both counts (Tabor Reference Tabor1954; see discussion in Larroulet Philippi Reference Larroulet Philippi2021).

## 5. Conclusion

I have cast doubt on the adequacy of the received view as a framework for guiding
measurement, by arguing that it is unlikely that the scales singled out by the
received view just are those that researchers find themselves with in actual
practice. When considered from the perspective of researchers’ beliefs, the
ordinal/interval distinction marks two extremes of a continuum. That all (or most)
actual scientific scales lie in either extreme of the continuum is not self-evident.
Indeed, for a scale to be ordinal, quite strong beliefs have to be in place.Footnote ^{7} Hence, it is unlikely that real-life
researchers always have either ordinal or interval scales but never something in
between.

Let me clarify that this does not necessarily amount to a critique of RTM. The correct interpretation of RTM (e.g., either as a complete theory of measurement or as a more modest project) is an open issue. And if RTM is interpreted merely as a (nonexhaustive) library of theorems (Heilmann Reference Heilmann2015), the above cannot be a critique of RTM per se. Rather, it is a critique of what I have called the received view on measurement scales, which includes the claim that the actual measurement scales used by researchers may be smoothly classified as ordinal, interval, and ratio.

The thesis here defended raises an important methodological issue. Widespread acceptance of the received view has arguably led to the implicit endorsement of the following assumption: if a scale ranks correctly but it is not interval, then it is ordinal. This is altogether reasonable when there are no other options between ordinal and interval. But of course, we have seen that there may be other options available. Indeed, from an inferential perspective, it makes little sense to take them as the only two options. Arguably, Mohs’s scale and (at least) some scales deemed ‘ordinal’ in biomedicine and the social sciences research contexts are merely known to be not interval. Being wrongly classified as ordinal is no small problem. Researchers using these scales might wrongly be forbidden to make inferences (e.g., from computed averages). Since not interval is compatible with being close to being interval (or close enough, depending on how strongly positive the evidence is), this prohibition might not be justified across the board. This methodological overstepping is, surely, an unwelcome result of the coarse grainedness of the received view. And it may well explain part of the tension between practitioners and measurement methodologists mentioned at the beginning.

Looking forward, we need a subtler epistemic framework for measurement scales. This article is only a first step in that direction. More fine-grained possibilities, and classifications better aligned with researchers’ epistemic predicaments, may ground more reasonable prescriptions. This would not only be better epistemology. It might also avoid some of the recurrent tensions between methodologists and practitioners on the status of their average comparisons.