A channel-based perspective on conjugate priors

Abstract A desired closure property in Bayesian probability is that an updated posterior distribution be in the same class of distributions – say Gaussians – as the prior distribution. When the updating takes place via a statistical model, one calls the class of prior distributions the ‘conjugate priors’ of the model. This paper gives (1) an abstract formulation of this notion of conjugate prior, using channels, in a graphical language, (2) a simple abstract proof that such conjugate priors yield Bayesian inversions and (3) an extension to multiple updates. The theory is illustrated with several standard examples.


Introduction
The main result of this paper, Theorem 6.3, is mathematically trivial. But it is not entirely trivial to see that this result is trivial. The effort and contribution of this paper lie in setting up a frameworkusing the abstract language of channels, Kleisli maps and string diagrams for probability theoryto define the notion of conjugate prior in such a way that there is a trivial proof of the main statement, saying that conjugate priors yield Bayesian inversions. This is indeed what conjugate priors are meant to be.
Conjugate priors form a fundamental topic in Bayesian theory. They are commonly described via a closure property of a class of prior distributions, namely as being closed under certain Bayesian updates. Conjugate priors are especially useful because they do not only involve a closure property, but also a particular structure, namely an explicit function that performs an analytical computation of posterior distributions via updates of the parameters. This greatly simplifies Bayesian analysis. For instance, the Beta distribution is conjugate prior to the Bernoulli (or 'flip') distribution, and also to the binomial distribution: updating a Beta(α, β) prior via a Bernoulli/binomial statistical model yields a new Beta(α , β ) prior, with adapted parameters α , β that can be computed explicitly from α, β and the observation at hand. Despite this importance, the descriptions in the literature of what it means to be a conjugate prior are remarkably informal. One does find several lists of classes of distributions, for instance at Wikipedia, 1 together with formulas about how to recompute parameters. The topic has a long and rich history in statistics (see e.g. Bishop (2006)), with much emphasis on exponential families (Diaconis and Ylvisaker 1979), but a precise, general definition is hard to find.
We briefly review some common approaches, without any pretension to be complete: the definition in Alpaydin (2010) is rather short, based on an example, and just says: 'We see that the Figure 1. Uniform prior, and two posterior probability density functions on [0, 1], after observing head, and after observing head-tail-tail-tail. These functions correspond respectively to Beta(1, 1), Beta(2, 1), and Beta(2, 4). These three plots can also be obtained via suitable Bayesian updates (inversions).
posterior has the same form as the prior and we call such a prior a conjugate prior'. Also Russell and Norvig (2003) mention the term 'conjugate prior' only in relation to an example. There is a separate section in Bishop (2006) about conjugate priors, but no precise definition. Instead, there is the informal description '. . . the posterior distribution has the same functional form as the prior'. The most precise definition (known to the author) is in Bernardo and Smith (2000, Section 5.2), where the conjugate family with respect to a statistical model, assuming a 'sufficient statistic' , is described. It comes close to our channel-based description, since it explicitly mentions the conjugate family as a conditional probability distribution with (recomputed) parameters. The approach is rather concrete however, and the high level of mathematical abstraction that we seek here is missing in Bernardo and Smith (2000).
This paper presents a novel systematic perspective for precisely defining what conjugate priorship means. It uses the notion of 'channel' as starting point. The basis of this approach lies in category theory, especially effectus theory (Cho et al. 2015;Jacobs 2015). However, we try to make this paper accessible to non-category theorists, by using the term 'channel' instead of morphism in a Kleisli category of a suitable monad. Moreover, a graphical language is used for channels, which hopefully makes the approach more intuitive. Thus, the emphasis of the paper is on what it means to have conjugate priors. It does not offer new perspectives on how to find/obtain them.
The paper is organised as follows. It starts in Section 2 with a high-level description of the main ideas, without going into technical details. Preparatory definitions are provided in Sections 3 and 4, dealing with channels in probabilistic computation, with a diagrammatic language for channels, and with Bayesian inversion. Then, Section 5 contains the novel channel-based definition of conjugate priorship; it also illustrates how several standard examples fit in this new setting. Section 6 establishes the (expected) close relationship between conjugate priors and Bayesian inversions (Cho and Jacobs, 2019;Clerc et al., 2017). Section 7 illustrates that multiple updates are handled basically in the same way as single updates, and also how the notion of 'sufficient statistic' fits in.

Main Ideas
This section gives an informal description of the main ideas underlying this paper. It starts with a standard example, and then proceeds with a step-by-step introduction to the essentials of the perspective of this paper.
A well-known example in Bayesian reasoning is inferring the (unknown) bias of a coin from a sequence of consecutive head/tail observations. The bias is a number r ∈ [0, 1] in the unit interval, giving the 'Bernoulli' or 'flip' probability r for head, and 1 − r for tail. Initially we assume a uniform distribution for r, as described by the constant probability density function (pdf) on the left in Figure 1. After observing one head, this pdf changes to the second picture. After observing headtail-tail-tail we get the third pdf. These pictures are obtained by Bayesian inversion, see Section 4.
It is a well-known fact that all the resulting distributions are instances of the Beta(α, β) family of distributions, for different parameters α, β. After each observation, one can recompute the entire updated distribution, via Bayesian inversion. But in fact there is a much more efficient way available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0960129519000082 Downloaded from https://www.cambridge.org/core. Universiteitsbibliotheek Nijmegen, on 05 Mar 2020 at 07:29:35, subject to the Cambridge Core terms of use, to obtain the revised distribution, namely by computing the new parameter values: increment α by one, for head, and increment β by one for tail, see Example 5.3 for details. The family of distributions Beta(α, β), indexed by parameters α, β, is thus suitably closed under updates with Bernoulli. It is the essence of the statement that Beta is conjugate prior to Bernoulli. This will be made precise later on.
Let X = (X, ) be a measurable space, where ⊆ P(X) is a σ -algebra of measurable subsets. We shall write G(X) for the set of probability distributions on X. Elements ω ∈ G(X) are thus countably additive functions ω : → [0, 1] with ω(X) = 1.
Idea 1: A family of distributions on X, indexed by a measurable space P of parameters, is a (measurable) function P → G(X). Categorically, such a function is a Kleisli map for G, considered as monad on the category of measurable spaces (see Section 3). These Kleisli maps are also called channels and will be written simply as arrows P → X, or diagrammatically as boxes X P where we imagine that information is flowing upwards.
The study of families of distributions goes back a long way, e.g. as 'experiments' (Blackwell, 1951). Along these lines we shall describe the family of Beta distributions as a channel with P = R >0 × R >0 and X = [0, 1], namely as function: For (α, β) ∈ R >0 × R >0 there is the probability distribution Beta(α, β) ∈ G([0, 1]) determined by its value on a measurable subset M ⊆ [0, 1], which is obtained via integration: where B(α, β) = [0,1] x α−1 (1 − x) β−1 dx is a normalisation constant. A conjugate prior relationship involves a family of distributions P → G(X) which is closed wrt. updates based on observations (or: data) from a separate domain O. Each 'parameter' element x ∈ X gives rise to a separate distribution on O. This is what is usually called a statistical or parametric model. We shall also describe it as a channel. The idea is that c( p) is a prior, for p ∈ P, which gets updated via the statistical model (channel) d, in the light of observation y ∈ O. The revised, updated distribution is c(h( p, y)). The model d is usually written as a conditional probability d(y | θ).
What has been left unexplained is the 'suitable' equation that the parameter translation function h : P × O → P should satisfy. It is not entirely trivial, because it is an equation between channels in what is called the Kleisli category K (G) of the Giry monad G. At this stage we need to move to a more categorical description. The equation, which will appear in Definition 5.1, bears similarities with the notion of Bayesian inversion, which will be introduced in Section 4.

Channels and conditional probabilities
This section will describe conditional probabilities as arrows and will show how to compose them. Thereby we are entering the world of category theory. We aim to suppress the underlying categorical machinery and make this work accessible to readers without such background. For those with categorical background knowledge: we will be working in the Kleisli categories of the distribution monad D for discrete probability, and of the Giry monad G for continuous probability, see e.g. Giry (1982), Panangaden (2009) andJacobs (2018). Discrete distributions may be seen as a special case of continuous distributions, via a suitable inclusion map D → G. Hence one could give one account, using G only. However, in computer science, unlike for instance in statistics, discrete distributions are so often used that they merit separate treatment.
We thus start with discrete probability. We write a (finite, discrete) distribution on a set X as a formal convex sum r 1 | x 1 + · · · + r n | x n of elements x i ∈ X and probabilities r i ∈ [0, 1] with i r i = 1. The 'ket' notation | − is syntactic sugar, used to distinguish elements of x from their occurrence |x in such formal convex sums. 2 A distribution as above can be identified with a 'probability mass' function ω : X → [0, 1] which is r i on x i and 0 elsewhere. We often implicitly identify distributions with such functions. We shall write D(X) for the set of distributions on X.
We shall focus on functions of the form c : X → D(Y). They give for each element x ∈ X a distribution c(x) on Y. Hence such functions form an X-indexed collection c(x) x∈X of distributions c(x) on Y. They can be understood as conditional probabilities P(y | x) = r, if c(x) is of the form · · · r|y · · · , with weight r = c(x)(y) ∈ [0, 1] for y ∈ Y. Thus, by construction, y P(y | x) = 1, for each x ∈ X. Moreover, if the sets X and Y are finite, we can describe c : X → D(Y) as a stochastic matrix, with entries P(y | x), adding up to one -per row or column, depending on the chosen orientation of the matrix.
We shall often write functions X → D(Y) simply as arrows X → Y, call them 'channels' , and write them as 'boxes' in diagrams. This arrow notation is justified, because there is a natural way to compose channels, as we shall see shortly. But first we describe state transformation, also called prediction. Given a channel c : X → D(Y) and a state ω ∈ D(X), we can form a new state, written as c ω, on Y. It is defined as: The outer sum y is a formal convex sum, whereas the inner sum x is an actual sum in the unit interval [0, 1]. Using state transformation it is easy to define composition of channels: given functions c : Essentially, this is matrix composition for stochastic matrices. Channel composition • is associative and also has a neutral element, namely the identity channel η : X → X given by the 'Dirac' We turn to channels in continuous probability. As already mentioned in Section 2, we write G(X) for the set of probability distributions ω : X → [0, 1], where X = (X, X ) is a measurable space. These probability distributions are (also) called states. The set G(X) carries a σ -algebra itself, but that does not play an important role here. Each element x ∈ X yields a probability measure For a state/measure ω ∈ G(X) and a measurable function f : X → R ≥0 we write f dω for the Lebesgue integral, if it exists. We follow the notation of Jacobs (2013) and refer there for details, or alternatively, to Panangaden (2009). We recall that an integral M f dω over a measurable subset M ⊆ X of the domain of f is defined as 1 M · f dω, and that Often, the measurable space X is a subset X ⊆ R of the real numbers and a probability distribution ω on X is given by a probability density function (pdf), that is, by a measurable function f : Such a pdf f gives rise to a state ω ∈ G(X), namely: We then write ω = f . In such cases where the measure is not explicitly mentioned, one should assume that it is the Lebesgue measure on R.
In this continuous context a channel is a measurable function c : X → G(Y), for measurable spaces X, Y. Like in the discrete case, it gives an X-indexed collection c(x) x∈X of probability distributions on Y. The channel c can transform a state ω ∈ G(X) on X into a state c ω ∈ G(Y) on Y, given on a measurable subset N ⊆ Y as: For another channel d : In many situations a channel c : X → G(Y) is given by an indexed probability density function The associated channel c is: In that case we simply write c = u and call c a pdf-channel. We have already seen such a description of the Beta distribution as a pdf-channel in (2).
(In these pdf-channels X → Y, we use a collection of pdf 's u(x, −) which are all dominated by the Lebesgue measure. This domination happens via the relationship of absolute continuity, using the Radon-Nikodym Theorem, see e.g. Panangaden (2009

Bayesian inversion in string diagrams
In this paper, we make superficial use of string diagrams to graphically represent sequential and parallel composition of channels, mainly in order to provide an intuitive visual overview. We refer to Selinger (2011) for mathematical details and mention here only the essentials.
A channel X → Y, for instance of the sort discussed in the previous section, can be written as a box Y X with information flowing upwards, from the wire labelled with X to the wire labelled with Y. Composition of channels, as in (4) or (8), simply involves connecting wires (of the same type). The identity channel is just a wire. We use a triangle notation X for a state on X. It is special case of a channel, namely of the form 1 → X with trivial singleton domain 1.
In the present (probabilistic) setting, we allow copying of wires, written diagrammatically as . We briefly describe such copy channels for discrete and continuous probability: After such a copy we can use parallel channels. We briefly describe how this works, first in the discrete case. For channels c : Similarly, in the continuous case, for channels c : Recall that the product σ -algebra on Y × B is generated by measurable rectangles of the form M × N, for M ∈ Y and N ∈ B . Hence measures are uniquely determined by their actions on such rectangles. We can now formulate what Bayesian inversion is. The definition is couched in purely diagrammatic language, but is applied only to probabilistic interpretations in this paper.
The dagger notation c † ω is copied from Clerc et al. (2017), see also Cho and Jacobs (2019). There the state ω is left implicit, via a restriction to a certain comma category of kernels. In that setting the operation (−) † is functorial and forms a dagger category (see e.g. Abramsky andCoecke (2009), Selinger (2007) for definitions). In particular, it preserves composition and identities of channels. Equation (10) can also be written as: id, c ω = c † ω , id (c ω). Alternatively, in the discrete case, with variables explicit, it says: It is relatively easy to define Bayesian inversion in discrete probability theory: for a channel c : X → D(Y) and a state/distribution ω ∈ D(X) one can define a channel c † ω : Y → D(X) as: assuming that the denominator is non-zero. This corresponds to the familiar formula P(B | A) = P(A,B) /P(A) for conditional probability. The state c † ω (y) can alternatively be defined via updating the state ω with the point predicate {y}, transformed via c into a predicate c 1 {y} on X, see Cho and Jacobs (2019) for details.
The situation is much more difficult in continuous probability theory, since Bayesian inversions may not exist (Ackerman et al. 2011;Stoyanov 2014) or may be determined only up to measure zero. But when restricted to e.g. standard Borel spaces, as in Clerc et al. (2017), existence is ensured, see also Culbertson and Sturtz (2014) and Faden (1985). Another common solution is to assume that we have a pdf-channel: there is a map u : X × Y → R ≥0 that defines a channel c : X → G(Y), like in (9), as c(x)(N) = N u(x, y) dy. Then, for a distribution ω ∈ G(X) we can take as Bayesian inversion: We prove that this definition satisfies the inversion Equation (10), using the calculation rules from the Appendix.

Conjugate priors
We now come to the core of this paper. As described in the introduction, the informal definition says that a class of distributions is conjugate prior to a statistical model if the associated posteriors are in the same class of distributions. The posteriors can be computed via Bayesian inversion (12) of the statistical model. This definition of 'conjugate prior' is a bit vague, since it loosely talks about 'classes of distributions' , without further specification. As described in 'Idea 1' in Section 2, we interpret 'class of states on X' as a channel P → X, where P is the type of parameters of the class.
We have already seen this channel-based description for the class Beta distributions, in (1) (normal) distributions Norm(μ, σ ), where μ is the mean parameter and σ is the standard deviation parameter, giving a channel of the form: It is determined by its value on a measurable subset M ⊆ R as the standard integral: Given a channel c : P → X, we shall look at states c( p), for parameters p ∈ P, as priors. The statistical model, for which these c( p)'s will be described as conjugate priors, goes from X to some other object O of 'observations'. Thus our starting point is a statistical model consisting of a pair of (composable) channels of the form: Such a pair of composable channels may be seen as a two-stage hierarchical Bayesian model, in which we standardly use O = observables, X = parameters of the statistical model, and P = parameters over the parameters, i.e. hyperparameters, see e.g. Bernardo and Smith (2000). There, esp. in Definition 5.6 of conjugate priorship, one can also distinguish two channels, written as p(θ | τ ) and p(x | θ), corresponding respectively to our channels c and d. The τ form the hyperparameters.
In this setting we come to our main definition that formulates the notion of conjugate prior in an abstract manner, avoiding classes of distributions. It contains the crucial equation that was missing in the informal description in Section 2.
All our examples of (conjugate prior) channels are maps in the Kleisli category of the Giry monad, but the formulation applies more generally. In fact, abstraction purifies the situation and shows the essentials. The definition below speaks of 'deterministic' channels, between brackets. This part will be explained later on, in the beginning of Section 6. It can be ignored for now.
The idea is that the map h : P × O → P translates parameters, with an observation from O as additional argument. Informally, one gets a posterior state c(h( p, y)) from the prior state c( p), given the observation y ∈ O. The power of this 'analytic' approach is that it involves simple recomputation of parameters, instead of more complicated updating of entire states. This will be illustrated in several standard examples below.
The above Equation (16) is formulated in an abstract manner -which is its main strength. We will derive an alternative formulation of Equation (16) In order to prove this equation, it suffices to prove that the two functions under the outer integral N are equal, that is, it suffices to prove for each y ∈ O,

This formulation will be used in the examples below.
Proof. We extensively use the equations for integration from Section 3 and from the Appendix, in order to prove (17). The left-hand side of Equation (16) gives the left-hand side of (17): Unravelling the right-hand side of (16) is a bit more work: By combining this outcome with the earlier one, we get the desired equation (17).
It now strongly resembles Equation (12) for Bayesian inversion. This connection will be established more generally in Theorem 6.3. Essentially, the above normalisation fraction (19) occurs in Bernardo and Smith (2000, Defnition 5.6).
We are now ready to review some standard examples. The first one describes the structure underlying the coin example in Section 2.
Example 5.3. It is well known that the beta distributions are conjugate prior to the Bernoulli 'flip' likelihood function. We shall reformulate this fact following the pattern of Definition 5.1, with two composable channels, as in (15), namely: The Beta channel is as in (1), but now restricted to the non-negative natural numbers N >0 . We recall that the normalisation constant The Flip channel sends a probability r ∈ [0, 1] to the Bernoulli(r) distribution, which can also be written as a discrete distribution Flip(r) = r| 1 + (1 − r)|0 . More formally, as a Kleisli map [0, 1] → G(2) it is, for a subset N ⊆ 2, The i in di refers here to the counting measure.
Thus, in one formula, h(α, β, i) = (α + i, β + (1 − i)). We prove Equation (18) for c = Beta = u and d = Flip = v. We start from its right-hand side, for an arbitrary i ∈ 2, The latter expression is the left-hand side of (18). We see that the essence of the verification of the conjugate prior equation is the shifting of functions and normalisation factors. This is a general pattern.
In the remainder of this section, we review how some of the standard examples fit in the current setting, without each time doing the entire 'big' calculation to verify Equation (18).
Example 5.4. In a similar way, one verifies that the Beta channel is a conjugate prior to the binomial channel. For the latter we fix a natural number n > 0, and consider the two channels: The binomial channel Binom n is defined for r ∈ [0, 1] and M ⊆ {0, 1, . . . , n} as: The conjugate prior property requires in this situation a parameter translation function h : Here is another well-known conjugate prior relationship, namely between Dirichlet and 'multinomial' distributions. The latter are simply called discrete distributions in the present context.
Example 5.5. Here we shall identify a number n ∈ N with the n-element set {0, 1, . . . , n − 1}. We then write D * (n) for the set of n-tuples (x 0 , . . . , For a fixed n > 0, let O = {y 0 , . . . , y n−1 } be a set of 'observations'. We consider the following two channels. The multinomial channel is defined as Mult(x 0 , . . . , x n−1 ) = x 0 |y 0 + · · · + x n−1 |y n−1 . The Dirichlet channel Dir n is more complicated: for an n-tuple α = (α 0 , . . . , α n−1 ) it is given via pdf 's d n , in: for (x 0 , . . . , x n−1 ) ∈ D * (n). The operation is the 'Gamma' function, which is defined on natural numbers k > 1 as (k) = (k − 1)!. The parameter translation function h : Example 5.6. The γ distribution is conjugate prior to the Poisson distribution, in: The parameter translation function h : We include one more example, illustrating that normal channels are conjugate priors to themselves. This fact is also well known.
The channel Norm is described in (13); it is used twice here, the second time with a fixed standard deviation ν, for 'noise'. This second channel is typically used for observation, like in Kalman filtering, for which a fixed noise level can be assumed. In this situation, the first normal channel Norm is a conjugate prior to the second channel Norm(−, ν), via the parameter translation function The state description is a special case of the channel description since a state on X is a channel 1 → X and copying on the trivial (final) object 1 does nothing, up to isomorphism.

Conjugate priors form bayesian inversions
Few channels (or states) are deterministic. In deterministic and continuous computation, the ordinary functions f : X → Y are deterministic, when considered as a channel η • f . We check this explicitly for point states, since this is what we need later on.
Example 6.2. Let x be an element of a measurable space X. The associated point state η(x) ∈ G(X) is deterministic, where η(x)(M) = 1 M (x). We check the equation on the right in Definition 6.1: We now come to the main result. Proof. We have to prove Equation (10), for channel d and state c p, with the channel c • h( p, −) playing the role of Bayesian inversion d † c p . This is easiest to see graphically, using that the state p is deterministic and thus commutes with copiers , see the equation on the right in Definition 6.1. available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0960129519000082 p distribution Beta(α, β). We use i y i as the number of 1's among the observations y i and i (1 − y i ) as the number of 0's, and then define as (multiple) parameter update function: Equation (21) then holds: using a reformulation in the style of (18), and with pdf 's u, v, as in Example 5.3, it suffices to prove: We see that for this multiple updating, we do not need the individual observations, y 1 , . . . , y m ∈ 2, but only the sums i y i and i (1 − y i ) = m − i y i . In many such cases the situation can be simplified via factorisation, using the notion of sufficient statistic, see e.g. Koopman (1936) and Bishop (2006). We adapt it to the current setting. Let's return to the situation in (21)  In the presence of a sufficient statistic, this Bayesian inversion takes the form as on the right.
The factorisation in the above triangle (23) may satisfy an obvious universal property once we add the requirement that the functions t m satisfy a certain invariance property. The latter property does not seem to be uniform across examples, as will be illustrated below.
Example 7.2. We mention sufficient statistics for some of the examples from Section 5.
(2) For the Gam-Pois conjugate priorship in Example 5.6, we have P = R >0 × R >0 and O = N as spaces of parameters and observations, and h m : P × O m → P is defined by h m (α, β, y) = (α + i y i , β + m). The sufficient statistic has the same Z = N × N and k(α, β, z, n) = (α + z, β + n), much as in the previous point, but with: N m t m / / N × N given by t m y = ( i y i , m).
(3) For the Norm-Norm conjugate priorship from Example 5.7, we have P = R × R >0 and O = R >0 . We can take Z = R >0 × N with: The proof that this works is laborious but essentially straightforward. Universality is obtained if we use the same invariance property as in the previous point.

Conclusions
This paper contains a novel view on conjugate priors, using the concept of channel in a systematic manner. It introduces a precise definition for conjugate priorship, using a pair of composable channels P → X → O and a parameter translation function P × O → P, satisfying a non-trivial   ⊗ d is (x, a, y, b) → u(x, y) · w(a, b).

A.3 Graph channels and pdf's
For a channel c : X → G(Y), we can form 'graph' channels id, c = (id ⊗ c) • : X → G(X × Y) and c, id = (c ⊗ id) • : X → G(Y × X). For x ∈ X we have: If c = u and ω = f is a state on X, then: