To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter shows some examples of applications of combinatorics on words to number theory with a brief incursion into physics. These examples have a common feature: the notion of morphism of the free monoid. Such morphisms have been widely studied in combinatorics on words; they generate infinite words which can be considered as highly ordered, and which occur in an ubiquitous way in mathematics, theoretical computer science, and theoretical physics.
The first part of this chapter is devoted to the notion of automatic sequences and uniform morphisms, in connection with the transcendence of formal power series with coefficients in a finite field. Namely it is possible to characterize algebraicity of these series in a simple way: a formal power series is algebraic if and only if the sequence of its coefficients is automatic, that is, if it is the image by a letter-to-letter map of a fixed point of a uniform morphism. This criterion is known as Christol's theorem. A central tool in the study of automatic sequences is the notion of kernel of an infinite word (sequence) over a finite alphabet: this is the set of subsequences obtained by certain decimations. A rephrasing of Christol's theorem is that transcendence of a formal power series over a finite field is equivalent to infiniteness of the kernel of the sequence of its coefficients: this will be illustrated in this chapter.
This chapter illustrates the use of words to derive enumeration results and algorithms for sampling and coding.
Given a family C of combinatorial structures, endowed with a size such that the subset Cn of objects of size n is finite, we consider three problems:
(i) Counting: determine for all n ≥ 0, the cardinal Card (Cn) of the set Cn of objects with size n.
(ii) Sampling: design an algorithm RandC that, for any n, produces a random object uniformly chosen in Cn: in other terms, the algorithm must satisfy P(RandC(n) = O) = 1/Card (Cn) for any object O ∊ Cn.
(iii) Optimal coding: construct a function φ that maps injectively objects of C on words of {0, 1}* in such a way that an object O of size n is coded by a word φ(O) of length roughly bounded above by log2 Card (Cn).
These three problems have in common an enumerative flavour, in the sense that they are immediately solved if a list of all objects of size n is available. However, since in general there is an exponential number of objects of size n in the families in which we are interested, this solution is in no way satisfying. For a wide class of so-called decomposable combinatorial structures, including nonambiguous algebraic languages, algorithms with polynomial complexity can be derived from the rather systematic recursive method. Our aim is to explore classes of structures for which an even tighter link exists between counting, sampling, and coding.
This chapter introduces various mathematical models and combinatorial algorithms that are used to infer network expressions which appear repeated in a word or are common to a set of words, where by network expression is meant a regular expression without Kleene closure on the alphabet of the input word(s). A network expression on such an alphabet is therefore any expression built up of concatenation and union operators. For example, the expression A(C + G)T concatenates A with the union (C + G) and with T. Inferring network expressions means discovering such expressions which are initially unknown. The only input is the word(s) where the repeated (or common) expressions will be sought. This is in contrast with another problem, we shall not be concerned with, which searches for a known expression in a word(s) both of which are in this case part of the input. The inference of network expressions has many applications, notably in molecular biology, system security, text mining, etc. Because of the richness of the mathematical and algorithmic problems posed by molecular biology, we concentrate on applications in this area. The network expressions considered may therefore contain spacers where by spacer is meant any number of don't care symbols (a don't care is a symbol that matches anything). Constrained spacers are consecutive don't care symbols whose number ranges over a fixed interval of values. Network expressions with don't care symbols but no spacers are called “simple” while network expressions with spacers are called “flexible” if the spacers are unconstrained, and “structured” otherwise.
Repetitions (periodicities) in words are important objects that play a fundamental role in combinatorial properties of words and their applications to string processing, such as compression or biological sequence analysis. Using properties of repetitions allows one to speed up pattern matching algorithms.
The problem of efficiently identifying repetitions in a given word is one of the classical pattern matching problems. Recently, searching for repetitions in strings received a new motivation, due to the biosequence analysis. In DNA sequences, successively repeated fragments often bear important biological information and their presence is characteristic for many genomic structures (such as telomer regions). From a practical view-point, satellites and alu-repeats are involved in chromosome analysis and genotyping, and thus are of major interest to genomic researchers. Thus, different biological studies based on the analysis of tandem repeats have been done, and even databases of tandem repeats in certain species have been compiled.
In this chapter, we present a general efficient approach to computing different periodic structures in words. It is based on two main algorithmic techniques – a special factorization of the word and so-called longest extension functions – described in Section 8.3. Different applications of this method are described in Sections 8.4, 8.5, 8.6, 8.7, and 8.8. These sections are preceded by Section 8.2 devoted to combinatorial enumerative properties of repetitions. Bounding the maximal number of repetitions is necessary for proving complexity bounds of corresponding search algorithms.
Statistical and probabilistic properties of words in sequences have been of considerable interest in many fields, such as coding theory and reliability theory, and most recently in the analysis of biological sequences. The latter will serve as the key example in this chapter. We only consider finite words.
Two main aspects of word occurrences in biological sequences are: where do they occur and how many times do they occurfi An important problem, for instance, was to determine the statistical significance of a word frequency in a DNA sequence. The naive idea is the following: a word may be significantly rare in a DNA sequence because it disrupts replication or gene expression, (perhaps a negative selection factor), whereas a significantly frequent word may have a fundamental activity with regard to genome stability. Well-known examples of words with exceptional frequencies in DNA sequences are certain biological palindromes corresponding to restriction sites avoided, for instance in E. coli, and the Cross-over Hotspot Instigator sites in several bacteria. Identifying over- and underrepresented words in a particular genome is a very common task in genome analysis.
Statistical methods of studying the distribution of the word locations along a sequence and word frequencies have also been an active field of research; the goal of this chapter is to provide an overview of the state of this research.
Although the English mathematician Alan Mathison Turing (1912–1954) is remembered today primarily for his work in mathematical logic (Turing machines and the “Entscheidungsproblem”), machine computation, and artificial intelligence (the “Turing test”), his name is not usually thought of in connection with either probability or statistics. One of the basic tools in both of these subjects is the use of the normal or Gaussian distribution as an approximation, one basic result being the Lindeberg-Feller central limit theorem taught in first-year graduate courses in mathematical probability. No one associates Turing with the central limit theorem, but in 1934 Turing, while still an undergraduate, rediscovered a version of Lindeberg's 1922 theorem and much of the Feller-Lévy converse to it (then unpublished). This paper discusses Turing's connection with the central limit theorem and its surprising aftermath: his use of statistical methods during World War II to break key German military codes.
INTRODUCTION
Turing went up to Cambridge as an undergraduate in the Fall Term of 1931, having gained a scholarship to King's College. (Ironically, King's was his second choice; he had failed to gain a scholarship to Trinity.) Two years later, during the course of his studies, Turing attended a series of lectures on the Methodology of Science, given in the autumn of 1933 by the distinguished astrophysicist Sir Arthur Stanley Eddington. One topic Eddington discussed was the tendency of experimental measurements subject to errors of observation to often have an approximately normal or Gaussian distribution.
Laplace's rule of succession states, in brief, that if an event has occurred m times in succession, then the probability that it will occur again is (m + 1)/ (m + 2). The rule of succession was the classical attempt to reduce certain forms of inductive inference – “pure inductions” (De Morgan) or “eductions” (W. E. Johnson) – to purely probabilistic terms. Subjected to varying forms of ridicule by Venn, Keynes, and many others, it often served as a touchstone for much broader issues about the nature and role of probability.
This paper will trace the evolution of the rule, from its original formulation at the hands of Bayes, Price, and Laplace, to its generalizations by the English philosopher W. E. Johnson, and its perfection at the hands of Bruno de Finetti. By following the debate over the rule, the criticisms of it that were raised and the defenses of it that were mounted, it is hoped that some insight will be gained into the achievements and limitations of the probabilistic attempt to explain induction. Our aim is thus not purely – or even primarily – historical in nature.
As usually formulated, however, the rule of succession involves some element of the infinite in its statement or derivation. That element is not only unnecessary, it can obscure and mislead. We begin therefore by discussing the finite version of the rule, its statement, history, and derivation (sections 2–3), and then use it as a background against which to study the probabilistic analysis of induction from Bayes to de Finetti (sections 4–9).
How do Bayesians justify using conjugate priors on grounds other than mathematical convenience? In the 1920s the Cambridge philosopher William Ernest Johnson in effect characterized symmetric Dirichlet priors for multinomial sampling in terms of a natural and easily assessed subjective condition. Johnson's proof can be generalized to include asymmetric Dirichlet priors and those finitely exchangeable sequences with linear posterior expectation of success. Some interesting open problems that Johnson's result raises, and its historical and philosophical background, are also discussed.
Key words and phrases: W. E. Johnson, sufficientness postulate, exchangeability, Dirichlet prior, Rudolph Carnap.
INTRODUCTION
In 1932 a posthumously published article by the Cambridge philosopher W. E. Johnson showed how symmetric Dirichlet priors for infinitely exchangeable multinomial sequences could be characterized by a simple property termed “Johnson's sufficiency postulate” by I. J. Good (1965). (Good (1967) later shifted to the term “sufficientness” to avoid confusion with the usual statistical meaning of sufficiency.) Johnson could prove such a result, prior to the appearance of de Finetti's work on exchangeability and the representation theorem, for Johnson had himself already invented the concept of exchangeability, dubbed by him the “permutation postulate” (see Johnson, 1924, page 183). Johnson's contributions were largely overlooked by philosophers and statisticians alike until the publication of Good's 1965 monograph, which discussed and made serious use of Johnson's result.
Due perhaps in part to the posthumous nature of its publication, Johnson's proof was only sketched and contains several gaps and ambiguities; the major purpose of this paper is to present a complete version of Johnson's proof.
Laplace's rule of succession states, in brief, that the probability of an event recurring, given that it has already occurred n times in succession, is (n + 1)/ (n + 2). In his Essai philosophique sur les probabilités (1814), Laplace gave a famous, if notorious illustration of the rule: the probability of the sun's rising.
Thus we find that an event having occurred successively any number of times, the probability that it will happen again the next time is equal to this number increased by unity divided by the same number, increased by two units. Placing the most ancient epoch of history at five thousand years ago, or at 1826213 days, and the sun having risen constantly in the interval at each revolution of twenty-four hours, it is a bet of 1826214 to one that it will rise again to-morrow.
[Laplace, Essai philosophique, p. xvii]
This passage was at the center of a spirited debate for over a century about the ability of the calculus of probabilities to provide a satisfactory account of inductive inference (e.g., Keynes 1921, Chapter 30). Although the later history of this debate is well known, what is less well known, perhaps, is its history prior to the appearance of Laplace's Essai. In fact, the question whether belief in the future rising of the sun can be expressed probabilistically had been briefly alluded to by Hume in his Treatise of 1739, and had been discussed prior to the appearance of Laplace's Essai by Price, Buffon, Condorcet, Waring, Prevost, and L'Huilier (e.g., Zabell, 1988, Section 5).