Introduction
Empirical work across many research domains is commonly based on samples that were drawn from a larger population. Diversity (or complexity) is an important characteristic of such samples. An important and intuitive diversity metric is the number of unique types in an assemblage of token counts, its richness, although alternative approaches to quantifying diversity exist (Chao et al., Reference Chao, Gotelli, Hsieh, Sander, Ma, Colwell and Ellison2014, Daly et al., Reference Daly, Baetens and De Baets2018). In spite of significant registration effort and sensible sampling design, samples are typically incomplete: many types available in the original population might be missing (Chao, Reference Chao1984). The observable diversity in a sample therefore commonly underestimates the true diversity of the underlying population. Consequently, correcting or de-biasing the diversity attested in samples has important applications across many disciplines, including ecology (Magurran, Reference Magurran1988), criminology (van der Heijden et al., Reference van der Heijden, Cruyff and Böhning2014), and archaeology (Kaufman, Reference Kaufman1998). In ecology, for instance, the issue of ‘unseen species’ has long been acknowledged (Fisher et al., Reference Fisher, Corbet and Williams1943), because empirical field surveys of biodiversity tend to underestimate the true number of unique species inhabiting a given geographical area. (Animal species might be difficult to observe for a variety of reasons, for example, because they are shy, rare, or well camouflaged (Chao & Chiu, Reference Chao and Chiu2016)). A family of biostatistical methods known as ‘unseen species models’ has been developed to deal with this situation.
Recently, it has been argued that unseen species models find a relevant application in the study of cultural heritage (Kestemont et al., Reference Kestemont, Karsdorp, de Bruijn, Driscoll, Kapitan, Macháin, Sawyer, Sleiderink and Chao2022). Work in this area is, by necessity, based on the historical record consisting of the material artefacts (such as books, paintings, or statues) that still survive today. However, such cultural assemblages have commonly sustained substantial material losses (Esch, Reference Esch1985), resulting from deliberate destruction (e.g. book burning), unintended catastrophes (e.g. library fires), or more casual forms of gradual forgetting (e.g. resulting from diachronic changes in aesthetic appreciation). (A well-informed recent survey, specifically for the British Isles, has been published by (Milne, Reference Milne2025)). Because of such losses, present-day heritage collections tend to underestimate the original diversity of the past human cultures that they represent. Recent studies have shown that unseen species models can provide a valid quantitative estimate of the cultural diversity that has not been attested in the available samples. Recent applications of this approach include lost historic literature (Martynenko, Reference Martynenko2023), unregistered sailors from the East-Indian company (Wevers et al., Reference Wevers, Karsdorp and van Lottum2022), or policing bias in historical arrest data (Karsdorp et al., Reference Karsdorp, Kestemont and De Koster2024). Earlier studies in this area happened in numismatics (Esty, Reference Esty1986).
In this paper, we turn to the problem of estimating the cultural richness (diversity) that is shared by two communities. Connected communities often overlap in their cultural expression because of processes of migration or social learning (e.g. translation) that result in cultural exchange. However, in the face of severe material losses, as is typical with historical records, the reliable estimation of the original, true size of this overlap or shared richness is prohibitively difficult. Here, we focus on the surviving corpus of medieval chivalric literature in Dutch and French (ca. 1150–1450). The texts documented in the surviving assemblages show a considerable overlap, mainly because many Middle Dutch texts were derived from Old and Middle French counterparts (which were more prestigious at the time). The fact that so many of these stories, in either language, did not survive makes it particularly challenging to estimate how many they originally shared, thus clouding our view of the cultural transfer between both communities. Here, we borrow an established method from ecology and use it to estimate the number of stories that these literatures originally shared.
The structure of this paper is as follows. We first introduce at greater length the cultural phenomenon of chivalric literature in North Western Europe during the medieval period, with special attention to the material text carriers on which this literature circulated and the later loss of these artefacts. Next, we introduce the case study and associated data used for this study, i.e. counts of the (translated) chivalric texts that survive in medieval French (langue d’oïl) and Dutch (Middelnederlands). Subsequently, we introduce the Chao1 estimator and explain how it can be applied to estimate the number of lost stories in a single assemblage; we go on to discuss Chao-shared, an extension of this method to estimate the unobserved diversity shared between two assemblages. A novel contribution is the bootstrap approach proposed to obtain confidence intervals for all estimands. Finally, we present the outcome of our analyses and confront the results with prior scholarly debates, in particular on the authenticity of Middle Dutch literature.
Materials
Medieval chivalric narratives
Chivalric and heroic stories, such as the courtly romances about King Arthur or the epics about Charlemagne, were an important component of the cultural production in Europe during the Middle Ages (Krueger, Reference Krueger2000). The twelfth century witnessed the emergence of secular written literature in the continent’s vernaculars. Frequently, chivalric stories were transmitted across linguistic communities via translations or adaptations. The stories about King Arthur are a case in point: originating in Celtic folklore, these were adapted into Old French by leading authors such as Chrétien de Troyes. His texts were subsequently adapted into Middle High German, such as Wolfram von Eschenbach’s Parzival (van Oostrom, Reference van Oostrom2006). As lingua franca, the prestigious French culture was a dominant force in these developments – the romanz language even literally lent its name to the newly emerged text variety as a whole (cf. ‘romance’). Many vernacular texts across the European continent thus derived from French source materials: in the case of Middle Dutch (Middelnederlands), for instance, a meaningful share of the extant texts were in fact based on French originals.
Before the introduction of the printing press in Europe, texts were manually copied by scribes onto handwritten text carriers, such as parchment or paper codices and scrolls (Kwakkel, Reference Kwakkel2018). The number of distinct witnesses that today still survive of a medieval text is often informally treated as a proxy for the historic market demand that existed for these texts. Such count data is nevertheless dangerous to take at face value because of the significant losses which these artefacts sustained. Based on surviving inventories of medieval libraries, book historians have estimated a survival rate of only ca. 9% for the general population of medieval manuscripts (Wijsman, Reference Wijsman2010). In prior work, it has been argued that unseen species models from ecology offer a principled, quantitative framework to estimate how much of this literature has been lost (Kestemont et al., Reference Kestemont, Karsdorp, de Bruijn, Driscoll, Kapitan, Macháin, Sawyer, Sleiderink and Chao2022). Based on the number of rare, poorly attested texts (that only survive in a few witnesses), biostatistical models can be used to estimate a lower bound on the number of texts that did not survive at all. We will discuss these methods in more detail below.
Here, we extend this research to a novel aspect of lost medieval literature: the stories that were shared across two medieval literatures. We specifically focus on stories that concurrently existed in the French and Dutch literary culture of the period. When it comes to stories that were shared between French and Dutch, for instance, the following estimands should be considered:
(1) Surviving Dutch translations based on French originals, which are no longer extant;
(2) Lost Dutch translations based on French originals, which are extant;
(3) Lost Dutch translations based on French originals, which are no longer extant.
Taken together, these estimands can shed new light on the number of stories that must have been shared by two cultures, even if this is difficult to observe in the empirical data because of the historic loss of sources.
Terminology
Because we tackle complex concepts from past cultural production, it is important to resort to a maximally precise terminology. In a recent collaboration aimed at establishing an international database of medieval textual traditions, an ontology was negotiated that we try to consistently adopt throughout this paper. By a story, we shall indicate an identifiable, unique sequence of events, fictitious or not (e.g. Perceval’s quest for the Holy Grail), conceived as independent of their manifestation in texts (Culler, Reference Culler1981). Such a story can find a concrete expression in a text, which is by definition created in a specific language, through the authorship of a specific individual (e.g. Wolfram von Eschenbach’s Middle High German Parzival). Finally, we distinguish the witness, or a copy which records a concrete, material manifestation of the text (e.g. the fifteenth-century illuminated copy by Diebold Lauer’s workshop, nowadays kept under the shelfmark ‘Heidelberg, Universitätsbibliothek, Cod. Pal. germ. 339’). If a medieval witness nowadays survives as disjoint objects (or documents), such as book fragments scattered across different locales, we still count this set of documents as a single witness.
A dedicated publication (Christensen and Camps, Reference Christensen and Camps2025) discusses in more detail how these terms map onto other scholarly traditions and encoding standards, such as the Library Reference Model Object-Oriented (
$LRM_{OO}$). Whereas the story and text are intangible abstractions, witnesses (and documents) refer to material heritage objects. Note that, in our terminology, stories can be shared across multiple languages, but texts never. Important for the present contribution is that the same story might have been realised in multiple texts (e.g. in different languages) that might be related or not: in the case of a translation, the target text explicitly derives from a pre-existing source text (in another language), but both would still be considered as realisations of the same underlying story. In our modelling approach, we treat stories as unique species, and we consider the two languages in which these stories have been realised as texts as distinct assemblages. The number of witnesses, in either language, refers to the number of extant copies that we can still observe for these texts in either assemblage, analogous to the notion of species sightings in ecological terms. Shared species are therefore stories that have (observed or unobserved) realisations as texts in both languages, whether they were translated from one another or not.
Shared stories
The term ‘Middle Dutch’ refers to the amalgam of West-Germanic, vernacular dialects that were spoken in the Low Countries during the medieval period, ca. 1150–1450 (Van de Velde, Reference Van de Velde, Kürschner and Dammel2024). Written literature is extant in this language from the middle of the twelfth century onwards (van Oostrom, Reference van Oostrom2006). Typically, the Middle Dutch period is said to run until ca. 1450 (until the introduction of the movable type printing press in the Low Countries) according to conventional periodisations. An important component of the Middle Dutch literature was the ridderepiek (Goossens, Reference Goossens2005, Caers & Kestemont, Reference Caers and Kestemont2011, Caers, Reference Caers2011), a corpus of chivalric stories that still has a great deal of prestige in neerlandophone culture. Where the earliest production of this text variety was concentrated in the Low Franconian region in the east (counties like Looz and Guelders), it shifted towards the West (the county of Flanders and the duchy of Brabant) as the thirteenth century progressed (Caers & Kestemont, Reference Caers and Kestemont2011, Caers, Reference Caers2011). Ridderepiek is more sparsely documented in the northern regions of the Low Countries, such as the County of Holland. To the present day, ca. 75 texts have survived in this text variety, surviving in ca. 150 fragmentary or (near-)complete witnesses. Famous examples include the Roman van Walewein, an originally Middle Dutch Arthurian romance, the Roelantslied, an adaptation of the French Chanson de Roland, and the Historie van Troyen, which Jacob van Maerlant translated from the Roman de Troie. A foundational inventory of this corpus can be found in a repertory by Kienhorst (Kienhorst, Reference Kienhorst1988), although a number of important, new discoveries have been made since then.
Because of processes of inter-vernacular exchange, literary heritage was to a meaningful extent shared in this period. A narrow majority of the surviving Middle Dutch texts (38/72) are known to derive in some way from French sources: the French langue d’oïl enjoyed an enormous cultural prestige in this period, and many originally French texts were translated into other vernaculars, such as Middle Dutch (Morato and Schoenaers, Reference Morato and Schoenaers2019). In some cases, Dutch texts even claim to be translations from French, probably to add to their cultural prestige, even if these assertions are sometimes demonstrably false (Besamusca, Reference Besamusca and Ziegeler2008, Sleiderink, Reference Sleiderink, Kleinhenz and Busby2010a). Although a majority of the surviving Middle Dutch texts are somehow associated with French sources, the nature of this relationship can range from very faithful translations to more creative adaptations (Gerritsen, Reference Gerritsen and Stuip1988). Determining this can be challenging: what appears to be a creative adaptation of a known French text may in fact be a faithful translation of another French text that is now lost. In other cases, stories can be shared across these two literatures, even if no direct dependence has been attested. A complex case is the Middle Dutch Karel ende Elegast (Janssens, Reference Janssens2023): a very similar text must have existed in French, called the *Chanson de Basin. The Basin, however, has not survived and we only know it from other indirect evidence referencing it: this makes it difficult to assess whether the Elegast was directly based on this text. Because the protagonist carries a different name (Basin vs. Elegast), this is an example of a story that was shared, although the precise nature or the direction of the cultural exchange cannot be assessed.
For our analysis, we have reused previously collected abundance data for the chivalric and heroic texts that survive in French and Dutch (Kestemont et al., Reference Kestemont, Karsdorp, de Bruijn, Driscoll, Kapitan, Macháin, Sawyer, Sleiderink and Chao2022): this data holds the number of handwritten witnesses per text that still survive today, which we treat as observation events, or ‘sightings’ in an ecological sense. Next, we aligned the datasets and paired the texts that shared the same underlying story, typically (but not exclusively) because the Dutch text was based on a French original. Thus, we obtain text attestation frequencies for each story in both languages, whether the text is known to have been translated or not.Footnote 1 Some French originals were adapted into Dutch multiple times, such as the Lancelot en prose of which three independent adaptations in Middle Dutch survive (Gerritsen, Reference Gerritsen and Stuip1988). In such cases, the number of sightings for the Middle Dutch adaptations, even if they constitute different texts in our terminology, was collapsed into a single cumulative count. Table 1 provides basic statistics on the two assemblages considered here.
Table 1. Diversity statistics for the chivalric subcorpora and the pooled result, including the single-assemblage Chao1 estimate (with 0.95 confidence intervals) of the total number of texts

Establishing these aligned counts comes with many challenges that should be made explicit. Discretisation, or determining whether two manuscripts offer a witness of the same text can be challenging – the same is true for assessing whether two texts share the same underlying story. Interestingly, similar issues arise in ecological surveys, where it can be equally difficult to establish whether two individuals belong to the same species or not (Daly et al., Reference Daly, Baetens and De Baets2018). While many scribes produced relatively faithful copies of their exemplars, others intervened more actively in their source text, for instance, through the practice of compilation or abridgement. Such changes can be so profound that domain experts categorise the witness as a new text altogether. In general, we took the executive decision to stick with the existing categorisations of witnesses in the authoritative inventories, which we consulted. Many witnesses only survive in a fragmentary, heavily damaged form, adding to the complexity of identifying the recorded text, let alone whether the story also existed in the other vernacular (where many of the surviving witnesses are also incomplete). In the case of short fragments, we might also not be able to observe that two seemingly unrelated witnesses are in fact non-overlapping fragments of the same text. We must therefore account for the fact that this situation introduces some one-inflation in our data, or the relative over-counting of singletons (Böhning & Friedl, Reference Böhning and Friedl2024).
Intuitively, it would make sense to hypothesise a weak but positive correlation between the attestation counts of stories in either language. When we take the original, paired counts for all stories (i.e. including texts with a zero count in one of the assemblages), no positive correlation is evident (cf. Figure 1): well-attested stories in French literature are not necessarily well attested in Dutch (and vice versa). A non-parametric, directional Kendall’s
$\tau$ test revealed no significant positive correlation whatsoever (
$n=258, \tau = -0.097$,
$p = 0.97$). On the contrary, some of the texts with the highest attestation counts in Dutch are in fact not translations (e.g. Roman van Limborch), with the notable exception of the evergreen Roman de Troie / Historie van Troyen, which appears to have been equally successful in either community. Likewise, a number of extremely popular texts in French, such as the Tristan en prose, are without an empirically traceable counterpart in Dutch. It remains unclear, of course, to which extent the post-medieval loss of witnesses is responsible for this. If we limit the test to the observably shared stories, i.e. stories that have been attested at least once in both languages, a mildly positive effect emerges: (
$n=38, \tau = 0.25$,
$p = 0.02$). This suggests that, while the empirical data does not suggest an obvious (positive) correlation, the pattern of source survival might be obstructing our understanding of this shared literary heritage.

Figure 1. Text attestation patterns (i.e. the number of witnesses) for chivalric texts in Dutch and French traditions. Log-log scatter plot showing the number of witnesses per story in each language. Colours distinguish attestation patterns: Dutch-only (orange), French-only (blue), and both languages (green). Top-ranking titles in each category are annotated.
Methodology
First, we will briefly review the original Chao1 estimator for species richness in a single, ecological community, which can be extended to the problem of estimating the number of distinct species shared by two communities Chao-shared).Footnote 2
Chao1 estimator
Consider a single assemblage of types which are represented as a vector of counts (strictly positive integers). Let
$X_i$ denote the abundance or frequency of the
$i$th type in the sample:
$i = 1, 2, \ldots, S_{\text{obs}}$, where
$S_{\text{obs}}$ refers to the number of unique types observed in the sample and
$n$ the cumulative number of tokens or sightings in the sample. In the case of undersampling,
$S_{\text{obs}}$ will underestimate the true number of distinct types in the underlying population (
$S$, and the corresponding estimate
$\hat{S}$). If we denote the number of unobserved species as
$f_0$ (and the corresponding estimate
$\hat{f}_0$), the original, true diversity of the population can be estimated as
$\hat{S} = S_{\text{obs}} + \hat{f}_0$. Chao1 is an established estimator (Chao, Reference Chao1984) to obtain a universally valid lower bound of
$S$, so that
$\hat{S} = Chao1 = S_{\text{obs}} + \hat{f}_0$, where
$\hat{f}_0$ is defined in Eq. 1. The method adopts the perspective that a reliable lower bound is of more value than an inaccurate point estimate.
In prior work, we have more extensively argued the applicability of unseen species models, in particular Chao1, to abundance data that records the surviving witnesses of medieval texts.Footnote 3 Chao1 is an estimator that can, in principle, be applied to any hyper-diverse assemblage that has been severely undersampled. The estimator makes minimal theoretical assumptions: we only assume that there exists an unknown number of types (stories or texts, in the present case) in an assemblage of entities (here, witnesses). We also assume that a sample of these entities has been independently drawn from an originally much larger collection and that each entity can be correctly classified to its type (i.e. each witness can be assigned to a story). Here, independent sampling means that the outcome of any sampled entity does not affect, nor does it reveal any information about the outcome of other sampled entities. Chao1 yields a universally valid lower bound for the true number of classes; the estimator is non-parametric and makes no assumptions about the species abundance distribution, meaning that the estimator also holds when the abundance or detection rates differ wildly across species types.
In this approach, rare types in the abundance vectors are of special relevance, as it was already argued by Alan Turing that sparsely documented types carry most information about the types which were not observed at all in a sample (Chao et al., Reference Chao, Chiu, Colwell, Magnago, Chazdon and Gotelli2017). Quantities of special interest below are
$f_1$ and
$f_2$, or the number of types in a sample that have an attestation frequency of 1 and 2, respectively – also called singletons and doubletons. For a sample with
$n$ observations, Chao1 estimates the total number of species by adding the observed species richness
$S_{obs}$ to an estimated number of unseen species,
$\hat{f}_0$, as follows:
\begin{equation}
\begin{aligned}
\hat{S} = S_{obs} + \hat{f}_0 = S_{obs} + \left(\frac{n-1}{n}\right) \frac{f_1^2}{2f_2}
\end{aligned}
\end{equation} The method was originally derived as a lower bound from a moment inequality (Chao, Reference Chao1984). When
$n$ is large, removing the first term will have a negligible effect. Using a bootstrap approach, confidence intervals can be reported for these estimates to account for the considerable uncertainty in the procedure. The observation rate, or the survival ratio in the case of medieval texts, can be calculated as the following proportion:
$S_{\text{obs}} / (S_{\text{obs}} + \hat{f}_0)$. Note that this ratio becomes an upper bound (as Chao1 provides a lower bound), i.e. this ratio indicates the maximum proportion of narratives that have survived (Kestemont et al., Reference Kestemont, Karsdorp, de Bruijn, Driscoll, Kapitan, Macháin, Sawyer, Sleiderink and Chao2022).
In Table 1, we report the Chao1 estimates for the French and Dutch chivalric texts in isolation, and for the pooled data, which are visualized in Figure 2. (Note that these estimates deviate slightly from our earlier work, because the discretisation of texts follows a different logic here, for example, when the counts of parallel Dutch translations are collapsed.) In general, there is a meaningful difference in sample size across the two languages, but Chao1 suggests a very similar loss rate for both languages: in both French and Dutch, it is estimated that a maximum of roughly half of the texts are still known today. Prior research discussed the implications of this finding in the light of previous work in philology (van Oostrom, Reference van Oostrom2006, Kestemont and Karsdorp, Reference Kestemont and Karsdorp2019, Reference Kestemont and Karsdorp2020), which has often been more optimistic about the survival rate of medieval literature.

Figure 2. Single-assemblage Chao1 estimates: Dutch, French, and ‘pooled’, meaning the frequencies of each row are combined, including 0.95 confidence intervals and survival ratio for the central point estimate. Dashed horizontal lines (left) indicate observed diversity; encircled numbers reflect the debiased estimate (with CIs indicated via the horizontal line.).
Chao-shared estimator
From the aligned data, it becomes clear that some texts belong to stories which were shared across the two cultures. However, due to the heavy losses of witnesses on both sides, the present data in all likelihood severely underestimates the size of this shared repertoire (
$S_{\textrm{shared}}$). This term can be decomposed into:
The first term (
$S_{\textrm{shared}, \textrm{obs}}$) indicates the number of stories that can today still be empirically observed to have been shared, i.e. the number of stories with a non-zero attestation count in both languages. This number however should be adjusted upwards by the estimands
$f_{0,+}$,
$f_{+,0}$ and
$f_{0,0}$, which each capture their own aspect of the – now lost, or ‘latent’ – shared diversity. Assume that the French assemblage is ordered first in the collection pair, so that
${n_1}$ refers to the total number of empirical witnesses (tokens) in the French corpus and
${n_2}$ represents the cumulative number of Dutch witnesses.
$f_{0,+}$ refers to the number of shared stories which are nowadays attested in Middle Dutch witnesses, but no longer in French:
\begin{equation}
\begin{aligned}
f_{0, +} \geq \frac{(n_1 - 1)}{n_1} \frac{f^2_{1, +}}{2f_{2, +}}
\end{aligned}
\end{equation}This quantity is of great interest, because many Middle Dutch texts are assumed to be translations that can no longer easily be identified as such, because the source text no longer survives in any witnesses.
Then, there is:
$f_{+,0}$:
\begin{equation}
\begin{aligned}
f_{+, 0} \geq \frac{(n_2 - 1)}{n_2} \frac{f^2_{+, 1}}{2f_{+, 2}}
\end{aligned}
\end{equation}This is the number of French texts today, which are still attested in French witnesses, which were shared with Middle Dutch, even if no Middle Dutch witnesses are still extant.
Finally, there is one, even more abstract estimand,
$f_{0,0}$:
\begin{equation}
\begin{aligned}
f_{0,0} \geq \frac{(n_1 - 1)}{n_1} \frac{(n_2 - 1)}{n_2} \frac{f^2_{1, 1}}{4f_{2, 2}}
\end{aligned}
\end{equation}This quantity indicates the number of stories, which were originally shared, even if they can no longer be observed in either assemblage.
Readers will note the similarities of the aforementioned formulas to the conventional Chao1 estimator (Eq. 1): to estimate
$f_{+,0}$, for example, we rely on the stories which were indeed shared with Middle Dutch, but which barely survived, i.e. as singletons (
$f_{+,1}$) and doubletons (
$f_{+,2}$). This information on the abundance of the most scarce and obscure texts in this category is then exploited to provide a lower bound estimate on the number of texts in this category which did not survive at all. Because all these equations ultimately provide lower-bound estimates, the same will be true of the combined estimate, which we consequently should write down as:
\begin{equation}
\begin{aligned}
S_{\textrm{shared}} \geq \hat{S}_{\textrm{shared}} = S_{\textrm{shared}, \textrm{obs}} + \hat{f}_{0, +} + \hat{f}_{+, 0} + \hat{f}_{0, 0}
\end{aligned}
\end{equation}In the appendix, we present a novel bootstrap procedure which can be used to obtain confidence intervals on each of the estimands. Taking these into account is crucial, especially when working with smaller, heavily undersampled assemblages like ours, where there is considerable uncertainty in the estimates. Below, we will report upper and lower .95 percentiles for the obtained bootstrap values, yielding intervals that are not necessarily symmetric.
Results
In Table 2, we present the results of the shared richness estimator, including the 0.95 confidence intervals obtained through the bootstrap procedure. These estimates are also visualised more intuitively using a forest plot-like graph in Fig. 3. Overall, the results confirm the view that the number of stories which were originally shared by these two literary cultures is indeed underestimated by the still surviving, empirically observable data points. When discussing the constituent point estimates, however, we should take into account that (1) these are lower bounds, and that we should consider that these estimates might be fairly conservative; (2) there exist fairly wide confidence intervals around these estimates, which is probably due to the overall data sparsity that characterises the assemblages, as well as the considerable losses which either assemblage sustained.

Figure 3. Visualisation of the (component) Chao-shared estimates for the chivalric corpus (for 1,000 bootstrap iterations, with circled point estimates and CIs as horizontal lines). Cf. Table 2.
Table 2. Chao-shared estimates for the chivalric corpus (for 1,000 bootstrap iterations)

Discussion
Our central estimate (
$\hat{S}_{shared}$) confirms that the number of chivalric stories which were originally shared by French and Dutch medieval literature is underestimated in the surviving sample (38 shared stories can now be observed): the point estimate suggests that originally at least ca. 57 stories were shared. The CI moreover shows that this might be a fairly conservative estimate in that there is more upside than downside potential for this quantity ([44.85–164.70]). This central estimate is valuable in its own respect because it sheds new light on the composition of the unobserved share for the population. For Middle Dutch literature, the standard Chao1 estimates that a minimum of about 50 stories were lost from the original population: Chao-shared estimates that a minimum of 17 of these lost stories (
$f_{0,0} + f_{+,0}$) – roughly a third (34%) – must have been shared. (The great majority of the texts relating those stories will have been translations.) This quantity is not negligible and should, well and truly, be accounted for in historical scholarship. Another way in which Chao-shared improves on the interpretability of this kind of research is that it can be naturally decomposed into a number of constituent estimands (
$f_{0,+}, f_{+,0}, f_{0,0}$). These can help improve our understanding of why these shared stories have remained under-observed. Below, we will zoom in on each of these constituent estimands and attempt to connect it to existing scholarship in philology.
Lost French pretexts (
$f_{0,+}$)
The estimand
$f_{0,+}$ indicates that
$\geq 2$ surviving stories in Middle Dutch had a French counterpart which is now lost altogether. The already mentioned Karel ende Elegast is probably one of these, because the French counterpart is no longer extant but has been attested otherwise. Another example where a lost French original has been attested via other ways is the Roman de Torrez, which survives in Middle Dutch translation but not in the French original – even though this original is known indirectly from a historic library inventory (Claassens, Reference Claassens, Goyens and Verbeke2009). Interestingly, the estimator does not strongly suggest that there were many more cases like these, apart from the ones that we already knew about.
Lost shared stories (
$f_{0,0}$)
We also have an estimate of shared stories that are now lost in both languages:
$f_{0,0} \geq 1$. It may seem very difficult to get an idea about the nature of these romances but there is evidence pointing in this area. There are, for example, early modern Dutch texts of which it is assumed that they are adaptations of lost medieval Dutch texts, and in some cases, these lost medieval Dutch texts may be based on a lost French original. Likewise, there exist German translations of now lost Middle Dutch texts, and again, some of these might find their origin in lost French texts. An intriguing case concerns Johann aus dem Baumgarten / Joncker Jan wt den vergiere (Scholz, Reference Scholz, Haug and Wachinger1991). The rhyming German text Johann aus dem Baumgarten, transmitted in a 15th c. manuscript, clearly mentions that it is derived from a rhyming Dutch, more specifically Flemish, source: ‘uz flemschen in unser dutsche sleht / zurbrochen rime machen reht’ (Duijvestijn, Reference Duijvestijn, van Oostrom and Willaert1989, p. 153). The form and content of this now lost Dutch text can more or less be reconstructed on the basis of the German translation, but also a 16th c. Dutch prose adaptation, of which we have an edition dating to ca. 1590: Joncker Jan wt den vergiere. It must have been a 14th c. rhyming Dutch text about a foundling called ‘Jan’ (John) who is adopted by a German emperor, but who finds out in the end that his real parents are count Robert of Artois and a French princess. We may hypothesise that this now lost Dutch text was based on a French original (that was itself lost too). The plot of the reconstructed Dutch romance is similar to that of the French romance of Richars li biau – without this being the real source text – and it also has common features with the English romance Sir Degaré, which is in itself supposed to go back to a lost French romance or lai (Besamusca, Reference Besamusca2003). In short, we hypothesise the existence of a French text *Jean du vergier which was translated into Dutch (and, on the basis of that, eventually also into German). Likely, the French text was intertextually related to other French texts like Richars li biau and the English Sir Degaré or its lost French source. Thus, texts that are only transmitted in one language may, in fact, have been translations of lost translations of lost texts. The example of the hypothetical *Jean du vergier and its ties with Dutch, German, and English literature, demonstrates that it would be worthwhile to consider more European languages in future scholarship. Latin, for instance, is an obvious omission in our work, even if chivalric literature was more rare in this language. An illustrative example is Barlaam ende Josaphat, a Middle Dutch text which is conventionally categorised as ‘ridderepiek’, although it borders on the text variety of hagiography, and might well have had a (currently unidentified) LatinVorlage (Kienhorst, Reference Kienhorst1988).
Lost Dutch translations (
$f_{+,0}$)
The highest share of unobserved shared stories is represented by
$f_{+,0}$, which is estimated to be
$\geq 16$, or the lower bound estimate on the number of lost Middle Dutch translations of French sources that are themselves still extant. External, additional evidence in this domain, however, is sparse and often indirect. Consider, for instance, the countal chamberlain John VI of Gistel: in 1417, the book collection of this Flemish nobleman included a ‘Dutch-language book speaking of the history of England and the Knight of the Lion’ (Caers, Reference Caers2011). This seems to refer to a now lost Middle Dutch translation of the Yvain (or Le chevalier au lion) by Chrétien de Troyes. Since other Middle Dutch texts too reveal some familiarity with the Yvain matter, it is quite conceivable that poets were exposed to this narrative in a Middle Dutch rendition (Besamusca, Reference Besamusca and Ertzdorff1994). In other cases, the evidence is less direct: an interesting archaeological find is the ‘Tristan slipper’, a fourteenth-century leather shoe cover that was found in Mechelen (Belgium) (Sleiderink, Reference Sleiderink2010b). This artefact features an unmistakable rendition of the famous orchard scene from Tristan and Ysolde. While this text must have circulated in the original French version in the Low Countries, no Middle Dutch translation survives from this period – disregarding the more eastern, Low Franconian Tristant (Goossens, Reference Goossens2005). The slipper, however, features a Dutch quotation that was likely extracted from a Middle Dutch translation that went missing.
A final category is the (somewhat notorious) Dutch intermediaries: previous scholars have argued that some medieval German texts have been translated from French but only via Dutch intermediaries, which are now lost. While there exist valid arguments for this hypothesis, it has remained controversial. Apart from the Arthurian texts discussed by Tilvis (van Oostrom, Reference van Oostrom2006), we should also mention Schlacht von Alischanz, a Charlemagne text that was translated from an existing French epic, Chanson d’Aliscans (Kienhorst, Reference Kienhorst1998, Bastert, Reference Bastert2010). The Schlacht survived in fragments that display a curious mix of High German and Low German dialectal features. However, scholars have also noted relics of a Middle Dutch lexical stratum in the text, with words such as har unde tar and broecgurtel (Leitzmann, Reference Leitzmann1920). These relicts suggest that the Aliscans might have impacted the German cultural area via a Middle Dutch intermediary.
Implications
Understanding the transmission of cultural traits across individuals and communities is a central task in studying cultural change. Here, we focused on this task from the perspective of literary history and more specifically the inter-vernacular sharing (e.g. translation) of stories across two medieval literary systems (medieval French and Dutch). Modelling this process is complicated by the problematic survival of material sources, which can mask important interdependencies between these cultures, leading to an under-estimation of the original narrative richness of these literatures, but also their shared diversity. Above, we proposed the application of an established method (Chao-shared) from ecology to abundance counts that capture the (observed) abundance of texts in both languages. We discussed how the central estimate can be decomposed into three constituent estimates, which capture the intuition that shared stories cannot always be reliably observed, for instance, when either the source or translation has not survived in either language.
The resulting estimates broadly support the hypothesis that the observable, surviving data underestimate the shared diversity on several fronts. While this has been acknowledged by several studies in the past, not all scholars were equally convinced of the extent of this phenomenon. It is particularly meaningful to confront our results with the state of the art in Middle Dutch scholarship, where the authenticity or originality (oorspronkelijkheid) of this literature has been hotly debated. By ‘authentic’ or ‘original’, scholars typically mean that a text was originally written in Middle Dutch and has not been translated or adapted from another language, like French or Latin. In an ecological sense, the number of authentic texts could therefore be termed the ‘endogenic’ cultural diversity. As mentioned, a narrow majority of Middle Dutch texts should be considered translations (including more creative, less verbatim adaptations, which we do not distinguish as such in this context). For many Middle Dutch texts, however, it has so far not been possible to identify a French source. Earlier scholars, such as Gerritsen, were convinced that many of these texts derive from lost French exemplars (Gerritsen, Reference Gerritsen and Stuip1988, p. 41). Besamusca has argued that it would be preconceived (idée fixe) to assume that all such Middle Dutch texts must derive from now lost French sources: he proposed the opposite view, i.e. to consider such texts, in the lack of additional evidence, as authentic or endogenic: ‘Wenn kein französisches Original nachzuweisen ist, gab es auch keines’ (Besamusca, Reference Besamusca and Ziegeler2008, p. 35).
The bootstrap approach presented above also yields, perhaps surprisingly, a confidence interval for the attested number of shared stories, even if this quantity is directly observable. (Due to the nature of the bootstrap, note that the CI will not necessarily include the observed value, even if this happens to be the case above.) From a statistical perspective, the CI reflects the uncertainty associated with the observed value: if similar sampling or data collection were repeated, the number of observed shared species would vary, and the CI provides the likely range of this variation. The resulting range ([29–44]) seems informative in the context of the 38 shared stories which we identified: it gives a sense of the variation one might expect if a different team of scholars were to repeat this exercise – i.e. not so much taking inventory of the available texts, but the alignment exercise involved in mapping the texts to stories. In some cases, for example, there is clear but indirect evidence for a French pretext. The Middle Dutch Florigout, for instance, features excessively many French proper nouns (like ‘Gardepont’) which point towards an earlier French stratum. This view sheds a new light on some likely translation glitches. In the following passage, a monkey is sitting in a tree and blows a horn when a company passes under his tree. His master, the aforementioned Gardepont, reacts with lots of joy:
Lude riep de felle here
HAHAHAHA, surleboye
Mijn symminkel hevet proye
Nu vernomen goet ende vet. (Heeroma, Reference Heeroma1962, l. 3770–3773, p. 29)
(‘Loudly the evil lord shouted: “Hahaha, my monkey, Surleboye, now has spotted a good and fat prey”’.)
Apparently, the Dutch author thought that ‘Surleboye’ was the name of the monkey (Kuiper & Hendriks, Reference Kuiper and Hendriks1993–2025). However, it seems more likely that sur le bois (‘in the tree’) in the French source text was the place from where the monkey had spotted his ‘prey’.
For other texts, the debate as to the authenticity has a stronger ideological dimension. Early scholars, such as Jonckbloet and Te Winkel, considered the Roman van Moriaen as a notable example of an ‘authentic’ or ‘original’ Middle Dutch Arthurian text (Wells, Reference Wells1971). It is the very first text in Dutch literary history that features a non-white protagonist. In the wake of ideological criticism, the international attention for this text has surged, making it perhaps one of the most widely read Middle Dutch texts internationally at this moment (Heng, Reference Heng2018). A source for the Moriaen has never been retrieved in French philology, adding to the (partly chauvinistic) claims that Moriaen is indeed an endogenic text (van Buuren and Gysseling, Reference van Buuren and Gysseling1971). However, this Dutch-centric view has not gone uncontested in the international literature (Sparnaay, Reference Sparnaay and Loomis1959). In the 19th century Gaston Paris already wrote: ‘Le fait qu’on n’a pas retrouvé l’original français ne prouve naturellement rien; nous avons vu bien des exemples de pertes semblables’ (Paris, Reference Paris1888, p. 254). Nevertheless, if Moriaen was indeed a translation or adaptation of a now lost French source, it might be easier to explain that in Wolfram von Eschenbach’s Parzival the hero also has a son with a black mother, Feirefiz, a knight who was not completely black, but striped, black and white. Several scholars, including international researchers (Heng, Reference Heng2018), nevertheless continue to insist on the likely originality of this ‘endogenic’ romance (Smith & Zemel, Reference Smith, Zemel, Besamusca and Brandsma2021).
Acknowledgments
We are grateful to Cecile Vermaas and Kelly Christensen for their help in collecting and curating data. This paper has benefited substantially from the constructive feedback of Elisabeth de Bruijn on earlier drafts. Finally, we are grateful to the anonymous reviewers for their valuable suggestions.
Author Contributions
Conceptualisation: M.K.; F.K.; R.S. Methodology: M.K.; F.K.; R.S.; A.C. Data curation: M.K.; R.S.; J.-B.C. Data visualisation: M.K.; F.K.; A.C. Writing original draft: M.K.; F.K.; R.S.; A.C. Copy-editing: All authors. All authors approved the final submitted draft.
Financial Support
M.K. was supported by a grant for a sabbatical research leave, co-funded by the University of Antwerp’s Special Research Fund and the Flemish Research Agency FWO (grant number K802224N). J.-B.C. was funded by the European Union (ERC, LostMA, 101117408). Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. This work has received support under the Major Research Program of PSL Research University ‘CultureLab’ launched by PSL Research University and implemented by ANR with the reference ANR-10-IDEX-0001.
Conflict of Interest
n/a
Research Transparency and Reproducibility
All software and data needed to replicate our analyses have been published online under a CC-BY licence: https://github.com/mikekestemont/shared-richness. The current and future releases of this repository are sustainable archived on Zenodo (https://doi.org/10.5281/zenodo.17236111). The central shared species estimator (with unit testing) is available from the copia package (v.0.1.6), commit number ‘4725e06’ (https://github.com/mikekestemont/copia).
Appendix A. Shared species confidence intervals (bootstrap procedure)
We propose a bootstrap procedure for assessing the uncertainty of the estimates of
$f_{0,+}$,
$f_{+,0}$,
$f_{0,0}$, and total shared species between two assemblages. We resort to the bootstrap method proposed in the R package iNEXT.beta3D to assess the s.e. of the estimates of
$f_{0,+}$,
$f_{+,0}$,
$f_{0,0}$, and total shared species between two assemblages. Here, the total shared is the sum of observed,
$f_{0,+}$,
$f_{+,0}$, and
$f_{0,0}$. The associated 95% CI is also obtained. Assume a random sample I (species abundance vector with sample size
$n_1$) was taken from Assemblage I with coverage estimate
$C_1$, undetected species estimate
$f_{0,1}$, and the Chao1 estimate
$S_1$. Next, assume a random sample II (species abundance vector with sample size
$n_2$) was taken from Assemblage II with coverage estimate
$C_2$, undetected species estimate
$f_{0,2}$, and the Chao1 estimates
$S_2$. Finally, assume that in the data there were
$D_{1,2}$ observed shared species,
$D_{0,+}$ observed unique species in Sample II, and
$D_{+,0}$ observed unique species in Sample I. This is shown in Fig. A1.

Figure A1. Bootstrap framework for shared species estimation showing the partition of species across two assemblages and their pooled combination.
$D_{1,2}$,
$D_{0,+}$, and
$D_{+,0}$ represent observed shared, unique to Assemblage II, and unique to Assemblage I species, respectively.
$f_{0,1}$,
$f_{0,2}$, and
$f_0^*$ denote undetected species estimates for each assemblage and the pooled sample.
The bootstrap procedures are briefly described in the following steps (see Fig. A1 for notation):
(1) Consider the pooled sample with each species frequency being the sum of the two frequencies of the same species in the two samples. Then in the pooled sample there were
$D^* = D_{1,2} + (D_{0,+}) + (D_{+,0})$ observed species. From the pooled sample, we first obtain the Chao1 species richness
$S^*$ and undetected species in the pooled assemblage
$f_0^* = S^* - D^*$. Then add
$f_{0}*$ species to each assemblage as ‘undetected’ ones; see Fig. A1.(2) Then we construct two bootstrap assemblages: For bootstrap assemblage I: in addition to the observed data, we randomly select
$f_{0,1}$ species from zero cells (there are
$(D_{0,+}) + f_0^*$ of them) as those undetected species in Sample I. If
$f_{0,1} \gt (D_{0,+}) + f_0^*$, then replace
$f_{0,1}$ with
$(D_{0,+}) + f_0^*$ and select all
$(D_{0,+}) + f_0^*$ cells. Each selected species/cell is assigned to a relative abundance
$(1-C_1)/f_{0,1}$ in the bootstrap assemblage I.(3) For bootstrap assemblage II: in addition to the observed data, we randomly select
$f_{0,2}$ species from zero cells (there are
$(D_{+,0}) + f_0^*$ of them) as those undetected species in Sample II. If
$f_{0,2} \gt (D_{+,0}) + f_0^*$, then replace
$f_{0,2}$ with
$(D_{+,0}) + f_0^*$ and select all
$(D_{+,0}) + f_0^*$ cells. Each selected species/cell is assigned to a relative abundance
$(1-C_2)/f_{0,2}$ in the bootstrap assemblage II.(4) Within each sample, for any detected species, its relative abundance is coverage-based-adjusted according to the method in (Chao et al., Reference Chao, Hsieh, Chazdon, Colwell and Gotelli2015) so that the sum of the relative abundance of the detected species is equal to the coverage estimate in each sample.
(5) After the two bootstrap assemblages are constructed, take a sample of size
$n_1$ from Bootstrap Assemblage I, and take a sample of size
$n_2$ from Bootstrap Assemblage II. Then compute bootstrap statistics (observed,
$f_{0,+}$,
$f_{+,0}$ and
$f_{0,0}$) based on the bootstrap sample.(6) Repeat Procedure (4)
$B$ times (
$B=1000$, by default), then the sample s.e., based on the replicated values, is used as the approximate s.e. of each estimate for the current data. The corresponding 95% CI is computed using the percentile method: the lower and upper confidence limits are the 2.5th and 97.5th percentiles of the
$B$ bootstrap estimates, respectively.
Each 95% CI is computed using bootstrap percentiles rather than symmetric intervals. This percentile-based approach naturally handles the constraint that estimates should be non-negative and provides more appropriate confidence intervals for skewed distributions commonly encountered in (shared) species estimation. An advantage of this bootstrap method is that it can be directly generalised to multiple assemblages and does not rely on normality assumptions. The number of bootstrap replications should be at least 200, preferably 1000, for more accurate results.
Appendix B. Shared stories in French and Dutch medieval texts
Below are aligned attestation counts for (shared) stories across Dutch and French medieval chivalric texts. If multiple independent translations or adaptations exist, the components have been joined with a semi-colon (;). Composite texts are indicated with a hyphen (-). These counts are largely based on the Forgotten books data (Kestemont et al., Reference Kestemont, Karsdorp, de Bruijn, Driscoll, Kapitan, Macháin, Sawyer, Sleiderink and Chao2022), but with some minor emendations.












