Hostname: page-component-6766d58669-76mfw Total loading time: 0 Render date: 2026-05-21T13:31:51.641Z Has data issue: false hasContentIssue false

Quantitative modelling of fine-scale variations in the Arabidopsis thaliana crossover landscape

Published online by Cambridge University Press:  15 February 2022

Yu-Ming Hsu
Affiliation:
Université Paris-Saclay, CNRS, INRAE, Univ Evry, Institute of Plant Sciences Paris-Saclay (IPS2), 91405, Orsay, France Université de Paris, CNRS, INRAE, Institute of Plant Sciences Paris-Saclay (IPS2), 91405, Orsay, France Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, 91190 Gif-sur-Yvette, France
Matthieu Falque
Affiliation:
Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, 91190 Gif-sur-Yvette, France
Olivier C. Martin*
Affiliation:
Université Paris-Saclay, CNRS, INRAE, Univ Evry, Institute of Plant Sciences Paris-Saclay (IPS2), 91405, Orsay, France Université de Paris, CNRS, INRAE, Institute of Plant Sciences Paris-Saclay (IPS2), 91405, Orsay, France Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, 91190 Gif-sur-Yvette, France
*
Author for correspondence: O. C. Martin E-mail: olivier.c.martin@inrae.fr

Abstract

In, essentially, all species where meiotic crossovers (COs) have been studied, they occur preferentially in open chromatin, typically near gene promoters and to a lesser extent, at the end of genes. Here, in the case of Arabidopsis thaliana, we unveil further trends arising when one considers contextual information, namely summarised epigenetic status, gene or intergenic region size, and degree of divergence between homologs. For instance, we find that intergenic recombination rate is reduced if those regions are less than 1.5 kb in size. Furthermore, we propose that the presence of single nucleotide polymorphisms enhances the rate of CO formation compared to when homologous sequences are identical, in agreement with previous works comparing rates in adjacent homozygous and heterozygous blocks. Lastly, by integrating these different effects, we produce a quantitative and predictive model of the recombination landscape that reproduces much of the experimental variation.

Information

Type
Original Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright
© The Author(s), 2022. Published by Cambridge University Press in association with The John Innes Centre
Figure 0

Figure 1. The correlations between recombination rate and nine genomic or epigenomic features taken from somatic tissues (cf. titles). Each dot represents the values for a 100-kb bin. The x-axis shows the density of each feature, and the y-axis is the recombination rate based on a total of 17,077 crossovers from the Col-0-Ler F2 population. Dots in red, blue or green are for bins located in arms, pericentromeric regions or the transition regions between arms and pericentromeric regions, respectively. The black curves are fits to polynomials of degree 4 (function lm(y ~ poly(x,4)) of the statistical package R). R2 corresponds to the fraction of explained variance when using the polynomial as predictor (equation (2)). To ensure that the points fill most of the space, the scale in the main part of each panel is a zoom to display only 95% of the data, cutting the 2.5% extremities on both sides of the x-axes in all these plots. Insets show the data in the whole range.

Figure 1

Figure 2. Relations between our 10 chromatin states, genes, intergenic regions and recombination rate. (a) The top pie chart shows the genome-wide occupation percentages of each of the 10 states. ‘SV’ refers to low synteny regions or structural variations between Col-0 and Ler. The characteristics of the nine other states are: state 1 (intragenic, transcription starting site (TSS)), state 2 (intergenic, proximal promoter), state 3 (intragenic, coding sequence), state 4 (intergenic, distal promoter), state 5 (intergenic, H3K27me3 rich), state 6 (intergenic, transcription termination site (TTS)), state 7 (intragenic, long genes), state 8 (heterochromatic, AT rich) and state 9 (heterochromatic, GC rich). The lower pie chart shows the percentage of crossover occurrences identified in the 10 states. (b) Two plots, giving respectively the profiles of cumulated fractions of occurrences of the 10 different states (top) and the recombination rate pattern (bottom) in cM per Mb, along gene bodies and their 3-kb flanking regions. In the absence of SV, the entire 3-kb flanking region was used, otherwise it was truncated. The gene body goes from the TSS to the TTS as given in TAIR 10. Only non-transposable element coding genes satisfying the synteny filter have been included in the analysis. For the gene body region, the x-axis represents relative position, that is the distance from the TSS divided by the distance between TTS and TSS. That procedure allows one to pool genes of different sizes. For the flanking regions, x-axis represents position relative to the TSS or TTS in kb. The blue curve at the bottom is the predicted recombination rate when using the chromatin state profiles at the top together with the genome-wide recombination rates derived from (a). (c) Two plots as in (b) but now for the intergenic regions. Again, the blue curve is the predicted recombination rate when using the chromatin state profiles at the top together with the genome-wide recombination rates derived from (a). The legend in the middle of (b) and (c) indicates the corresponding chromatin state of each color used in plotting the chromatin-state profiles.

Figure 2

Figure 3. The relationship between the size of intergenic regions and their average recombination rate. These bar charts were constructed using all intergenic regions, but in the bottom, the regions were divided into three categories according to the transcription orientations of the two flanking genes, corresponding to convergent, divergent and parallel transcriptions. In all cases, the x-axis gives the size of the intergenic regions in kb, and the y-axis gives the corresponding averaged recombination rate (cM/Mb). Binning of the intergenic region sizes was applied every 500 bases up to a total size of 10 kb. For example, the leftmost bin covers intergenic regions of size 0–0.5 kb. However, we also include a rightmost bar on each chart to cover intergenic regions of sizes larger than 10 kb. Error bars are errors on the mean computed by the jackknife method (only the top segments are displayed). In both top and bottom figures, the blue curves give the predicted recombination rates using the genome-wide recombination rates of the 10 chromatin states as obtained from Figure 2a. The red curves show the predicted recombination rates when one includes the modulation based on the size of the intergenic regions as specified in equation (4).

Figure 3

Figure 4. The relationship between recombination rate and single nucleotide polymorphism (SNP) density. The Col-0 genome was decomposed into bins of 100 kb. For each cross starting with that of Rowan et al. (2019), SNPs and crossovers (COs) were inferred from reads produced using the F2 populations by mapping to the Col-0 genome. SNP density and recombination rates were then determined for each bin and displayed as a scatter plot. The five additional crosses are from Blackwell et al. (2020). The continuous red curves are fits when using the function (a + b x) exp(−cx) so as to maximise the log likelihood. To filter out the high SNP density regions that are expected to causally repress recombination, we restricted the analysis to SNP densities in the first two quantiles. All crosses show a reduced recombination rate at low SNP density and the likelihood ratio test allows us to reject the hypothesis H0 that ‘b = 0’, corresponding to no such suppressive effect (p-values shown for each cross and computed using the chi-square distribution with one degree of freedom).

Figure 4

Figure 5. Experimental and predicted recombination landscapes of chromosome 1. Landscapes using 100 kb bins obtained from the Rowan et al. (2019) dataset (red) and predicted from our calibrated model based on chromatin states (blue) with 15 parameters. Inset: a zoom in the right arm. For landscapes of all chromosomes, see Supplementary Figure S9.

Supplementary material: PDF

Hsu et al. supplementary material

Hsu et al. supplementary material 1

Download Hsu et al. supplementary material(PDF)
PDF 2.8 MB
Supplementary material: File

Hsu et al. supplementary material

Hsu et al. supplementary material 2

Download Hsu et al. supplementary material(File)
File 15.7 MB

Author comment: Quantitative modelling of fine-scale variations in the Arabidopsis thaliana crossover landscape — R0/PR1

Comments

Dear colleagues,

Genetic recombination shuffles alleles between parental chromosomes to generate novel combinations to be passed on to offspring. That shuffling is a cornerstone for research in genetics, it is one of the most important forces in natural evolution and it is key for artificial genetic improvement (e.g. plant breeding). The distribution of meiotic crossovers along chromosomes is strikingly heterogeneous on all scales: large heterochromatic pericentromeric regions are nearly totally devoid of crossovers while hot spots of recombination typically arise in regions of 1 kb or less. Although the qualitative trends of such "recombination landscapes" (repression in heterochromatin, enrichment in intergenic regions) are common to nearly all species studied, there have been hardly any attempts to really model these landscapes. The two fundamental reasons are (i) lack of high quality data and (ii) the fact that many different genomic or epigenomic features are expected to affect recombination rate. To overcome this last difficulty, theoretical investigations of these landscapes have relied on two frameworks. The first is standard quantitative genetics, leading to linear and generalized linear models (allowing for interactions also). The second is supervised learning which is qualitative rather than quantitative (regions are only classified as high vs low in recombination rate, cf. Demirci et al 2018 cited in our manuscript).

In the work presented in this manuscript, we have been able to construct a quantitative model reproducing the recombination landscape of Arabidopsis thaliana, a species chosen because of its available data, both genomic and epigenomic. Rather than use one of the two frameworks mentioned above, we take a third path allowing us to escape the pitfalls of those high complexity models. Notably, (i) our model explains a substantial fraction the variance in recombination rate, (ii) it is predictive while the models based on quantitative genetics are not, (iii) its different parameters all have a direct interpretation, either as base recombination rates or as motivated factors modulating those rates. Given these successes, our work will interest scientists working on recombination rates. But, more generally, our framework is likely to inspire researchers attempting to model complex biological systems driven by a plethora of factors. If we hope to publish in "Quantitative Plant Biology" it is because it has precisely the readers we want to target: readers motivated by quantitative and predictive modeling solidly anchored in plant biology.

Sincerely yours,

Y.-M. Hsu, M. Falque and O. Martin

Review: Quantitative modelling of fine-scale variations in the Arabidopsis thaliana crossover landscape — R0/PR2

Conflict of interest statement

Reviewer declares none.

Comments

Comments to Author: Meiotic crossover rates vary greatly along genomes. While there is now a considerable amount of empirical data for crossover distributions, we are still lacking good fine-scale models that explain these distributions. The authors of the current work have exploited publicly available crossover data an integrated them with genomic and epigenomic features. It is interesting that using the raw data do not allow for a very good model, but that previously established segmentation of the genome into different epigenetic states based on multiple epigenetic marks does. This model is surprisingly powerful, and further enhanced by including the size of intergenic regions and local interparent sequence divergence. It is curious that intergenic size matters, as this should be captured by the epigenetic states, and I would like the authors to comment on this.

Minor comment: that homozygosity suppresses recombination rate in Arabidopsis was first described by Barth et al. https://pubmed.ncbi.nlm.nih.gov/11768224/

Review: Quantitative modelling of fine-scale variations in the Arabidopsis thaliana crossover landscape — R0/PR3

Conflict of interest statement

Reviewer declares none.

Comments

Comments to Author: In this paper Hsu et al., use published A. thaliana genetic and epigenetic data to develop a quantitative model that does a good job of predicting the fine scale recombination landscapes of Arabidopsis chromosomes. The paper is well written and the narrative clearly describes the steps taken to develop the final model. The authors also make interesting observations along the way that COs are suppressed in small intergenic regions and in regions of low SNP density (the latter supporting previous observations form Ian Henderson’s lab). The final model doesn’t quite capture all the variation in recombination frequency in the experimental data, but this is rightly acknowledged by the authors and they suggest reasonable explanations as to why this is the case (such as the lack of crossover interference effects within the model).

I only have the following minor comments that the authors may wish to address in their final manuscript:

1. It is not entirely clear to me why the final rescaling of the recombination rates in the model (lines 452-454) is necessary of justified. I agree that genetic lengths of whole genomes may not vary much with genome size, but it is my understanding that the recombination rates of individual Arabidopsis chromosomes positively correlate with physical length (see Giraut et al., 2011 doi.org/10.1371/journal.pgen.1002354 and Salome et al., 2012 doi.org/10.1038/hdy.2011.95). It would be useful if the authors could add some citations or data to further justify why this rescaling is necessary.

2. On a related note, the authors mention that they rescaled the recombination rates to incorporate the effects of CO homeostasis (lines 159-162). CO homeostasis refers to the retention of COs at the expense of non-COs (e.g. when DSB frequency is reduced – see Martini et al., 2006 doi.org/10.1016/j.cell.2006.05.044). It’s not clear to me why the rescaling would address this effect, so it would be useful if this could be further explained in the text.

3. It would be nice in the discussion if the authors could mention other mathematical models that have been used to explain recombination rate in Arabidopsis (e.g. Lloyd & Jenczewski 2019 doi.org/10.1534/genetics.118.301838 and Morgan et al., 2021 doi.org/10.1038/s41467-021-24827-w) and how the model presented in this paper differs from these other models.

Recommendation: Quantitative modelling of fine-scale variations in the Arabidopsis thaliana crossover landscape — R0/PR4

Comments

Comments to Author: Dear authors,

Thanks for your submission to QPB. Your manuscript has now been viewed by two reviewers, who consider it interesting and valuable. Their comments and requests are attached to this message. In addition I have some editorial requests about the statistical approaches used. We would be delighted to consider a revised version of the manuscript when these points have been addressed.

Statistical points:

1. Please rephrase to avoid describing continuous variables as "factors" (e.g. l299, 368, 438, 442, 496, ...). "Factor" in statistics typically means a discrete categorical variable and this phrasing could lead to some confusion in this ms. You could consider "covariates" or just "variables"?

2. l128 -- this p-value, derived from resampling a goodness-of-fit statistic, tests the hypothesis that a more complex model provides a better fit to the data. This is not really a useful hypothesis -- a more complex model with free parameters will always provide a better fit. The question is whether the improvement "justifies" the additional complexity. This tradeoff between fit and complexity is at the heart of model comparison approaches like AIC/BIC and the likelihood ratio test. If this comparison is to be used to justify the use of one model over another, please replace this with one of these approaches or similar that accounts for model complexity.

3. In Fig 1, the R^2 statistics compare the predictions from a spline model to observations. But it's not clear how this spline model comes about. Please describe what's going into this. Also, please avoid the truncation of axes here and in the supplement -- the outliers play an essential part in determining the R^2 values. You could consider using log(x+1) or sqrt transformations?

4. The majority of the main text considers only goodness-of-fit and not model complexity (as in point 2). This makes l478 onwards, and Supp Tables S4-S6, particularly important. Please summarise the findings of this section in the main text so the reader can see at a glance that you have considered the issue of overfitting. Ideally consider using model selection criteria as above.

5. Thankyou for providing code as Supp Info. Could you please include a readme file describing the code payload and any steps required to run it on a clean machine?

Decision: Quantitative modelling of fine-scale variations in the Arabidopsis thaliana crossover landscape — R0/PR5

Comments

No accompanying comment.