Hostname: page-component-89b8bd64d-sd5qd Total loading time: 0 Render date: 2026-05-09T04:58:14.409Z Has data issue: false hasContentIssue false

PacBio assembly of a Plasmodium knowlesi genome sequence with Hi-C correction and manual annotation of the SICAvar gene family

Published online by Cambridge University Press:  19 July 2017

S. A. LAPP
Affiliation:
Emory Vaccine Center, Yerkes National Primate Research Center, Emory University, Atlanta, GA, USA
J. A. GERALDO
Affiliation:
Federal University of Minas Gerais, Belo Horizonte, MG, Brazil René Rachou Research Center (CPqRR-FIOCRUZ), Belo Horizonte, MG, Brazil
J.-T. CHIEN
Affiliation:
Emory Vaccine Center, Yerkes National Primate Research Center, Emory University, Atlanta, GA, USA Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA
F. AY
Affiliation:
La Jolla Institute for Allergy and Immunology, La Jolla, CA 92037, USA
S. B. PAKALA
Affiliation:
Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA Center for Tropical and Emerging Global Diseases, University of Georgia, Athens, GA 30602, USA
G. BATUGEDARA
Affiliation:
Center for Disease and Vector Research, Institute for Integrative Genome Biology, Department of Cell Biology & Neuroscience, University of California Riverside, CA 92521, USA
J. HUMPHREY
Affiliation:
Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA Center for Tropical and Emerging Global Diseases, University of Georgia, Athens, GA 30602, USA
J. D. DeBARRY
Affiliation:
Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA Center for Tropical and Emerging Global Diseases, University of Georgia, Athens, GA 30602, USA
K. G. Le ROCH
Affiliation:
Center for Disease and Vector Research, Institute for Integrative Genome Biology, Department of Cell Biology & Neuroscience, University of California Riverside, CA 92521, USA
M. R. GALINSKI
Affiliation:
Emory Vaccine Center, Yerkes National Primate Research Center, Emory University, Atlanta, GA, USA Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA Division of Infectious Diseases, Department of Medicine, Emory University, Atlanta, GA, USA
J. C. KISSINGER*
Affiliation:
Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA Center for Tropical and Emerging Global Diseases, University of Georgia, Athens, GA 30602, USA Department of Genetics, University of Georgia, Athens, GA 30602, USA
the MaHPIC consortium
Affiliation:
Malaria Host-Pathogen Interaction Center, http://www.systemsbiology.emory.edu/
*
*Corresponding author: Coverdell Center, University of Georgia, 500 D.W. Brooks Drive, Suite 107, Athens, GA 30602, USA. E-mail: jkissing@uga.edu

Summary

Plasmodium knowlesi has risen in importance as a zoonotic parasite that has been causing regular episodes of malaria throughout South East Asia. The P. knowlesi genome sequence generated in 2008 highlighted and confirmed many similarities and differences in Plasmodium species, including a global view of several multigene families, such as the large SICAvar multigene family encoding the variant antigens known as the schizont-infected cell agglutination proteins. However, repetitive DNA sequences are the bane of any genome project, and this and other Plasmodium genome projects have not been immune to the gaps, rearrangements and other pitfalls created by these genomic features. Today, long-read PacBio and chromatin conformation technologies are overcoming such obstacles. Here, based on the use of these technologies, we present a highly refined de novo P. knowlesi genome sequence of the Pk1(A+) clone. This sequence and annotation, referred to as the ‘MaHPIC Pk genome sequence’, includes manual annotation of the SICAvar gene family with 136 full-length members categorized as type I or II. This sequence provides a framework that will permit a better understanding of the SICAvar repertoire, selective pressures acting on this gene family and mechanisms of antigenic variation in this species and other pathogens.

Information

Type
Special Issue Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © Cambridge University Press 2017
Figure 0

Table 1. Characteristics of nuclear genome sequences utilized in this study

Figure 1

Fig. 1. Hi-C assisted scaffolding of PacBio contigs. (A) Alignment of Hi-C data to the initial set of 35 high-coverage contigs by PacBio assembly showed that one of the contigs includes DNA from three different chromosomes as evidenced by the tri-partite structure of intracontig contact map of this contig (right). Other contigs did not exhibit similar contact patterns (representative example – left) suggesting they are contiguous pieces from a single chromosome. (B) Intercontig Hi-C contact maps of the unordered set of contigs (left) that were named according to their similarity with chromosomes in the PKNH assembly show striking off-diagonal contact enrichment suggesting that pairs of contigs that belong to the same chromosome are not ordered consecutively. Similar intercontig maps when contigs are clustered into scaffolds according to their Hi-C contact counts (mid) show minimal off-diagonal enrichment. Interchromosomal/scaffold contact map generated by aligning Hi-C reads to the new, chromosome level assembly (right) exhibits contact patterns that are expected of and observed in Plasmodium and yeast species (Ay et al.2014b; Duan et al.2010). This assembly was generated by breaking down the problematic contig, clustering contigs into chromosomal groups, and ordering and reorienting contigs within each group to maximize Hi-C contacts between adjacent and correctly oriented contigs to create scaffolds representative of each chromosome. (C) Intrascaffold Hi-C contact maps (normalized counts, 10 kb resolution) from two representative scaffolds in the new assembly. Scaffold 6 (left) and scaffold 14 were constructed by joining two and four PacBio contigs, respectively. The rows/columns marked by white represent unmappable or poorly mappable regions with Hi-C reads (Illumina 76 × 2 bp, paired-end sequencing).

Figure 2

Fig. 2. Chromosomal synteny between PKNH and the MaHPIC PKNOH genome sequences. (A) SyMAP circular DNA comparison of the MaHPIC Pk genome sequence scaffolds to the PKNH 2015 consensus sequence. (B) SyMAP circular DNA comparison of the MaHPIC Pk genome sequence scaffolds to the Plasmodium coatneyi HACKERI genome sequence that was assembled using PacBio technologies (Chien et al.2016). (C) SyMAP circular DNA comparison of the PKNH 2015 consensus sequence and P. coatneyi genome sequence.

Figure 3

Fig. 3. Hi-C contact maps for the join regions present on scaffolds 8 and 9. Hi-C contact maps of two scaffolds from the PKNOH-PacBio-Hi-C assembly that contain contigs previously assigned to two different chromosomes in the PKNH assembly. These contact maps are zoomed in to the join regions and are at the single MboI restriction fragment level (~1 kb in resolution). Each heatmap is rotated 45 degrees compared with previous intracontig/scaffold heatmaps for visualization purposes. (A) The 200 kb region of scaffold 8 (scf8:500 000– 700 000) that surrounds the join (at scf8:593 400) between two contigs previously assigned to chr13 and chr4 (left) compared with a matched 200 kb region from scaffold 12, which consists of a single contiguous PacBio contig (right). (B) Similar case vs control figure for scaffold 9 compared with matched coordinates in scaffold 5. The dashed blue lines correspond to location of the join (or matching coordinates on the right) and the sum and average number (excluding zeros) of interactions between the left and right (rectangular area) of a join are reported for each case.

Figure 4

Table 2. Nuclear genome annotation metrics

Figure 5

Fig. 4. SICAvar distribution and gene models. (A) Shown are representative examples of types I and II SICAvar genes with exons noted in blue, and their directionality indicated with arrow heads placed at the end of the 3-prime exons. Type I SICAvar genes are characterized by multiple exons (5–16), often with extremely large introns, particularly between exons 2 and 3. Type II SICAvar genes have three or four exons and are more compact with smaller introns. In five of the six examples shown, the initial two exons shown are typical. (B) Distribution of full SICAvar genes (types I and II) along the PKNOH scaffolds. (C) Distribution of partial SICAvar gene segments (types I and II) along the PKNOH scaffolds.

Figure 6

Table 3. Comparative SICAvar gene statistics in the PKNH (April 2017) and PKNOH (MAHPIC PacBio) assemblies

Supplementary material: PDF

Lapp supplementary material

Figures S1-S8

Download Lapp supplementary material(PDF)
PDF 4.9 MB
Supplementary material: File

Lapp supplementary material

Lapp supplementary material 1

Download Lapp supplementary material(File)
File 215.2 KB
Supplementary material: PDF

Lapp supplementary material

Tables S1-S3

Download Lapp supplementary material(PDF)
PDF 66.2 KB