Skip to main content Accessibility help

Parasitism as the main factor shaping peptide vocabularies in current organisms



Self/non-self-discrimination by vertebrate immune systems is based on the recognition of the presence of peptides in proteins of a parasite that are not contained in the proteins of a host. Therefore, a reduction of the number of ‘words’ in its own peptide vocabulary could be an efficient evolutionary strategy of parasites for escaping recognition. Here, we compared peptide vocabularies of 30 endoparasitic and 17 free-living unicellular organisms and also eight multicellular parasitic and 16 multicellular free-living organisms. We found that both unicellular and multicellular parasites used a significantly lower number of different pentapeptides than free-living controls. Impoverished pentapeptide vocabularies in parasites were observed across all five clades that contain both the parasitic and free-living species. The effect of parasitism on a number of peptides used in an organism's proteins is larger than effects of all other studied factors, including the size of a proteome, the number of encoded proteins, etc. This decrease of pentapeptide diversity was partly compensated for by an increased number of hexapeptides. Our results support the hypothesis of parasitism-associated reduction of peptide vocabulary and suggest that T-cell receptors mostly recognize the five amino acids-long part of peptides that are presented in the groove of major histocompatibility complex molecules.


Corresponding author

*Corresponding author: Division of Biology, Faculty of Science, Charles University in Prague, Vinicna 7, 128 44, Prague, Czech Republic.


Hide All
Adl, S. M., Simpson, A. G. B., Lane, C. E., Lukes, J., Bass, D., Bowser, S. S., Brown, M. W., Burki, F., Dunthorn, M., Hampl, V., Heiss, A., Hoppenrath, M., Lara, E., le Gall, L., Lynn, D. H., McManus, H., Mitchell, E. A. D., Mozley-Stanridge, S. E., Parfrey, L. W., Pawlowski, J., Rueckert, S., Shadwick, L., Schoch, C. L., Smirnov, A. and Spiegel, F. W. (2012). The revised classification of eukaryotes. Journal of Eukaryotic Microbiology, 59, 429493. doi: 10.1111/j.1550-7408.2012.00644.x.
Beckmann, J. S., Brendel, V. and Trifonov, E. N. (1986). Intervening sequences exhibit distinct vocabulary. Journal of Biomolecular Structure and Dynamics 4, 391400.
Bolshoy, A. (2003). DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity. Applied Bioinformatics 2, 103112.
Craiu, A., Akoplan, T., Goldberg, A. and Rock, K. L. (1997). Two distinct proteolytic processes in the generation of a major histocompatibility complex class I-presented peptide. Proceedings of the National Academy of Sciences of the United States of America 94, 1085010855. doi: 10.1073/pnas.94.20.10850.
Diamond, L. S. and Clark, C. G. (1993). A redescription of Entamoeba histolytica Schaudinn, 1903 (Emended Walker, 1911) Separating it from Entamoeba dispar Brumpt, 1925. Journal of Eukaryotic Microbiology 40, 340344. doi: 10.1111/j.1550-7408.1993.tb04926.x.
Elliott, A. M. (1973). Biology of Tetrahymena. Hutchinson & Ross, Dowden.
Eroglu, S. (2014). Language-like behavior of protein length distribution in proteomes. Complexity 20, 1221. doi: 10.1002/cplx.21498.
Flegr, J. (2011). Pozor, Toxo! Tajná učebnice praktické metodologie vědy (Watch out for Toxo! The secret guide to practical science). Academia, Prague.
Gatherer, D. (2007). Peptide vocabulary analysis reveals ultraconservation and homonymity in protein sequences. Bioinformatics and Biology Insights 1, 129137.
Gimona, M. (2006). Protein linguistics – a grammar for modular protein assembly? Nature Reviews Molecular Cell Biology 7, 6873. doi: 10.1038/nrm1785.
Hamzah, Z., Petmitr, S., Mungthin, M., Leelayoova, S. and Chavalitshewinkoon-Petmitr, P. (2006). Differential detection of Entamoeba histolytica, Entamoeba dispar, and Entamoeba moshkovskii by a single-round PCR assay. Journal of Clinical Microbiology 44, 31963200. doi: 10.1128/Jcm.00778-06.
King, R. C., Stansfield, W. D. and Mulligan, P. K. (2006). A Dictionary of Genetics. Oxford University Press, New York.
Lanzavecchia, A. (1985). Antigen-specific interaction between T-cells and B-cells. Nature 314, 537539.
Lingelbach, K. and Joiner, K. A. (1998). The parasitophorous vacuole membrane surrounding Plasmodium and Toxoplasma: an unusual compartment in infected cells. Journal of Cell Science 111, 14671475.
Motomura, K., Fujita, T., Tsutsumi, M., Kikuzato, S., Nakamura, M. and Otaki, J. M. (2012). Word decoding of protein amino acid sequences with availability analysis: a linguistic approach. PLoS ONE 7, 115. doi: ARTN e5003910.1371/journal.pone.0050039.
Neefjes, J. and Ovaa, H. (2013). A peptide's perspective on antigen presentation to the immune system. Nature Chemical Biology 9, 769775. doi: 10.1038/Nchembio.1391.
Orlov, Y. L. and Potapov, V. N. (2004). Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Research 32, W628W633. doi: 10.1093/nar/gkh466.
Paz, P., Brouwenstijn, N., Perry, R. and Shastri, N. (1999). Discrete proteolytic intermediates in the MHC class I antigen processing pathway and MHC I-dependent peptide trimming in the ER. Immunity 11, 241251. doi: 10.1016/S1074-7613(00)80099-0.
Pietrokovski, S. and Trifonov, E. N. (1992). Imported sequences in the mitochondrial yeast genome identified by nucleotide linguistics. Gene 122, 129137. doi: 10.1016/0378-1119(92)90040-V.
Popov, O., Segal, D. M. and Trifonov, E. N. (1996). Linguistic complexity of protein sequences as compared to texts of human languages. BioSystens 38, 6574.
Rencher, A. C. (2002). Methods of Multivariate Analysis, pp. 380408. Wiley, New York.
Sharon, I., Birkland, A., Chang, K., El-Yaniv, R. and Yona, G. (2005). Correcting BLAST e-values for low-complexity segments. Journal of Computational Biology 12, 9801003. doi: 10.1089/cmb.2005.12.980.
Sheskin, D. J. (2003). Handbook of Parametric and Nonparametric Statistical Procedures, Chapman & Hall/CRC Press, Boca Raton, FL, USA.
Trombetta, E. S. and Mellman, I. (2005). Cell biology of antigen processing in vitro and in vivo . Annual Review of Immunology 23, 9751028. doi: 10.1146/annurev.immunol.22.012703.104538.
Tyler, K. M. and Engman, D. M. (2001). The life cycle of Trypanosoma cruzi revisited. International Journal for Parasitology 31, 472481. doi: 10.1016/S0020-7519(01)00153-9.
Volkovich, Z., Kirzhner, V., Bolshoy, A., Nevo, E. and Korol, A. (2005). The method of N-grams in large-scale clustering of DNA texts. Pattern Recognition 38, 19021912. doi: 10.1016/j.patcog.2005.05.002.
Vyas, J. M., Van der Veen, A. G. and Ploegh, H. L. (2008). The known unknowns of antigen processing and presentation. Nature Reviews Immunology 8, 607618. doi: 10.1038/nri2368.
Wang, X. Y., Chen, W. J., Huang, Y., Sun, J. F., Men, J. T., Liu, H. L., Luo, F., Guo, L., Lv, X. L., Deng, C. H., Zhou, C. H., Fan, Y. X., Li, X. R., Huang, L. S., Hu, Y., Liang, C., Hu, X. C., Xu, J. and Yu, X. B. (2011). The draft genome of the carcinogenic human liver fluke Clonorchis sinensis . Genome Biology 12, 114. doi: Artn R10710.1186/Gb-2011-12-10-R107.
Wootton, J. C. and Federhen, S. (1993). Statistics of local complexity in amino-acid-sequences and sequence databases. Computers & Chemistry 17, 149163. doi: 10.1016/0097-8485(93)85006-X.
Yang, X. W. and Wang, T. M. (2013). A novel statistical measure for sequence comparison on the basis of k-word counts. Journal of Theoretical Biology 318, 91100. doi: 10.1016/j.jtbi.2012.10.035.
Zemkova, M., Trifonov, E. and Zahradnik, D. (2014). One common structural feature of ‘words’ in protein sequences and human texts. Journal of Biomolecular Structure and Dynamics 32, 10851091. doi: 10.1080/07391102.2013.809317.


Type Description Title
Supplementary materials

Zemková supplementary material
Tables S1-S3 and Figures S1-S2

 PDF (3.0 MB)
3.0 MB

Parasitism as the main factor shaping peptide vocabularies in current organisms



Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed