First whole-genome analysis of the novel coronavirus (SARS-CoV-2) obtained from COVID-19 patients from five districts in Western Serbia

This study was endeavoured to contribute in furthering our understanding of the molecular epidemiology of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) by sequencing and analysing the first full-length genome sequences obtained from 48 coronavirus disease-2019 (COVID-19) patients in five districts in Western Serbia in the period April 2020–July 2020. SARS-CoV-2 sequences in Western Serbia distinguished from the Wuhan sequence in 128 SNPs in total. The phylogenetic structure of local SARS-CoV-2 isolates suggested the existence of at least four distinct groups of SARS-CoV-2 strains in Western Serbia. The first group is the most similar to the strain from Italy. These isolates included two 20A sequences and 15−30 20B sequences that displayed a newly occurring set of four conjoined mutations. The second group is the most similar to the strain from France, carrying two mutations and belonged to 20A clade. The third group is the most similar to the strain from Switzerland carrying four co-occurring mutations and belonging to 20B clade. The fourth group is the most similar to another strain from France, displaying one mutation that gave rise to a single local isolate that belonged to 20A clade.


Introduction
After initially making an appearance in December 2019 [1,2], the novel betacoronavirus named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) swiftly and saliently escalated into an international health emergency, reaching more than 5.6 million cases of coronavirus disease-2019 (COVID-19) by 31 July 2020 [3]. Key elements in revealing the architecture of any emerging pathogen are deciphering its genome and performing a comprehensive phylogenetic analysis. Next-generation sequencing (NGS) techniques were able to rapidly identify SARS-CoV-2, which bore a striking sequence similarity to a bat coronavirus, as well as to the previously described SARS-CoV [1]. SARS-CoV-2 emerged as an enveloped positive-sense single-stranded RNA virus of 29 903 nucleotides (nt) long, encoding for ORF1a [cleaved into 11 non-structural proteins (nsp)] and ORF1b (cleaved into 16 nsp), four structural proteins (spike/S, nucleocapsid/N, membrane/M, and envelope/E) and 11 accessory proteins (3a, 3b, 3c, 3d, 6, 7a, 7b, 8, 9b, 9c and 10) (the International Coronaviridae Study Group 2020, [4,5]). Accumulation of the SARS-CoV-2 whole genome sequences from the whole world in the GISAID database (https://www.gisaid.org/), and supporting software tools such as Phylogenetic Assignment of Named Global Outbreak LINeages (PANGOLIN) [6], enabled tracking of the geographical prevalence of individual mutations and the paths of virus spread through the human population.
Early GISAID phylogenetic analyses partitioned SARS-CoV-2 based on specific combinations of genetic markers into three variants − A, B and C − distinguishing East Asian isolates from European and American ones [7]. However, this is a dynamic nomenclature, which changes as the virus spreads and mutates, and currently, it distinguishes seven SARS-CoV-2 clades: S, L, V, G, GH, GR, GV and GRY [8]. In parallel, another classification system was developed through the Nextstrain.org platform (https://nextstrain.org/sars-cov-2/), which tracks SARS-CoV-2 evolution in real-time. The Nextstrain classification organises sequences into major clades marked by the year in which the particular SARS-CoV-2 strains have emerged and a letter (i.e. 19A, 19B, 20A, 20B, 20C) [9]. Within each major clade, any new clade is at least two mutations away from its parent major clade, thus smaller clades are being labelled by their parental clade name and specific nucleotide mutations they contain (i.e. 20A/20268G). Nevertheless, both GISAID and Nextstrain classifications are based on dynamic SARS-CoV-2 phylogenetic frameworks, which identify those viral lineages that contribute most to the active spread and those that became inactive.
The first confirmed case of COVID-19 in Serbia was reported on 6 March 2020, with 25 552 more cases registered until 31 July 2020. In Western Serbia, the first official case of COVID-19 was recorded in the district of Cacak on 13 March 2020, in the district of Uzice on 17 March, in the district of Krusevac on 20 March, in the district of Kraljevo on 22 March and in the district of Novi Pazar on 4 April. However, no human isolate-derived sequences from Serbia have been analysed in detail so far. Thus, the objectives of this investigation were to sequence 48 SARS-CoV-2 isolates obtained from COVID-19 patients in five districts in Western Serbia and to determine their phylogenetic relationship to the large database of hitherto published data.

Sample collection
During the COVID-19 outbreak in Serbia, the Veterinary Specialized Institute Kraljevo was included in the testing of human naso-pharyngeal swabs for the presence of SARS-COV-2 virus. Samples were taken by the Public Health Epidemiology service from Raska, Rasina, Zlatibor and Moravian districts, all located in Western Serbia (Kraljevo, Novi Pazar, Krusevac, Uzice and Cacak).

Extraction of total nucleic acids and SARS-CoV-2 qRT-PCR
Total nucleic acids were extracted from samples using a commercial kit, BIOEXTRACT® SUPERBALL®, according to the manufacturer's instructions (Biosellal, France). Extractions were performed on a Kingfisher Flex device (Thermo Fisher Scientific, Finland). Extracted RNA was tested immediately and the remaining RNA was stored at −80°C for further molecular analysis.
Real-time RT-PCR reactions to identify SARS-CoV-2 isolates were carried out using a commercial Real-Time Fluorescent RT-PCR Kit for Detecting SARS-CoV-2 (BGI, China) according to the manufacturer's instructions. Amplification was performed on an AriaMX Real-Time PCR device (Agilent, USA). Ct values were exported and analysed in Microsoft Excel. A total of 48 isolates with Ct between 19 and 25 were selected for further NGS analyses.
cDNA synthesis and multiplex PCR cDNA synthesis was performed using SuperScript IV Reverse Transcriptase (Thermo Fisher Scientific, Finland) according to the manufacturer's instructions. Synthesised cDNA was quantified using a Real-Time RT-PCR assay designed by Pasteur Institute, France, targeting the RdRp (RNA-dependent RNA polymerase) gene. The reaction was performed on an AriaMx Real-Time PCR device (Agilent, USA) using Luna® Universal Probe qPCR Master Mix (NEB, USA).

Nanopore library preparation
Preparation of each library for Oxford Nanopore platform was performed according to the ARTIC protocol using Ligation Sequencing Kit (SQK-LSK109, Oxford Nanopore Technologies) and Native Barcoding Expansion 1-12 (PCR-free) (EXP-NBD104, Oxford Nanopore Technologies). After preparation, 7 ng of the library was loaded on an R9.4.1 flow cell (Oxford Nanopore Technologies), and sequenced on the Oxford Nanopore MinION (Oxford Nanopore Technologies) for 48 h.

Sanger sequencing
In order to confirm the mutations that were found in sequences obtained on the MinION device, Sanger sequencing was performed covering 37 SNPs using Artic ARTIC Network primers. PCR products were excised from gel and purified using mi-Gel Extraction Kit (Metabion). Sequencing reactions were performed using a BigDye Terminator kit (Applied Biosystems) v3.1 and run on a Genetic Analyzer 3130 (Applied Biosystems).

Data deposition
Analysed SARS-CoV-2 isolates in five districts in Western Serbia included six isolates from Uzice (all from April 2020), four isolates from Krusevac (all from April 2020), two isolates from Cacak (both from April 2020), 24 isolates from Kraljevo (three from April 2020, two from May 2020 and 19 from July 2020) and 12 isolates from Novi Pazar (one from April 2020, two from May 2020 and nine from July 2020). Genome sequences were deposited in the GISAID database (the Global Initiative on Sharing all Individual Data, available at https://www.gisaid.org) under the accession numbers: EPI_ISL_445086, EPI_ISL_445087, EPI_ISL_454795,  Table S1) was performed in MEGA X software [10] using Neighbour-Joining (NJ) method [11] with a 1000 bootstrap iterations. Additionally, the same analysis was performed using the whole amino acid SARS-CoV2 sequences translated from the whole genome sequences using ExPASy. The Wuhan-Hu-1 complete genome sequence used in this work is identical to the complete genome sequence of the WIV04 isolate (NCBI accession number MN996528.1) that is used by the GISAID database for the detection of the SARS-CoV-2 mutations, with the exception of the 3 ′ poly-A length. The Nextstrain (https://nextstrain.org/ ncov/global) platform was used for the real-time tracking of SARS-Cov-2 genome evolution [12]. In order to reveal the potential impact of nucleotide polymorphisms found in SARS-Cov-2 sequences from Western Serbia at the protein level, the obtained nucleotide sequences were translated into amino acid sequences using the ExPASy translate tool (https:// web.expasy.org/translate/).

Results
We report 48 whole-genome SARS-CoV-2 sequences obtained from Western Serbia patients with coronavirus disease 2019 (COVID-19) sampled in Cacak, Uzice and Krusevac in April 2020 and in Kraljevo and Novi Pazar in the period April 2020-July 2020. The sequences 29 903 nt in length, with a mean read depth ranging from 160 to 220, and harboured 126 distinct single nucleotide polymorphisms (SNPs) in comparison to the first Wuhan complete genome sequence. Four of the detected SNPs occurred in non-translated 5 ′ /3 ′ regions, 49 were synonymous changes leading to the same amino acid as in the original codon, while 75 of the detected SNPs were deduced to cause amino acid replacements, as summarised in Table 1  In accordance with presence of the above mentioned four conjoined major mutations along with presence or absence of 28881GGG>AAC, sequences were respectively classified into GR vs. G clade, according to GISAID (Fig. 1a), or into 20A vs. 20B clade, according to the Nextstrain analysis (Fig. 1b) Obtained data suggest that there may have been four independent introductions of the virus. Results of the phylogenetic analysis, presented in Figure 1c, showed that SARS-CoV-2 strains from Western Serbia could be descending from the original Wuhan strain via a more recent intermediate strain from Germany (EPI_ISL_450200 -collected on 28 January 2020), that gave origin to four other intermediate strains that actually entered Western Serbia. The first introduction could have occurred via the strain from Italy (EPI_ISL_542230 -24 February 2020) that entered Western Serbia in March/April 2020 and contributed to the rise of 31.25-66.67% of local isolates in all five districts by July 2020. This Italian strain carried four major European hotspot mutations 241C>T&3037C>T&14408C >T&23403A>G and shared an identical sequence with Western Serbia isolate from Kraljevo (EPI_ISL_462437 -23 April 2020). Besides the isolate from Kraljevo, another isolate from Uzice (EPI_ISL_462436 -28 April 2020) that belonged to 20A clade, has descended from the Italian strain. In parallel, other descendant(s) from this initial strain independently acquired an additional major European mutation 28881GGG>AAC, contributing to 15-30 Western Serbia 20B sequences (the number could not be determined precisely because of the possibility to one to three uncertain secondary introductions from Switzerland and England into this group of sequences, see Discussion). All these sequences displayed four new co-occurring polymorphisms NS5-10265G>A|G71S&NSP12-15972A>G&NSP15-19718C>T|T33I &N-28759T>C, clustering together as one 20B sub-clade and were Epidemiology and Infection The second introduction could have occurred via the strain from France (EPI_ISL_663234 -9 March 2020) that entered in two districts in Western Serbiain Cacak (EPI_ISL_455480 -7 April 2020) and Kraljevo (EPI_ISL_462438 -22 April 2020 and EPI_ISL_514426 -11 May 2020), giving rise to 6.25% of local isolates that belonged to clade 20A in April 2020 and   Dejan Vidanović et al.
strains were no longer recorded in July 2020 in Western Serbia, local 20B strains remained and spread locally, at least in Kraljevo and Novi Pazar, by July 2020.

Epidemiology and Infection
proteins accumulated all detected nucleotide changes, ranging (in descending order) from 14.06% to 0.78% of all detected SNPs, respectively. Regarding the type and positions of the mutations detected in Western Serbia isolates, they are concordant with the reports published so far. The number of transitions was 64.06%, which is agreeable with reports on transitions being generally more frequent in molecular evolution than transversions [13]. Also, finding that C>T nucleotide change is predominant among all of detected mutations (38.28%) is consistent with the other SARS-CoV-2 reports [14][15][16]. Identified nucleotide substitutions included five major hotspot mutations characteristic for Europe [17][18][19][20][21][22]. All isolates collected in five districts in Western Serbia contained D614G mutation in spike protein (known to create a more contiguous SARS-CoV-2 form), which was recorded to be introduced in the viral genome for the first time in late February [18] and entered in Western  Dejan Vidanović et al.
isolates and the dates of the local COVID-19 disease case reports suggested at least four independent introductions in Western Serbia. From four distinct SARS-CoV-2 groups deduced in Western Serbia only descendants of the strains sharing high similarity to the Italian and Switzerland strain remained active by July 2020, while there was no evidence on the activity of the strains sharing high similarity with two French strains beyond May 2020. Four identical SARS-CoV-2 strains found as shared between Italy/ Switzerland/France and Western Serbia suggest they were exchanged between countries, but deducing the direction of the introduction requires epidemiological data which are currently lacking. Also, one to three possible additional introductions in Western Serbia could have occurred, but additional data are needed to elucidate if these were introductions (and if so in which direction) or homoplasies. One of these uncertain additional entrances included the isolate from Switzerland (EPI_ISL_500938), carrying NSP3-6566G>T|D1283Y&NSP13-17864A>G|Y543C& S23707C>T&N-29422G>T, which clustered closely with the isolate from Kraljevo (EPI_ISL_541656). Still, since both isolates were sampled on the same date (6 July 2020), and the latter one from Kraljevo harboured four additional mutations NSP3-6566G>T|D1283Y&NSP13-17864A>G|Y543C &S-23707C>T&N-29422G>T, the origin of Kraljevo isolate remained uncertain. Another unclear case was related to two identical isolates from England (EPI_ISL_533103-15 July 2020 and EPI_ISL_556656 -18 July 2020), carrying NSP4-9961T>C&NSP12-16178C>T|S913L&NSP15-19648G>T|V10L& NS3a-25492A>G|T34A, which clustered closely with the isolate from Kraljevo (EPI_ISL_582531 -24 July), but the latter also displayed four additional mutations NSP3-4113C>T| A465V&NSP4-9811&NSP7-12052| K70N&NSP16-20733. Moreover, closely to these isolates, two more samples from Kraljevo (EPI_ ISL_541660 -20 July 2020 and EPI_ISL_644574 -24 July 2020) clustered, sharing two synonymous NSP3-3169T>C&NSP5-10396G>T mutations and individually acquiring additional different ones. The correct interpretation of the accumulation of numerous different mutations in local isolates in a short time period (based on the collecting dates), require additional data. One more uncertain case involved the strain from Switzerland (EPI_ISL_486532 -24 June 2020), which, based on the shared mutation M-26700G>T|V60L, clustered closely to eight isolates from Novi Pazar (EPI_ISL_541649, EPI_ISL_541654, EPI_ISL_541661, EPI_ISL_644579, EPI_ISL_ 541650, EPI_ISL_541653, EPI_ISL_541652 and EPI_ISL_541659) and three isolates from Kraljevo (EPI_ISL_582532, EPI_ISL_514427, EPI_ISL_514428), all collected in July 2020. However, the entire subgroup, in which these isolates clustered together, has descended from a recent unknown intermediate ancestral strain (originating from the Italian strain via local 20A isolate from Kraljevo -EPI_ISL_462437), which also gave the origin to an isolate from Novi Pazar (EPI_ ISL_514429 -14 July 2020) that lacked M-26700G>T|V60L, while displaying additional individual rare mutations.
Taking into account performances of SARS-CoV-2 strains detected in Western Serbia, results suggested that 20A strains did not survive as such in Western Serbia by July 2020, only the local 20B strains did. Regarding mutations that appeared in Western Serbia independently of virus foreign introductions they included NSP5-10265G>A|G71S&NSP12-15972A> G&NSP15-19718C>T|T33I, NSP2-1551C>T|A249V& NSP8-12525C>T|T145I&S-23674A>G&NS8-28077G>T|V62L, S-23621G> A|V687I, NSP4-9421T>C&NS3a-25966A>G|K192E&NS7b-27881C >T and a major triplet mutation N-28881GGG>AAC|R203K& G204R. While some of these polymorphisms were reported to occur in other countries as well, like 10265G>A causing G71S amino acid replacement in 3-CL proteinase, reported as the ninth of ten most frequent mutations detected in the main protease of SARS-CoV-2 [23] and a polymorphic position NS8-28077 which harboured G>C or G>T mutation (which was also reported as positively selected by homoplasy) [24], the others, including amino acid replacement in the spike protein S-23621G>A|V687I that appeared in Kraljevo in July 2020, were not found in public records (except the indirect link in the report by [25] that 23621 position was included in an 'in frame' 24 nt deletion in spike protein, informally called 'Bristol deletion').
Concerning the impact of all mutations recorded in Western Serbia SARS-CoV-2 strains on the viral performance, their genomic positions support influence on the RNA-RNA, RNA-protein and protein-protein interactions of SARS-CoV-2 affecting strains' performance and possibly virulence. However, epidemiological data on their virulence/pathogenicity/reproduction power are lacking at the moment. This absence of specific epidemiological data on patients with coronavirus disease 2019 linked to sequenced SARS-CoV-2 isolates is one of the shortcomings of the study. Experimental and epidemiological data on various SARS-CoV-2 mutations are just emerging in the literature, in the first place related to mutations affecting the spike protein.
Considering information published so far on the impact of amino acid changes in the spike protein matching the ones detected in Western Serbia SARS-CoV-2 strains, beside previously described D614G, they included a signal peptide mutation L5F, which was reported to increase the viral infectivity in four cell culture lines [26], as well as the H49Y mutation, which was reported to increase viral entrance in the cell [27]. Moreover, L5F and D614G were reported to possibly impact CD8T cell epitope production, influencing the severity of infection [28]. On the other hand, the M1229I mutation, occurring in the hydrophobic transmembrane domain of the spike protein, was only considered as co-occurring to several other mutations in the lineage B.1.1.298 (cluster 5), which was characterised as a variant of concern [29]. Thus, while collecting future novel data, at least the Western Serbia strains containing the aforementioned mutations should be followed through time.
The great efforts worldwide are being put forward to sequence as many complete genomes of SARS-CoV-2 as possible, allowing a comprehensive viral phylogeographic study, and more recently they were accompanied by efforts to elucidate viral transcriptome and epitranscriptome as well [5]. However, the same efforts are needed to be put forward to build a comprehensive epidemiological database on patients with coronavirus disease 2019 that will be linked to sequenced SARS-CoV-2 isolates. Only in this way, the entirety of collected data will enable association studies of viral genotype with the viral phenotype (pathogenicity, viability, etc.) and epidemiological data, enabling the development of different knowledge-based strategies to handle this new virus. Given that any of the evolutionary prediction models we are building is as good as the starting datasets and underlying bioinformatics algorithms are, securing a critical amount of relevant data is essential for any credible prediction. Besides mutations leading to amino acid replacements, synonymous nucleotide changes may also contribute to different viral pathogenicity levels, as was already shown for another positive-sense single-stranded RNA virus [30]. Therefore, all detected nucleotide changes, even the silent ones, should be taken into account when conducting data mining analysis and viral genotype−phenotype association SARS-CoV-2 studies. Such knowledge would be extremely valuable in developing efficient SARS-CoV-2 monitoring/novel fast screening tests, devising novel strategies to fight the virus and choosing adequate medical nurture for patients with coronavirus disease 2019.
Author contributions. DV and BT: WGS sequencing, bioinformatic analysis, data analysis, review; JV and CA: bioinformatic and phylogenetic analysis, review; JQ review, MS, AK, MJ, TJ and TP: funding acquisition and review; BBJ: writing the manuscript, bioinformatic analysis, data analysis.
Financial support. This study is financed by Government of Republic of Serbia, "National study of seroprevalence and molecular characterization of SARS-CoV-2 during an epidemic in the population of Serbia (EPI-COVID-SERBIA)", decision No. 53-3455/2020-3, and by Veterinary Specialized Institute Kraljevo.
Data availability. The sequences of SARS-CoV-2 that are subject of this study are deposited in the Global Initiative on Sharing all Individual Data (GISAID), available at https://www.gisaid.org.in EpiCoV database. It can be accessed under numbers described in the Material and methods section, data deposition. Supplementary