Introduction
Between June 2020–April 2021, the highest nationwide incidence of COVID-19 in the U.S. was among 18–29-year-olds, which represents most university students. Reference Srinivasa, Griffith and Waggle1 Even though students are less likely to develop severe COVID-19, identifying transmission is vital for protecting susceptible faculty, staff, and the surrounding community. Reference Srinivasa, Griffith and Waggle1–Reference Ramaiah, Khubbar and Bauer3 Using whole genome sequencing (WGS), we investigated SARS-CoV-2 transmission between University of Pittsburgh students and the surrounding community.
Methods
Study setting
SARS-CoV-2 specimens were collected between July 2020–April 2021 from students attending the University of Pittsburgh’s sprawling urban main campus (detailed in Reference O’Donnell, Brownlee and Martin4 ), and from patients and healthcare workers (HCWs) at 12 regional hospitals within an integrated healthcare system across the greater Pittsburgh region.
We assumed that SARS-CoV-2 specimens from the hospital population were representative of the community surrounding the university campus. Rationale for sequencing these specimens, along with specimen collection from students, laboratory methods, and bioinformatics analyses, has been detailed previously. Reference Srinivasa, Griffith and Waggle1,Reference Rader, Srinivasa and Griffith5,Reference Zhu, Marsh and Griffith6
Genomic data analysis
We used a pairwise SNP cutoff of 2 with average linkage hierarchical clustering to identify genetically related clusters. Reference Srinivasa, Griffith and Waggle1,Reference Rader, Srinivasa and Griffith5 FastTree (v0.11.2, default parameters) was used to construct a maximum likelihood phylogenetic tree based on the generalized time-reversible (GTR) model.
Epidemiological analyses
To evaluate potential transmission directionality and overlap between students and community members, we defined a putative transmission cluster as containing genetically related SARS-CoV-2 genomes belonging to both population groups. We then assessed the relationship between geographic proximity and clustering using ZIP code data. Specimens were excluded if the ZIP code was missing. Patient names and demographic information were reviewed to remove duplicate specimens across two groups.
To ensure appropriate geographic assignment, community SARS-CoV-2–positive specimens were stratified into two groups: (1) Ambulatory Community: included outpatients (tested in an outpatient clinic, emergency department, or within three days of hospital admission) and HCWs. For this group, we used residential ZIP codes as geographic reference, (2) Hospitalized Community: included inpatients who tested positive ≥ 3 days after hospital admission, suggesting healthcare-associated infection. Therefore, the hospital of specimen collection served as the geographic reference. For students, addresses reported during the university’s pandemic contact tracing efforts were used to obtain ZIP codes.
Pairwise distances between ZIP codes of students and community members were calculated using the Haversine formula (geosphere package in R). Mann-Whitney U test was used to assess whether geographic proximity was associated with clustering among genomes belonging to the same SARS-CoV-2 lineage across the two population groups.
Results
We identified 664 SARS-CoV-2 specimens from 653 individuals after excluding 16 with missing ZIP codes and two that failed WGS quality control (≥95% coverage at 10 × depth). Of these, 264 (39.8%) were from students and 400 (60.2%) were from community members (Supp. Table 1); 339 (51%) specimens were collected during the fall semester of 2020 and 325 (49%) in the spring semester of 2021 (Supp. Figure 1A and 1B). Among community specimens, 299 (74.8%) belonged to the Ambulatory Community (269 outpatients and 30 HCWs), while 101 (25.3%) belonged to the Hospitalized Community (Table 1). Community members were predominantly female (231, 59.4%) and white (287, 73.8%), with ages ranging from 9 months to 98 years (median = 54 years).
Table 1. Demographic characteristics of community members

We identified 52 SARS-CoV-2 lineages, predominantly B.1.2 (255, 38.4%) and B.1.1.7 (134, 20.2%). Seven lineages containing 18 genomes were unique to students, 16 (88.9%) of which coincided with semester move in days, holidays, and recesses, supported by contact tracing links. Reference Srinivasa, Griffith and Waggle1 There were 23 lineages with 54 genomes unique to community patients, ranging from 1–14 patients/lineage. Same-lineage, pairwise SNP distances ranged from 0–52, with most genetic relatedness explained within 2-SNPs, thus supporting our choice of a 2-SNP threshold for clustering (Supp. Figure 1C).
Overall, 99/664 (15%) genomes were closely genetically related, forming 14 putative transmission clusters (range = 2–30 genomes/cluster), each involving students and community members. Of these, 77 (77.8%) were from students, and 22 (22.2%) from community members (Supp. Table 1 and 2). The average cluster duration was 28 days (range = 6–55 days). In nine clusters (64.3%), the earliest specimen was from a student, while the remaining five began in the community (Figure 1A). Fewer clusters were noted during Dec 2020–Jan 2021 when students were on winter break.

Figure 1. A. Putative transmission clusters among students and the surrounding community. Note: Each data point represents a sequenced SARS-CoV-2 specimen in a cluster, colored by demographic groups. The horizontal line represents the cluster duration. N indicates the number of patients within each cluster. Visualization created in R using ggplot2 package. Important dates: Fall term move in days, 07/15/2020–08/19/2020; Family weekend, 09/25/2020; Winter recess and spring term move in days, 12/06/2020–01/19/2021; St. Patrick’s Day, 03/17/2021. B. Spatial distribution of ZIP codes from which SARS-CoV-2 specimens were collected. Note: The geographic borders represent the state (dark red) and county (gray) boundaries. Each data point represents a ZIP code that contributed SARS-CoV-2 specimens. N indicates the number of ZIP codes represented by each group. C. Spatial distribution of putative transmission clusters. Note: The outer ring represents the clusters, and the inner ring represents specimen grouping. Pie charts indicate ZIP codes containing multiple clusters and/or specimen groups. The geographic borders represent the state (dark red) and county (gray) boundaries. Visualization created in R using tigris and ggplot2 packages.
Specimens were collected from 110 ZIP codes across 17 counties, reflecting a broad geographic distribution (Figure 1B). Putative transmission clusters were widely dispersed, with a median pairwise geographic distance of 15 miles (mean = 15.5; range = 0–41; Figure 1C). In contrast, non-clustered cases had a median distance of 5.5 miles (mean = 10.8; range = 0–171; P = .0007).
Discussion
WGS surveillance provided insights into SARS-CoV-2 transmission dynamics between students on an urban campus and the surrounding community. We identified 14 overlapping clusters between the two groups, most originating from students (64.3%). The median distance for clustered cases was 15 miles vs 5.5 miles for non-clustered cases, indicating that putative transmission extended beyond the campus area and contributed to broader regional spread. These findings underscore the difficulty in preventing transmission despite multiple interventions implemented by the university. Reference O’Donnell, Brownlee and Martin4
Greater lineage diversity was observed among community members (44% of lineages) as compared to students (13%), consistent with prior findings. Reference Aggarwal, Warne and Jahun7 Furthermore, 42% of lineages were shared between students and the community, indicating substantial overlap in circulating lineages. However, this may also suggest that some circulating lineages were not sequenced, highlighting inherent limitations of convenience sampling.
Prior studies in the U.S. using publicly available genomes have shown clustering between SARS-CoV-2 sampled from students and the local community. Reference Ramaiah, Khubbar and Bauer3,Reference Casto, Paredes and Bennett8,Reference Currie, Moreno and Delahoy9 Availability of ZIP code data strengthened our study, enabling assessment of the relationship between geographical proximity and clustering, which is relevant in a city like Pittsburgh, PA that has over 20 colleges and universities interspersed throughout the community. Prior studies have shown that geographically intertwined population groups foster cross-transmission, Reference Nickbakhsh, Hughes and Christofidis10 and medical students may act as bridges for transmission to the community. Reference Aggarwal, Warne and Jahun7 Clustered specimens had a narrower distance range (0–41 vs 0–171 miles) but a higher median distance than non-clustered cases, potentially explained by frequent visitation between students and nearby family and friends. However, we did not identify student degree programs and did not have data beyond ZIP codes to establish epidemiological links.
Our study had several limitations. We sequenced a small subset of the community and student specimens, which may not reflect cases that were not collected and/or not sequenced. Clinician-initiated testing and definition of healthcare-associated infection may have led to misclassification. Community specimens were sampled from hospitals, which could underrepresent viral transmission and diversity. We inferred transmission using ZIP codes as a proxy for geographic proximity, which offers relatively low geographic resolution. Lack of symptom onset data limits temporal interpretation. Some clusters may represent widely circulating community strains that meet the stringent 2-SNP cutoff without indicating true epidemiological links. Finally, the study was conducted in a single university–community setting and, therefore, the findings may not be generalized.
In conclusion, this hypothesis-generating study suggests that SARS-CoV-2 transmission between students and the surrounding community was substantial, yet complex. This underscores the need for integrated genomic surveillance to enable targeted interventions; coordinated intervention strategies between universities, hospitals, and local public health departments through shared communication to identify early overlap events; and data-driven public health policies that guide isolation and quarantine practices to prevent transmission between the two populations.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/ash.2026.10307.
Data availability statement
The NCBI, GISAID, and SRA accession numbers for the respiratory viral genomes used in this study can be found in Supp. Table 1. Sequence data can be found at NCBI BioProject PRJNA1337675.
Acknowledgements
We are grateful to Ms. Melissa McCullough, UPMC Clinical Microbiology/Virology lab, for helping with SARS-CoV-2 specimen collection. We also appreciate Dr. Jane Marsh’s support in conceptualizing the study design.
Author contributions
VRS: Conceptualization, Methodology, Analysis, Data Curation, Visualization, Project Administration, Writing - Original Draft, Writing - Review and Editing MPG: Data Curation, Analysis, Writing - Review and Editing KAS: Data Curation, Writing - Review and Editing HC: Data Curation and Visualization, Writing - Review and Editing NJR: Data Curation and Visualization, Writing - Review and Editing KDW: Data Curation, Writing - Review and Editing TP: Data Curation, Writing - Review and Editing GMS: Conceptualization, Writing - Review and Editing LLP: Conceptualization, Resources, Project administration, Writing - Review and Editing DVT: Conceptualization, Methodology, Supervision and Writing - Review and Editing AJS: Conceptualization, Methodology, Analysis, Supervision, Writing - Review and Editing EMM: Conceptualization, Methodology, Data Curation, Supervision, Funding acquisition, Writing - Review and Editing LHH: Conceptualization, Methodology, Analysis, Resources, Data Curation, Writing - Original Draft, Writing - Review and Editing, Supervision, Project administration and Funding acquisition
Financial support
This work was funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health (NIH) (R01AI127472) and internal funding from the University of Pittsburgh and UPMC. NIH, University of Pittsburgh, and UPMC played no role in data collection, analysis, or interpretation; study design; writing of the manuscript; or decision to submit for publication.
Competing interests
LHH serves on the scientific advisory board of Next Gen Diagnostics. AJS is a consultant for Next Gen Diagnostics. LPP reports grant funding from AstraZeneca. Neither company had a role in the study design, data collection, analysis, interpretation, or writing of this manuscript. The other authors declare that there are no conflicts of interest, including financial interests, activities, relationships, and affiliations.
Ethical standard
Ethics approval was obtained from the University of Pittsburgh’s Institutional Review Board (STUDY22120012).
