*Result*: Compressive pangenomics using mutation-annotated networks.
Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief. Bioinform. 19, 118–135 (2018).
Aggarwal, S. K. et al. Pangenomics in microbial and crop research: progress, applications, and perspectives. Genes 13, 598 (2022). (PMID: 35456404903167610.3390/genes13040598)
Shu, Y. & McCauley, J. GISAID: global initiative on sharing all influenza data—from vision to reality. Euro Surveill. 22, 30494 (2017). (PMID: 28382917538810110.2807/1560-7917.ES.2017.22.13.30494)
Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Sayers, E. W. GenBank. Nucleic Acids Res. 44, D67–D72 (2016). (PMID: 2659040710.1093/nar/gkv1276)
Rambaut, A. et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol. 5, 1403–1407 (2020). (PMID: 32669681761051910.1038/s41564-020-0770-5)
De Bernardi Schneider, A. et al. SARS-CoV-2 lineage assignments using phylogenetic placement/UShER are superior to pangoLEARN machine-learning method. Virus Evol. 10, vead085 (2024). (PMID: 383618131086854910.1093/ve/vead085)
Chen, C. et al. CoV-Spectrum: analysis of globally shared SARS-CoV-2 data to identify and characterize new variants. Bioinformatics 38, 1735–1737 (2022). (PMID: 34954792889660510.1093/bioinformatics/btab856)
Obermeyer, F. et al. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science 376, 1327–1332 (2022). (PMID: 35608456916137210.1126/science.abm1208)
Tsui, J. L.-H. et al. Genomic assessment of invasion dynamics of SARS-CoV-2 Omicron BA.1. Science 381, 336–343 (2023). (PMID: 374715381086630110.1126/science.adg6605)
Lam-Hine, T. et al. Outbreak associated with SARS-CoV-2 B.1.617.2 (Delta) variant in an elementary school—Marin County, California, May–June 2021. MMWR Morb. Mortal. Wkly. Rep. 70, 1214–1219 (2021). (PMID: 34473683842287010.15585/mmwr.mm7035e2)
Li, T. et al. Genomic variation, origin tracing, and vaccine development of SARS-CoV-2: a systematic review. Innovation 2, 100116 (2021). (PMID: 339978278110321)
Chalkias, S. et al. A bivalent omicron-containing booster vaccine against COVID-19. N. Engl. J. Med. 387, 1279–1291 (2022). (PMID: 36112399951163410.1056/NEJMoa2208343)
Brandt, D. Y. C. et al. Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project Phase I data. G3 (Bethesda) 5, 931–941 (2015). (PMID: 25787242442637710.1534/g3.114.015784)
Günther, T. & Nettelblad, C. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet. 15, e1008302 (2019). (PMID: 31348818668563810.1371/journal.pgen.1008302)
Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022). (PMID: 35676474920063810.1038/s41586-022-04808-9)
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018). (PMID: 30125266612694910.1038/nbt.4227)
Li, H. GFA-spec. https://github.com/GFA-spec/GFA-spec (2016).
Sirén, J. & Paten, B. GBZ file format for pangenome graphs. Bioinformatics 38, 5012–5018 (2022). (PMID: 36179091966585710.1093/bioinformatics/btac656)
Noll, N., Molari, M., Shaw, L. P. & Neher, R. A. PanGraph: scalable bacterial pan-genome graph construction. Microb. Genom. 9, mgen001034 (2023). (PMID: 3727871910327495)
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020). (PMID: 33066802756835310.1186/s13059-020-02168-z)
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 42, 663–673 (2024). (PMID: 3716508310.1038/s41587-023-01793-w)
Colquhoun, R. M. et al. Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs. Genome Biol. 22, 267 (2021). (PMID: 34521456844237310.1186/s13059-021-02473-1)
Eggertsson, H. P. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49, 1654–1660 (2017). (PMID: 2894525110.1038/ng.3964)
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023). (PMID: 371652421017212310.1038/s41586-023-05896-x)
Deorowicz, S., Danek, A. & Li, H. AGC: compact representation of assembled genomes with fast queries and updates. Bioinformatics 39, btad097 (2023). (PMID: 36864624999479110.1093/bioinformatics/btad097)
Břinda, K. et al. Efficient and robust search of microbial genomes via phylogenetic compression. Nat. Methods 22, 692–697 (2025). (PMID: 4020517410.1038/s41592-025-02625-2)
Turakhia, Y. et al. Ultrafast sample placement on existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nat. Genet. 53, 809–816 (2021). (PMID: 33972780924829410.1038/s41588-021-00862-7)
Kelleher, J., Thornton, K. R., Ashander, J. & Ralph, P. L. Efficient pedigree recording for fast population genetics simulation. PLoS Comput. Biol. 14, e1006581 (2018). (PMID: 30383757623392310.1371/journal.pcbi.1006581)
Hudson, R. R. Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7, 1–44 (1990).
Kelleher, J. et al. Inferring whole-genome histories in large population datasets. Nat. Genet. 51, 1330–1338 (2019). (PMID: 31477934672647810.1038/s41588-019-0483-y)
Iglhaut C, Pečerska J, Gil M, Anisimova M. Please mind the gap: indel-aware parsimony for fast and accurate ancestral sequence reconstruction and multiple sequence alignment including long indels. Mol. Biol. Evol. 41 (2024).
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016). (PMID: 27323842491504510.1186/s13059-016-0997-x)
Lin, M. F., Bai, X., Salerno, W. J. & Reid, J. G. Sparse Project VCF: efficient encoding of population genotype matrices. Bioinformatics 36, 5537–5538 (2021). (PMID: 33300997801646110.1093/bioinformatics/btaa1004)
Cardona, G., Rosselló, F. & Valiente, G. Extended Newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics 9, 532 (2008). (PMID: 19077301262136710.1186/1471-2105-9-532)
Ishikawa, S. A., Zhukova, A., Iwasaki, W. & Gascuel, O. A fast likelihood method to reconstruct and visualize ancestral scenarios. Mol. Biol. Evol. 36, 2069–2085 (2019). (PMID: 31127303673570510.1093/molbev/msz131)
Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020). (PMID: 32011700718220610.1093/molbev/msaa015)
Swofford, D. L. & Maddison, W. P. Reconstructing ancestral character states under Wagner parsimony. Math. Biosci. 87, 199–229 (1987). (PMID: 10.1016/0025-5564(87)90074-5)
Chen, Z. et al. Global landscape of SARS-CoV-2 genomic surveillance and data sharing. Nat. Genet. 54, 499–507 (2022). (PMID: 35347305900535010.1038/s41588-022-01033-y)
Bloom, J. D., Beichman, A. C., Neher, R. A. & Harris, K. Evolution of the SARS-CoV-2 mutational spectrum. Mol. Biol. Evol. 40, msad085 (2023). (PMID: 370395571012487010.1093/molbev/msad085)
Boonsiri, T. et al. Identification and characterization of mutations responsible for the β-lactam resistance in oxacillin-susceptible mecA-positive Staphylococcus aureus. Sci. Rep. 10, 16907 (2020). (PMID: 33037239754710310.1038/s41598-020-73796-5)
Karthikeyan, S. et al. Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission. Nature 609, 101–108 (2022). (PMID: 35798029943331810.1038/s41586-022-05049-6)
Li, X., Yan, H., Wong, G., Ouyang, W. & Cui, J. Identifying featured indels associated with SARS-CoV-2 fitness. Microbiol. Spectr. 11, e0226923 (2023). (PMID: 376984271058094010.1128/spectrum.02269-23)
Hill, V. et al. The origins and molecular evolution of SARS-CoV-2 lineage B.1.1.7 in the UK. Virus Evol. 8, veac080 (2022). (PMID: 36533153975279410.1093/ve/veac080)
Turakhia, Y. et al. Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape. Nature 609, 994–997 (2022). (PMID: 35952714951945810.1038/s41586-022-05189-9)
Zhan, S. H. et al. A pandemic-scale ancestral recombination graph for SARS-CoV-2. Preprint at bioRxiv https://doi.org/10.1101/2023.06.08.544212 (2023).
Katoh, K. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002). (PMID: 1213608813575610.1093/nar/gkf436)
Fitch, W. M. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Biol. 20, 406–416 (1971). (PMID: 10.1093/sysbio/20.4.406)
Huang, Y., Yang, C., Xu, X., Xu, W. & Liu, S. Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19. Acta Pharmacol. Sin. 41, 1141–1149 (2020). (PMID: 32747721739672010.1038/s41401-020-0485-4)
Alisoltani, A., Jaroszewski, L., Iyer, M., Iranzadeh, A. & Godzik, A. Increased frequency of indels in hypervariable regions of SARS-CoV-2 proteins—a possible signature of adaptive selection. Front. Genet. 13, 875406 (2022). (PMID: 35719386920182610.3389/fgene.2022.875406)
Resende, P. C. et al. The ongoing evolution of variants of concern and interest of SARS-CoV-2 in Brazil revealed by convergent indels in the amino (N)-terminal domain of the spike protein. Virus Evol. 7, veab069 (2021). (PMID: 34532067843891610.1093/ve/veab069)
Sanderson, T. Taxonium, a web-based tool for exploring large phylogenetic trees. eLife 11, e82392 (2022). (PMID: 36377483970480310.7554/eLife.82392)
Miao, Z. & Yue, J.-X. Interactive visualization and interpretation of pangenome graphs by linear reference-based coordinate projection and annotation integration. Genome Res. 35, 296–310 (2025). (PMID: 3980570411874961)
Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015). (PMID: 26099265459590410.1093/bioinformatics/btv383)
Kramer, A. M., Sanderson, T. & Corbett-Detig, R. Treenome Browser: co-visualization of enormous phylogenies and millions of genomes. Bioinformatics 39, btac772 (2023). (PMID: 36453872980558810.1093/bioinformatics/btac772)
Yu, Y., Blair, C. & He, X. RASP 4: ancestral state reconstruction tool for multiple genes and characters. Mol. Biol. Evol. 37, 604–606 (2020). (PMID: 3167077410.1093/molbev/msz257)
Bouckaert, R. et al. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 10, e1003537 (2014). (PMID: 24722319398517110.1371/journal.pcbi.1003537)
Martin, D. P. et al. RDP3: a flexible and fast computer program for analyzing recombination. Bioinformatics 26, 2462–2463 (2010). (PMID: 20798170294421010.1093/bioinformatics/btq467)
Li, X. et al. A Novel Strategy for Detecting Recent Horizontal Gene Transfer and Its Application to Rhizobium Strains. Front. Microbiol. 9, 973 (2018). (PMID: 29867876596838110.3389/fmicb.2018.00973)
Pruitt, K. D., Katz, K. S., Sicotte, H. & Maglott, D. R. Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. Trends Genet. 16, 44–47 (2000). (PMID: 1063763110.1016/S0168-9525(99)01882-X)
Voss, M., Asenjo, R. & Reinders, J. Pro TBB: C++ Parallel Programming with Threading Building Blocks (Apress, 2019).
Koranne, S. Handbook of Open Source Tools 127–143 (Springer, 2011).
Boni, M. F., Posada, D. & Feldman, M. W. An exact nonparametric method for inferring mosaic structure in sequence triplets. Genetics 176, 1035–1047 (2007). (PMID: 17409078189457310.1534/genetics.106.068874)
Garrison, E. et al. Building pangenome graphs. Nat. Methods 21, 2008–2012 (2024). (PMID: 3943387810.1038/s41592-024-02430-3)
Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987). (PMID: 3447015)
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018). (PMID: 29750242613799610.1093/bioinformatics/bty191)
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). (PMID: 19505943272300210.1093/bioinformatics/btp352)
Cochrane, G. et al. Facing growth in the European Nucleotide Archive. Nucleic Acids Res. 41, D30–D35 (2012). (PMID: 23203883353118710.1093/nar/gks1175)
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012). (PMID: 2219939210.1093/bioinformatics/btr708)
Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021). (PMID: 34914532936533310.1126/science.abg8871)
Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981). (PMID: 728889110.1007/BF01734359)
McBroome, J. et al. A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees. Mol. Biol. Evol. 38, 5819–5824 (2021). (PMID: 34469548866261710.1093/molbev/msab264)
Turakhia, Y. et al. Stability of SARS-CoV-2 phylogenies. PLoS Genet. 16, e1009175 (2020). (PMID: 33206635772116210.1371/journal.pgen.1009175)
Smith, K., Ye, C. & Turakhia, Y. Tracking and curating putative SARS-CoV-2 recombinants with RIVET. Bioinformatics 39, btad538 (2023). (PMID: 376514641049317910.1093/bioinformatics/btad538)
Tseng, Y.-H., Walia, S. & Turakhia, Y. Ultrafast and ultralarge multiple sequence alignments using TWILIGHT. Bioinformatics 41, i332–i341 (2025). (PMID: 406628331226141210.1093/bioinformatics/btaf212)
Walia, S. & Turakhia, Y. Pangenome mutation-annotated networks. Zenodo https://doi.org/10.5281/zenodo.17781629 (2025).
Walia, S., Motwani, H. & Turakhia, Y. panmanUtils: a software toolkit for PanMANs. Zenodo https://doi.org/10.5281/zenodo.17728989 (2025).
*Further Information*
*Pangenomics is an emerging field that uses collections of genomes, rather than a single reference, to reduce bias and capture intra-species diversity. However, existing pangenomic data formats face challenges in scaling to millions of genomes and primarily emphasize variation, often neglecting the underlying mutational events and evolutionary relationships. This work introduces Pangenome Mutation-Annotated Network (PanMAN), a lossless pangenome representation that achieves compression ratios ranging from 3.5-1,391× in file sizes compared to existing variation-preserving formats, with performance generally improving on larger datasets. In addition to compression, PanMAN increases representational capacity by encoding detailed mutational and evolutionary histories inferred across genomes, thereby enabling new biological insights. Using PanMAN, a comprehensive SARS-CoV-2 pangenome was constructed from 8 million publicly available sequences, requiring only 366 MB of disk space. We also present 'panmanUtils', a toolkit that supports common analyses and ensures interoperability with existing software. PanMAN is poised to greatly improve the scale, speed, resolution and scope of pangenomic analysis and data sharing.
(© 2026. The Author(s), under exclusive licence to Springer Nature America, Inc.)*
*Competing interests: All authors declare no competing interests.*