*Result*: Normalized compression distance for DNA classification.

Title:
Normalized compression distance for DNA classification.
Authors:
Hearne G; Department of Electrical Engineering, Drexel University, Philadelphia, PA, United States of America., S Refahi M; Department of Electrical Engineering, Drexel University, Philadelphia, PA, United States of America., Duan HN; Department of Electrical Engineering, Drexel University, Philadelphia, PA, United States of America., Brown JR; Department of Electrical Engineering, Drexel University, Philadelphia, PA, United States of America., Rosen GL; Department of Electrical Engineering, Drexel University, Philadelphia, PA, United States of America.
Source:
PeerJ [PeerJ] 2026 Feb 06; Vol. 14, pp. e20677. Date of Electronic Publication: 2026 Feb 06 (Print Publication: 2026).
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: PeerJ Inc Country of Publication: United States NLM ID: 101603425 Publication Model: eCollection Cited Medium: Internet ISSN: 2167-8359 (Electronic) Linking ISSN: 21678359 NLM ISO Abbreviation: PeerJ Subsets: MEDLINE
Imprint Name(s):
Original Publication: Corte Madera, CA : PeerJ Inc.
References:
Bioinformatics. 2024 Dec 26;41(1):. (PMID: 39700412)
PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. (PMID: 30807567)
Nat Biotechnol. 2017 Nov;35(11):1026-1028. (PMID: 29035372)
Bioinformatics. 2019 Feb 1;35(3):521-522. (PMID: 30016412)
J Biomed Biotechnol. 2011;2011:495849. (PMID: 21541181)
Microbiome. 2019 Feb 8;7(1):17. (PMID: 30736849)
Nat Methods. 2025 Feb;22(2):287-297. (PMID: 39609566)
Science. 2024 Nov 15;386(6723):eado9336. (PMID: 39541441)
Nat Biotechnol. 2017 Sep 12;35(9):833-844. (PMID: 28898207)
Trends Genet. 2018 Sep;34(9):666-681. (PMID: 29941292)
Bioinformatics. 2022 Apr 12;38(8):2102-2110. (PMID: 35020807)
Bioinformatics. 2021 Aug 9;37(15):2112-2120. (PMID: 33538820)
Nat Rev Genet. 2016 May 17;17(6):333-51. (PMID: 27184599)
mSystems. 2022 Apr 26;7(2):e0016722. (PMID: 35369727)
Proc Mach Learn Res. 2024 Jul;235:43632-43648. (PMID: 40567809)
Nat Commun. 2022 Apr 8;13(1):1914. (PMID: 35395843)
Nat Commun. 2022 Sep 29;13(1):5731. (PMID: 36175448)
Commun Biol. 2025 Mar 29;8(1):517. (PMID: 40155693)
Genome Biol. 2019 Nov 28;20(1):257. (PMID: 31779668)
Genome Biol. 2019 Jul 25;20(1):144. (PMID: 31345254)
Contributed Indexing:
Keywords: Alignment-free methods; Bioinformatics; Compression distance; Genomic classification; Genomic sequence analysis; Gzip compression; Metagenomics
Substance Nomenclature:
9007-49-2 (DNA)
Entry Date(s):
Date Created: 20260211 Date Completed: 20260211 Latest Revision: 20260213
Update Code:
20260213
PubMed Central ID:
PMC12884959
DOI:
10.7717/peerj.20677
PMID:
41669552
Database:
MEDLINE

*Further Information*

*Analyzing the origin and diversity of numerous genomic sequences, such as those sampled from the human microbiome, is an important first step in genomic analysis. The use of normalized compression distance (NCD) has demonstrated capabilities in the field of text classification as a low-resource alternative to deep neural networks (DNNs) by leveraging compression algorithms to approximate Kolmogorov information distance. In an effort to apply this technique toward genomics tasks akin to tools such as Many-against-Many sequence searching (MMseqs) and Kraken2, we have explored the use of a gzip-based NCD combination in both gene labeling of open reading frames (ORFs) and taxonomic classification of short reads. Our implementation achieved 0.89 accuracy and 0.88 macro-F1 on human gene classification, surpassing similar NCD-based approaches. In prokaryotic gene labeling tasks, NCD shows superior classification accuracy to traditional alignment or exact-match tools in out-of-distribution settings, while also outperforming comparable sequence-embedding methods in in-distribution classification. However, the computational complexity of O(MN) (in standard big-O notation, where M and N denote the sizes of the training and test databases, respectively) constrains scalability to very large datasets, though these findings nonetheless demonstrate that compression-based approaches provide an effective alternative for genomic sequence classification, particularly in low-data environments.
(©2026 Hearne et al.)*

*The authors declare there are no competing interests.*