*Result*: Statistical modelling of an outcome variable with integrated multi-omics.
Proc Natl Acad Sci U S A. 2016 Apr 19;113(16):4252-9. (PMID: 27036001)
Nat Commun. 2020 Jan 7;11(1):39. (PMID: 31911595)
Bioinformatics. 2018 Mar 15;34(6):1009-1015. (PMID: 29077792)
J Neuroinflammation. 2024 Sep 26;21(1):234. (PMID: 39327581)
Brief Bioinform. 2022 Jan 17;23(1):. (PMID: 34791014)
J Appl Stat. 2024 Feb 21;51(13):2627-2651. (PMID: 39290359)
Bioinformatics. 2014 Dec 1;30(23):3427-9. (PMID: 25150247)
BMC Bioinformatics. 2016 Jan 20;17 Suppl 2:11. (PMID: 26822911)
Comput Struct Biotechnol J. 2021 Jun 22;19:3735-3746. (PMID: 34285775)
BMC Bioinformatics. 2018 Oct 11;19(1):371. (PMID: 30309317)
Stat Appl Genet Mol Biol. 2008;7(1):Article 35. (PMID: 19049491)
Aging (Albany NY). 2022 Jan 24;14(2):623-659. (PMID: 35073279)
Nat Genet. 2015 Sep;47(9):1091-8. (PMID: 26258848)
Bioinformatics. 2010 Nov 15;26(22):2867-73. (PMID: 20926424)
Am J Hum Genet. 2008 Sep;83(3):359-72. (PMID: 18760389)
Circ Res. 2018 May 25;122(11):1555-1564. (PMID: 29535164)
Am J Hum Genet. 2007 Sep;81(3):559-75. (PMID: 17701901)
Twin Res Hum Genet. 2006 Dec;9(6):899-906. (PMID: 17254428)
Nat Commun. 2016 Mar 23;7:11122. (PMID: 27005778)
Nat Protoc. 2020 Sep;15(9):2759-2772. (PMID: 32709988)
BMC Bioinformatics. 2025 Aug 19;26(1):214. (PMID: 40830833)
Curr Protoc. 2024 Feb;4(2):e981. (PMID: 38314955)
Pac Symp Biocomput. 2018;23:448-459. (PMID: 29218904)
Commun Biol. 2022 Jun 30;5(1):645. (PMID: 35773471)
*Further Information*
*Background: In studies that aim to model the relationship between an outcome variable and multiple omics datasets, it is often desirable to reduce the dimensionality of these datasets or to represent one omics dataset in terms of another. Several approaches exist for this purpose, including univariate methods such as polygenic scores, and multivariate methods. Multivariate approaches offer advantages by producing lower-dimensional integrative scores, capturing joint structures across datasets, and filtering out dataset-specific noise. In this paper, we describe one univariate and two multivariate methods, and evaluate their performance through simulations involving two correlated multivariate normally distributed omics datasets, as well as a combination of one multivariate normal and one fixed categorical dataset.
Results: We assess method performance using the root mean squared error (RMSE) when modelling the outcome variable as a function of the reduced omics representations. Multivariate methods generally perform well, particularly when a slightly higher number of components is used for integration. They outperform the univariate method in scenarios involving two normally distributed omics datasets and perform comparably in settings with one normal and one categorical dataset. In real data applications, including two metabolomics datasets from TwinsUK and a metabolomics-genetic dataset from ORCADES, all methods show similar performance in modelling body mass index.
Conclusions: Multivariate methods provide a valuable framework for summarizing multi-omics datasets into low-dimensional components suitable for outcome modelling. Even in the presence of non-normal data, these methods offer a promising alternative to high-dimensional univariate approaches.
(© 2025. The Author(s).)*
*Declarations. Ethics approval and consent to participate: Not applicable. Consent for publications: Not applicable. Competing interests: ZG is currently an employee of Novartis Pharmaceuticals UK, but all work presented in this manuscript was completed while he was an employee of University of Cambridge and UMC Utrecht.*