Dealing with dimensionality: the application of machine learning to multi-omics data.

Feldner-Busztin, Dylan; Firbas Nisantzis, Panos; Edmunds, Shelley Jane; Boza, Gergely; Racimo, Fernando; Gopalakrishnan, Shyam; Limborg, Morten Tønsberg; Lahti, Leo; de Polavieja, Gonzalo G

Feldner-Busztin, Dylan; Firbas Nisantzis, Panos; Edmunds, Shelley Jane; Boza, Gergely; Racimo, Fernando; Gopalakrishnan, Shyam; Limborg, Morten Tønsberg; Lahti, Leo; de Polavieja, Gonzalo G.

Afiliación

Feldner-Busztin D; Champalimaud Centre for the Unknown, Champalimaud Foundation, 1400-038 Lisbon, Portugal.
Firbas Nisantzis P; Champalimaud Centre for the Unknown, Champalimaud Foundation, 1400-038 Lisbon, Portugal.
Edmunds SJ; Center for Evolutionary Hologenomics, GLOBE Institute, Faculty of Health and Medical Sciences, University of Copenhagen, 1353 Copenhagen, Denmark.
Boza G; Centre for Ecological Research, 1113 Budapest, Hungary.
Racimo F; Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark.
Gopalakrishnan S; Center for Evolutionary Hologenomics, GLOBE Institute, Faculty of Health and Medical Sciences, University of Copenhagen, 1353 Copenhagen, Denmark.
Limborg MT; Center for Evolutionary Hologenomics, GLOBE Institute, Faculty of Health and Medical Sciences, University of Copenhagen, 1353 Copenhagen, Denmark.
Lahti L; Department of Computing, University of Turku, 20014 Turku, Finland.
de Polavieja GG; Champalimaud Centre for the Unknown, Champalimaud Foundation, 1400-038 Lisbon, Portugal.

Bioinformatics ; 39(2)2023 02 03.

Article en En | MEDLINE | ID: mdl-36637211

RESUMEN

MOTIVATION: Machine learning (ML) methods are motivated by the need to automate information extraction from large datasets in order to support human users in data-driven tasks. This is an attractive approach for integrative joint analysis of vast amounts of omics data produced in next generation sequencing and other -omics assays. A systematic assessment of the current literature can help to identify key trends and potential gaps in methodology and applications. We surveyed the literature on ML multi-omic data integration and quantitatively explored the goals, techniques and data involved in this field. We were particularly interested in examining how researchers use ML to deal with the volume and complexity of these datasets. RESULTS: Our main finding is that the methods used are those that address the challenges of datasets with few samples and many features. Dimensionality reduction methods are used to reduce the feature count alongside models that can also appropriately handle relatively few samples. Popular techniques include autoencoders, random forests and support vector machines. We also found that the field is heavily influenced by the use of The Cancer Genome Atlas dataset, which is accessible and contains many diverse experiments. AVAILABILITY AND IMPLEMENTATION: All data and processing scripts are available at this GitLab repository: https://gitlab.com/polavieja_lab/ml_multi-omics_review/ or in Zenodo: https://doi.org/10.5281/zenodo.7361807. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Multiómica; Neoplasias; Humanos; Neoplasias/genética; Aprendizaje Automático; Genoma

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Multiómica / Neoplasias Tipo de estudio: Prognostic_studies Límite: Humans Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2023 Tipo del documento: Article País de afiliación: Portugal

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google