DBFE: distribution-based feature extraction from structural variants in whole-genome data.

Piernik, Maciej; Brzezinski, Dariusz; Sztromwasser, Pawel; Pacewicz, Klaudia; Majer-Burman, Weronika; Gniot, Michal; Sielski, Dawid; Bryzghalov, Oleksii; Wozna, Alicja; Zawadzki, Pawel

Piernik, Maciej; Brzezinski, Dariusz; Sztromwasser, Pawel; Pacewicz, Klaudia; Majer-Burman, Weronika; Gniot, Michal; Sielski, Dawid; Bryzghalov, Oleksii; Wozna, Alicja; Zawadzki, Pawel.

Afiliación

Piernik M; Institute of Computing Science, Faculty of Computing and Telecommunications, Poznan University of Technology, 60-965 Poznan, Poland.
Brzezinski D; MNM Bioscience Inc., Cambridge, MA 02142, USA.
Sztromwasser P; Institute of Computing Science, Faculty of Computing and Telecommunications, Poznan University of Technology, 60-965 Poznan, Poland.
Pacewicz K; MNM Bioscience Inc., Cambridge, MA 02142, USA.
Majer-Burman W; Institute of Bioorganic Chemistry of the Polish Academy of Sciences, 61-704 Poznan, Poland.
Gniot M; MNM Bioscience Inc., Cambridge, MA 02142, USA.
Sielski D; MNM Bioscience Inc., Cambridge, MA 02142, USA.
Bryzghalov O; MNM Bioscience Inc., Cambridge, MA 02142, USA.
Wozna A; MNM Bioscience Inc., Cambridge, MA 02142, USA.
Zawadzki P; Department of Hematology and Bone Marrow Transplantation, Poznan University of Medical Sciences, 60-569 Poznan, Poland.

Bioinformatics ; 38(19): 4466-4473, 2022 09 30.

Article en En | MEDLINE | ID: mdl-35929780

RESUMEN

MOTIVATION: Whole-genome sequencing has revolutionized biosciences by providing tools for constructing complete DNA sequences of individuals. With entire genomes at hand, scientists can pinpoint DNA fragments responsible for oncogenesis and predict patient responses to cancer treatments. Machine learning plays a paramount role in this process. However, the sheer volume of whole-genome data makes it difficult to encode the characteristics of genomic variants as features for learning algorithms. RESULTS: In this article, we propose three feature extraction methods that facilitate classifier learning from sets of genomic variants. The core contributions of this work include: (i) strategies for determining features using variant length binning, clustering and density estimation; (ii) a programing library for automating distribution-based feature extraction in machine learning pipelines. The proposed methods have been validated on five real-world datasets using four different classification algorithms and a clustering approach. Experiments on genomes of 219 ovarian, 61 lung and 929 breast cancer patients show that the proposed approaches automatically identify genomic biomarkers associated with cancer subtypes and clinical response to oncological treatment. Finally, we show that the extracted features can be used alongside unsupervised learning methods to analyze genomic samples. AVAILABILITY AND IMPLEMENTATION: The source code of the presented algorithms and reproducible experimental scripts are available on Github at https://github.com/MNMdiagnostics/dbfe. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Genoma; Programas Informáticos; Humanos; Genómica/métodos; Algoritmos; Aprendizaje Automático

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Programas Informáticos / Genoma Tipo de estudio: Prognostic_studies Límite: Humans Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2022 Tipo del documento: Article País de afiliación: Polonia

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google