A maximum flow-based network approach for identification of stable noncoding biomarkers associated with the multigenic neurological condition, autism.

Varma, Maya; Paskov, Kelley M; Chrisman, Brianna S; Sun, Min Woo; Jung, Jae-Yoon; Stockham, Nate T; Washington, Peter Y; Wall, Dennis P

Varma, Maya; Paskov, Kelley M; Chrisman, Brianna S; Sun, Min Woo; Jung, Jae-Yoon; Stockham, Nate T; Washington, Peter Y; Wall, Dennis P.

Afiliación

Varma M; Department of Computer Science, Stanford University, Stanford, CA, USA.
Paskov KM; Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
Chrisman BS; Department of Bioengineering, Stanford University, Stanford, CA, USA.
Sun MW; Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
Jung JY; Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
Stockham NT; Department of Pediatrics, Stanford University, Stanford, CA, USA.
Washington PY; Department of Neuroscience, Stanford University, Stanford, CA, USA.
Wall DP; Department of Bioengineering, Stanford University, Stanford, CA, USA.

BioData Min ; 14(1): 28, 2021 May 03.

Article en En | MEDLINE | ID: mdl-33941233

ABSTRACT

ABSTRACT

BACKGROUND:

Machine learning approaches for predicting disease risk from high-dimensional whole genome sequence (WGS) data often result in unstable models that can be difficult to interpret, limiting the identification of putative sets of biomarkers. Here, we design and validate a graph-based methodology based on maximum flow, which leverages the presence of linkage disequilibrium (LD) to identify stable sets of variants associated with complex multigenic disorders.

RESULTS:

We apply our method to a previously published logistic regression model trained to identify variants in simple repeat sequences associated with autism spectrum disorder (ASD); this L1-regularized model exhibits high predictive accuracy yet demonstrates great variability in the features selected from over 230,000 possible variants. In order to improve model stability, we extract the variants assigned non-zero weights in each of 5 cross-validation folds and then assemble the five sets of features into a flow network subject to LD constraints. The maximum flow formulation allowed us to identify 55 variants, which we show to be more stable than the features identified by the original classifier.

CONCLUSION:

Our method allows for the creation of machine learning models that can identify predictive variants. Our results help pave the way towards biomarker-based diagnosis methods for complex genetic disorders.

Palabras clave

Feature selection; Feature stability; Linkage disequilibrium; Machine learning; Maximum flow; Network

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Tipo de estudio: Diagnostic_studies / Prognostic_studies / Risk_factors_studies Idioma: En Revista: BioData Min Año: 2021 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google