Search | VHL Regional Portal

MousiPLIER: A Mouse Pathway-Level Information Extractor Model.

Zhang, Shuo; Heil, Benjamin J; Mao, Weiguang; Chikina, Maria; Greene, Casey S; Heller, Elizabeth A.

eNeuro ; 11(6)2024 Jun.

Article in English | MEDLINE | ID: mdl-38789274

ABSTRACT

High-throughput gene expression profiling measures individual gene expression across conditions. However, genes are regulated in complex networks, not as individual entities, limiting the interpretability of gene expression data. Machine learning models that incorporate prior biological knowledge are a powerful tool to extract meaningful biology from gene expression data. Pathway-level information extractor (PLIER) is an unsupervised machine learning method that defines biological pathways by leveraging the vast amount of published transcriptomic data. PLIER converts gene expression data into known pathway gene sets, termed latent variables (LVs), to substantially reduce data dimensionality and improve interpretability. In the current study, we trained the first mouse PLIER model on 190,111 mouse brain RNA-sequencing samples, the greatest amount of training data ever used by PLIER. We then validated the mousiPLIER approach in a study of microglia and astrocyte gene expression across mouse brain aging. mousiPLIER identified biological pathways that are significantly associated with aging, including one latent variable (LV41) corresponding to striatal signal. To gain further insight into the genes contained in LV41, we performed k-means clustering on the training data to identify studies that respond strongly to LV41. We found that the variable was relevant to striatum and aging across the scientific literature. Finally, we built a Web server (http://mousiplier.greenelab.com/) for users to easily explore the learned latent variables. Taken together, this study defines mousiPLIER as a method to uncover meaningful biological processes in mouse brain transcriptomic studies.

Subject(s)

Brain , Animals , Mice , Brain/metabolism , Gene Expression Profiling , Aging/physiology , Unsupervised Machine Learning , Transcriptome , Astrocytes/metabolism , Microglia/metabolism , Machine Learning , Male , Mice, Inbred C57BL

MousiPLIER: A Mouse Pathway-Level Information Extractor Model.

Zhang, Shuo; Heil, Benjamin J; Mao, Weiguang; Chikina, Maria; Greene, Casey S; Heller, Elizabeth A.

bioRxiv ; 2023 Aug 15.

Article in English | MEDLINE | ID: mdl-37577575

ABSTRACT

High throughput gene expression profiling is a powerful approach to generate hypotheses on the underlying causes of biological function and disease. Yet this approach is limited by its ability to infer underlying biological pathways and burden of testing tens of thousands of individual genes. Machine learning models that incorporate prior biological knowledge are necessary to extract meaningful pathways and generate rational hypothesis from the vast amount of gene expression data generated to date. We adopted an unsupervised machine learning method, Pathway-level information extractor (PLIER), to train the first mouse PLIER model on 190,111 mouse brain RNA-sequencing samples, the greatest amount of training data ever used by PLIER. mousiPLER converted gene expression data into a latent variables that align to known pathway or cell maker gene sets, substantially reducing data dimensionality and improving interpretability. To determine the utility of mousiPLIER, we applied it to a mouse brain aging study of microglia and astrocyte transcriptomic profiling. We found a specific set of latent variables that are significantly associated with aging, including one latent variable (LV41) corresponding to striatal signal. We next performed k-means clustering on the training data to identify studies that respond strongly to LV41, finding that the variable is relevant to striatum and aging across the scientific literature. Finally, we built a web server (http://mousiplier.greenelab.com/) for users to easily explore the learned latent variables. Taken together this study provides proof of concept that mousiPLIER can uncover meaningful biological processes in mouse transcriptomic studies.

The effect of non-linear signal in classification problems using gene expression.

Heil, Benjamin J; Crawford, Jake; Greene, Casey S.

PLoS Comput Biol ; 19(3): e1010984, 2023 03.

Article in English | MEDLINE | ID: mdl-36972227

ABSTRACT

Those building predictive models from transcriptomic data are faced with two conflicting perspectives. The first, based on the inherent high dimensionality of biological systems, supposes that complex non-linear models such as neural networks will better match complex biological systems. The second, imagining that complex systems will still be well predicted by simple dividing lines prefers linear models that are easier to interpret. We compare multi-layer neural networks and logistic regression across multiple prediction tasks on GTEx and Recount3 datasets and find evidence in favor of both possibilities. We verified the presence of non-linear signal when predicting tissue and metadata sex labels from expression data by removing the predictive linear signal with Limma, and showed the removal ablated the performance of linear methods but not non-linear ones. However, we also found that the presence of non-linear signal was not necessarily sufficient for neural networks to outperform logistic regression. Our results demonstrate that while multi-layer neural networks may be useful for making predictions from gene expression data, including a linear baseline model is critical because while biological systems are high-dimensional, effective dividing lines for predictive models may not be.

Subject(s)

Gene Expression , Nonlinear Dynamics , Gene Expression Profiling , Neural Networks, Computer , Linear Models

Hetnet connectivity search provides rapid insights into how two biomedical entities are related.

Himmelstein, Daniel S; Zietz, Michael; Rubinetti, Vincent; Kloster, Kyle; Heil, Benjamin J; Alquaddoomi, Faisal; Hu, Dongbo; Nicholson, David N; Hao, Yun; Sullivan, Blair D; Nagle, Michael W; Greene, Casey S.

bioRxiv ; 2023 Jan 07.

Article in English | MEDLINE | ID: mdl-36711546

ABSTRACT

Hetnets, short for "heterogeneous networks", contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet connects 11 types of nodes - including genes, diseases, drugs, pathways, and anatomical structures - with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any two nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We find that predictions are broadly similar to those from previously described supervised approaches for certain node type pairs. Scoring of individual paths is based on the most specific paths of a given type. Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. We implemented the method on Hetionet and provide an online interface at https://het.io/search . We provide an open source implementation of these methods in our new Python package named hetmatpy .

The Field-Dependent Nature of PageRank Values in Citation Networks.

Heil, Benjamin J; Greene, Casey S.

bioRxiv ; 2023 Jan 06.

Article in English | MEDLINE | ID: mdl-36711900

ABSTRACT

The value of scientific research can be easier to assess at the collective level than at the level of individual contributions. Several journal-level and article-level metrics aim to measure the importance of journals or individual manuscripts. However, many are citation-based and citation practices vary between fields. To account for these differences, scientists have devised normalization schemes to make metrics more comparable across fields. We use PageRank as an example metric and examine the extent to which field-specific citation norms drive estimated importance differences. In doing so, we recapitulate differences in journal and article PageRanks between fields. We also find that manuscripts shared between fields have different PageRanks depending on which field's citation network the metric is calculated in. We implement a degree-preserving graph shuffling algorithm to generate a null distribution of similar networks and find differences more likely attributed to field-specific preferences than citation norms. Our results suggest that while differences exist between fields' metric distributions, applying metrics in a field-aware manner rather than using normalized global metrics avoids losing important information about article preferences. They also imply that assigning a single importance value to a manuscript may not be a useful construct, as the importance of each manuscript varies by the reader's field.

Hetnet connectivity search provides rapid insights into how biomedical entities are related.

Gigascience ; 122022 12 28.

Article in English | MEDLINE | ID: mdl-37503959

ABSTRACT

BACKGROUND: Hetnets, short for "heterogeneous networks," contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet, connects 11 types of nodes-including genes, diseases, drugs, pathways, and anatomical structures-with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious about not only how metformin is related to breast cancer but also how a given gene might be involved in insomnia. FINDINGS: We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any 2 nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. CONCLUSION: We implemented the method on Hetionet and provide an online interface at https://het.io/search. We provide an open-source implementation of these methods in our new Python package named hetmatpy.

Subject(s)

Algorithms , Probability

Reproducibility standards for machine learning in the life sciences.

Heil, Benjamin J; Hoffman, Michael M; Markowetz, Florian; Lee, Su-In; Greene, Casey S; Hicks, Stephanie C.

Nat Methods ; 18(10): 1132-1135, 2021 10.

Article in English | MEDLINE | ID: mdl-34462593

Subject(s)

Computational Biology/methods , Computational Biology/standards , Machine Learning/standards , Reproducibility of Results , Software

Protein classification using modified n-grams and skip-grams.

Islam, S M Ashiqul; Heil, Benjamin J; Kearney, Christopher Michel; Baker, Erich J.

Bioinformatics ; 34(9): 1481-1487, 2018 05 01.

Article in English | MEDLINE | ID: mdl-29309523

ABSTRACT

Motivation: Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG). Results: A meta-comparison of cross-validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein characteristic prediction to a broader range of scientists. Availability and implementation: m-NGSG is freely available at Bitbucket: https://bitbucket.org/sm_islam/mngsg/src. A web server is available at watson.ecs.baylor.edu/ngsg. Contact: erich_baker@baylor.edu. Supplementary information: Supplementary data are available at Bioinformatics online.

Subject(s)

Molecular Sequence Annotation/methods , Natural Language Processing , Protein Conformation , Proteins/classification , Sequence Analysis, Protein/methods , Supervised Machine Learning , Models, Molecular , Proteins/metabolism

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL