Search | VHL Search Portal

1.

The evolution of computational research in a data-centric world.

Deshpande, Dhrithi; Chhugani, Karishma; Ramesh, Tejasvene; Pellegrini, Matteo; Shiffman, Sagiv; Abedalthagafi, Malak S; Alqahtani, Saleh; Ye, Jimmie; Liu, Xiaole Shirley; Leek, Jeffrey T; Brazma, Alvis; Ophoff, Roel A; Rao, Gauri; Butte, Atul J; Moore, Jason H; Katritch, Vsevolod; Mangul, Serghei.

Cell ; 187(17): 4449-4457, 2024 Aug 22.

Article in English | MEDLINE | ID: mdl-39178828

ABSTRACT

Computational data-centric research techniques play a prevalent and multi-disciplinary role in life science research. In the past, scientists in wet labs generated the data, and computational researchers focused on creating tools for the analysis of those data. Computational researchers are now becoming more independent and taking leadership roles within biomedical projects, leveraging the increased availability of public data. We are now able to generate vast amounts of data, and the challenge has shifted from data generation to data analysis. Here we discuss the pitfalls, challenges, and opportunities facing the field of data-centric research in biology. We discuss the evolving perception of computational data-driven research and its rise as an independent domain in biomedical research while also addressing the significant collaborative opportunities that arise from integrating computational research with experimental and translational biology. Additionally, we discuss the future of data-centric research and its applications across various areas of the biomedical field.

Subject(s)

Biomedical Research , Computational Biology , Computational Biology/methods , Humans

2.

Evidence of epistasis in regions of long-range linkage disequilibrium across five complex diseases in the UK Biobank and eMERGE datasets.

Singhal, Pankhuri; Veturi, Yogasudha; Dudek, Scott M; Lucas, Anastasia; Frase, Alex; van Steen, Kristel; Schrodi, Steven J; Fasel, David; Weng, Chunhua; Pendergrass, Rion; Schaid, Daniel J; Kullo, Iftikhar J; Dikilitas, Ozan; Sleiman, Patrick M A; Hakonarson, Hakon; Moore, Jason H; Williams, Scott M; Ritchie, Marylyn D; Verma, Shefali S.

Am J Hum Genet ; 110(4): 575-591, 2023 04 06.

Article in English | MEDLINE | ID: mdl-37028392

ABSTRACT

Leveraging linkage disequilibrium (LD) patterns as representative of population substructure enables the discovery of additive association signals in genome-wide association studies (GWASs). Standard GWASs are well-powered to interrogate additive models; however, new approaches are required for invesigating other modes of inheritance such as dominance and epistasis. Epistasis, or non-additive interaction between genes, exists across the genome but often goes undetected because of a lack of statistical power. Furthermore, the adoption of LD pruning as customary in standard GWASs excludes detection of sites that are in LD but might underlie the genetic architecture of complex traits. We hypothesize that uncovering long-range interactions between loci with strong LD due to epistatic selection can elucidate genetic mechanisms underlying common diseases. To investigate this hypothesis, we tested for associations between 23 common diseases and 5,625,845 epistatic SNP-SNP pairs (determined by Ohta's D statistics) in long-range LD (>0.25 cM). Across five disease phenotypes, we identified one significant and four near-significant associations that replicated in two large genotype-phenotype datasets (UK Biobank and eMERGE). The genes that were most likely involved in the replicated associations were (1) members of highly conserved gene families with complex roles in multiple pathways, (2) essential genes, and/or (3) genes that were associated in the literature with complex traits that display variable expressivity. These results support the highly pleiotropic and conserved nature of variants in long-range LD under epistatic selection. Our work supports the hypothesis that epistatic interactions regulate diverse clinical mechanisms and might especially be driving factors in conditions with a wide range of phenotypic outcomes.

Subject(s)

Epistasis, Genetic , Genome-Wide Association Study , Linkage Disequilibrium/genetics , Genotype , Biological Specimen Banks , United Kingdom , Polymorphism, Single Nucleotide/genetics

3.

Electronic health records and polygenic risk scores for predicting disease risk.

Li, Ruowang; Chen, Yong; Ritchie, Marylyn D; Moore, Jason H.

Nat Rev Genet ; 21(8): 493-502, 2020 08.

Article in English | MEDLINE | ID: mdl-32235907

ABSTRACT

Accurate prediction of disease risk based on the genetic make-up of an individual is essential for effective prevention and personalized treatment. Nevertheless, to date, individual genetic variants from genome-wide association studies have achieved only moderate prediction of disease risk. The aggregation of genetic variants under a polygenic model shows promising improvements in prediction accuracies. Increasingly, electronic health records (EHRs) are being linked to patient genetic data in biobanks, which provides new opportunities for developing and applying polygenic risk scores in the clinic, to systematically examine and evaluate patient susceptibilities to disease. However, the heterogeneous nature of EHR data brings forth many practical challenges along every step of designing and implementing risk prediction strategies. In this Review, we present the unique considerations for using genotype and phenotype data from biobank-linked EHRs for polygenic risk prediction.

Subject(s)

Electronic Health Records , Genetic Association Studies , Genetic Predisposition to Disease , Multifactorial Inheritance , Algorithms , Computational Biology/methods , Genome-Wide Association Study , Genomics/methods , Genotype , Humans , Phenotype , Reproducibility of Results , Risk Assessment , Risk Factors

4.

KRAGEN: a knowledge graph-enhanced RAG framework for biomedical problem solving using large language models.

Matsumoto, Nicholas; Moran, Jay; Choi, Hyunjun; Hernandez, Miguel E; Venkatesan, Mythreye; Wang, Paul; Moore, Jason H.

Bioinformatics ; 40(6)2024 06 03.

Article in English | MEDLINE | ID: mdl-38830083

ABSTRACT

MOTIVATION: Answering and solving complex problems using a large language model (LLM) given a certain domain such as biomedicine is a challenging task that requires both factual consistency and logic, and LLMs often suffer from some major limitations, such as hallucinating false or irrelevant information, or being influenced by noisy data. These issues can compromise the trustworthiness, accuracy, and compliance of LLM-generated text and insights. RESULTS: Knowledge Retrieval Augmented Generation ENgine (KRAGEN) is a new tool that combines knowledge graphs, Retrieval Augmented Generation (RAG), and advanced prompting techniques to solve complex problems with natural language. KRAGEN converts knowledge graphs into a vector database and uses RAG to retrieve relevant facts from it. KRAGEN uses advanced prompting techniques: namely graph-of-thoughts (GoT), to dynamically break down a complex problem into smaller subproblems, and proceeds to solve each subproblem by using the relevant knowledge through the RAG framework, which limits the hallucinations, and finally, consolidates the subproblems and provides a solution. KRAGEN's graph visualization allows the user to interact with and evaluate the quality of the solution's GoT structure and logic. AVAILABILITY AND IMPLEMENTATION: KRAGEN is deployed by running its custom Docker containers. KRAGEN is available as open-source from GitHub at: https://github.com/EpistasisLab/KRAGEN.

Subject(s)

Software , Natural Language Processing , Problem Solving , Algorithms , Information Storage and Retrieval/methods , Humans , Computational Biology/methods , Databases, Factual

5.

Artificial intelligence: revolutionizing cardiology with large language models.

Boonstra, Machteld J; Weissenbacher, Davy; Moore, Jason H; Gonzalez-Hernandez, Graciela; Asselbergs, Folkert W.

Eur Heart J ; 45(5): 332-345, 2024 Feb 01.

Article in English | MEDLINE | ID: mdl-38170821

ABSTRACT

Natural language processing techniques are having an increasing impact on clinical care from patient, clinician, administrator, and research perspective. Among others are automated generation of clinical notes and discharge letters, medical term coding for billing, medical chatbots both for patients and clinicians, data enrichment in the identification of disease symptoms or diagnosis, cohort selection for clinical trial, and auditing purposes. In the review, an overview of the history in natural language processing techniques developed with brief technical background is presented. Subsequently, the review will discuss implementation strategies of natural language processing tools, thereby specifically focusing on large language models, and conclude with future opportunities in the application of such techniques in the field of cardiology.

Subject(s)

Artificial Intelligence , Cardiology , Humans , Natural Language Processing , Patient Discharge

6.

Aliro: an automated machine learning tool leveraging large language models.

Choi, Hyunjun; Moran, Jay; Matsumoto, Nicholas; Hernandez, Miguel E; Moore, Jason H.

Bioinformatics ; 39(10)2023 10 03.

Article in English | MEDLINE | ID: mdl-37796839

ABSTRACT

MOTIVATION: Biomedical and healthcare domains generate vast amounts of complex data that can be challenging to analyze using machine learning tools, especially for researchers without computer science training. RESULTS: Aliro is an open-source software package designed to automate machine learning analysis through a clean web interface. By infusing the power of large language models, the user can interact with their data by seamlessly retrieving and executing code pulled from the large language model, accelerating automated discovery of new insights from data. Aliro includes a pre-trained machine learning recommendation system that can assist the user to automate the selection of machine learning algorithms and its hyperparameters and provides visualization of the evaluated model and data. AVAILABILITY AND IMPLEMENTATION: Aliro is deployed by running its custom Docker containers. Aliro is available as open-source from GitHub at: https://github.com/EpistasisLab/Aliro.

Subject(s)

Algorithms , Software , Machine Learning , Language

7.

Ten simple rules for managing laboratory information.

Berezin, Casey-Tyler; Aguilera, Luis U; Billerbeck, Sonja; Bourne, Philip E; Densmore, Douglas; Freemont, Paul; Gorochowski, Thomas E; Hernandez, Sarah I; Hillson, Nathan J; King, Connor R; Köpke, Michael; Ma, Shuyi; Miller, Katie M; Moon, Tae Seok; Moore, Jason H; Munsky, Brian; Myers, Chris J; Nicholas, Dequina A; Peccoud, Samuel J; Zhou, Wen; Peccoud, Jean.

PLoS Comput Biol ; 19(12): e1011652, 2023 Dec.

Article in English | MEDLINE | ID: mdl-38060459

ABSTRACT

Information is the cornerstone of research, from experimental (meta)data and computational processes to complex inventories of reagents and equipment. These 10 simple rules discuss best practices for leveraging laboratory information management systems to transform this large information load into useful scientific findings.

8.

Preference matrix guided sparse canonical correlation analysis for mining brain imaging genetic associations in Alzheimer's disease.

Sha, Jiahang; Bao, Jingxuan; Liu, Kefei; Yang, Shu; Wen, Zixuan; Wen, Junhao; Cui, Yuhan; Tong, Boning; Moore, Jason H; Saykin, Andrew J; Davatzikos, Christos; Long, Qi; Shen, Li.

Methods ; 218: 27-38, 2023 10.

Article in English | MEDLINE | ID: mdl-37507059

ABSTRACT

Investigating the relationship between genetic variation and phenotypic traits is a key issue in quantitative genetics. Specifically for Alzheimer's disease, the association between genetic markers and quantitative traits remains vague while, once identified, will provide valuable guidance for the study and development of genetics-based treatment approaches. Currently, to analyze the association of two modalities, sparse canonical correlation analysis (SCCA) is commonly used to compute one sparse linear combination of the variable features for each modality, giving a pair of linear combination vectors in total that maximizes the cross-correlation between the analyzed modalities. One drawback of the plain SCCA model is that the existing findings and knowledge cannot be integrated into the model as priors to help extract interesting correlations as well as identify biologically meaningful genetic and phenotypic markers. To bridge this gap, we introduce preference matrix guided SCCA (PM-SCCA) that not only takes priors encoded as a preference matrix but also maintains computational simplicity. A simulation study and a real-data experiment are conducted to investigate the effectiveness of the model. Both experiments demonstrate that the proposed PM-SCCA model can capture not only genotype-phenotype correlation but also relevant features effectively.

Subject(s)

Alzheimer Disease , Neuroimaging , Humans , Neuroimaging/methods , Canonical Correlation Analysis , Algorithms , Alzheimer Disease/diagnostic imaging , Alzheimer Disease/genetics , Brain , Magnetic Resonance Imaging

9.

Novel EDGE encoding method enhances ability to identify genetic interactions.

Hall, Molly A; Wallace, John; Lucas, Anastasia M; Bradford, Yuki; Verma, Shefali S; Müller-Myhsok, Bertram; Passero, Kristin; Zhou, Jiayan; McGuigan, John; Jiang, Beibei; Pendergrass, Sarah A; Zhang, Yanfei; Peissig, Peggy; Brilliant, Murray; Sleiman, Patrick; Hakonarson, Hakon; Harley, John B; Kiryluk, Krzysztof; Van Steen, Kristel; Moore, Jason H; Ritchie, Marylyn D.

PLoS Genet ; 17(6): e1009534, 2021 06.

Article in English | MEDLINE | ID: mdl-34086673

ABSTRACT

Assumptions are made about the genetic model of single nucleotide polymorphisms (SNPs) when choosing a traditional genetic encoding: additive, dominant, and recessive. Furthermore, SNPs across the genome are unlikely to demonstrate identical genetic models. However, running SNP-SNP interaction analyses with every combination of encodings raises the multiple testing burden. Here, we present a novel and flexible encoding for genetic interactions, the elastic data-driven genetic encoding (EDGE), in which SNPs are assigned a heterozygous value based on the genetic model they demonstrate in a dataset prior to interaction testing. We assessed the power of EDGE to detect genetic interactions using 29 combinations of simulated genetic models and found it outperformed the traditional encoding methods across 10%, 30%, and 50% minor allele frequencies (MAFs). Further, EDGE maintained a low false-positive rate, while additive and dominant encodings demonstrated inflation. We evaluated EDGE and the traditional encodings with genetic data from the Electronic Medical Records and Genomics (eMERGE) Network for five phenotypes: age-related macular degeneration (AMD), age-related cataract, glaucoma, type 2 diabetes (T2D), and resistant hypertension. A multi-encoding genome-wide association study (GWAS) for each phenotype was performed using the traditional encodings, and the top results of the multi-encoding GWAS were considered for SNP-SNP interaction using the traditional encodings and EDGE. EDGE identified a novel SNP-SNP interaction for age-related cataract that no other method identified: rs7787286 (MAF: 0.041; intergenic region of chromosome 7)-rs4695885 (MAF: 0.34; intergenic region of chromosome 4) with a Bonferroni LRT p of 0.018. A SNP-SNP interaction was found in data from the UK Biobank within 25 kb of these SNPs using the recessive encoding: rs60374751 (MAF: 0.030) and rs6843594 (MAF: 0.34) (Bonferroni LRT p: 0.026). We recommend using EDGE to flexibly detect interactions between SNPs exhibiting diverse action.

Subject(s)

Models, Genetic , Cataract/genetics , Datasets as Topic , Diabetes Mellitus, Type 2/genetics , Gene Frequency , Genome-Wide Association Study , Glaucoma/genetics , Humans , Hypertension/genetics , Macular Degeneration/genetics , Phenotype , Polymorphism, Single Nucleotide

10.

The Alzheimer's Knowledge Base: A Knowledge Graph for Alzheimer Disease Research.

Romano, Joseph D; Truong, Van; Kumar, Rachit; Venkatesan, Mythreye; Graham, Britney E; Hao, Yun; Matsumoto, Nick; Li, Xi; Wang, Zhiping; Ritchie, Marylyn D; Shen, Li; Moore, Jason H.

J Med Internet Res ; 26: e46777, 2024 Apr 18.

Article in English | MEDLINE | ID: mdl-38635981

ABSTRACT

BACKGROUND: As global populations age and become susceptible to neurodegenerative illnesses, new therapies for Alzheimer disease (AD) are urgently needed. Existing data resources for drug discovery and repurposing fail to capture relationships central to the disease's etiology and response to drugs. OBJECTIVE: We designed the Alzheimer's Knowledge Base (AlzKB) to alleviate this need by providing a comprehensive knowledge representation of AD etiology and candidate therapeutics. METHODS: We designed the AlzKB as a large, heterogeneous graph knowledge base assembled using 22 diverse external data sources describing biological and pharmaceutical entities at different levels of organization (eg, chemicals, genes, anatomy, and diseases). AlzKB uses a Web Ontology Language 2 ontology to enforce semantic consistency and allow for ontological inference. We provide a public version of AlzKB and allow users to run and modify local versions of the knowledge base. RESULTS: AlzKB is freely available on the web and currently contains 118,902 entities with 1,309,527 relationships between those entities. To demonstrate its value, we used graph data science and machine learning to (1) propose new therapeutic targets based on similarities of AD to Parkinson disease and (2) repurpose existing drugs that may treat AD. For each use case, AlzKB recovers known therapeutic associations while proposing biologically plausible new ones. CONCLUSIONS: AlzKB is a new, publicly available knowledge resource that enables researchers to discover complex translational associations for AD drug discovery. Through 2 use cases, we show that it is a valuable tool for proposing novel therapeutic hypotheses based on public biomedical knowledge.

Subject(s)

Alzheimer Disease , Humans , Alzheimer Disease/drug therapy , Alzheimer Disease/genetics , Pattern Recognition, Automated , Knowledge Bases , Machine Learning , Knowledge

11.

Artificial Intelligence and Technology Collaboratories: Innovating aging research and Alzheimer's care.

Abadir, Peter; Oh, Esther; Chellappa, Rama; Choudhry, Niteesh; Demiris, George; Ganesan, Deepak; Karlawish, Jason; Marlin, Benjamin; Li, Rose M; Dehak, Najim; Arbaje, Alicia; Unberath, Mathias; Cudjoe, Thomas; Chute, Christopher; Moore, Jason H; Phan, Phillip; Samus, Quincy; Schoenborn, Nancy L; Battle, Alexis; Walston, Jeremy D.

Alzheimers Dement ; 20(4): 3074-3079, 2024 04.

Article in English | MEDLINE | ID: mdl-38324244

ABSTRACT

This perspective outlines the Artificial Intelligence and Technology Collaboratories (AITC) at Johns Hopkins University, University of Pennsylvania, and University of Massachusetts, highlighting their roles in developing AI-based technologies for older adult care, particularly targeting Alzheimer's disease (AD). These National Institute on Aging (NIA) centers foster collaboration among clinicians, gerontologists, ethicists, business professionals, and engineers to create AI solutions. Key activities include identifying technology needs, stakeholder engagement, training, mentoring, data integration, and navigating ethical challenges. The objective is to apply these innovations effectively in real-world scenarios, including in rural settings. In addition, the AITC focuses on developing best practices for AI application in the care of older adults, facilitating pilot studies, and addressing ethical concerns related to technology development for older adults with cognitive impairment, with the ultimate aim of improving the lives of older adults and their caregivers. HIGHLIGHTS: Addressing the complex needs of older adults with Alzheimer's disease (AD) requires a comprehensive approach, integrating medical and social support. Current gaps in training, techniques, tools, and expertise hinder uniform access across communities and health care settings. Artificial intelligence (AI) and digital technologies hold promise in transforming care for this demographic. Yet, transitioning these innovations from concept to marketable products presents significant challenges, often stalling promising advancements in the developmental phase. The Artificial Intelligence and Technology Collaboratories (AITC) program, funded by the National Institute on Aging (NIA), presents a viable model. These Collaboratories foster the development and implementation of AI methods and technologies through projects aimed at improving care for older Americans, particularly those with AD, and promote the sharing of best practices in AI and technology integration. Why Does This Matter? The National Institute on Aging (NIA) Artificial Intelligence and Technology Collaboratories (AITC) program's mission is to accelerate the adoption of artificial intelligence (AI) and new technologies for the betterment of older adults, especially those with dementia. By bridging scientific and technological expertise, fostering clinical and industry partnerships, and enhancing the sharing of best practices, this program can significantly improve the health and quality of life for older adults with Alzheimer's disease (AD).

Subject(s)

Alzheimer Disease , Isothiocyanates , United States , Humans , Aged , Alzheimer Disease/therapy , Artificial Intelligence , Geroscience , Quality of Life , Technology

12.

Genetic heterogeneity: Challenges, impacts, and methods through an associative lens.

Woodward, Alexa A; Urbanowicz, Ryan J; Naj, Adam C; Moore, Jason H.

Genet Epidemiol ; 46(8): 555-571, 2022 12.

Article in English | MEDLINE | ID: mdl-35924480

ABSTRACT

Genetic heterogeneity describes the occurrence of the same or similar phenotypes through different genetic mechanisms in different individuals. Robustly characterizing and accounting for genetic heterogeneity is crucial to pursuing the goals of precision medicine, for discovering novel disease biomarkers, and for identifying targets for treatments. Failure to account for genetic heterogeneity may lead to missed associations and incorrect inferences. Thus, it is critical to review the impact of genetic heterogeneity on the design and analysis of population level genetic studies, aspects that are often overlooked in the literature. In this review, we first contextualize our approach to genetic heterogeneity by proposing a high-level categorization of heterogeneity into "feature," "outcome," and "associative" heterogeneity, drawing on perspectives from epidemiology and machine learning to illustrate distinctions between them. We highlight the unique nature of genetic heterogeneity as a heterogeneous pattern of association that warrants specific methodological considerations. We then focus on the challenges that preclude effective detection and characterization of genetic heterogeneity across a variety of epidemiological contexts. Finally, we discuss systems heterogeneity as an integrated approach to using genetic and other high-dimensional multi-omic data in complex disease research.

Subject(s)

Genetic Heterogeneity , Precision Medicine , Humans , Precision Medicine/methods , Machine Learning , Phenotype

13.

PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods.

Romano, Joseph D; Le, Trang T; La Cava, William; Gregg, John T; Goldberg, Daniel J; Chakraborty, Praneel; Ray, Natasha L; Himmelstein, Daniel; Fu, Weixuan; Moore, Jason H.

Bioinformatics ; 38(3): 878-880, 2022 01 12.

Article in English | MEDLINE | ID: mdl-34677586

ABSTRACT

MOTIVATION: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. RESULTS: This release of PMLB (Penn Machine Learning Benchmarks) provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community. AVAILABILITY AND IMPLEMENTATION: PMLB is available at https://github.com/EpistasisLab/pmlb. Python and R interfaces for PMLB can be installed through the Python Package Index and Comprehensive R Archive Network, respectively.

Subject(s)

Benchmarking , Software , Machine Learning , Models, Statistical

14.

Critical appraisal of artificial intelligence-based prediction models for cardiovascular disease.

van Smeden, Maarten; Heinze, Georg; Van Calster, Ben; Asselbergs, Folkert W; Vardas, Panos E; Bruining, Nico; de Jaegere, Peter; Moore, Jason H; Denaxas, Spiros; Boulesteix, Anne Laure; Moons, Karel G M.

Eur Heart J ; 43(31): 2921-2930, 2022 08 14.

Article in English | MEDLINE | ID: mdl-35639667

ABSTRACT

The medical field has seen a rapid increase in the development of artificial intelligence (AI)-based prediction models. With the introduction of such AI-based prediction model tools and software in cardiovascular patient care, the cardiovascular researcher and healthcare professional are challenged to understand the opportunities as well as the limitations of the AI-based predictions. In this article, we present 12 critical questions for cardiovascular health professionals to ask when confronted with an AI-based prediction model. We aim to support medical professionals to distinguish the AI-based prediction models that can add value to patient care from the AI that does not.

Subject(s)

Artificial Intelligence , Cardiovascular Diseases , Health Personnel , Humans , Software

15.

The Translational Machine: A novel machine-learning approach to illuminate complex genetic architectures.

Askland, Kathleen D; Strong, David; Wright, Marvin N; Moore, Jason H.

Genet Epidemiol ; 45(5): 485-536, 2021 07.

Article in English | MEDLINE | ID: mdl-33942369

ABSTRACT

The Translational Machine (TM) is a machine learning (ML)-based analytic pipeline that translates genotypic/variant call data into biologically contextualized features that richly characterize complex variant architectures and permit greater interpretability and biological replication. It also reduces potentially confounding effects of population substructure on outcome prediction. The TM consists of three main components. First, replicable but flexible feature engineering procedures translate genome-scale data into biologically informative features that appropriately contextualize simple variant calls/genotypes within biological and functional contexts. Second, model-free, nonparametric ML-based feature filtering procedures empirically reduce dimensionality and noise of both original genotype calls and engineered features. Third, a powerful ML algorithm for feature selection is used to differentiate risk variant contributions across variant frequency and functional prediction spectra. The TM simultaneously evaluates potential contributions of variants operative under polygenic and heterogeneous models of genetic architecture. Our TM enables integration of biological information (e.g., genomic annotations) within conceptual frameworks akin to geneset-/pathways-based and collapsing methods, but overcomes some of these methods' limitations. The full TM pipeline is executed in R. Our approach and initial findings from its application to a whole-exome schizophrenia case-control data set are presented. These TM procedures extend the findings of the primary investigation and yield novel results.

Subject(s)

Machine Learning , Models, Genetic , Algorithms , Genomics , Genotype , Humans

16.

The promise of automated machine learning for the genetic analysis of complex traits.

Manduchi, Elisabetta; Romano, Joseph D; Moore, Jason H.

Hum Genet ; 141(9): 1529-1544, 2022 Sep.

Article in English | MEDLINE | ID: mdl-34713318

ABSTRACT

The genetic analysis of complex traits has been dominated by parametric statistical methods due to their theoretical properties, ease of use, computational efficiency, and intuitive interpretation. However, there are likely to be patterns arising from complex genetic architectures which are more easily detected and modeled using machine learning methods. Unfortunately, selecting the right machine learning algorithm and tuning its hyperparameters can be daunting for experts and non-experts alike. The goal of automated machine learning (AutoML) is to let a computer algorithm identify the right algorithms and hyperparameters thus taking the guesswork out of the optimization process. We review the promises and challenges of AutoML for the genetic analysis of complex traits and give an overview of several approaches and some example applications to omics data. It is our hope that this review will motivate studies to develop and evaluate novel AutoML methods and software in the genetics and genomics space. The promise of AutoML is to enable anyone, regardless of training or expertise, to apply machine learning as part of their genetic analysis strategy.

Subject(s)

Machine Learning , Multifactorial Inheritance , Algorithms , Genomics/methods , Humans , Software

17.

treeheatr: an R package for interpretable decision tree visualizations.

Le, Trang T; Moore, Jason H.

Bioinformatics ; 37(2): 282-284, 2021 04 19.

Article in English | MEDLINE | ID: mdl-32702108

ABSTRACT

SUMMARY: treeheatr is an R package for creating interpretable decision tree visualizations with the data represented as a heatmap at the tree's leaf nodes. The integrated presentation of the tree structure along with an overview of the data efficiently illustrates how the tree nodes split up the feature space and how well the tree model performs. This visualization can also be examined in depth to uncover the correlation structure in the data and importance of each feature in predicting the outcome. Implemented in an easily installed package with a detailed vignette, treeheatr can be a useful teaching tool to enhance students' understanding of a simple decision tree model before diving into more complex tree-based machine learning methods. AVAILABILITY AND IMPLEMENTATION: The treeheatr package is freely available under the permissive MIT license at https://trang1618.github.io/treeheatr and https://cran.r-project.org/package=treeheatr. It comes with a detailed vignette that is automatically built with GitHub Actions continuous integration.

Subject(s)

Machine Learning , Software , Decision Trees , Humans

18.

Evaluating recommender systems for AI-driven biomedical informatics.

La Cava, William; Williams, Heather; Fu, Weixuan; Vitale, Steve; Srivatsan, Durga; Moore, Jason H.

Bioinformatics ; 37(2): 250-256, 2021 04 19.

Article in English | MEDLINE | ID: mdl-32766825

ABSTRACT

MOTIVATION: Many researchers with domain expertise are unable to easily apply machine learning (ML) to their bioinformatics data due to a lack of ML and/or coding expertise. Methods that have been proposed thus far to automate ML mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating biomedical data science using a web-based AI platform to recommend model choices and conduct experiments. We have two goals in mind: first, to make it easy to construct sophisticated models of biomedical processes; and second, to provide a fully automated AI agent that can choose and conduct promising experiments for the user, based on the user's experiments as well as prior knowledge. To validate this framework, we conduct an experiment on 165 classification problems, comparing to state-of-the-art, automated approaches. Finally, we use this tool to develop predictive models of septic shock in critical care patients. RESULTS: We find that matrix factorization-based recommendation systems outperform metalearning methods for automating ML. This result mirrors the results of earlier recommender systems research in other domains. The proposed AI is competitive with state-of-the-art automated ML methods in terms of choosing optimal algorithm configurations for datasets. In our application to prediction of septic shock, the AI-driven analysis produces a competent ML model (AUROC 0.85±0.02) that performs on par with state-of-the-art deep learning results for this task, with much less computational effort. AVAILABILITY AND IMPLEMENTATION: PennAI is available free of charge and open-source. It is distributed under the GNU public license (GPL) version 3. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Machine Learning , Humans , Informatics

19.

Estimating prevalence of human traits among populations from polygenic risk scores.

Graham, Britney E; Plotkin, Brian; Muglia, Louis; Moore, Jason H; Williams, Scott M.

Hum Genomics ; 15(1): 70, 2021 12 13.

Article in English | MEDLINE | ID: mdl-34903281

ABSTRACT

The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population-level polygenic risk score (PRS) can explain phenotypic variation among geographic populations based solely on risk allele frequencies. We applied a population-specific PRS (psPRS) to 26 populations from the 1000 Genomes to four phenotypes: lactase persistence (LP), melanoma, multiple sclerosis (MS) and height. Our models assumed additive genetic architecture among the polymorphisms in the psPRSs, as is convention. Linear psPRSs explained a significant proportion of trait variance ranging from 0.32 for height in men to 0.88 for melanoma. The best models for LP and height were linear, while those for melanoma and MS were nonlinear. As not all variants in a PRS may confer similar, or even any, risk among diverse populations, we also filtered out SNPs to assess whether variance explained was improved using psPRSs with fewer SNPs. Variance explained usually improved with fewer SNPs in the psPRS and was as high as 0.99 for height in men using only 548 of the initial 4208 SNPs. That reducing SNPs improves psPRSs performance may indicate that missing heritability is partially due to complex architecture that does not mandate additivity, undiscovered variants or spurious associations in the databases. We demonstrated that PRS-based analyses can be used across diverse populations and phenotypes for population prediction and that these comparisons can identify the universal risk variants.

Subject(s)

Multifactorial Inheritance , Polymorphism, Single Nucleotide , Genome-Wide Association Study , Humans , Multifactorial Inheritance/genetics , Phenotype , Polymorphism, Single Nucleotide/genetics , Prevalence , Risk Factors

20.

Automating Predictive Toxicology Using ComptoxAI.

Romano, Joseph D; Hao, Yun; Moore, Jason H; Penning, Trevor M.

Chem Res Toxicol ; 35(8): 1370-1382, 2022 08 15.

Article in English | MEDLINE | ID: mdl-35819939

ABSTRACT

ComptoxAI is a new data infrastructure for computational and artificial intelligence research in predictive toxicology. Here, we describe and showcase ComptoxAI's graph-structured knowledge base in the context of three real-world use-cases, demonstrating that it can rapidly answer complex questions about toxicology that are infeasible using previous technologies and data resources. These use-cases each demonstrate a tool for information retrieval from the knowledge base being used to solve a specific task: The "shortest path" module is used to identify mechanistic links between perfluorooctanoic acid (PFOA) exposure and nonalcoholic fatty liver disease; the "expand network" module identifies communities that are linked to dioxin toxicity; and the quantitative structure-activity relationship (QSAR) dataset generator predicts pregnane X receptor agonism in a set of 4,021 pesticide ingredients. The contents of ComptoxAI's source data are rigorously aggregated from a diverse array of public third-party databases, and ComptoxAI is designed as a free, public, and open-source toolkit to enable diverse classes of users including biomedical researchers, public health and regulatory officials, and the general public to predict toxicology of unknowns and modes of action.

Subject(s)

Computational Biology , Toxicology , Artificial Intelligence , Databases, Factual , Quantitative Structure-Activity Relationship

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL