ABSTRACT
Computational data-centric research techniques play a prevalent and multi-disciplinary role in life science research. In the past, scientists in wet labs generated the data, and computational researchers focused on creating tools for the analysis of those data. Computational researchers are now becoming more independent and taking leadership roles within biomedical projects, leveraging the increased availability of public data. We are now able to generate vast amounts of data, and the challenge has shifted from data generation to data analysis. Here we discuss the pitfalls, challenges, and opportunities facing the field of data-centric research in biology. We discuss the evolving perception of computational data-driven research and its rise as an independent domain in biomedical research while also addressing the significant collaborative opportunities that arise from integrating computational research with experimental and translational biology. Additionally, we discuss the future of data-centric research and its applications across various areas of the biomedical field.
Subject(s)
Biomedical Research , Computational Biology , Computational Biology/methods , HumansABSTRACT
Leveraging linkage disequilibrium (LD) patterns as representative of population substructure enables the discovery of additive association signals in genome-wide association studies (GWASs). Standard GWASs are well-powered to interrogate additive models; however, new approaches are required for invesigating other modes of inheritance such as dominance and epistasis. Epistasis, or non-additive interaction between genes, exists across the genome but often goes undetected because of a lack of statistical power. Furthermore, the adoption of LD pruning as customary in standard GWASs excludes detection of sites that are in LD but might underlie the genetic architecture of complex traits. We hypothesize that uncovering long-range interactions between loci with strong LD due to epistatic selection can elucidate genetic mechanisms underlying common diseases. To investigate this hypothesis, we tested for associations between 23 common diseases and 5,625,845 epistatic SNP-SNP pairs (determined by Ohta's D statistics) in long-range LD (>0.25 cM). Across five disease phenotypes, we identified one significant and four near-significant associations that replicated in two large genotype-phenotype datasets (UK Biobank and eMERGE). The genes that were most likely involved in the replicated associations were (1) members of highly conserved gene families with complex roles in multiple pathways, (2) essential genes, and/or (3) genes that were associated in the literature with complex traits that display variable expressivity. These results support the highly pleiotropic and conserved nature of variants in long-range LD under epistatic selection. Our work supports the hypothesis that epistatic interactions regulate diverse clinical mechanisms and might especially be driving factors in conditions with a wide range of phenotypic outcomes.
Subject(s)
Epistasis, Genetic , Genome-Wide Association Study , Linkage Disequilibrium/genetics , Genotype , Biological Specimen Banks , United Kingdom , Polymorphism, Single Nucleotide/geneticsABSTRACT
Accurate prediction of disease risk based on the genetic make-up of an individual is essential for effective prevention and personalized treatment. Nevertheless, to date, individual genetic variants from genome-wide association studies have achieved only moderate prediction of disease risk. The aggregation of genetic variants under a polygenic model shows promising improvements in prediction accuracies. Increasingly, electronic health records (EHRs) are being linked to patient genetic data in biobanks, which provides new opportunities for developing and applying polygenic risk scores in the clinic, to systematically examine and evaluate patient susceptibilities to disease. However, the heterogeneous nature of EHR data brings forth many practical challenges along every step of designing and implementing risk prediction strategies. In this Review, we present the unique considerations for using genotype and phenotype data from biobank-linked EHRs for polygenic risk prediction.
Subject(s)
Electronic Health Records , Genetic Association Studies , Genetic Predisposition to Disease , Multifactorial Inheritance , Algorithms , Computational Biology/methods , Genome-Wide Association Study , Genomics/methods , Genotype , Humans , Phenotype , Reproducibility of Results , Risk Assessment , Risk FactorsABSTRACT
MOTIVATION: Answering and solving complex problems using a large language model (LLM) given a certain domain such as biomedicine is a challenging task that requires both factual consistency and logic, and LLMs often suffer from some major limitations, such as hallucinating false or irrelevant information, or being influenced by noisy data. These issues can compromise the trustworthiness, accuracy, and compliance of LLM-generated text and insights. RESULTS: Knowledge Retrieval Augmented Generation ENgine (KRAGEN) is a new tool that combines knowledge graphs, Retrieval Augmented Generation (RAG), and advanced prompting techniques to solve complex problems with natural language. KRAGEN converts knowledge graphs into a vector database and uses RAG to retrieve relevant facts from it. KRAGEN uses advanced prompting techniques: namely graph-of-thoughts (GoT), to dynamically break down a complex problem into smaller subproblems, and proceeds to solve each subproblem by using the relevant knowledge through the RAG framework, which limits the hallucinations, and finally, consolidates the subproblems and provides a solution. KRAGEN's graph visualization allows the user to interact with and evaluate the quality of the solution's GoT structure and logic. AVAILABILITY AND IMPLEMENTATION: KRAGEN is deployed by running its custom Docker containers. KRAGEN is available as open-source from GitHub at: https://github.com/EpistasisLab/KRAGEN.
Subject(s)
Software , Natural Language Processing , Problem Solving , Algorithms , Information Storage and Retrieval/methods , Humans , Computational Biology/methods , Databases, FactualABSTRACT
Natural language processing techniques are having an increasing impact on clinical care from patient, clinician, administrator, and research perspective. Among others are automated generation of clinical notes and discharge letters, medical term coding for billing, medical chatbots both for patients and clinicians, data enrichment in the identification of disease symptoms or diagnosis, cohort selection for clinical trial, and auditing purposes. In the review, an overview of the history in natural language processing techniques developed with brief technical background is presented. Subsequently, the review will discuss implementation strategies of natural language processing tools, thereby specifically focusing on large language models, and conclude with future opportunities in the application of such techniques in the field of cardiology.
Subject(s)
Artificial Intelligence , Cardiology , Humans , Natural Language Processing , Patient DischargeABSTRACT
MOTIVATION: Biomedical and healthcare domains generate vast amounts of complex data that can be challenging to analyze using machine learning tools, especially for researchers without computer science training. RESULTS: Aliro is an open-source software package designed to automate machine learning analysis through a clean web interface. By infusing the power of large language models, the user can interact with their data by seamlessly retrieving and executing code pulled from the large language model, accelerating automated discovery of new insights from data. Aliro includes a pre-trained machine learning recommendation system that can assist the user to automate the selection of machine learning algorithms and its hyperparameters and provides visualization of the evaluated model and data. AVAILABILITY AND IMPLEMENTATION: Aliro is deployed by running its custom Docker containers. Aliro is available as open-source from GitHub at: https://github.com/EpistasisLab/Aliro.
Subject(s)
Algorithms , Software , Machine Learning , LanguageABSTRACT
Information is the cornerstone of research, from experimental (meta)data and computational processes to complex inventories of reagents and equipment. These 10 simple rules discuss best practices for leveraging laboratory information management systems to transform this large information load into useful scientific findings.
ABSTRACT
Investigating the relationship between genetic variation and phenotypic traits is a key issue in quantitative genetics. Specifically for Alzheimer's disease, the association between genetic markers and quantitative traits remains vague while, once identified, will provide valuable guidance for the study and development of genetics-based treatment approaches. Currently, to analyze the association of two modalities, sparse canonical correlation analysis (SCCA) is commonly used to compute one sparse linear combination of the variable features for each modality, giving a pair of linear combination vectors in total that maximizes the cross-correlation between the analyzed modalities. One drawback of the plain SCCA model is that the existing findings and knowledge cannot be integrated into the model as priors to help extract interesting correlations as well as identify biologically meaningful genetic and phenotypic markers. To bridge this gap, we introduce preference matrix guided SCCA (PM-SCCA) that not only takes priors encoded as a preference matrix but also maintains computational simplicity. A simulation study and a real-data experiment are conducted to investigate the effectiveness of the model. Both experiments demonstrate that the proposed PM-SCCA model can capture not only genotype-phenotype correlation but also relevant features effectively.
Subject(s)
Alzheimer Disease , Neuroimaging , Humans , Neuroimaging/methods , Canonical Correlation Analysis , Algorithms , Alzheimer Disease/diagnostic imaging , Alzheimer Disease/genetics , Brain , Magnetic Resonance ImagingABSTRACT
Assumptions are made about the genetic model of single nucleotide polymorphisms (SNPs) when choosing a traditional genetic encoding: additive, dominant, and recessive. Furthermore, SNPs across the genome are unlikely to demonstrate identical genetic models. However, running SNP-SNP interaction analyses with every combination of encodings raises the multiple testing burden. Here, we present a novel and flexible encoding for genetic interactions, the elastic data-driven genetic encoding (EDGE), in which SNPs are assigned a heterozygous value based on the genetic model they demonstrate in a dataset prior to interaction testing. We assessed the power of EDGE to detect genetic interactions using 29 combinations of simulated genetic models and found it outperformed the traditional encoding methods across 10%, 30%, and 50% minor allele frequencies (MAFs). Further, EDGE maintained a low false-positive rate, while additive and dominant encodings demonstrated inflation. We evaluated EDGE and the traditional encodings with genetic data from the Electronic Medical Records and Genomics (eMERGE) Network for five phenotypes: age-related macular degeneration (AMD), age-related cataract, glaucoma, type 2 diabetes (T2D), and resistant hypertension. A multi-encoding genome-wide association study (GWAS) for each phenotype was performed using the traditional encodings, and the top results of the multi-encoding GWAS were considered for SNP-SNP interaction using the traditional encodings and EDGE. EDGE identified a novel SNP-SNP interaction for age-related cataract that no other method identified: rs7787286 (MAF: 0.041; intergenic region of chromosome 7)-rs4695885 (MAF: 0.34; intergenic region of chromosome 4) with a Bonferroni LRT p of 0.018. A SNP-SNP interaction was found in data from the UK Biobank within 25 kb of these SNPs using the recessive encoding: rs60374751 (MAF: 0.030) and rs6843594 (MAF: 0.34) (Bonferroni LRT p: 0.026). We recommend using EDGE to flexibly detect interactions between SNPs exhibiting diverse action.
Subject(s)
Models, Genetic , Cataract/genetics , Datasets as Topic , Diabetes Mellitus, Type 2/genetics , Gene Frequency , Genome-Wide Association Study , Glaucoma/genetics , Humans , Hypertension/genetics , Macular Degeneration/genetics , Phenotype , Polymorphism, Single NucleotideABSTRACT
BACKGROUND: As global populations age and become susceptible to neurodegenerative illnesses, new therapies for Alzheimer disease (AD) are urgently needed. Existing data resources for drug discovery and repurposing fail to capture relationships central to the disease's etiology and response to drugs. OBJECTIVE: We designed the Alzheimer's Knowledge Base (AlzKB) to alleviate this need by providing a comprehensive knowledge representation of AD etiology and candidate therapeutics. METHODS: We designed the AlzKB as a large, heterogeneous graph knowledge base assembled using 22 diverse external data sources describing biological and pharmaceutical entities at different levels of organization (eg, chemicals, genes, anatomy, and diseases). AlzKB uses a Web Ontology Language 2 ontology to enforce semantic consistency and allow for ontological inference. We provide a public version of AlzKB and allow users to run and modify local versions of the knowledge base. RESULTS: AlzKB is freely available on the web and currently contains 118,902 entities with 1,309,527 relationships between those entities. To demonstrate its value, we used graph data science and machine learning to (1) propose new therapeutic targets based on similarities of AD to Parkinson disease and (2) repurpose existing drugs that may treat AD. For each use case, AlzKB recovers known therapeutic associations while proposing biologically plausible new ones. CONCLUSIONS: AlzKB is a new, publicly available knowledge resource that enables researchers to discover complex translational associations for AD drug discovery. Through 2 use cases, we show that it is a valuable tool for proposing novel therapeutic hypotheses based on public biomedical knowledge.
Subject(s)
Alzheimer Disease , Humans , Alzheimer Disease/drug therapy , Alzheimer Disease/genetics , Pattern Recognition, Automated , Knowledge Bases , Machine Learning , KnowledgeABSTRACT
This perspective outlines the Artificial Intelligence and Technology Collaboratories (AITC) at Johns Hopkins University, University of Pennsylvania, and University of Massachusetts, highlighting their roles in developing AI-based technologies for older adult care, particularly targeting Alzheimer's disease (AD). These National Institute on Aging (NIA) centers foster collaboration among clinicians, gerontologists, ethicists, business professionals, and engineers to create AI solutions. Key activities include identifying technology needs, stakeholder engagement, training, mentoring, data integration, and navigating ethical challenges. The objective is to apply these innovations effectively in real-world scenarios, including in rural settings. In addition, the AITC focuses on developing best practices for AI application in the care of older adults, facilitating pilot studies, and addressing ethical concerns related to technology development for older adults with cognitive impairment, with the ultimate aim of improving the lives of older adults and their caregivers. HIGHLIGHTS: Addressing the complex needs of older adults with Alzheimer's disease (AD) requires a comprehensive approach, integrating medical and social support. Current gaps in training, techniques, tools, and expertise hinder uniform access across communities and health care settings. Artificial intelligence (AI) and digital technologies hold promise in transforming care for this demographic. Yet, transitioning these innovations from concept to marketable products presents significant challenges, often stalling promising advancements in the developmental phase. The Artificial Intelligence and Technology Collaboratories (AITC) program, funded by the National Institute on Aging (NIA), presents a viable model. These Collaboratories foster the development and implementation of AI methods and technologies through projects aimed at improving care for older Americans, particularly those with AD, and promote the sharing of best practices in AI and technology integration. Why Does This Matter? The National Institute on Aging (NIA) Artificial Intelligence and Technology Collaboratories (AITC) program's mission is to accelerate the adoption of artificial intelligence (AI) and new technologies for the betterment of older adults, especially those with dementia. By bridging scientific and technological expertise, fostering clinical and industry partnerships, and enhancing the sharing of best practices, this program can significantly improve the health and quality of life for older adults with Alzheimer's disease (AD).
Subject(s)
Alzheimer Disease , Isothiocyanates , United States , Humans , Aged , Alzheimer Disease/therapy , Artificial Intelligence , Geroscience , Quality of Life , TechnologyABSTRACT
Genetic heterogeneity describes the occurrence of the same or similar phenotypes through different genetic mechanisms in different individuals. Robustly characterizing and accounting for genetic heterogeneity is crucial to pursuing the goals of precision medicine, for discovering novel disease biomarkers, and for identifying targets for treatments. Failure to account for genetic heterogeneity may lead to missed associations and incorrect inferences. Thus, it is critical to review the impact of genetic heterogeneity on the design and analysis of population level genetic studies, aspects that are often overlooked in the literature. In this review, we first contextualize our approach to genetic heterogeneity by proposing a high-level categorization of heterogeneity into "feature," "outcome," and "associative" heterogeneity, drawing on perspectives from epidemiology and machine learning to illustrate distinctions between them. We highlight the unique nature of genetic heterogeneity as a heterogeneous pattern of association that warrants specific methodological considerations. We then focus on the challenges that preclude effective detection and characterization of genetic heterogeneity across a variety of epidemiological contexts. Finally, we discuss systems heterogeneity as an integrated approach to using genetic and other high-dimensional multi-omic data in complex disease research.
Subject(s)
Genetic Heterogeneity , Precision Medicine , Humans , Precision Medicine/methods , Machine Learning , PhenotypeABSTRACT
MOTIVATION: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. RESULTS: This release of PMLB (Penn Machine Learning Benchmarks) provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community. AVAILABILITY AND IMPLEMENTATION: PMLB is available at https://github.com/EpistasisLab/pmlb. Python and R interfaces for PMLB can be installed through the Python Package Index and Comprehensive R Archive Network, respectively.
Subject(s)
Benchmarking , Software , Machine Learning , Models, StatisticalABSTRACT
The medical field has seen a rapid increase in the development of artificial intelligence (AI)-based prediction models. With the introduction of such AI-based prediction model tools and software in cardiovascular patient care, the cardiovascular researcher and healthcare professional are challenged to understand the opportunities as well as the limitations of the AI-based predictions. In this article, we present 12 critical questions for cardiovascular health professionals to ask when confronted with an AI-based prediction model. We aim to support medical professionals to distinguish the AI-based prediction models that can add value to patient care from the AI that does not.
Subject(s)
Artificial Intelligence , Cardiovascular Diseases , Health Personnel , Humans , SoftwareABSTRACT
The Translational Machine (TM) is a machine learning (ML)-based analytic pipeline that translates genotypic/variant call data into biologically contextualized features that richly characterize complex variant architectures and permit greater interpretability and biological replication. It also reduces potentially confounding effects of population substructure on outcome prediction. The TM consists of three main components. First, replicable but flexible feature engineering procedures translate genome-scale data into biologically informative features that appropriately contextualize simple variant calls/genotypes within biological and functional contexts. Second, model-free, nonparametric ML-based feature filtering procedures empirically reduce dimensionality and noise of both original genotype calls and engineered features. Third, a powerful ML algorithm for feature selection is used to differentiate risk variant contributions across variant frequency and functional prediction spectra. The TM simultaneously evaluates potential contributions of variants operative under polygenic and heterogeneous models of genetic architecture. Our TM enables integration of biological information (e.g., genomic annotations) within conceptual frameworks akin to geneset-/pathways-based and collapsing methods, but overcomes some of these methods' limitations. The full TM pipeline is executed in R. Our approach and initial findings from its application to a whole-exome schizophrenia case-control data set are presented. These TM procedures extend the findings of the primary investigation and yield novel results.
Subject(s)
Machine Learning , Models, Genetic , Algorithms , Genomics , Genotype , HumansABSTRACT
The genetic analysis of complex traits has been dominated by parametric statistical methods due to their theoretical properties, ease of use, computational efficiency, and intuitive interpretation. However, there are likely to be patterns arising from complex genetic architectures which are more easily detected and modeled using machine learning methods. Unfortunately, selecting the right machine learning algorithm and tuning its hyperparameters can be daunting for experts and non-experts alike. The goal of automated machine learning (AutoML) is to let a computer algorithm identify the right algorithms and hyperparameters thus taking the guesswork out of the optimization process. We review the promises and challenges of AutoML for the genetic analysis of complex traits and give an overview of several approaches and some example applications to omics data. It is our hope that this review will motivate studies to develop and evaluate novel AutoML methods and software in the genetics and genomics space. The promise of AutoML is to enable anyone, regardless of training or expertise, to apply machine learning as part of their genetic analysis strategy.
Subject(s)
Machine Learning , Multifactorial Inheritance , Algorithms , Genomics/methods , Humans , SoftwareABSTRACT
SUMMARY: treeheatr is an R package for creating interpretable decision tree visualizations with the data represented as a heatmap at the tree's leaf nodes. The integrated presentation of the tree structure along with an overview of the data efficiently illustrates how the tree nodes split up the feature space and how well the tree model performs. This visualization can also be examined in depth to uncover the correlation structure in the data and importance of each feature in predicting the outcome. Implemented in an easily installed package with a detailed vignette, treeheatr can be a useful teaching tool to enhance students' understanding of a simple decision tree model before diving into more complex tree-based machine learning methods. AVAILABILITY AND IMPLEMENTATION: The treeheatr package is freely available under the permissive MIT license at https://trang1618.github.io/treeheatr and https://cran.r-project.org/package=treeheatr. It comes with a detailed vignette that is automatically built with GitHub Actions continuous integration.
Subject(s)
Machine Learning , Software , Decision Trees , HumansABSTRACT
MOTIVATION: Many researchers with domain expertise are unable to easily apply machine learning (ML) to their bioinformatics data due to a lack of ML and/or coding expertise. Methods that have been proposed thus far to automate ML mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating biomedical data science using a web-based AI platform to recommend model choices and conduct experiments. We have two goals in mind: first, to make it easy to construct sophisticated models of biomedical processes; and second, to provide a fully automated AI agent that can choose and conduct promising experiments for the user, based on the user's experiments as well as prior knowledge. To validate this framework, we conduct an experiment on 165 classification problems, comparing to state-of-the-art, automated approaches. Finally, we use this tool to develop predictive models of septic shock in critical care patients. RESULTS: We find that matrix factorization-based recommendation systems outperform metalearning methods for automating ML. This result mirrors the results of earlier recommender systems research in other domains. The proposed AI is competitive with state-of-the-art automated ML methods in terms of choosing optimal algorithm configurations for datasets. In our application to prediction of septic shock, the AI-driven analysis produces a competent ML model (AUROC 0.85±0.02) that performs on par with state-of-the-art deep learning results for this task, with much less computational effort. AVAILABILITY AND IMPLEMENTATION: PennAI is available free of charge and open-source. It is distributed under the GNU public license (GPL) version 3. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Algorithms , Machine Learning , Humans , InformaticsABSTRACT
The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population-level polygenic risk score (PRS) can explain phenotypic variation among geographic populations based solely on risk allele frequencies. We applied a population-specific PRS (psPRS) to 26 populations from the 1000 Genomes to four phenotypes: lactase persistence (LP), melanoma, multiple sclerosis (MS) and height. Our models assumed additive genetic architecture among the polymorphisms in the psPRSs, as is convention. Linear psPRSs explained a significant proportion of trait variance ranging from 0.32 for height in men to 0.88 for melanoma. The best models for LP and height were linear, while those for melanoma and MS were nonlinear. As not all variants in a PRS may confer similar, or even any, risk among diverse populations, we also filtered out SNPs to assess whether variance explained was improved using psPRSs with fewer SNPs. Variance explained usually improved with fewer SNPs in the psPRS and was as high as 0.99 for height in men using only 548 of the initial 4208 SNPs. That reducing SNPs improves psPRSs performance may indicate that missing heritability is partially due to complex architecture that does not mandate additivity, undiscovered variants or spurious associations in the databases. We demonstrated that PRS-based analyses can be used across diverse populations and phenotypes for population prediction and that these comparisons can identify the universal risk variants.
Subject(s)
Multifactorial Inheritance , Polymorphism, Single Nucleotide , Genome-Wide Association Study , Humans , Multifactorial Inheritance/genetics , Phenotype , Polymorphism, Single Nucleotide/genetics , Prevalence , Risk FactorsABSTRACT
ComptoxAI is a new data infrastructure for computational and artificial intelligence research in predictive toxicology. Here, we describe and showcase ComptoxAI's graph-structured knowledge base in the context of three real-world use-cases, demonstrating that it can rapidly answer complex questions about toxicology that are infeasible using previous technologies and data resources. These use-cases each demonstrate a tool for information retrieval from the knowledge base being used to solve a specific task: The "shortest path" module is used to identify mechanistic links between perfluorooctanoic acid (PFOA) exposure and nonalcoholic fatty liver disease; the "expand network" module identifies communities that are linked to dioxin toxicity; and the quantitative structure-activity relationship (QSAR) dataset generator predicts pregnane X receptor agonism in a set of 4,021 pesticide ingredients. The contents of ComptoxAI's source data are rigorously aggregated from a diverse array of public third-party databases, and ComptoxAI is designed as a free, public, and open-source toolkit to enable diverse classes of users including biomedical researchers, public health and regulatory officials, and the general public to predict toxicology of unknowns and modes of action.