RESUMO
Complex astrophysical systems often exhibit low-scatter relations between observable properties (e.g., luminosity, velocity dispersion, oscillation period). These scaling relations illuminate the underlying physics, and can provide observational tools for estimating masses and distances. Machine learning can provide a fast and systematic way to search for new scaling relations (or for simple extensions to existing relations) in abstract high-dimensional parameter spaces. We use a machine learning tool called symbolic regression (SR), which models patterns in a dataset in the form of analytic equations. We focus on the Sunyaev-Zeldovich flux-cluster mass relation (YSZ - M), the scatter in which affects inference of cosmological parameters from cluster abundance data. Using SR on the data from the IllustrisTNG hydrodynamical simulation, we find a new proxy for cluster mass which combines YSZ and concentration of ionized gas (cgas): M â Yconc3/5 ≡ YSZ3/5(1 - A cgas). Yconc reduces the scatter in the predicted M by â¼20 - 30% for large clusters (M â³ 1014 h-1 Mâ), as compared to using just YSZ. We show that the dependence on cgas is linked to cores of clusters exhibiting larger scatter than their outskirts. Finally, we test Yconc on clusters from CAMELS simulations and show that Yconc is robust against variations in cosmology, subgrid physics, and cosmic variance. Our results and methodology can be useful for accurate multiwavelength cluster mass estimation from upcoming CMB and X-ray surveys like ACT, SO, eROSITA and CMB-S4.
RESUMO
Machine learning methods, particularly neural networks trained on large datasets, are transforming how scientists approach scientific discovery and experimental design. However, current state-of-the-art neural networks are limited by their uninterpretability: Despite their excellent accuracy, they cannot describe how they arrived at their predictions. Here, using an "interpretable-by-design" approach, we present a neural network model that provides insights into RNA splicing, a fundamental process in the transfer of genomic information into functional biochemical products. Although we designed our model to emphasize interpretability, its predictive accuracy is on par with state-of-the-art models. To demonstrate the model's interpretability, we introduce a visualization that, for any given exon, allows us to trace and quantify the entire decision process from input sequence to output splicing prediction. Importantly, the model revealed uncharacterized components of the splicing logic, which we experimentally validated. This study highlights how interpretable machine learning can advance scientific discovery.
Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Genômica , Splicing de RNA/genética , LógicaRESUMO
Artificial intelligence (AI) systems utilizing deep neural networks and machine learning (ML) algorithms are widely used for solving critical problems in bioinformatics, biomedical informatics and precision medicine. However, complex ML models that are often perceived as opaque and black-box methods make it difficult to understand the reasoning behind their decisions. This lack of transparency can be a challenge for both end-users and decision-makers, as well as AI developers. In sensitive areas such as healthcare, explainability and accountability are not only desirable properties but also legally required for AI systems that can have a significant impact on human lives. Fairness is another growing concern, as algorithmic decisions should not show bias or discrimination towards certain groups or individuals based on sensitive attributes. Explainable AI (XAI) aims to overcome the opaqueness of black-box models and to provide transparency in how AI systems make decisions. Interpretable ML models can explain how they make predictions and identify factors that influence their outcomes. However, the majority of the state-of-the-art interpretable ML methods are domain-agnostic and have evolved from fields such as computer vision, automated reasoning or statistics, making direct application to bioinformatics problems challenging without customization and domain adaptation. In this paper, we discuss the importance of explainability and algorithmic transparency in the context of bioinformatics. We provide an overview of model-specific and model-agnostic interpretable ML methods and tools and outline their potential limitations. We discuss how existing interpretable ML methods can be customized and fit to bioinformatics research problems. Further, through case studies in bioimaging, cancer genomics and text mining, we demonstrate how XAI methods can improve transparency and decision fairness. Our review aims at providing valuable insights and serving as a starting point for researchers wanting to enhance explainability and decision transparency while solving bioinformatics problems. GitHub: https://github.com/rezacsedu/XAI-for-bioinformatics.
Assuntos
Inteligência Artificial , Biologia Computacional , Humanos , Aprendizado de Máquina , Algoritmos , GenômicaRESUMO
Cell-surface proteins play a critical role in cell function and are primary targets for therapeutics. CITE-seq is a single-cell technique that enables simultaneous measurement of gene and surface protein expression. It is powerful but costly and technically challenging. Computational methods have been developed to predict surface protein expression using gene expression information such as from single-cell RNA sequencing (scRNA-seq) data. Existing methods however are computationally demanding and lack the interpretability to reveal underlying biological processes. We propose CrossmodalNet, an interpretable machine learning model, to predict surface protein expression from scRNA-seq data. Our model with a customized adaptive loss accurately predicts surface protein abundances. When samples from multiple time points are given, our model encodes temporal information into an easy-to-interpret time embedding to make prediction in a time-point-specific manner, and is able to uncover noise-free causal gene-protein relationships. Using three publicly available time-resolved CITE-seq data sets, we validate the performance of our model by comparing it with benchmarking methods and evaluate its interpretability. Together, we show that our method accurately and interpretably profiles surface protein expression using scRNA-seq data, thereby expanding the capacity of CITE-seq experiments for investigating molecular mechanisms involving surface proteins.
Assuntos
Algoritmos , Perfilação da Expressão Gênica , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Proteínas de MembranaRESUMO
Rapid identification of newly emerging or circulating viruses is an important first step toward managing the public health response to potential outbreaks. A portable virus capture device, coupled with label-free Raman spectroscopy, holds the promise of fast detection by rapidly obtaining the Raman signature of a virus followed by a machine learning (ML) approach applied to recognize the virus based on its Raman spectrum, which is used as a fingerprint. We present such an ML approach for analyzing Raman spectra of human and avian viruses. A convolutional neural network (CNN) classifier specifically designed for spectral data achieves very high accuracy for a variety of virus type or subtype identification tasks. In particular, it achieves 99% accuracy for classifying influenza virus type A versus type B, 96% accuracy for classifying four subtypes of influenza A, 95% accuracy for differentiating enveloped and nonenveloped viruses, and 99% accuracy for differentiating avian coronavirus (infectious bronchitis virus [IBV]) from other avian viruses. Furthermore, interpretation of neural net responses in the trained CNN model using a full-gradient algorithm highlights Raman spectral ranges that are most important to virus identification. By correlating ML-selected salient Raman ranges with the signature ranges of known biomolecules and chemical functional groupsfor example, amide, amino acid, and carboxylic acidwe verify that our ML model effectively recognizes the Raman signatures of proteins, lipids, and other vital functional groups present in different viruses and uses a weighted combination of these signatures to identify viruses.
Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Vírus , Surtos de Doenças , Pandemias , Sorogrupo , Vírus/classificaçãoRESUMO
Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the "Locally Spiky Sparse" (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called "Depth-Weighted Prevalence" (DWP) for a set of signed features S±. Intuitively speaking, DWP(S±) measures how frequently features in S± appear together in an RF tree ensemble. We prove that, with high probability, DWP(S±) attains a universal upper bound that does not involve any model coefficients, if and only if S± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.
Assuntos
Algoritmos , Aprendizado de MáquinaRESUMO
Cancer is one of the leading causes of deaths worldwide. Survival analysis and prediction of cancer patients is of great significance for their precision medicine. The robustness and interpretability of the survival prediction models are important, where robustness tells whether a model has learned the knowledge, and interpretability means if a model can show human what it has learned. In this paper, we propose a robust and interpretable model SurvConvMixer, which uses pathways customized gene expression images and ConvMixer for cancer short-term, mid-term and long-term overall survival prediction. With ConvMixer, the representation of each pathway can be learned respectively. We show the robustness of our model by testing the trained model on absolutely untrained external datasets. The interpretability of SurvConvMixer depends on gradient-weighted class activation mapping (Grad-Cam), by which we can obtain the pathway-level activation heat map. Then wilcoxon rank-sum tests are conducted to obtain the statistically significant pathways, thereby revealing which pathways the model focuses on more. SurvConvMixer achieves remarkable performance on the short-term, mid-term and long-term overall survival of lung adenocarcinoma, lung squamous cell carcinoma and skin cutaneous melanoma, and the external validation tests show that SurvConvMixer can generalize to external datasets so that it is robust. Finally, we investigate the activation maps generated by Grad-Cam, after wilcoxon rank-sum test and Kaplan-Meier estimation, we find that some survival-related pathways play important role in SurvConvMixer.
Assuntos
Adenocarcinoma de Pulmão , Neoplasias Pulmonares , Melanoma , Neoplasias Cutâneas , Humanos , Expressão GênicaRESUMO
Major depressive disorder (MDD) is a serious and heterogeneous psychiatric disorder that needs accurate diagnosis. Resting-state functional MRI (rsfMRI), which captures multiple perspectives on brain structure, function, and connectivity, is increasingly applied in the diagnosis and pathological research of MDD. Different machine learning algorithms are then developed to exploit the rich information in rsfMRI and discriminate MDD patients from normal controls. Despite recent advances reported, the MDD discrimination accuracy has room for further improvement. The generalizability and interpretability of the discrimination method are not sufficiently addressed either. Here, we propose a machine learning method (MFMC) for MDD discrimination by concatenating multiple features and stacking multiple classifiers. MFMC is tested on the REST-meta-MDD data set that contains 2428 subjects collected from 25 different sites. MFMC yields 96.9% MDD discrimination accuracy, demonstrating a significant improvement over existing methods. In addition, the generalizability of MFMC is validated by the good performance when the training and testing subjects are from independent sites. The use of XGBoost as the meta classifier allows us to probe the decision process of MFMC. We identify 13 feature values related to 9 brain regions including the posterior cingulate gyrus, superior frontal gyrus orbital part, and angular gyrus, which contribute most to the classification and also demonstrate significant differences at the group level. The use of these 13 feature values alone can reach 87% of MFMC's full performance when taking all feature values. These features may serve as clinically useful diagnostic and prognostic biomarkers for MDD in the future.
Assuntos
Transtorno Depressivo Maior , Humanos , Transtorno Depressivo Maior/diagnóstico por imagem , Transtorno Depressivo Maior/patologia , Mapeamento Encefálico/métodos , Imageamento por Ressonância Magnética/métodos , Encéfalo , Aprendizado de MáquinaRESUMO
Although immune checkpoint inhibitor-based therapy has shown promising results in non-small cell lung cancer patients with high programmed death-ligand 1 expression, not all patients respond to therapy. The tumor microenvironment (TME) is complex and heterogeneous, making it challenging to understand the key agents and features that influence response to therapies. In this study, we leverage multiplex fluorescent immunohistochemistry to quantitatively assess interactions between tumor and immune cells in an effort to identify patterns occurring at multiple spatial levels of the TME. To do so, we introduce several computational methods novel to a data set of 1,269 multiplex fluorescent immunohistochemistry images from a cohort of 52 patients with metastatic non-small cell lung cancer. With the spatial G-cross function, we quantify the degree of cell interaction at an entire image level, where we see significantly increased activity of cytotoxic T cells and helper T cells with epithelial tumor cells in responders to immune checkpoint inhibitor-based (P = .022 and P < .001, respectively) and decreased activity of T-regulatory cells with epithelial tumor cells compared with nonresponders (P = .010). By leveraging spatial overlap methods, we define tumor subregions (which we call the tumor "periphery," "edge." and "center") and discover more localized immune-immune interactions influencing positive response, including those between cytotoxic T cells and helper T cells with antigen presenting cells in these subregions specifically. Finally, we trained an interpretable deep learning model that identified key cellular regions of interest that most influenced response classification (area under the curve = 0.71 ± 0.02). Assessing spatial interactions within these subregions further revealed new insights that were not significant at the whole image level, particularly the elevated association of antigen presenting cells and T-regulatory cells with one another in responder groups (P = .024). Altogether, we demonstrate that elucidating patterns of cell composition and interplay across multiple levels of spatial analyses can improve our understanding of the TME and better differentiate patient responses to immunotherapy.
RESUMO
INTRODUCTION: The relationship between cognitive function and subsequent sarcopenia remains unclear. Therefore, this study aimed to examine the associations of performance on multiple cognitive domains with sarcopenia in the middle-aged and older adults. METHODS: This longitudinal analysis (wave 2011-2013) included 2,934 participants from the CHARLS study. Sarcopenia was defined by the Asian Sarcopenia Working Group 2019 criteria. Cognitive function was measured by the Chinese version of the Mini-Mental State Examination (MMSE). Three interpretable techniques, namely SHapley Additive exPlanations (SHAP) and two built-in methods (coefficients of logistic regression and Gini importance of random forest), were used to assess the relationship between MMSE, its components (orientation, attention, episodic memory, and visuospatial ability) and sarcopenia. In addition, the association of MMSE score and its components with sarcopenia was further validated using stepwise regression. RESULTS: All interpretable methods showed that MMSE score was important predictors of sarcopenia, especially the SHAP (MMSE score ranked top one). For its components, episodic memory, visuospatial ability, and attention showed high predictive value compared with orientation. Stepwise regression analyses showed that MMSE score and its components of episodic memory and visuospatial ability were correlated with sarcopenia, with their odds ratios of 0.93 (95% CI: 0.91-0.96, p < 0.001), 0.87 (95% CI: 0.82-0.93, p < 0.001), and 1.32 (95% CI: 1.05-1.65, p = 0.016), respectively. CONCLUSIONS: Better cognitive function especially episodic memory and visuospatial ability was negatively associated with incident sarcopenia among community middle-aged and older adults.
Assuntos
Cognição , Sarcopenia , Humanos , Sarcopenia/psicologia , Masculino , Feminino , Idoso , Pessoa de Meia-Idade , Estudos Longitudinais , Cognição/fisiologia , Memória Episódica , Testes de Estado Mental e Demência , Disfunção Cognitiva/psicologia , China/epidemiologia , Testes Neuropsicológicos , Idoso de 80 Anos ou mais , Atenção/fisiologiaRESUMO
Long waiting time in outpatient departments is a crucial factor in patient dissatisfaction. We aim to analytically interpret the waiting times predicted by machine learning models and provide patients with an explanation of the expected waiting time. Here, underestimating waiting times can cause patient dissatisfaction, so preventing this in predictive models is necessary. To address this issue, we propose a framework considering dissatisfaction for estimating the waiting time in an outpatient department. In our framework, we leverage asymmetric loss functions to ensure robustness against underestimation. We also propose a dissatisfaction-aware asymmetric error score (DAES) to determine an appropriate model by considering the trade-off between underestimation and accuracy. Finally, Shapley additive explanation (SHAP) is applied to interpret the relationship trained by the model, enabling decision makers to use this information for improving outpatient service operations. We apply our framework in the endocrinology metabolism department and neurosurgery department in one of the largest hospitals in South Korea. The use of asymmetric functions prevents underestimation in the model, and with the proposed DAES, we can strike a balance in selecting the best model. By using SHAP, we can analytically interpret the waiting time in outpatient service (e.g., the length of the queue affects the waiting time the most) and provide explanations about the expected waiting time to patients. The proposed framework aids in improving operations, considering practical application in hospitals for real-time patient notification and minimizing patient dissatisfaction. Given the significance of managing hospital operations from the perspective of patients, this work is expected to contribute to operations improvement in health service practices.
Assuntos
Aprendizado de Máquina , Satisfação do Paciente , Listas de Espera , Humanos , República da Coreia , Fatores de Tempo , Pacientes AmbulatoriaisRESUMO
Proactive analysis of patient pathways helps healthcare providers anticipate treatment-related risks, identify outcomes, and allocate resources. Machine learning (ML) can leverage a patient's complete health history to make informed decisions about future events. However, previous work has mostly relied on so-called black-box models, which are unintelligible to humans, making it difficult for clinicians to apply such models. Our work introduces PatWay-Net, an ML framework designed for interpretable predictions of admission to the intensive care unit (ICU) for patients with symptoms of sepsis. We propose a novel type of recurrent neural network and combine it with multi-layer perceptrons to process the patient pathways and produce predictive yet interpretable results. We demonstrate its utility through a comprehensive dashboard that visualizes patient health trajectories, predictive outcomes, and associated risks. Our evaluation includes both predictive performance - where PatWay-Net outperforms standard models such as decision trees, random forests, and gradient-boosted decision trees - and clinical utility, validated through structured interviews with clinicians. By providing improved predictive accuracy along with interpretable and actionable insights, PatWay-Net serves as a valuable tool for healthcare decision support in the critical case of patients with symptoms of sepsis.
Assuntos
Unidades de Terapia Intensiva , Aprendizado de Máquina , Sepse , Humanos , Redes Neurais de Computação , Procedimentos ClínicosRESUMO
BACKGROUND: Overweight and obesity pose a huge burden on individuals and society. While the relationship between lifestyle factors and overweight and obesity is well-established, the relative contribution of specific lifestyle factors remains unclear. To address this gap in the literature, this study utilizes interpretable machine learning methods to identify the relative importance of specific lifestyle factors as predictors of overweight and obesity in adults. METHODS: Data were obtained from 46,057 adults in the China Health and Nutrition Survey (2004-2011) and the National Health and Nutrition Examination Survey (2007-2014). Basic demographic information, self-reported lifestyle factors, including physical activity, macronutrient intake, tobacco and alcohol consumption, and body weight status were collected. Three machine learning models, namely decision tree, random forest, and gradient-boosting decision tree, were employed to predict body weight status from lifestyle factors. The SHapley Additive exPlanation (SHAP) method was used to interpret the prediction results of the best-performing model by determining the contributions of specific lifestyle factors to the development of overweight and obesity in adults. RESULTS: The performance of the gradient-boosting decision tree model outperformed the decision tree and random forest models. Analysis based on the SHAP method indicates that sedentary behavior, alcohol consumption, and protein intake were important lifestyle factors predicting the development of overweight and obesity in adults. The amount of alcohol consumption and time spent sedentary were the strongest predictors of overweight and obesity, respectively. Specifically, sedentary behavior exceeding 28-35 h/week, alcohol consumption of more than 7 cups/week, and protein intake exceeding 80 g/day increased the risk of being predicted as overweight and obese. CONCLUSION: Pooled evidence from two nationally representative studies suggests that recognizing demographic differences and emphasizing the relative importance of sedentary behavior, alcohol consumption, and protein intake are beneficial for managing body weight status in adults. The specific risk thresholds for lifestyle factors observed in this study can help inform and guide future research and public health actions.
Assuntos
Estilo de Vida , Aprendizado de Máquina , Inquéritos Nutricionais , Obesidade , Sobrepeso , Humanos , Adulto , Masculino , Obesidade/epidemiologia , Feminino , Pessoa de Meia-Idade , Sobrepeso/epidemiologia , China/epidemiologia , Fatores de Risco , Árvores de Decisões , Adulto JovemRESUMO
Limited information is available on potential predictive value of environmental chemicals for mortality. Our study aimed to investigate the associations between 43 of 8 classes representative environmental chemicals in serum/urine and mortality, and further develop the interpretable machine learning models associated with environmental chemicals to predict mortality. A total of 1602 participants were included from the National Health and Nutrition Examination Survey (NHANES). During 154,646 person-months of follow-up, 127 deaths occurred. We found that machine learning showed promise in predicting mortality. CoxPH was selected as the optimal model for predicting all-cause mortality with time-dependent AUROC of 0.953 (95%CI: 0.951-0.955). Coxnet was the best model for predicting cardiovascular disease (CVD) and cancer mortality with time-dependent AUROCs of 0.935 (95%CI: 0.933-0.936) and 0.850 (95%CI: 0.844-0.857). Based on clinical variables, adding environmental chemicals could enhance the predictive ability of cancer mortality (P < 0.05). Some environmental chemicals contributed more to the models than traditional clinical variables. Combined the results of association and prediction models by interpretable machine learning analyses, we found urinary methyl paraben (MP) and urinary 2-napthol (2-NAP) were negatively associated with all-cause mortality, while serum cadmium (Cd) was positively associated with all-cause mortality. Urinary bisphenol A (BPA) was positively associated with CVD mortality.
Assuntos
Doenças Cardiovasculares , Neoplasias , Humanos , Estudos Longitudinais , Inquéritos Nutricionais , Aprendizado de Máquina , Neoplasias/induzido quimicamenteRESUMO
The Northern Sea Route (NSR) makes travel between Europe and Asia shorter and quicker than a southern transit via the Strait of Malacca and Suez Canal. It provides greater access to Arctic resources such as oil and gas. As global warming accelerates, melting Arctic ice caps are likely to increase traffic in the NSR and enhance its commercial viability. Due to the harsh Arctic environment imposing threats to the safety of ship navigation, it is necessary to assess Arctic navigation risk to maintain shipping safety. Currently, most studies are focused on the conventional assessment of the risk, which lacks the validation based on actual data. In this study, actual data about Arctic navigation environment and related expert judgments were used to generate a structured data set. Based on the structured data set, extreme gradient boosting (XGBoost) and alternative methods were used to establish models for the assessment of Arctic navigation risk, which were validated using cross-validation. The results show that compared with alternative models, XGBoost models have the best performance in terms of mean absolute errors and root mean squared errors. The XGBoost models can learn and reproduce expert judgments and knowledge for the assessment of Arctic navigation risk. Feature importance (FI) and shapley additive explanations (SHAP) are used to further interpret the relationship between input data and predictions. The application of XGBoost, FI, and SHAP is aimed to improve the safety of Arctic shipping using advanced artificial intelligence techniques. The validated assessment enhances the quality and robustness of assessment.
RESUMO
Optimization and control of wastewater treatment process (WTP) can contribute to cost reduction and efficiency. A wastewater treatment process multi-objective optimization (WTPMO) framework is proposed in this paper to provide suggestions for decision-making in setting parameters of WTP. Firstly, the prediction models based on Extreme Gradient Boosting (XGB) with Bayesian optimization (BO) are developed for predicting effluent water quality (EQ) and energy consumption (EC) for different influent quality and process parameter settings. Then, the SHapley Additive exPlanations (SHAP) algorithm is used to complement the interpretability of machine learning to quantitatively evaluate the impact of different features on the predicted targets. Finally, the Non-dominated Sorting Genetic Algorithm II (NSGA-II) with the Technique for Ordering Preferences on Similarity of Ideal Solutions (TOPSIS) is introduced to solve and make decisions on the multi-objective optimization problem. The WTPMO applicability is validated on Benchmark Simulation Model 1 (BSM1). The results show that BOXGB achieves accurate prediction for EQ and EC with R2 values of 0.923 and 0.965, respectively, indicating that BO can effectively select the model hyperparameters in XGB. Based on SHAP supplemented the interpretability of the model to fully explain how the influent water quality and decision variables affect the EQ and EC of the WTP. In addition, the optimized process parameters are determined based on NSGA-II and TOPSIS, and the EC optimization rate is 1.552% while guaranteeing water quality compliance. Overall, this research can effectively achieve the optimization of WTP, ensure that the effluent water quality meets the standards while reducing energy consumption, assist Wastewater treatment plants (WWTPs) to achieve more intelligent and efficient operation and maintenance management, and provide strong support for environmental protection and sustainable development goals.
Assuntos
Algoritmos , Teorema de Bayes , Aprendizado de Máquina , Eliminação de Resíduos Líquidos , Águas Residuárias , Qualidade da Água , Eliminação de Resíduos Líquidos/métodos , Purificação da Água/métodos , Modelos TeóricosRESUMO
Groundwater nitrate contamination poses a potential threat to human health and environmental safety globally. This study proposes an interpretable stacking ensemble learning (SEL) framework for enhancing and interpreting groundwater nitrate spatial predictions by integrating the two-level heterogeneous SEL model and SHapley Additive exPlanations (SHAP). In the SEL model, five commonly used machine learning models were utilized as base models (gradient boosting decision tree, extreme gradient boosting, random forest, extremely randomized trees, and k-nearest neighbor), whose outputs were taken as input data for the meta-model. When applied to the agricultural intensive area, the Eden Valley in the UK, the SEL model outperformed the individual models in predictive performance and generalization ability. It reveals a mean groundwater nitrate level of 2.22 mg/L-N, with 2.46% of sandstone aquifers exceeding the drinking standard of 11.3 mg/L-N. Alarmingly, 8.74% of areas with high groundwater nitrate remain outside the designated nitrate vulnerable zones. Moreover, SHAP identified that transmissivity, baseflow index, hydraulic conductivity, the percentage of arable land, and the C:N ratio in the soil were the top five key driving factors of groundwater nitrate. With nitrate threatening groundwater globally, this study presents a high-accuracy, interpretable, and flexible modeling framework that enhances our understanding of the mechanisms behind groundwater nitrate contamination. It implies that the interpretable SEL framework has great promise for providing valuable evidence for environmental management, water resource protection, and sustainable development, particularly in the data-scarce area.
Assuntos
Água Subterrânea , Aprendizado de Máquina , Nitratos , Poluentes Químicos da Água , Água Subterrânea/química , Nitratos/análise , Poluentes Químicos da Água/análise , Monitoramento Ambiental/métodosRESUMO
Nanoparticle (NP) characterization is essential because diverse shapes, sizes, and morphologies inevitably occur in as-synthesized NP mixtures, profoundly impacting their properties and applications. Currently, the only technique to concurrently determine these structural parameters is electron microscopy, but it is time-intensive and tedious. Here, we create a three-dimensional (3D) NP structural space to concurrently determine the purity, size, and shape of 1000 sets of as-synthesized Ag nanocubes mixtures containing interfering nanospheres and nanowires from their extinction spectra, attaining low predictive errors at 2.7-7.9 %. We first use plasmonically-driven feature enrichment to extract localized surface plasmon resonance attributes from spectra and establish a lasso regressor (LR) model to predict purity, size, and shape. Leveraging the learned LR, we artificially generate 425,592 augmented extinction spectra to overcome data scarcity and create a comprehensive NP structural space to bidirectionally predict extinction spectra from structural parameters with <4 % error. Our interpretable NP structural space further elucidates the two higher-order combined electric dipole, quadrupole, and magnetic dipole as the critical structural parameter predictors. By incorporating other NP shapes and mixtures' extinction spectra, we anticipate our approach, especially the data augmentation, can create a fully generalizable NP structural space to drive on-demand, autonomous synthesis-characterization platforms.
RESUMO
BACKGROUND: Understanding the impact of gene interactions on disease phenotypes is increasingly recognised as a crucial aspect of genetic disease research. This trend is reflected by the growing amount of clinical research on oligogenic diseases, where disease manifestations are influenced by combinations of variants on a few specific genes. Although statistical machine-learning methods have been developed to identify relevant genetic variant or gene combinations associated with oligogenic diseases, they rely on abstract features and black-box models, posing challenges to interpretability for medical experts and impeding their ability to comprehend and validate predictions. In this work, we present a novel, interpretable predictive approach based on a knowledge graph that not only provides accurate predictions of disease-causing gene interactions but also offers explanations for these results. RESULTS: We introduce BOCK, a knowledge graph constructed to explore disease-causing genetic interactions, integrating curated information on oligogenic diseases from clinical cases with relevant biomedical networks and ontologies. Using this graph, we developed a novel predictive framework based on heterogenous paths connecting gene pairs. This method trains an interpretable decision set model that not only accurately predicts pathogenic gene interactions, but also unveils the patterns associated with these diseases. A unique aspect of our approach is its ability to offer, along with each positive prediction, explanations in the form of subgraphs, revealing the specific entities and relationships that led to each pathogenic prediction. CONCLUSION: Our method, built with interpretability in mind, leverages heterogenous path information in knowledge graphs to predict pathogenic gene interactions and generate meaningful explanations. This not only broadens our understanding of the molecular mechanisms underlying oligogenic diseases, but also presents a novel application of knowledge graphs in creating more transparent and insightful predictors for genetic research.
Assuntos
Epistasia Genética , Reconhecimento Automatizado de Padrão , Aprendizado de Máquina , Fenótipo , Ontologia GenéticaRESUMO
Because of its ability to find complex patterns in high dimensional and heterogeneous data, machine learning (ML) has emerged as a critical tool for making sense of the growing amount of genetic and genomic data available. While the complexity of ML models is what makes them powerful, it also makes them difficult to interpret. Fortunately, efforts to develop approaches that make the inner workings of ML models understandable to humans have improved our ability to make novel biological insights. Here, we discuss the importance of interpretable ML, different strategies for interpreting ML models, and examples of how these strategies have been applied. Finally, we identify challenges and promising future directions for interpretable ML in genetics and genomics.