Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
1.
Front Oral Health ; 5: 1408867, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39092200

RESUMO

Oral diseases pose a significant burden on global healthcare. While many oral conditions are preventable and manageable through regular dental office visits, a substantial portion of the population faces obstacles in accessing essential and affordable quality oral healthcare. In this mini review, we describe the issue of inequity and bias in oral healthcare and discuss various strategies to address these challenges, with an emphasis on the application of artificial intelligence (AI). Recent advances in AI technologies have led to significant performance improvements in oral healthcare. AI also holds tremendous potential for advancing equity in oral healthcare, yet its application must be approached with caution to prevent the exacerbation of inequities. The "black box" approaches of some advanced AI models raise uncertainty about their operations and decision-making processes. To this end, we discuss the use of interpretable and explainable AI techniques in enhancing transparency and trustworthiness. Those techniques, aimed at augmenting rather than replacing oral health practitioners' judgment and skills, have the potential to achieve personalized dental and oral care that is unbiased, equitable, and transparent. Overall, achieving equity in oral healthcare through the responsible use of AI requires collective efforts from all stakeholders involved in the design, implementation, regulation, and utilization of AI systems. We use the United States as an example due to its uniquely diverse population, making it an excellent model for our discussion. However, the general and responsible AI strategies suggested in this article can be applied to address equity in oral healthcare on a global level.

2.
Digit Health ; 10: 20552076231224225, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38235416

RESUMO

Objective: Chronic kidney disease (CKD) poses a major global health burden. Early CKD risk prediction enables timely interventions, but conventional models have limited accuracy. Machine learning (ML) enhances prediction, but interpretability is needed to support clinical usage with both in diagnostic and decision-making. Methods: A cohort of 491 patients with clinical data was collected for this study. The dataset was randomly split into an 80% training set and a 20% testing set. To achieve the first objective, we developed four ML algorithms (logistic regression, random forests, neural networks, and eXtreme Gradient Boosting (XGBoost)) to classify patients into two classes-those who progressed to CKD stages 3-5 during follow-up (positive class) and those who did not (negative class). For the classification task, the area under the receiver operating characteristic curve (AUC-ROC) was used to evaluate model performance in discriminating between the two classes. For survival analysis, Cox proportional hazards regression (COX) and random survival forests (RSFs) were employed to predict CKD progression, and the concordance index (C-index) and integrated Brier score were used for model evaluation. Furthermore, variable importance, partial dependence plots, and restrict cubic splines were used to interpret the models' results. Results: XGBOOST demonstrated the best predictive performance for CKD progression in the classification task, with an AUC-ROC of 0.867 (95% confidence interval (CI): 0.728-0.100), outperforming the other ML algorithms. In survival analysis, RSF showed slightly better discrimination and calibration on the test set compared to COX, indicating better generalization to new data. Variable importance analysis identified estimated glomerular filtration rate, age, and creatinine as the most important predictors for CKD survival analysis. Further analysis revealed non-linear associations between age and CKD progression, suggesting higher risks in patients aged 52-55 and 65-66 years. The association between cholesterol levels and CKD progression was also non-linear, with lower risks observed when cholesterol levels were in the range of 5.8-6.4 mmol/L. Conclusions: Our study demonstrated the effectiveness of interpretable ML models for predicting CKD progression. The comparison between COX and RSF highlighted the advantages of ML in survival analysis, particularly in handling non-linearity and high-dimensional data. By leveraging interpretable ML for unraveling risk factor relationships, contrasting predictive techniques, and exposing non-linear associations, this study significantly advances CKD risk prediction to enable enhanced clinical decision-making.

3.
mSystems ; 8(4): e0053123, 2023 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-37404032

RESUMO

With the concomitant advances in both the microbiome and machine learning fields, the gut microbiome has become of great interest for the potential discovery of biomarkers to be used in the classification of the host health status. Shotgun metagenomics data derived from the human microbiome is composed of a high-dimensional set of microbial features. The use of such complex data for the modeling of host-microbiome interactions remains a challenge as retaining de novo content yields a highly granular set of microbial features. In this study, we compared the prediction performances of machine learning approaches according to different types of data representations derived from shotgun metagenomics. These representations include commonly used taxonomic and functional profiles and the more granular gene cluster approach. For the five case-control datasets used in this study (Type 2 diabetes, obesity, liver cirrhosis, colorectal cancer, and inflammatory bowel disease), gene-based approaches, whether used alone or in combination with reference-based data types, allowed improved or similar classification performances as the taxonomic and functional profiles. In addition, we show that using subsets of gene families from specific functional categories of genes highlight the importance of these functions on the host phenotype. This study demonstrates that both reference-free microbiome representations and curated metagenomic annotations can provide relevant representations for machine learning based on metagenomic data. IMPORTANCE Data representation is an essential part of machine learning performance when using metagenomic data. In this work, we show that different microbiome representations provide varied host phenotype classification performance depending on the dataset. In classification tasks, untargeted microbiome gene content can provide similar or improved classification compared to taxonomical profiling. Feature selection based on biological function also improves classification performance for some pathologies. Function-based feature selection combined with interpretable machine learning algorithms can generate new hypotheses that can potentially be assayed mechanistically. This work thus proposes new approaches to represent microbiome data for machine learning that can potentiate the findings associated with metagenomic data.


Assuntos
Diabetes Mellitus Tipo 2 , Microbioma Gastrointestinal , Microbiota , Humanos , Diabetes Mellitus Tipo 2/genética , Microbiota/genética , Metagenoma , Microbioma Gastrointestinal/genética , Fenótipo
4.
Comput Struct Biotechnol J ; 21: 2446-2453, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37090433

RESUMO

Peptide retention time (RT) prediction algorithms are tools to study and identify the physicochemical properties that drive the peptide-sorbent interaction. Traditional RT algorithms use multiple linear regression with manually curated parameters to determine the degree of direct contribution for each parameter and improvements to RT prediction accuracies relied on superior feature engineering. Deep learning led to a significant increase in RT prediction accuracy and automated feature engineering via chaining multiple learning modules. However, the significance and the identity of these extracted variables are not well understood due to the inherent complexity when interpreting "relationships-of-relationships" found in deep learning variables. To achieve both accuracy and interpretability simultaneously, we isolated individual modules used in deep learning and the isolated modules are the shallow learners employed for RT prediction in this work. Using a shallow convolutional neural network (CNN) and gated recurrent unit (GRU), we find that the spatial features obtained via the CNN correlate with real-world physicochemical properties namely cross-collisional sections (CCS) and variations of assessable surface area (ASA). Furthermore, we determined that the discovered parameters are "micro-coefficients" that contribute to the "macro-coefficient" - hydrophobicity. Manually embedding CCS and the variations of ASA to the GRU model yielded an R2 = 0.981 using only 525 variables and can represent 88% of the ∼110,000 tryptic peptides used in our dataset. This work highlights the feature discovery process of our shallow learners can achieve beyond traditional RT models in performance and have better interpretability when compared with the deep learning RT algorithms found in the literature.

5.
Comput Methods Programs Biomed ; 233: 107482, 2023 May.
Artigo em Inglês | MEDLINE | ID: mdl-36947980

RESUMO

BACKGROUND AND OBJECTIVE: Prediction of survival in patients diagnosed with a brain tumour is challenging because of heterogeneous tumour behaviours and treatment response. Advances in machine learning have led to the development of clinical prognostic models, but due to the lack of model interpretability, integration into clinical practice is almost non-existent. In this retrospective study, we compare five classification models with varying degrees of interpretability for the prediction of brain tumour survival greater than one year following diagnosis. METHODS: 1028 patients aged ≥16 years with a brain tumour diagnosis between April 2012 and April 2020 were included in our study. Three intrinsically interpretable 'glass box' classifiers (Bayesian Rule Lists [BRL], Explainable Boosting Machine [EBM], and Logistic Regression [LR]), and two 'black box' classifiers (Random Forest [RF] and Support Vector Machine [SVM]) were trained on electronic patients records for the prediction of one-year survival. All models were evaluated using balanced accuracy (BAC), F1-score, sensitivity, specificity, and receiver operating characteristics. Black box model interpretability and misclassified predictions were quantified using SHapley Additive exPlanations (SHAP) values and model feature importance was evaluated by clinical experts. RESULTS: The RF model achieved the highest BAC of 78.9%, closely followed by SVM (77.7%), LR (77.5%) and EBM (77.1%). Across all models, age, diagnosis (tumour type), functional features, and first treatment were top contributors to the prediction of one year survival. We used EBM and SHAP to explain model misclassifications and investigated the role of feature interactions in prognosis. CONCLUSION: Interpretable models are a natural choice for the domain of predictive medicine. Intrinsically interpretable models, such as EBMs, may provide an advantage over traditional clinical assessment of brain tumour prognosis by weighting potential risk factors and their interactions that may be unknown to clinicians. An agreement between model predictions and clinical knowledge is essential for establishing trust in the models decision making process, as well as trust that the model will make accurate predictions when applied to new data.


Assuntos
Neoplasias Encefálicas , Humanos , Teorema de Bayes , Estudos Retrospectivos , Neoplasias Encefálicas/diagnóstico , Aprendizado de Máquina , Encéfalo
6.
J Gerontol A Biol Sci Med Sci ; 78(4): 718-726, 2023 03 30.
Artigo em Inglês | MEDLINE | ID: mdl-35657011

RESUMO

BACKGROUND: Multiple organ dysfunction syndrome (MODS) is associated with a high risk of mortality among older patients. Current severity scores are limited in their ability to assist clinicians with triage and management decisions. We aim to develop mortality prediction models for older patients with MODS admitted to the ICU. METHODS: The study analyzed older patients from 197 hospitals in the United States and 1 hospital in the Netherlands. The cohort was divided into the young-old (65-80 years) and old-old (≥80 years), which were separately used to develop and evaluate models including internal, external, and temporal validation. Demographic characteristics, comorbidities, vital signs, laboratory measurements, and treatments were used as predictors. We used the XGBoost algorithm to train models, and the SHapley Additive exPlanations (SHAP) method to interpret predictions. RESULTS: Thirty-four thousand four hundred and ninety-seven young-old (11.3% mortality) and 21 330 old-old (15.7% mortality) patients were analyzed. Discrimination AUROC of internal validation models in 9 046 U.S. patients was as follows: 0.87 and 0.82, respectively; discrimination of external validation models in 1 905 EUR patients was as follows: 0.86 and 0.85, respectively; and discrimination of temporal validation models in 8 690 U.S. patients: 0.85 and 0.78, respectively. These models outperformed standard clinical scores like Sequential Organ Failure Assessment and Acute Physiology Score III. The Glasgow Coma Scale, Charlson Comorbidity Index, and Code Status emerged as top predictors of mortality. CONCLUSIONS: Our models integrate data spanning physiologic and geriatric-relevant variables that outperform existing scores used in older adults with MODS, which represents a proof of concept of how machine learning can streamline data analysis for busy ICU clinicians to potentially optimize prognostication and decision making.


Assuntos
Hospitais , Insuficiência de Múltiplos Órgãos , Humanos , Idoso , Estudos Retrospectivos , Insuficiência de Múltiplos Órgãos/diagnóstico , Mortalidade Hospitalar , Aprendizado de Máquina
7.
Med Image Comput Comput Assist Interv ; 14221: 628-638, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-38827244

RESUMO

Building generalizable AI models is one of the primary challenges in the healthcare domain. While radiologists rely on generalizable descriptive rules of abnormality, Neural Network (NN) models suffer even with a slight shift in input distribution (e.g., scanner type). Fine-tuning a model to transfer knowledge from one domain to another requires a significant amount of labeled data in the target domain. In this paper, we develop an interpretable model that can be efficiently fine-tuned to an unseen target domain with minimal computational cost. We assume the interpretable component of NN to be approximately domain-invariant. However, interpretable models typically underperform compared to their Blackbox (BB) variants. We start with a BB in the source domain and distill it into a mixture of shallow interpretable models using human-understandable concepts. As each interpretable model covers a subset of data, a mixture of interpretable models achieves comparable performance as BB. Further, we use the pseudo-labeling technique from semi-supervised learning (SSL) to learn the concept classifier in the target domain, followed by fine-tuning the interpretable models in the target domain. We evaluate our model using a real-life large-scale chest-X-ray (CXR) classification dataset. The code is available at: https://github.com/batmanlab/MICCAI-2023-Route-interpret-repeat-CXRs.

8.
Biotechnol Adv ; 60: 108008, 2022 11.
Artigo em Inglês | MEDLINE | ID: mdl-35738510

RESUMO

Glycans are complex, yet ubiquitous across biological systems. They are involved in diverse essential organismal functions. Aberrant glycosylation may lead to disease development, such as cancer, autoimmune diseases, and inflammatory diseases. Glycans, both normal and aberrant, are synthesized using extensive glycosylation machinery, and understanding this machinery can provide invaluable insights for diagnosis, prognosis, and treatment of various diseases. Increasing amounts of glycomics data are being generated thanks to advances in glycoanalytics technologies, but to maximize the value of such data, innovations are needed for analyzing and interpreting large-scale glycomics data. Artificial intelligence (AI) provides a powerful analysis toolbox in many scientific fields, and here we review state-of-the-art AI approaches on glycosylation analysis. We further discuss how models can be analyzed to gain mechanistic insights into glycosylation machinery and how the machinery shapes glycans under different scenarios. Finally, we propose how to leverage the gained knowledge for developing predictive AI-based models of glycosylation. Thus, guiding future research of AI-based glycosylation model development will provide valuable insights into glycosylation and glycan machinery.


Assuntos
Inteligência Artificial , Neoplasias , Glicômica , Glicosilação , Humanos , Polissacarídeos
9.
Sci Total Environ ; 837: 155856, 2022 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-35561926

RESUMO

Droughts are one of the most devastating and recurring natural disaster due to a multitude of reasons. Among the different drought studies, drought forecasting is one of the key aspects of effective drought management. The occurrence of droughts is related to a multitude of factors which is a combination of hydro-meteorological and climatic factors. These variables are non-linear in nature, and neural networks have been found to effectively forecast drought. However, classical neural nets often succumb to over-fitting due to various lag components among the variables and therefore, the emergence of new deep learning and explainable models can effectively solve this problem. The present study uses an Attention-based model to forecast meteorological droughts (Standard Precipitation Index) at short-term forecast range (1-3 months) for five sites situated in Eastern Australia. The main aim of the work is to interpret the model outcomes and examine how a deep neural network achieves the forecasting results. The plots show the importance of the variables along with its short-term and long-term dependencies at different lead times. The results indicate the importance of large-scale climatic indices at different sequence dependencies specific to the study site, thus providing an example of the necessity to build a spatio-temporal explainable AI model for drought forecasting. The use of such interpretable models would help the decision-makers and planners to use data-driven models as an effective measure to forecast droughts as they provide transparency and trust while using these models.


Assuntos
Secas , Meteorologia , Austrália , Previsões , Redes Neurais de Computação
10.
BMC Plant Biol ; 22(1): 180, 2022 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-35395721

RESUMO

Recent growth in crop genomic and trait data have opened opportunities for the application of novel approaches to accelerate crop improvement. Machine learning and deep learning are at the forefront of prediction-based data analysis. However, few approaches for genotype to phenotype prediction compare machine learning with deep learning and further interpret the models that support the predictions. This study uses genome wide molecular markers and traits across 1110 soybean individuals to develop accurate prediction models. For 13/14 sets of predictions, XGBoost or random forest outperformed deep learning models in prediction performance. Top ranked SNPs by F-score were identified from XGBoost, and with further investigation found overlap with significantly associated loci identified from GWAS and previous literature. Feature importance rankings were used to reduce marker input by up to 90%, and subsequent models maintained or improved their prediction performance. These findings support interpretable machine learning as an approach for genomic based prediction of traits in soybean and other crops.


Assuntos
Aprendizado Profundo , Glycine max , Genótipo , Aprendizado de Máquina , Fenótipo , Glycine max/genética
11.
Comput Biol Med ; 145: 105388, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35349798

RESUMO

BACKGROUND AND OBJECTIVE: Diabetes mellitus manifests as prolonged elevated blood glucose levels resulting from impaired insulin production. Such high glucose levels over a long period of time damage multiple internal organs. To mitigate this condition, researchers and engineers have developed the closed loop artificial pancreas consisting of a continuous glucose monitor and an insulin pump connected via a microcontroller or smartphone. A problem, however, is how to accurately predict short term future glucose levels in order to exert efficient glucose-level control. Much work in the literature focuses on least prediction error as a key metric and therefore pursues complex prediction methods such a deep learning. Such an approach neglects other important and significant design issues such as method complexity (impacting interpretability and safety), hardware requirements for low-power devices such as the insulin pump, the required amount of input data for training (potentially rendering the method infeasible for new patients), and the fact that very small improvements in accuracy may not have significant clinical benefit. METHODS: We propose a novel low-complexity, explainable blood glucose prediction method derived from the Intel P6 branch predictor algorithm. We use Meta-Differential Evolution to determine predictor parameters on training data splits of the benchmark datasets we use. A comparison is made between our new algorithm and a state-of-the-art deep-learning method for blood glucose level prediction. RESULTS: To evaluate the new method, the Blood Glucose Level Prediction Challenge benchmark dataset is utilised. On the official test data split after training, the state-of-the-art deep learning method predicted glucose levels 30 min ahead of current time with 96.3% of predicted glucose levels having relative error less than 30% (which is equivalent to the safe zone of the Surveillance Error Grid). Our simpler, interpretable approach prolonged the prediction horizon by another 5 min with 95.8% of predicted glucose levels of all patients having relative error less than 30%. CONCLUSIONS: When considering predictive performance as assessed using the Blood Glucose Level Prediction Challenge benchmark dataset and Surveillance Error Grid metrics, we found that the new algorithm delivered comparable predictive accuracy performance, while operating only on the glucose-level signal with considerably less computational complexity.


Assuntos
Automonitorização da Glicemia , Diabetes Mellitus Tipo 1 , Algoritmos , Glicemia , Humanos , Insulina
12.
IEEE Trans Signal Process ; 70: 5954-5966, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36777018

RESUMO

Probabilistic generative models are attractive for scientific modeling because their inferred parameters can be used to generate hypotheses and design experiments. This requires that the learned model provides an accurate representation of the input data and yields a latent space that effectively predicts outcomes relevant to the scientific question. Supervised Variational Autoencoders (SVAEs) have previously been used for this purpose, as a carefully designed decoder can be used as an interpretable generative model of the data, while the supervised objective ensures a predictive latent representation. Unfortunately, the supervised objective forces the encoder to learn a biased approximation to the generative posterior distribution, which renders the generative parameters unreliable when used in scientific models. This issue has remained undetected as reconstruction losses commonly used to evaluate model performance do not detect bias in the encoder. We address this previously-unreported issue by developing a second-order supervision framework (SOS-VAE) that updates the decoder parameters, rather than the encoder, to induce a predictive latent representation. This ensures that the encoder maintains a reliable posterior approximation and the decoder parameters can be effectively interpreted. We extend this technique to allow the user to trade-off the bias in the generative parameters for improved predictive performance, acting as an intermediate option between SVAEs and our new SOS-VAE. We also use this methodology to address missing data issues that often arise when combining recordings from multiple scientific experiments. We demonstrate the effectiveness of these developments using synthetic data and electrophysiological recordings with an emphasis on how our learned representations can be used to design scientific experiments.

13.
Neural Comput Appl ; 34(1): 67-78, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-33935376

RESUMO

We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00521-021-06018-2.

14.
Entropy (Basel) ; 23(10)2021 Oct 17.
Artigo em Inglês | MEDLINE | ID: mdl-34682081

RESUMO

In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.

15.
Gigascience ; 9(3)2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-32150601

RESUMO

BACKGROUND: Microbiome biomarker discovery for patient diagnosis, prognosis, and risk evaluation is attracting broad interest. Selected groups of microbial features provide signatures that characterize host disease states such as cancer or cardio-metabolic diseases. Yet, the current predictive models stemming from machine learning still behave as black boxes and seldom generalize well. Their interpretation is challenging for physicians and biologists, which makes them difficult to trust and use routinely in the physician-patient decision-making process. Novel methods that provide interpretability and biological insight are needed. Here, we introduce "predomics", an original machine learning approach inspired by microbial ecosystem interactions that is tailored for metagenomics data. It discovers accurate predictive signatures and provides unprecedented interpretability. The decision provided by the predictive model is based on a simple, yet powerful score computed by adding, subtracting, or dividing cumulative abundance of microbiome measurements. RESULTS: Tested on >100 datasets, we demonstrate that predomics models are simple and highly interpretable. Even with such simplicity, they are at least as accurate as state-of-the-art methods. The family of best models, discovered during the learning process, offers the ability to distil biological information and to decipher the predictability signatures of the studied condition. In a proof-of-concept experiment, we successfully predicted body corpulence and metabolic improvement after bariatric surgery using pre-surgery microbiome data. CONCLUSIONS: Predomics is a new algorithm that helps in providing reliable and trustworthy diagnostic decisions in the microbiome field. Predomics is in accord with societal and legal requirements that plead for an explainable artificial intelligence approach in the medical field.


Assuntos
Microbioma Gastrointestinal/genética , Metagenoma , Metagenômica/métodos , Humanos , Modelos Genéticos , Máquina de Vetores de Suporte
16.
BioData Min ; 12: 1, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30627219

RESUMO

BACKGROUND: Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend to neglect interpretability and comprehensiveness of the resulting models. RESULTS: Generalized Matrix Learning Vector Quantization (GMLVQ) is a supervised, prototype-based machine learning method and provides comprehensive visualization capabilities not present in other classifiers which allow for a fine-grained interpretation of the data. In contrast to commonly used machine learning strategies, GMLVQ is well-suited for imbalanced classification problems which are frequent in life sciences. We present a Weka plug-in implementing GMLVQ. The feasibility of GMLVQ is demonstrated on a dataset of Early Folding Residues (EFR) that have been shown to initiate and guide the protein folding process. Using 27 features, an area under the receiver operating characteristic of 76.6% was achieved which is comparable to other state-of-the-art classifiers. The obtained model is accessible at https://biosciences.hs-mittweida.de/efpred/. CONCLUSIONS: The application on EFR prediction demonstrates how an easy interpretation of classification models can promote the comprehension of biological mechanisms. The results shed light on the special features of EFR which were reported as most influential for the classification: EFR are embedded in ordered secondary structure elements and they participate in networks of hydrophobic residues. Visualization capabilities of GMLVQ are presented as we demonstrate how to interpret the results.

17.
Artigo em Inglês | MEDLINE | ID: mdl-34335110

RESUMO

Variable importance (VI) tools describe how much covariates contribute to a prediction model's accuracy. However, important variables for one well-performing model (for example, a linear model f (x) = x T ß with a fixed coefficient vector ß) may be unimportant for another model. In this paper, we propose model class reliance (MCR) as the range of VI values across all well-performing model in a prespecified class. Thus, MCR gives a more comprehensive description of importance by accounting for the fact that many prediction models, possibly of different parametric forms, may fit the data well. In the process of deriving MCR, we show several informative results for permutation-based VI estimates, based on the VI measures used in Random Forests. Specifically, we derive connections between permutation importance estimates for a single prediction model, U-statistics, conditional variable importance, conditional causal effects, and linear model coefficients. We then give probabilistic bounds for MCR, using a novel, generalizable technique. We apply MCR to a public data set of Broward County criminal records to study the reliance of recidivism prediction models on sex and race. In this application, MCR can be used to help inform VI for unknown, proprietary models.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA