RESUMO
MOTIVATION: Mutational signatures are a critical component in deciphering the genetic alterations that underlie cancer development and have become a valuable resource to understand the genomic changes during tumorigenesis. Therefore, it is essential to employ precise and accurate methods for their extraction to ensure that the underlying patterns are reliably identified and can be effectively utilized in new strategies for diagnosis, prognosis, and treatment of cancer patients. RESULTS: We present MUSE-XAE, a novel method for mutational signature extraction from cancer genomes using an explainable autoencoder. Our approach employs a hybrid architecture consisting of a nonlinear encoder that can capture nonlinear interactions among features, and a linear decoder which ensures the interpretability of the active signatures. We evaluated and compared MUSE-XAE with other available tools on both synthetic and real cancer datasets and demonstrated that it achieves superior performance in terms of precision and sensitivity in recovering mutational signature profiles. MUSE-XAE extracts highly discriminative mutational signature profiles by enhancing the classification of primary tumour types and subtypes in real world settings. This approach could facilitate further research in this area, with neural networks playing a critical role in advancing our understanding of cancer genomics. AVAILABILITY AND IMPLEMENTATION: MUSE-XAE software is freely available at https://github.com/compbiomed-unito/MUSE-XAE.
Assuntos
Mutação , Neoplasias , Humanos , Neoplasias/genética , Algoritmos , Software , Genômica/métodos , Biologia Computacional/métodos , Redes Neurais de ComputaçãoRESUMO
One of the primary challenges in human genetics is determining the functional impact of single nucleotide variants (SNVs) and insertion and deletions (InDels), whether coding or noncoding. In the past, methods have been created to detect disease-related single amino acid changes, but only some can assess the influence of noncoding variations. CADD is the most commonly used and advanced algorithm for predicting the diverse effects of genome variations. It employs a combination of sequence conservation and functional features derived from the ENCODE project data. To use CADD, a large set of pre-calculated information must be downloaded during the installation process. To streamline the variant annotation process, we developed PhD-SNPg, a machine-learning tool that is easy to install and lightweight, relying solely on sequence-based features. Here we present an updated version, trained on a larger dataset, that can also predict the impact of the InDel variations. Despite its simplicity, PhD-SNPg performs similarly to CADD, making it ideal for rapid genome interpretation and as a benchmark for tool development.
Assuntos
Algoritmos , Genoma Humano , Humanos , Mutação INDEL , Aprendizado de Máquina , Polimorfismo de Nucleotídeo ÚnicoRESUMO
BACKGROUND: Seafood is increasingly traded worldwide, but its supply chain is particularly prone to frauds. To increase consumer confidence, prevent illegal trade, and provide independent validation for eco-labelling, accurate tools for seafood traceability are needed. Here we show that the use of microbiome profiling (MP) coupled with machine learning (ML) allows precise tracing the origin of Manila clams harvested in areas separated by small geographic distances. The study was designed to represent a real-world scenario. Clams were collected in different seasons across the most important production area in Europe (lagoons along the northern Adriatic coast) to cover the known seasonal variation in microbiome composition for the species. DNA extracted from samples underwent the same depuration process as commercial products (i.e. at least 12 h in open flow systems). RESULTS: Machine learning-based analysis of microbiome profiles was carried out using two completely independent sets of data (collected at the same locations but in different years), one for training the algorithm, and the other for testing its accuracy and assessing the temporal stability signal. Briefly, gills (GI) and digestive gland (DG) of clams were collected in summer and winter over two different years (i.e. from 2018 to 2020) in one banned area and four farming sites. 16S DNA metabarcoding was performed on clam tissues and the obtained amplicon sequence variants (ASVs) table was used as input for ML MP. The best-predicting performances were obtained using the combined information of GI and DG (consensus analysis), showing a Cohen K-score > 0.95 when the target was the classification of samples collected from the banned area and those harvested at farming sites. Classification of the four different farming areas showed slightly lower accuracy with a 0.76 score. CONCLUSIONS: We show here that MP coupled with ML is an effective tool to trace the origin of shellfish products. The tool is extremely robust against seasonal and inter-annual variability, as well as product depuration, and is ready for implementation in routine assessment to prevent the trade of illegally harvested or mislabeled shellfish.
Assuntos
Bivalves , Aprendizado de Máquina , Microbiota , Alimentos Marinhos , Alimentos Marinhos/microbiologia , Animais , Bivalves/microbiologia , ComércioRESUMO
OBJECTIVE: Hyperferritinaemia is associated with liver fibrosis severity in patients with metabolic dysfunction-associated steatotic liver disease (MASLD), but the longitudinal implications have not been thoroughly investigated. We assessed the role of serum ferritin in predicting long-term outcomes or death. DESIGN: We evaluated the relationship between baseline serum ferritin and longitudinal events in a multicentre cohort of 1342 patients. Four survival models considering ferritin with confounders or non-invasive scoring systems were applied with repeated five-fold cross-validation schema. Prediction performance was evaluated in terms of Harrell's C-index and its improvement by including ferritin as a covariate. RESULTS: Median follow-up time was 96 months. Liver-related events occurred in 7.7%, hepatocellular carcinoma in 1.9%, cardiovascular events in 10.9%, extrahepatic cancers in 8.3% and all-cause mortality in 5.8%. Hyperferritinaemia was associated with a 50% increased risk of liver-related events and 27% of all-cause mortality. A stepwise increase in baseline ferritin thresholds was associated with a statistical increase in C-index, ranging between 0.02 (lasso-penalised Cox regression) and 0.03 (ridge-penalised Cox regression); the risk of developing liver-related events mainly increased from threshold 215.5 µg/L (median HR=1.71 and C-index=0.71) and the risk of overall mortality from threshold 272 µg/L (median HR=1.49 and C-index=0.70). The inclusion of serum ferritin thresholds (215.5 µg/L and 272 µg/L) in predictive models increased the performance of Fibrosis-4 and Non-Alcoholic Fatty Liver Disease Fibrosis Score in the longitudinal risk assessment of liver-related events (C-indices>0.71) and overall mortality (C-indices>0.65). CONCLUSIONS: This study supports the potential use of serum ferritin values for predicting the long-term prognosis of patients with MASLD.
Assuntos
Neoplasias Hepáticas , Doenças Metabólicas , Hepatopatia Gordurosa não Alcoólica , Humanos , Hepatopatia Gordurosa não Alcoólica/patologia , Cirrose Hepática/patologia , Fibrose , Neoplasias Hepáticas/complicações , FerritinasRESUMO
Predicting the difference in thermodynamic stability between protein variants is crucial for protein design and understanding the genotype-phenotype relationships. So far, several computational tools have been created to address this task. Nevertheless, most of them have been trained or optimized on the same and 'all' available data, making a fair comparison unfeasible. Here, we introduce a novel dataset, collected and manually cleaned from the latest version of the ThermoMutDB database, consisting of 669 variants not included in the most widely used training datasets. The prediction performance and the ability to satisfy the antisymmetry property by considering both direct and reverse variants were evaluated across 21 different tools. The Pearson correlations of the tested tools were in the ranges of 0.21-0.5 and 0-0.45 for the direct and reverse variants, respectively. When both direct and reverse variants are considered, the antisymmetric methods perform better achieving a Pearson correlation in the range of 0.51-0.62. The tested methods seem relatively insensitive to the physiological conditions, performing well also on the variants measured with more extreme pH and temperature values. A common issue with all the tested methods is the compression of the $\Delta \Delta G$ predictions toward zero. Furthermore, the thermodynamic stability of the most significantly stabilizing variants was found to be more challenging to predict. This study is the most extensive comparisons of prediction methods using an entirely novel set of variants never tested before.
Assuntos
Mutação Puntual , Proteínas , Mutação , Estabilidade Proteica , Proteínas/química , TermodinâmicaRESUMO
MOTIVATION: The prediction of reliable Drug-Target Interactions (DTIs) is a key task in computer-aided drug design and repurposing. Here, we present a new approach based on data fusion for DTI prediction built on top of the NXTfusion library, which generalizes the Matrix Factorization paradigm by extending it to the nonlinear inference over Entity-Relation graphs. RESULTS: We benchmarked our approach on five datasets and we compared our models against state-of-the-art methods. Our models outperform most of the existing methods and, simultaneously, retain the flexibility to predict both DTIs as binary classification and regression of the real-valued drug-target affinity, competing with models built explicitly for each task. Moreover, our findings suggest that the validation of DTI methods should be stricter than what has been proposed in some previous studies, focusing more on mimicking real-life DTI settings where predictions for previously unseen drugs, proteins, and drug-protein pairs are needed. These settings are exactly the context in which the benefit of integrating heterogeneous information with our Entity-Relation data fusion approach is the most evident. AVAILABILITY AND IMPLEMENTATION: All software and data are available at https://github.com/eugeniomazzone/CPI-NXTFusion and https://pypi.org/project/NXTfusion/.
Assuntos
Desenvolvimento de Medicamentos , Software , Proteínas , Interações Medicamentosas , Desenho de FármacosRESUMO
Genetic markers (especially short tandem repeats or STRs) located on the X chromosome are a valuable resource to solve complex kinship cases in forensic genetics in addition or alternatively to autosomal STRs. Groups of tightly linked markers are combined into haplotypes, thus increasing the discriminating power of tests. However, this approach requires precise knowledge of the recombination rates between adjacent markers. The International Society of Forensic Genetics recommends that recombination rate estimation on the X chromosome is performed from pedigree genetic data while taking into account the confounding effect of mutations. However, implementations that satisfy these requirements have several drawbacks: they were never publicly released, they are very slow and/or need cluster-level hardware and strong computational expertise to use. In order to address these key concerns we developed Recombulator-X, a new open-source Python tool. The most challenging issue, namely the running time, was addressed with dynamic programming techniques to greatly reduce the computational complexity of the algorithm. Compared to the previous methods, Recombulator-X reduces the estimation times from weeks or months to less than one hour for typical datasets. Moreover, the estimation process, including preprocessing, has been streamlined and packaged into a simple command-line tool that can be run on a normal PC. Where previous approaches were limited to small panels of STR markers (up to 15), our tool can handle greater numbers (up to 100) of mixed STR and non-STR markers. In conclusion, Recombulator-X makes the estimation process much simpler, faster and accessible to researchers without a computational background, hopefully spurring increased adoption of best practices.
RESUMO
Beta-blockers are a crucial part of post-myocardial infarction (MI) pharmacological therapy. Recent studies have raised questions about their efficacy in patients without reduced left ventricular ejection fraction (LVEF). This study aims to assess adherence to beta-blockers after discharge for ST-segment elevation myocardial infarction (STEMI) and the impact of adherence on outcomes based on LVEF at discharge. The retrospective registry FAST-STEMI evaluated real-world adherence to main cardiovascular drugs in STEMI patients between 2012 and 2017 by comparing purchased tablets to expected ones at one year through pharmacy registries. Optimal adherence was defined ≥80%. Primary outcomes included all-cause and cardiovascular death, while secondary outcomes were myocardial infarction, major/minor bleeding events, and ischemic stroke The study included 4688 patients discharged on beta-blockers. Mean age was 64 ± 12.3 years, 76% were male, and mean LVEF was 49.2 ± 8.8%. Mean adherence at one year was 87.1%. Optimal adherence was associated with lower all-cause (adjHR 0.62, 95%CI 0.41-0.92, p 0.02) and cardiovascular mortality (adjHR 0.55, 95%CI 0.26-0.98, p 0.043). In LVEF ≤40% patients, optimal adherence was linked to reduced all-cause and cardiovascular mortality but this was not found either in patients with preserved or mildly reduced LVEF. Predictors of cardiovascular mortality included older age, chronic kidney disease, male gender, and atrial fibrillation. Optimal adherence to beta-blocker therapy in all-comers STEMI patients reduced all-cause and cardiovascular mortality at 1 year; once stratified by LVEF, this effect is confirmed only in patients with reduced LVEF (< 40%) at hospital discharge.
RESUMO
In many cases, the unprecedented availability of data provided by high-throughput sequencing has shifted the bottleneck from a data availability issue to a data interpretation issue, thus delaying the promised breakthroughs in genetics and precision medicine, for what concerns Human genetics, and phenotype prediction to improve plant adaptation to climate change and resistance to bioagressors, for what concerns plant sciences. In this paper, we propose a novel Genome Interpretation paradigm, which aims at directly modeling the genotype-to-phenotype relationship, and we focus on A. thaliana since it is the best studied model organism in plant genetics. Our model, called Galiana, is the first end-to-end Neural Network (NN) approach following the genomes in/phenotypes out paradigm and it is trained to predict 288 real-valued Arabidopsis thaliana phenotypes from Whole Genome sequencing data. We show that 75 of these phenotypes are predicted with a Pearson correlation ≥0.4, and are mostly related to flowering traits. We show that our end-to-end NN approach achieves better performances and larger phenotype coverage than models predicting single phenotypes from the GWAS-derived known associated genes. Galiana is also fully interpretable, thanks to the Saliency Maps gradient-based approaches. We followed this interpretation approach to identify 36 novel genes that are likely to be associated with flowering traits, finding evidence for 6 of them in the existing literature.
Assuntos
Arabidopsis , Arabidopsis/genética , Genoma , Estudo de Associação Genômica Ampla , Genótipo , Redes Neurais de Computação , Fenótipo , Sequenciamento Completo do GenomaRESUMO
Estimating the functional effect of single amino acid variants in proteins is fundamental for predicting the change in the thermodynamic stability, measured as the difference in the Gibbs free energy of unfolding, between the wild-type and the variant protein (ΔΔG). Here, we present the web-server of the DDGun method, which was previously developed for the ΔΔG prediction upon amino acid variants. DDGun is an untrained method based on basic features derived from evolutionary information. It is antisymmetric, as it predicts opposite ΔΔG values for direct (A â B) and reverse (B â A) single and multiple site variants. DDGun is available in two versions, one based on only sequence information and the other one based on sequence and structure information. Despite being untrained, DDGun reaches prediction performances comparable to those of trained methods. Here we make DDGun available as a web server. For the web server version, we updated the protein sequence database used for the computation of the evolutionary features, and we compiled two new data sets of protein variants to do a blind test of its performances. On these blind data sets of single and multiple site variants, DDGun confirms its prediction performance, reaching an average correlation coefficient between experimental and predicted ΔΔG of 0.45 and 0.49 for the sequence-based and structure-based versions, respectively. Besides being used for the prediction of ΔΔG, we suggest that DDGun should be adopted as a benchmark method to assess the predictive capabilities of newly developed methods. Releasing DDGun as a web-server, stand-alone program and docker image will facilitate the necessary process of method comparison to improve ΔΔG prediction.
Assuntos
Aminoácidos , Estabilidade Proteica , Proteínas , Aminoácidos/genética , Computadores , Bases de Dados de Proteínas , Proteínas/genética , Proteínas/químicaRESUMO
BACKGROUND AND AIMS: Nonalcoholic fatty liver disease (NAFLD) is a complex disease, resulting from the interplay between environmental determinants and genetic variations. Single nucleotide polymorphism rs738409 C>G in the PNPLA3 gene is associated with hepatic fibrosis and with higher risk of developing hepatocellular carcinoma. Here, we analyzed a longitudinal cohort of biopsy-proven NAFLD subjects with the aim to identify individuals in whom genetics may have a stronger impact on disease progression. METHODS: We retrospectively analyzed 756 consecutive, prospectively enrolled biopsy-proven NAFLD subjects from Italy, United Kingdom, and Spain who were followed for a median of 84 months (interquartile range, 65-109 months). We stratified the study cohort according to sex, body mass index (BMI) ≥30 kg/m2) and age (≥50 years). Liver-related events (hepatic decompensation, hepatic encephalopathy, esophageal variceal bleeding, and hepatocellular carcinoma) were recorded during the follow-up and the log-rank test was used to compare groups. RESULTS: Overall, the median age was 48 years and most individuals were men (64.7%). The PNPLA3 rs738409 genotype was CC in 235 (31.1%), CG in 328 (43.4%), and GG in 193 (25.5%) patients. At univariate analysis, the PNPLA3 GG risk genotype was associated with female sex and inversely related to BMI (odds ratio, 1.6; 95% confidence interval, 1.1-2.2; P = .006; and odds ratio, 0.97; 95% confidence interval, 0.94-0.99; P = .043, respectively). Specifically, PNPLA3 GG risk homozygosis was more prevalent in female vs male individuals (31.5% vs 22.3%; P = .006) and in nonobese compared with obese NAFLD subjects (50.0% vs 44.2%; P = .011). Following stratification for age, sex, and BMI, we observed an increased incidence of liver-related events in the subgroup of nonobese women older than 50 years of age carrying the PNPLA3 GG risk genotype (log-rank test, P = .0047). CONCLUSIONS: Nonobese female patients with NAFLD 50 years of age and older, and carrying the PNPLA3 GG risk genotype, are at higher risk of developing liver-related events compared with those with the wild-type allele (CC/CG). This finding may have implications in clinical practice for risk stratification and personalized medicine.
Assuntos
Carcinoma Hepatocelular , Varizes Esofágicas e Gástricas , Neoplasias Hepáticas , Hepatopatia Gordurosa não Alcoólica , Humanos , Feminino , Masculino , Pessoa de Meia-Idade , Hepatopatia Gordurosa não Alcoólica/complicações , Hepatopatia Gordurosa não Alcoólica/genética , Hepatopatia Gordurosa não Alcoólica/epidemiologia , Carcinoma Hepatocelular/epidemiologia , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/complicações , Estudos Retrospectivos , Varizes Esofágicas e Gástricas/complicações , Hemorragia Gastrointestinal/complicações , Genótipo , Polimorfismo de Nucleotídeo Único , Neoplasias Hepáticas/epidemiologia , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/complicações , Predisposição Genética para DoençaRESUMO
Most living organisms rely on double-stranded DNA (dsDNA) to store their genetic information and perpetuate themselves. This biological information has been considered as the main target of evolution. However, here we show that symmetries and patterns in the dsDNA sequence can emerge from the physical peculiarities of the dsDNA molecule itself and the maximum entropy principle alone, rather than from biological or environmental evolutionary pressure. The randomness justifies the human codon biases and context-dependent mutation patterns in human populations. Thus, the DNA 'exceptional symmetries,' emerged from the randomness, have to be taken into account when looking for the DNA encoded information. Our results suggest that the double helix energy constraints and, more generally, the physical properties of the dsDNA are the hard drivers of the overall DNA sequence architecture, whereas the selective biological processes act as soft drivers, which only under extraordinary circumstances overtake the overall entropy content of the genome.
Assuntos
DNA/genética , Evolução Molecular , Análise de Sequência de DNA/métodos , HumanosRESUMO
A review, recently published in this journal by Fang (2019), showed that methods trained for the prediction of protein stability changes upon mutation have a very critical bias: they neglect that a protein variation (A- > B) and its reverse (B- > A) must have the opposite value of the free energy difference (ΔΔGAB = - ΔΔGBA). In this letter, we complement the Fang's paper presenting a more general view of the problem. In particular, a machine learning-based method, published in 2015 (INPS), addressed the bias issue directly. We include the analysis of the missing method, showing that INPS is nearly insensitive to the addressed problem.
Assuntos
Algoritmos , Aprendizado de Máquina , Mutação , Estabilidade ProteicaRESUMO
OBJECTIVE: The full phenotypic expression of non-alcoholic fatty liver disease (NAFLD) in lean subjects is incompletely characterised. We aimed to investigate prevalence, characteristics and long-term prognosis of Caucasian lean subjects with NAFLD. DESIGN: The study cohort comprises 1339 biopsy-proven NAFLD subjects from four countries (Italy, UK, Spain and Australia), stratified into lean and non-lean (body mass index (BMI) ≥25 kg/m2). Liver/non-liver-related events and survival free of transplantation were recorded during the follow-up, compared by log-rank testing and reported by adjusted HR. RESULTS: Lean patients represented 14.4% of the cohort and were predominantly of Italian origin (89%). They had less severe histological disease (lean vs non-lean: non-alcoholic steatohepatitis 54.1% vs 71.2% p<0.001; advanced fibrosis 10.1% vs 25.2% p<0.001), lower prevalence of diabetes (9.2% vs 31.4%, p<0.001), but no significant differences in the prevalence of the PNPLA3 I148M variant (p=0.57). During a median follow-up of 94 months (>10 483 person-years), 4.7% of lean vs 7.7% of non-lean patients reported liver-related events (p=0.37). No difference in survival was observed compared with non-lean NAFLD (p=0.069). CONCLUSIONS: Caucasian lean subjects with NAFLD may progress to advanced liver disease, develop metabolic comorbidities and experience cardiovascular disease (CVD) as well as liver-related mortality, independent of longitudinal progression to obesity and PNPLA3 genotype. These patients represent one end of a wide spectrum of phenotypic expression of NAFLD where the disease manifests at lower overall BMI thresholds. LAY SUMMARY: NAFLD may affect and progress in both obese and lean individuals. Lean subjects are predominantly males, have a younger age at diagnosis and are more prevalent in some geographic areas. During the follow-up, lean subjects can develop hepatic and extrahepatic disease, including metabolic comorbidities, in the absence of weight gain. These patients represent one end of a wide spectrum of phenotypic expression of NAFLD.
Assuntos
Hepatopatia Gordurosa não Alcoólica/complicações , Magreza/complicações , População Branca , Adulto , Índice de Massa Corporal , Estudos de Coortes , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Hepatopatia Gordurosa não Alcoólica/mortalidade , Hepatopatia Gordurosa não Alcoólica/patologia , Prognóstico , Taxa de Sobrevida , Magreza/mortalidade , Magreza/patologiaRESUMO
BACKGROUND: Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain position in the human genome is highly context-dependent. RESULTS: Here we develop context-dependent nucleotide models. We first investigate models of nucleotides conditioned on sequence context. We develop a bidirectional Markov model that use an average of the probability from a Markov model applied to both strands of the sequence and thus depends on up to 14 bases to each side of the nucleotide. We show how the genome predictability varies across different types of genomic regions. Surprisingly, this model can predict a base from its context with an average of more than 50% accuracy. For somatic variants we show a tendency towards higher probability for the variant base than for the reference base. Inspired by DNA substitution models, we develop a model of mutability that estimates a mutation matrix (called the alpha matrix) on top of the nucleotide distribution. The alpha matrix can be estimated from a much smaller context than the nucleotide model, but the final model will still depend on the full context of the nucleotide model. With the bidirectional Markov model of order 14 and an alpha matrix dependent on just one base to each side, we obtain a model that compares well with a model of mutability that estimates mutation probabilities directly conditioned on three nucleotides to each side. For somatic variants in particular, our model fits better than the simpler model. Interestingly, the model is not very sensitive to the size of the context for the alpha matrix. CONCLUSIONS: Our study found strong context dependencies of nucleotides in the human genome. The best model uses a context of 14 nucleotides to each side. Based on these models, a substitution model was constructed that separates into the context model and a matrix dependent on a small context. The model fit somatic variants particularly well.
Assuntos
DNA , Nucleotídeos , DNA/genética , Genoma Humano , Genômica , Humanos , Nucleotídeos/genética , ProbabilidadeRESUMO
Evolutionary information is the primary tool for detecting functional conservation in nucleic acid and protein. This information has been extensively used to predict structure, interactions and functions in macromolecules. Pathogenicity prediction models rely on multiple sequence alignment information at different levels. However, most accurate genome-wide variant deleteriousness ranking algorithms consider different features to assess the impact of variants. Here, we analyze three different ways of extracting evolutionary information from sequence alignments in the context of pathogenicity predictions at DNA and protein levels. We showed that protein sequence-based information is slightly more informative in the annotation of Clinvar missense variants than those obtained at the DNA level. Furthermore, to achieve the performance of state-of-the-art methods, such as CADD and REVEL, the conservation of reference and variant, encoded as frequencies of reference/alternate alleles or wild-type/mutant residues, should be included. Our results on a large set of missense variants show that a basic method based on three input features derived from the protein sequence profile performs similarly to the CADD algorithm which uses hundreds of genomic features. As expected, our method results in ~ 3% lower area under the receiver-operating characteristic curve (AUC). When compared with an ensemble-based algorithm (REVEL). Nevertheless, the combination of predictions of multiple methods can help to identify more reliable predictions. These observations indicate that for missense variants, evolutionary information, when properly encoded, plays the primary role in ranking pathogenicity.
Assuntos
Biologia Computacional , Ácidos Nucleicos , Algoritmos , Sequência de Aminoácidos , Biologia Computacional/métodos , Humanos , Mutação de Sentido Incorreto , Alinhamento de SequênciaRESUMO
SUMMARY: Identifying pathogenic variants and annotating them is a major challenge in human genetics, especially for the non-coding ones. Several tools have been developed and used to predict the functional effect of genetic variants. However, the calibration assessment of the predictions has received little attention. Calibration refers to the idea that if a model predicts a group of variants to be pathogenic with a probability P, it is expected that the same fraction P of true positive is found in the observed set. For instance, a well-calibrated classifier should label the variants such that among the ones to which it gave a probability value close to 0.7, approximately 70% actually belong to the pathogenic class. Poorly calibrated algorithms can be misleading and potentially harmful for clinical decision making. AVALIABILITY AND IMPLEMENTATION: The dataset used for testing the methods is available through the DOI:10.5281/zenodo.4448197. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. RESULTS: In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. CONCLUSIONS: To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open.
Assuntos
Carcinogênese/genética , Progressão da Doença , Aprendizado de Máquina , Oncologia/instrumentação , Neoplasias/genética , Medicina de Precisão/instrumentação , Neoplasias/patologiaRESUMO
BACKGROUND & AIMS: Non-invasive scoring systems (NSS) are used to identify patients with non-alcoholic fatty liver disease (NAFLD) who are at risk of advanced fibrosis, but their reliability in predicting long-term outcomes for hepatic/extrahepatic complications or death and their concordance in cross-sectional and longitudinal risk stratification remain uncertain. METHODS: The most common NSS (NFS, FIB-4, BARD, APRI) and the Hepamet fibrosis score (HFS) were assessed in 1,173 European patients with NAFLD from tertiary centres. Performance for fibrosis risk stratification and for the prediction of long-term hepatic/extrahepatic events, hepatocarcinoma (HCC) and overall mortality were evaluated in terms of AUC and Harrell's c-index. For longitudinal data, NSS-based Cox proportional hazard models were trained on the whole cohort with repeated 5-fold cross-validation, sampling for testing from the 607 patients with all NSS available. RESULTS: Cross-sectional analysis revealed HFS as the best performer for the identification of significant (F0-1 vs. F2-4, AUC = 0.758) and advanced (F0-2 vs. F3-4, AUC = 0.805) fibrosis, while NFS and FIB-4 showed the best performance for detecting histological cirrhosis (range AUCs 0.85-0.88). Considering longitudinal data (follow-up between 62 and 110 months), NFS and FIB-4 were the best at predicting liver-related events (c-indices>0.7), NFS for HCC (c-index = 0.9 on average), and FIB-4 and HFS for overall mortality (c-indices >0.8). All NSS showed limited performance (c-indices <0.7) for extrahepatic events. CONCLUSIONS: Overall, NFS, HFS and FIB-4 outperformed APRI and BARD for both cross-sectional identification of fibrosis and prediction of long-term outcomes, confirming that they are useful tools for the clinical management of patients with NAFLD at increased risk of fibrosis and liver-related complications or death. LAY SUMMARY: Non-invasive scoring systems are increasingly being used in patients with non-alcoholic fatty liver disease to identify those at risk of advanced fibrosis and hence clinical complications. Herein, we compared various non-invasive scoring systems and identified those that were best at identifying risk, as well as those that were best for the prediction of long-term outcomes, such as liver-related events, liver cancer and death.