Pesquisa | BVS IEC

1.

MUSE-XAE: MUtational Signature Extraction with eXplainable AutoEncoder enhances tumour types classification.

Pancotti, Corrado; Rollo, Cesare; Codicè, Francesco; Birolo, Giovanni; Fariselli, Piero; Sanavia, Tiziana.

Bioinformatics ; 40(5)2024 May 02.

Artigo em Inglês | MEDLINE | ID: mdl-38754097

RESUMO

MOTIVATION: Mutational signatures are a critical component in deciphering the genetic alterations that underlie cancer development and have become a valuable resource to understand the genomic changes during tumorigenesis. Therefore, it is essential to employ precise and accurate methods for their extraction to ensure that the underlying patterns are reliably identified and can be effectively utilized in new strategies for diagnosis, prognosis, and treatment of cancer patients. RESULTS: We present MUSE-XAE, a novel method for mutational signature extraction from cancer genomes using an explainable autoencoder. Our approach employs a hybrid architecture consisting of a nonlinear encoder that can capture nonlinear interactions among features, and a linear decoder which ensures the interpretability of the active signatures. We evaluated and compared MUSE-XAE with other available tools on both synthetic and real cancer datasets and demonstrated that it achieves superior performance in terms of precision and sensitivity in recovering mutational signature profiles. MUSE-XAE extracts highly discriminative mutational signature profiles by enhancing the classification of primary tumour types and subtypes in real world settings. This approach could facilitate further research in this area, with neural networks playing a critical role in advancing our understanding of cancer genomics. AVAILABILITY AND IMPLEMENTATION: MUSE-XAE software is freely available at https://github.com/compbiomed-unito/MUSE-XAE.

Assuntos

Mutação , Neoplasias , Humanos , Neoplasias/genética , Algoritmos , Software , Genômica/métodos , Biologia Computacional/métodos , Redes Neurais de Computação

2.

PhD-SNPg: updating a webserver and lightweight tool for scoring nucleotide variants.

Capriotti, Emidio; Fariselli, Piero.

Nucleic Acids Res ; 51(W1): W451-W458, 2023 07 05.

Artigo em Inglês | MEDLINE | ID: mdl-37246737

RESUMO

One of the primary challenges in human genetics is determining the functional impact of single nucleotide variants (SNVs) and insertion and deletions (InDels), whether coding or noncoding. In the past, methods have been created to detect disease-related single amino acid changes, but only some can assess the influence of noncoding variations. CADD is the most commonly used and advanced algorithm for predicting the diverse effects of genome variations. It employs a combination of sequence conservation and functional features derived from the ENCODE project data. To use CADD, a large set of pre-calculated information must be downloaded during the installation process. To streamline the variant annotation process, we developed PhD-SNPg, a machine-learning tool that is easy to install and lightweight, relying solely on sequence-based features. Here we present an updated version, trained on a larger dataset, that can also predict the impact of the InDel variations. Despite its simplicity, PhD-SNPg performs similarly to CADD, making it ideal for rapid genome interpretation and as a benchmark for tool development.

Assuntos

Algoritmos , Genoma Humano , Humanos , Mutação INDEL , Aprendizado de Máquina , Polimorfismo de Nucleotídeo Único

3.

Serum ferritin levels can predict long-term outcomes in patients with metabolic dysfunction-associated steatotic liver disease.

Armandi, Angelo; Sanavia, Tiziana; Younes, Ramy; Caviglia, Gian Paolo; Rosso, Chiara; Govaere, Olivier; Liguori, Antonio; Francione, Paolo; Gallego-Duràn, Rocìo; Ampuero, Javier; Pennisi, Grazia; Aller, Rocio; Tiniakos, Dina; Burt, Alastair; David, Ezio; Vecchio, Fabio; Maggioni, Marco; Cabibi, Daniela; McLeod, Duncan; Pareja, Maria Jesus; Zaki, Marco Y W; Grieco, Antonio; Stål, Per; Kechagias, Stergios; Fracanzani, Anna Ludovica; Valenti, Luca; Miele, Luca; Fariselli, Piero; Eslam, Mohammed; Petta, Salvatore; Hagström, Hannes; George, Jacob; Schattenberg, Jörn M; Romero-Gómez, Manuel; Anstee, Quentin Mark; Bugianesi, Elisabetta.

Gut ; 73(5): 825-834, 2024 Apr 05.

Artigo em Inglês | MEDLINE | ID: mdl-38199805

RESUMO

OBJECTIVE: Hyperferritinaemia is associated with liver fibrosis severity in patients with metabolic dysfunction-associated steatotic liver disease (MASLD), but the longitudinal implications have not been thoroughly investigated. We assessed the role of serum ferritin in predicting long-term outcomes or death. DESIGN: We evaluated the relationship between baseline serum ferritin and longitudinal events in a multicentre cohort of 1342 patients. Four survival models considering ferritin with confounders or non-invasive scoring systems were applied with repeated five-fold cross-validation schema. Prediction performance was evaluated in terms of Harrell's C-index and its improvement by including ferritin as a covariate. RESULTS: Median follow-up time was 96 months. Liver-related events occurred in 7.7%, hepatocellular carcinoma in 1.9%, cardiovascular events in 10.9%, extrahepatic cancers in 8.3% and all-cause mortality in 5.8%. Hyperferritinaemia was associated with a 50% increased risk of liver-related events and 27% of all-cause mortality. A stepwise increase in baseline ferritin thresholds was associated with a statistical increase in C-index, ranging between 0.02 (lasso-penalised Cox regression) and 0.03 (ridge-penalised Cox regression); the risk of developing liver-related events mainly increased from threshold 215.5 µg/L (median HR=1.71 and C-index=0.71) and the risk of overall mortality from threshold 272 µg/L (median HR=1.49 and C-index=0.70). The inclusion of serum ferritin thresholds (215.5 µg/L and 272 µg/L) in predictive models increased the performance of Fibrosis-4 and Non-Alcoholic Fatty Liver Disease Fibrosis Score in the longitudinal risk assessment of liver-related events (C-indices>0.71) and overall mortality (C-indices>0.65). CONCLUSIONS: This study supports the potential use of serum ferritin values for predicting the long-term prognosis of patients with MASLD.

Assuntos

Neoplasias Hepáticas , Doenças Metabólicas , Hepatopatia Gordurosa não Alcoólica , Humanos , Hepatopatia Gordurosa não Alcoólica/patologia , Cirrose Hepática/patologia , Fibrose , Neoplasias Hepáticas/complicações , Ferritinas

4.

Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset.

Pancotti, Corrado; Benevenuta, Silvia; Birolo, Giovanni; Alberini, Virginia; Repetto, Valeria; Sanavia, Tiziana; Capriotti, Emidio; Fariselli, Piero.

Brief Bioinform ; 23(2)2022 03 10.

Artigo em Inglês | MEDLINE | ID: mdl-35021190

RESUMO

Predicting the difference in thermodynamic stability between protein variants is crucial for protein design and understanding the genotype-phenotype relationships. So far, several computational tools have been created to address this task. Nevertheless, most of them have been trained or optimized on the same and 'all' available data, making a fair comparison unfeasible. Here, we introduce a novel dataset, collected and manually cleaned from the latest version of the ThermoMutDB database, consisting of 669 variants not included in the most widely used training datasets. The prediction performance and the ability to satisfy the antisymmetry property by considering both direct and reverse variants were evaluated across 21 different tools. The Pearson correlations of the tested tools were in the ranges of 0.21-0.5 and 0-0.45 for the direct and reverse variants, respectively. When both direct and reverse variants are considered, the antisymmetric methods perform better achieving a Pearson correlation in the range of 0.51-0.62. The tested methods seem relatively insensitive to the physiological conditions, performing well also on the variants measured with more extreme pH and temperature values. A common issue with all the tested methods is the compression of the $\Delta \Delta G$ predictions toward zero. Furthermore, the thermodynamic stability of the most significantly stabilizing variants was found to be more challenging to predict. This study is the most extensive comparisons of prediction methods using an entirely novel set of variants never tested before.

Assuntos

Mutação Puntual , Proteínas , Mutação , Estabilidade Proteica , Proteínas/química , Termodinâmica

5.

Nonlinear data fusion over Entity-Relation graphs for Drug-Target Interaction prediction.

Mazzone, Eugenio; Moreau, Yves; Fariselli, Piero; Raimondi, Daniele.

Bioinformatics ; 39(6)2023 06 01.

Artigo em Inglês | MEDLINE | ID: mdl-37255310

RESUMO

MOTIVATION: The prediction of reliable Drug-Target Interactions (DTIs) is a key task in computer-aided drug design and repurposing. Here, we present a new approach based on data fusion for DTI prediction built on top of the NXTfusion library, which generalizes the Matrix Factorization paradigm by extending it to the nonlinear inference over Entity-Relation graphs. RESULTS: We benchmarked our approach on five datasets and we compared our models against state-of-the-art methods. Our models outperform most of the existing methods and, simultaneously, retain the flexibility to predict both DTIs as binary classification and regression of the real-valued drug-target affinity, competing with models built explicitly for each task. Moreover, our findings suggest that the validation of DTI methods should be stricter than what has been proposed in some previous studies, focusing more on mimicking real-life DTI settings where predictions for previously unseen drugs, proteins, and drug-protein pairs are needed. These settings are exactly the context in which the benefit of integrating heterogeneous information with our Entity-Relation data fusion approach is the most evident. AVAILABILITY AND IMPLEMENTATION: All software and data are available at https://github.com/eugeniomazzone/CPI-NXTFusion and https://pypi.org/project/NXTfusion/.

Assuntos

Desenvolvimento de Medicamentos , Software , Proteínas , Interações Medicamentosas , Desenho de Fármacos

6.

Recombulator-X: A fast and user-friendly tool for estimating X chromosome recombination rates in forensic genetics.

Aneli, Serena; Fariselli, Piero; Chierto, Elena; Bini, Carla; Robino, Carlo; Birolo, Giovanni.

PLoS Comput Biol ; 19(9): e1011474, 2023 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-37721960

RESUMO

Genetic markers (especially short tandem repeats or STRs) located on the X chromosome are a valuable resource to solve complex kinship cases in forensic genetics in addition or alternatively to autosomal STRs. Groups of tightly linked markers are combined into haplotypes, thus increasing the discriminating power of tests. However, this approach requires precise knowledge of the recombination rates between adjacent markers. The International Society of Forensic Genetics recommends that recombination rate estimation on the X chromosome is performed from pedigree genetic data while taking into account the confounding effect of mutations. However, implementations that satisfy these requirements have several drawbacks: they were never publicly released, they are very slow and/or need cluster-level hardware and strong computational expertise to use. In order to address these key concerns we developed Recombulator-X, a new open-source Python tool. The most challenging issue, namely the running time, was addressed with dynamic programming techniques to greatly reduce the computational complexity of the algorithm. Compared to the previous methods, Recombulator-X reduces the estimation times from weeks or months to less than one hour for typical datasets. Moreover, the estimation process, including preprocessing, has been streamlined and packaged into a simple command-line tool that can be run on a normal PC. Where previous approaches were limited to small panels of STR markers (up to 15), our tool can handle greater numbers (up to 100) of mixed STR and non-STR markers. In conclusion, Recombulator-X makes the estimation process much simpler, faster and accessible to researchers without a computational background, hopefully spurring increased adoption of best practices.

7.

From genotype to phenotype in Arabidopsis thaliana: in-silico genome interpretation predicts 288 phenotypes from sequencing data.

Raimondi, Daniele; Corso, Massimiliano; Fariselli, Piero; Moreau, Yves.

Nucleic Acids Res ; 50(3): e16, 2022 02 22.

Artigo em Inglês | MEDLINE | ID: mdl-34792168

RESUMO

In many cases, the unprecedented availability of data provided by high-throughput sequencing has shifted the bottleneck from a data availability issue to a data interpretation issue, thus delaying the promised breakthroughs in genetics and precision medicine, for what concerns Human genetics, and phenotype prediction to improve plant adaptation to climate change and resistance to bioagressors, for what concerns plant sciences. In this paper, we propose a novel Genome Interpretation paradigm, which aims at directly modeling the genotype-to-phenotype relationship, and we focus on A. thaliana since it is the best studied model organism in plant genetics. Our model, called Galiana, is the first end-to-end Neural Network (NN) approach following the genomes in/phenotypes out paradigm and it is trained to predict 288 real-valued Arabidopsis thaliana phenotypes from Whole Genome sequencing data. We show that 75 of these phenotypes are predicted with a Pearson correlation ≥0.4, and are mostly related to flowering traits. We show that our end-to-end NN approach achieves better performances and larger phenotype coverage than models predicting single phenotypes from the GWAS-derived known associated genes. Galiana is also fully interpretable, thanks to the Saliency Maps gradient-based approaches. We followed this interpretation approach to identify 36 novel genes that are likely to be associated with flowering traits, finding evidence for 6 of them in the existing literature.

Assuntos

Arabidopsis , Arabidopsis/genética , Genoma , Estudo de Associação Genômica Ampla , Genótipo , Redes Neurais de Computação , Fenótipo , Sequenciamento Completo do Genoma

8.

DDGun: an untrained predictor of protein stability changes upon amino acid variants.

Montanucci, Ludovica; Capriotti, Emidio; Birolo, Giovanni; Benevenuta, Silvia; Pancotti, Corrado; Lal, Dennis; Fariselli, Piero.

Nucleic Acids Res ; 50(W1): W222-W227, 2022 07 05.

Artigo em Inglês | MEDLINE | ID: mdl-35524565

RESUMO

Estimating the functional effect of single amino acid variants in proteins is fundamental for predicting the change in the thermodynamic stability, measured as the difference in the Gibbs free energy of unfolding, between the wild-type and the variant protein (ΔΔG). Here, we present the web-server of the DDGun method, which was previously developed for the ΔΔG prediction upon amino acid variants. DDGun is an untrained method based on basic features derived from evolutionary information. It is antisymmetric, as it predicts opposite ΔΔG values for direct (A â B) and reverse (B â A) single and multiple site variants. DDGun is available in two versions, one based on only sequence information and the other one based on sequence and structure information. Despite being untrained, DDGun reaches prediction performances comparable to those of trained methods. Here we make DDGun available as a web server. For the web server version, we updated the protein sequence database used for the computation of the evolutionary features, and we compiled two new data sets of protein variants to do a blind test of its performances. On these blind data sets of single and multiple site variants, DDGun confirms its prediction performance, reaching an average correlation coefficient between experimental and predicted ΔΔG of 0.45 and 0.49 for the sequence-based and structure-based versions, respectively. Besides being used for the prediction of ΔΔG, we suggest that DDGun should be adopted as a benchmark method to assess the predictive capabilities of newly developed methods. Releasing DDGun as a web-server, stand-alone program and docker image will facilitate the necessary process of method comparison to improve ΔΔG prediction.

Assuntos

Aminoácidos , Estabilidade Proteica , Proteínas , Aminoácidos/genética , Computadores , Bases de Dados de Proteínas , Proteínas/genética , Proteínas/química

9.

Impact of PNPLA3 rs738409 Polymorphism on the Development of Liver-Related Events in Patients With Nonalcoholic Fatty Liver Disease.

Rosso, Chiara; Caviglia, Gian Paolo; Birolo, Giovanni; Armandi, Angelo; Pennisi, Grazia; Pelusi, Serena; Younes, Ramy; Liguori, Antonio; Perez-Diaz-Del-Campo, Nuria; Nicolosi, Aurora; Govaere, Olivier; Castelnuovo, Gabriele; Olivero, Antonella; Abate, Maria Lorena; Ribaldone, Davide Giuseppe; Fariselli, Piero; Valenti, Luca; Miele, Luca; Petta, Salvatore; Romero-Gomez, Manuel; Anstee, Quentin M; Bugianesi, Elisabetta.

Clin Gastroenterol Hepatol ; 21(13): 3314-3321.e3, 2023 12.

Artigo em Inglês | MEDLINE | ID: mdl-37149016

RESUMO

BACKGROUND AND AIMS: Nonalcoholic fatty liver disease (NAFLD) is a complex disease, resulting from the interplay between environmental determinants and genetic variations. Single nucleotide polymorphism rs738409 C>G in the PNPLA3 gene is associated with hepatic fibrosis and with higher risk of developing hepatocellular carcinoma. Here, we analyzed a longitudinal cohort of biopsy-proven NAFLD subjects with the aim to identify individuals in whom genetics may have a stronger impact on disease progression. METHODS: We retrospectively analyzed 756 consecutive, prospectively enrolled biopsy-proven NAFLD subjects from Italy, United Kingdom, and Spain who were followed for a median of 84 months (interquartile range, 65-109 months). We stratified the study cohort according to sex, body mass index (BMI)

Assuntos

Carcinoma Hepatocelular , Varizes Esofágicas e Gástricas , Neoplasias Hepáticas , Hepatopatia Gordurosa não Alcoólica , Humanos , Feminino , Masculino , Pessoa de Meia-Idade , Hepatopatia Gordurosa não Alcoólica/complicações , Hepatopatia Gordurosa não Alcoólica/genética , Hepatopatia Gordurosa não Alcoólica/epidemiologia , Carcinoma Hepatocelular/epidemiologia , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/complicações , Estudos Retrospectivos , Varizes Esofágicas e Gástricas/complicações , Hemorragia Gastrointestinal/complicações , Genótipo , Polimorfismo de Nucleotídeo Único , Neoplasias Hepáticas/epidemiologia , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/complicações , Predisposição Genética para Doença

10.

DNA sequence symmetries from randomness: the origin of the Chargaff's second parity rule.

Fariselli, Piero; Taccioli, Cristian; Pagani, Luca; Maritan, Amos.

Brief Bioinform ; 22(2): 2172-2181, 2021 03 22.

Artigo em Inglês | MEDLINE | ID: mdl-32266404

RESUMO

Most living organisms rely on double-stranded DNA (dsDNA) to store their genetic information and perpetuate themselves. This biological information has been considered as the main target of evolution. However, here we show that symmetries and patterns in the dsDNA sequence can emerge from the physical peculiarities of the dsDNA molecule itself and the maximum entropy principle alone, rather than from biological or environmental evolutionary pressure. The randomness justifies the human codon biases and context-dependent mutation patterns in human populations. Thus, the DNA 'exceptional symmetries,' emerged from the randomness, have to be taken into account when looking for the DNA encoded information. Our results suggest that the double helix energy constraints and, more generally, the physical properties of the dsDNA are the hard drivers of the overall DNA sequence architecture, whereas the selective biological processes act as soft drivers, which only under extraordinary circumstances overtake the overall entropy content of the genome.

Assuntos

DNA/genética , Evolução Molecular , Análise de Sequência de DNA/métodos , Humanos

11.

On the critical review of five machine learning-based algorithms for predicting protein stability changes upon mutation.

Savojardo, Castrense; Martelli, Pier Luigi; Casadio, Rita; Fariselli, Piero.

Brief Bioinform ; 22(1): 601-603, 2021 01 18.

Artigo em Inglês | MEDLINE | ID: mdl-31885042

RESUMO

A review, recently published in this journal by Fang (2019), showed that methods trained for the prediction of protein stability changes upon mutation have a very critical bias: they neglect that a protein variation (A- > B) and its reverse (B- > A) must have the opposite value of the free energy difference (ΔΔGAB = - ΔΔGBA). In this letter, we complement the Fang's paper presenting a more general view of the problem. In particular, a machine learning-based method, published in 2015 (INPS), addressed the bias issue directly. We include the analysis of the missing method, showing that INPS is nearly insensitive to the addressed problem.

Assuntos

Algoritmos , Aprendizado de Máquina , Mutação , Estabilidade Proteica

12.

Caucasian lean subjects with non-alcoholic fatty liver disease share long-term prognosis of non-lean: time for reappraisal of BMI-driven approach?

Younes, Ramy; Govaere, Olivier; Petta, Salvatore; Miele, Luca; Tiniakos, Dina; Burt, Alastair; David, Ezio; Vecchio, Fabio Maria; Maggioni, Marco; Cabibi, Daniela; McLeod, Duncan; Pareja, Maria Jesus; Fracanzani, Anna Ludovica; Aller, Rocio; Rosso, Chiara; Ampuero, Javier; Gallego-Durán, Rocío; Armandi, Angelo; Caviglia, Gian Paolo; Zaki, Marco Y W; Liguori, Antonio; Francione, Paolo; Pennisi, Grazia; Grieco, Antonio; Birolo, Giovanni; Fariselli, Piero; Eslam, Mohammed; Valenti, Luca; George, Jacob; Romero-Gómez, Manuel; Anstee, Quentin Mark; Bugianesi, Elisabetta.

Gut ; 71(2): 382-390, 2022 02.

Artigo em Inglês | MEDLINE | ID: mdl-33541866

RESUMO

OBJECTIVE: The full phenotypic expression of non-alcoholic fatty liver disease (NAFLD) in lean subjects is incompletely characterised. We aimed to investigate prevalence, characteristics and long-term prognosis of Caucasian lean subjects with NAFLD. DESIGN: The study cohort comprises 1339 biopsy-proven NAFLD subjects from four countries (Italy, UK, Spain and Australia), stratified into lean and non-lean (body mass index (BMI) 10 483 person-years), 4.7% of lean vs 7.7% of non-lean patients reported liver-related events (p=0.37). No difference in survival was observed compared with non-lean NAFLD (p=0.069). CONCLUSIONS: Caucasian lean subjects with NAFLD may progress to advanced liver disease, develop metabolic comorbidities and experience cardiovascular disease (CVD) as well as liver-related mortality, independent of longitudinal progression to obesity and PNPLA3 genotype. These patients represent one end of a wide spectrum of phenotypic expression of NAFLD where the disease manifests at lower overall BMI thresholds. LAY SUMMARY: NAFLD may affect and progress in both obese and lean individuals. Lean subjects are predominantly males, have a younger age at diagnosis and are more prevalent in some geographic areas. During the follow-up, lean subjects can develop hepatic and extrahepatic disease, including metabolic comorbidities, in the absence of weight gain. These patients represent one end of a wide spectrum of phenotypic expression of NAFLD.

Assuntos

Hepatopatia Gordurosa não Alcoólica/complicações , Magreza/complicações , População Branca , Adulto , Índice de Massa Corporal , Estudos de Coortes , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Hepatopatia Gordurosa não Alcoólica/mortalidade , Hepatopatia Gordurosa não Alcoólica/patologia , Prognóstico , Taxa de Sobrevida , Magreza/mortalidade , Magreza/patologia

13.

Context dependency of nucleotide probabilities and variants in human DNA.

Liang, Yuhu; Grønbæk, Christian; Fariselli, Piero; Krogh, Anders.

BMC Genomics ; 23(1): 87, 2022 Jan 31.

Artigo em Inglês | MEDLINE | ID: mdl-35100973

RESUMO

BACKGROUND: Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain position in the human genome is highly context-dependent. RESULTS: Here we develop context-dependent nucleotide models. We first investigate models of nucleotides conditioned on sequence context. We develop a bidirectional Markov model that use an average of the probability from a Markov model applied to both strands of the sequence and thus depends on up to 14 bases to each side of the nucleotide. We show how the genome predictability varies across different types of genomic regions. Surprisingly, this model can predict a base from its context with an average of more than 50% accuracy. For somatic variants we show a tendency towards higher probability for the variant base than for the reference base. Inspired by DNA substitution models, we develop a model of mutability that estimates a mutation matrix (called the alpha matrix) on top of the nucleotide distribution. The alpha matrix can be estimated from a much smaller context than the nucleotide model, but the final model will still depend on the full context of the nucleotide model. With the bidirectional Markov model of order 14 and an alpha matrix dependent on just one base to each side, we obtain a model that compares well with a model of mutability that estimates mutation probabilities directly conditioned on three nucleotides to each side. For somatic variants in particular, our model fits better than the simpler model. Interestingly, the model is not very sensitive to the size of the context for the alpha matrix. CONCLUSIONS: Our study found strong context dependencies of nucleotides in the human genome. The best model uses a context of 14 nucleotides to each side. Based on these models, a substitution model was constructed that separates into the context model and a matrix dependent on a small context. The model fit somatic variants particularly well.

Assuntos

DNA , Nucleotídeos , DNA/genética , Genoma Humano , Genômica , Humanos , Nucleotídeos/genética , Probabilidade

14.

Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants.

Capriotti, Emidio; Fariselli, Piero.

Hum Genet ; 141(10): 1649-1658, 2022 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-35098354

RESUMO

Evolutionary information is the primary tool for detecting functional conservation in nucleic acid and protein. This information has been extensively used to predict structure, interactions and functions in macromolecules. Pathogenicity prediction models rely on multiple sequence alignment information at different levels. However, most accurate genome-wide variant deleteriousness ranking algorithms consider different features to assess the impact of variants. Here, we analyze three different ways of extracting evolutionary information from sequence alignments in the context of pathogenicity predictions at DNA and protein levels. We showed that protein sequence-based information is slightly more informative in the annotation of Clinvar missense variants than those obtained at the DNA level. Furthermore, to achieve the performance of state-of-the-art methods, such as CADD and REVEL, the conservation of reference and variant, encoded as frequencies of reference/alternate alleles or wild-type/mutant residues, should be included. Our results on a large set of missense variants show that a basic method based on three input features derived from the protein sequence profile performs similarly to the CADD algorithm which uses hundreds of genomic features. As expected, our method results in ~ 3% lower area under the receiver-operating characteristic curve (AUC). When compared with an ensemble-based algorithm (REVEL). Nevertheless, the combination of predictions of multiple methods can help to identify more reliable predictions. These observations indicate that for missense variants, evolutionary information, when properly encoded, plays the primary role in ranking pathogenicity.

Assuntos

Biologia Computacional , Ácidos Nucleicos , Algoritmos , Sequência de Aminoácidos , Biologia Computacional/métodos , Humanos , Mutação de Sentido Incorreto , Alinhamento de Sequência

15.

Calibrating variant-scoring methods for clinical decision making.

Benevenuta, Silvia; Capriotti, Emidio; Fariselli, Piero.

Bioinformatics ; 36(24): 5709-5711, 2021 Apr 05.

Artigo em Inglês | MEDLINE | ID: mdl-33492342

RESUMO

SUMMARY: Identifying pathogenic variants and annotating them is a major challenge in human genetics, especially for the non-coding ones. Several tools have been developed and used to predict the functional effect of genetic variants. However, the calibration assessment of the predictions has received little attention. Calibration refers to the idea that if a model predicts a group of variants to be pathogenic with a probability P, it is expected that the same fraction P of true positive is found in the observed set. For instance, a well-calibrated classifier should label the variants such that among the ones to which it gave a probability value close to 0.7, approximately 70% actually belong to the pathogenic class. Poorly calibrated algorithms can be misleading and potentially harmful for clinical decision making. AVALIABILITY AND IMPLEMENTATION: The dataset used for testing the methods is available through the DOI:10.5281/zenodo.4448197. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

16.

Current cancer driver variant predictors learn to recognize driver genes instead of functional variants.

Raimondi, Daniele; Passemiers, Antoine; Fariselli, Piero; Moreau, Yves.

BMC Biol ; 19(1): 3, 2021 01 13.

Artigo em Inglês | MEDLINE | ID: mdl-33441128

RESUMO

BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. RESULTS: In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. CONCLUSIONS: To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open.

Assuntos

Carcinogênese/genética , Progressão da Doença , Aprendizado de Máquina , Oncologia/instrumentação , Neoplasias/genética , Medicina de Precisão/instrumentação , Neoplasias/patologia

17.

Long-term outcomes and predictive ability of non-invasive scoring systems in patients with non-alcoholic fatty liver disease.

Younes, Ramy; Caviglia, Gian Paolo; Govaere, Olivier; Rosso, Chiara; Armandi, Angelo; Sanavia, Tiziana; Pennisi, Grazia; Liguori, Antonio; Francione, Paolo; Gallego-Durán, Rocío; Ampuero, Javier; Garcia Blanco, Maria J; Aller, Rocio; Tiniakos, Dina; Burt, Alastair; David, Ezio; Vecchio, Fabio M; Maggioni, Marco; Cabibi, Daniela; Pareja, María Jesús; Zaki, Marco Y W; Grieco, Antonio; Fracanzani, Anna L; Valenti, Luca; Miele, Luca; Fariselli, Piero; Petta, Salvatore; Romero-Gomez, Manuel; Anstee, Quentin M; Bugianesi, Elisabetta.

J Hepatol ; 75(4): 786-794, 2021 10.

Artigo em Inglês | MEDLINE | ID: mdl-34090928

RESUMO

BACKGROUND & AIMS: Non-invasive scoring systems (NSS) are used to identify patients with non-alcoholic fatty liver disease (NAFLD) who are at risk of advanced fibrosis, but their reliability in predicting long-term outcomes for hepatic/extrahepatic complications or death and their concordance in cross-sectional and longitudinal risk stratification remain uncertain. METHODS: The most common NSS (NFS, FIB-4, BARD, APRI) and the Hepamet fibrosis score (HFS) were assessed in 1,173 European patients with NAFLD from tertiary centres. Performance for fibrosis risk stratification and for the prediction of long-term hepatic/extrahepatic events, hepatocarcinoma (HCC) and overall mortality were evaluated in terms of AUC and Harrell's c-index. For longitudinal data, NSS-based Cox proportional hazard models were trained on the whole cohort with repeated 5-fold cross-validation, sampling for testing from the 607 patients with all NSS available. RESULTS: Cross-sectional analysis revealed HFS as the best performer for the identification of significant (F0-1 vs. F2-4, AUC = 0.758) and advanced (F0-2 vs. F3-4, AUC = 0.805) fibrosis, while NFS and FIB-4 showed the best performance for detecting histological cirrhosis (range AUCs 0.85-0.88). Considering longitudinal data (follow-up between 62 and 110 months), NFS and FIB-4 were the best at predicting liver-related events (c-indices>0.7), NFS for HCC (c-index = 0.9 on average), and FIB-4 and HFS for overall mortality (c-indices >0.8). All NSS showed limited performance (c-indices <0.7) for extrahepatic events. CONCLUSIONS: Overall, NFS, HFS and FIB-4 outperformed APRI and BARD for both cross-sectional identification of fibrosis and prediction of long-term outcomes, confirming that they are useful tools for the clinical management of patients with NAFLD at increased risk of fibrosis and liver-related complications or death. LAY SUMMARY: Non-invasive scoring systems are increasingly being used in patients with non-alcoholic fatty liver disease to identify those at risk of advanced fibrosis and hence clinical complications. Herein, we compared various non-invasive scoring systems and identified those that were best at identifying risk, as well as those that were best for the prediction of long-term outcomes, such as liver-related events, liver cancer and death.

Assuntos

Hepatopatia Gordurosa não Alcoólica/complicações , Valor Preditivo dos Testes , Projetos de Pesquisa/normas , Tempo , Adulto , Área Sob a Curva , Estudos Transversais , Feminino , Humanos , Fígado/patologia , Masculino , Pessoa de Meia-Idade , Hepatopatia Gordurosa não Alcoólica/mortalidade , Prognóstico , Curva ROC , Reprodutibilidade dos Testes , Projetos de Pesquisa/tendências , Índice de Gravidade de Doença

18.

Insight into the protein solubility driving forces with neural attention.

Raimondi, Daniele; Orlando, Gabriele; Fariselli, Piero; Moreau, Yves.

PLoS Comput Biol ; 16(4): e1007722, 2020 04.

Artigo em Inglês | MEDLINE | ID: mdl-32352965

RESUMO

Protein solubility is a key aspect for many biotechnological, biomedical and industrial processes, such as the production of active proteins and antibodies. In addition, understanding the molecular determinants of the solubility of proteins may be crucial to shed light on the molecular mechanisms of diseases caused by aggregation processes such as amyloidosis. Here we present SKADE, a novel Neural Network protein solubility predictor and we show how it can provide novel insight into the protein solubility mechanisms, thanks to its neural attention architecture. First, we show that SKADE positively compares with state of the art tools while using just the protein sequence as input. Then, thanks to the neural attention mechanism, we use SKADE to investigate the patterns learned during training and we analyse its decision process. We use this peculiarity to show that, while the attention profiles do not correlate with obvious sequence aspects such as biophysical properties of the aminoacids, they suggest that N- and C-termini are the most relevant regions for solubility prediction and are predictive for complex emergent properties such as aggregation-prone regions involved in beta-amyloidosis and contact density. Moreover, SKADE is able to identify mutations that increase or decrease the overall solubility of the protein, allowing it to be used to perform large scale in-silico mutagenesis of proteins in order to maximize their solubility.

Assuntos

Biologia Computacional/métodos , Rede Nervosa/fisiologia , Solubilidade , Algoritmos , Sequência de Aminoácidos/fisiologia , Aminoácidos , Animais , Simulação por Computador , Humanos , Modelos Moleculares , Conformação Proteica , Proteínas/química , Proteínas/metabolismo , Software

19.

Fido-SNP: the first webserver for scoring the impact of single nucleotide variants in the dog genome.

Capriotti, Emidio; Montanucci, Ludovica; Profiti, Giuseppe; Rossi, Ivan; Giannuzzi, Diana; Aresu, Luca; Fariselli, Piero.

Nucleic Acids Res ; 47(W1): W136-W141, 2019 07 02.

Artigo em Inglês | MEDLINE | ID: mdl-31114899

RESUMO

As the amount of genomic variation data increases, tools that are able to score the functional impact of single nucleotide variants become more and more necessary. While there are several prediction servers available for interpreting the effects of variants in the human genome, only few have been developed for other species, and none were specifically designed for species of veterinary interest such as the dog. Here, we present Fido-SNP the first predictor able to discriminate between Pathogenic and Benign single-nucleotide variants in the dog genome. Fido-SNP is a binary classifier based on the Gradient Boosting algorithm. It is able to classify and score the impact of variants in both coding and non-coding regions based on sequence features within seconds. When validated on a previously unseen set of annotated variants from the OMIA database, Fido-SNP reaches 88% overall accuracy, 0.77 Matthews correlation coefficient and 0.91 Area Under the ROC Curve.

Assuntos

Genoma/genética , Genômica , Polimorfismo de Nucleotídeo Único/genética , Software , Algoritmos , Animais , Cães , Variação Genética , Estudo de Associação Genômica Ampla , Genótipo , Internet

20.

A natural upper bound to the accuracy of predicting protein stability changes upon mutations.

Montanucci, Ludovica; Martelli, Pier Luigi; Ben-Tal, Nir; Fariselli, Piero.

Bioinformatics ; 35(9): 1513-1517, 2019 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-30329016

RESUMO

MOTIVATION: Accurate prediction of protein stability changes upon single-site variations (ΔΔG) is important for protein design, as well as for our understanding of the mechanisms of genetic diseases. The performance of high-throughput computational methods to this end is evaluated mostly based on the Pearson correlation coefficient between predicted and observed data, assuming that the upper bound would be 1 (perfect correlation). However, the performance of these predictors can be limited by the distribution and noise of the experimental data. Here we estimate, for the first time, a theoretical upper-bound to the ΔΔG prediction performances imposed by the intrinsic structure of currently available ΔΔG data. RESULTS: Given a set of measured ΔΔG protein variations, the theoretically "best predictor" is estimated based on its similarity to another set of experimentally determined ΔΔG values. We investigate the correlation between pairs of measured ΔΔG variations, where one is used as a predictor for the other. We analytically derive an upper bound to the Pearson correlation as a function of the noise and distribution of the ΔΔG data. We also evaluate the available datasets to highlight the effect of the noise in conjunction with ΔΔG distribution. We conclude that the upper bound is a function of both uncertainty and spread of the ΔΔG values, and that with current data the best performance should be between 0.7 and 0.8, depending on the dataset used; higher Pearson correlations might be indicative of overtraining. It also follows that comparisons of predictors using different datasets are inherently misleading. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Proteínas/genética , Mutação , Estabilidade Proteica

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA