RESUMO
PURPOSE: Rare cancers constitute over 20% of human neoplasms, often affecting patients with unmet medical needs. The development of effective classification and prognostication systems is crucial to improve the decision-making process and drive innovative treatment strategies. We have created and implemented MOSAIC, an artificial intelligence (AI)-based framework designed for multimodal analysis, classification, and personalized prognostic assessment in rare cancers. Clinical validation was performed on myelodysplastic syndrome (MDS), a rare hematologic cancer with clinical and genomic heterogeneities. METHODS: We analyzed 4,427 patients with MDS divided into training and validation cohorts. Deep learning methods were applied to integrate and impute clinical/genomic features. Clustering was performed by combining Uniform Manifold Approximation and Projection for Dimension Reduction + Hierarchical Density-Based Spatial Clustering of Applications with Noise (UMAP + HDBSCAN) methods, compared with the conventional Hierarchical Dirichlet Process (HDP). Linear and AI-based nonlinear approaches were compared for survival prediction. Explainable AI (Shapley Additive Explanations approach [SHAP]) and federated learning were used to improve the interpretation and the performance of the clinical models, integrating them into distributed infrastructure. RESULTS: UMAP + HDBSCAN clustering obtained a more granular patient stratification, achieving a higher average silhouette coefficient (0.16) with respect to HDP (0.01) and higher balanced accuracy in cluster classification by Random Forest (92.7% ± 1.3% and 85.8% ± 0.8%). AI methods for survival prediction outperform conventional statistical techniques and the reference prognostic tool for MDS. Nonlinear Gradient Boosting Survival stands in the internal (Concordance-Index [C-Index], 0.77; SD, 0.01) and external validation (C-Index, 0.74; SD, 0.02). SHAP analysis revealed that similar features drove patients' subgroups and outcomes in both training and validation cohorts. Federated implementation improved the accuracy of developed models. CONCLUSION: MOSAIC provides an explainable and robust framework to optimize classification and prognostic assessment of rare cancers. AI-based approaches demonstrated superior accuracy in capturing genomic similarities and providing individual prognostic information compared with conventional statistical methods. Its federated implementation ensures broad clinical application, guaranteeing high performance and data protection.
Assuntos
Inteligência Artificial , Medicina de Precisão , Humanos , Prognóstico , Medicina de Precisão/métodos , Feminino , Doenças Raras/classificação , Doenças Raras/genética , Doenças Raras/diagnóstico , Masculino , Aprendizado Profundo , Neoplasias/classificação , Neoplasias/genética , Neoplasias/diagnóstico , Síndromes Mielodisplásicas/diagnóstico , Síndromes Mielodisplásicas/classificação , Síndromes Mielodisplásicas/genética , Síndromes Mielodisplásicas/terapia , Algoritmos , Pessoa de Meia-Idade , Idoso , Análise por ConglomeradosRESUMO
MOTIVATION: Mutational signatures are a critical component in deciphering the genetic alterations that underlie cancer development and have become a valuable resource to understand the genomic changes during tumorigenesis. Therefore, it is essential to employ precise and accurate methods for their extraction to ensure that the underlying patterns are reliably identified and can be effectively utilized in new strategies for diagnosis, prognosis, and treatment of cancer patients. RESULTS: We present MUSE-XAE, a novel method for mutational signature extraction from cancer genomes using an explainable autoencoder. Our approach employs a hybrid architecture consisting of a nonlinear encoder that can capture nonlinear interactions among features, and a linear decoder which ensures the interpretability of the active signatures. We evaluated and compared MUSE-XAE with other available tools on both synthetic and real cancer datasets and demonstrated that it achieves superior performance in terms of precision and sensitivity in recovering mutational signature profiles. MUSE-XAE extracts highly discriminative mutational signature profiles by enhancing the classification of primary tumour types and subtypes in real world settings. This approach could facilitate further research in this area, with neural networks playing a critical role in advancing our understanding of cancer genomics. AVAILABILITY AND IMPLEMENTATION: MUSE-XAE software is freely available at https://github.com/compbiomed-unito/MUSE-XAE.
Assuntos
Mutação , Neoplasias , Humanos , Neoplasias/genética , Algoritmos , Software , Genômica/métodos , Biologia Computacional/métodos , Redes Neurais de ComputaçãoRESUMO
OBJECTIVE: Hyperferritinaemia is associated with liver fibrosis severity in patients with metabolic dysfunction-associated steatotic liver disease (MASLD), but the longitudinal implications have not been thoroughly investigated. We assessed the role of serum ferritin in predicting long-term outcomes or death. DESIGN: We evaluated the relationship between baseline serum ferritin and longitudinal events in a multicentre cohort of 1342 patients. Four survival models considering ferritin with confounders or non-invasive scoring systems were applied with repeated five-fold cross-validation schema. Prediction performance was evaluated in terms of Harrell's C-index and its improvement by including ferritin as a covariate. RESULTS: Median follow-up time was 96 months. Liver-related events occurred in 7.7%, hepatocellular carcinoma in 1.9%, cardiovascular events in 10.9%, extrahepatic cancers in 8.3% and all-cause mortality in 5.8%. Hyperferritinaemia was associated with a 50% increased risk of liver-related events and 27% of all-cause mortality. A stepwise increase in baseline ferritin thresholds was associated with a statistical increase in C-index, ranging between 0.02 (lasso-penalised Cox regression) and 0.03 (ridge-penalised Cox regression); the risk of developing liver-related events mainly increased from threshold 215.5 µg/L (median HR=1.71 and C-index=0.71) and the risk of overall mortality from threshold 272 µg/L (median HR=1.49 and C-index=0.70). The inclusion of serum ferritin thresholds (215.5 µg/L and 272 µg/L) in predictive models increased the performance of Fibrosis-4 and Non-Alcoholic Fatty Liver Disease Fibrosis Score in the longitudinal risk assessment of liver-related events (C-indices>0.71) and overall mortality (C-indices>0.65). CONCLUSIONS: This study supports the potential use of serum ferritin values for predicting the long-term prognosis of patients with MASLD.
Assuntos
Neoplasias Hepáticas , Doenças Metabólicas , Hepatopatia Gordurosa não Alcoólica , Humanos , Hepatopatia Gordurosa não Alcoólica/patologia , Cirrose Hepática/patologia , Fibrose , Neoplasias Hepáticas/complicações , FerritinasRESUMO
Introduction: Prostate cancer (PCa) is the most frequent tumor among men in Europe and has both indolent and aggressive forms. There are several treatment options, the choice of which depends on multiple factors. To further improve current prognostication models, we established the Turin Prostate Cancer Prognostication (TPCP) cohort, an Italian retrospective biopsy cohort of patients with PCa and long-term follow-up. This work presents this new cohort with its main characteristics and the distributions of some of its core variables, along with its potential contributions to PCa research. Methods: The TPCP cohort includes consecutive non-metastatic patients with first positive biopsy for PCa performed between 2008 and 2013 at the main hospital in Turin, Italy. The follow-up ended on December 31st 2021. The primary outcome is the occurrence of metastasis; death from PCa and overall mortality are the secondary outcomes. In addition to numerous clinical variables, the study's prognostic variables include histopathologic information assigned by a centralized uropathology review using a digital pathology software system specialized for the study of PCa, tumor DNA methylation in candidate genes, and features extracted from digitized slide images via Deep Neural Networks. Results: The cohort includes 891 patients followed-up for a median time of 10 years. During this period, 97 patients had progression to metastatic disease and 301 died; of these, 56 died from PCa. In total, 65.3% of the cohort has a Gleason score less than or equal to 3 + 4, and 44.5% has a clinical stage cT1. Consistent with previous studies, age and clinical stage at diagnosis are important prognostic factors: the crude cumulative incidence of metastatic disease during the 14-years of follow-up increases from 9.1% among patients younger than 64 to 16.2% for patients in the age group of 75-84, and from 6.1% for cT1 stage to 27.9% in cT3 stage. Discussion: This study stands to be an important resource for updating existing prognostic models for PCa on an Italian cohort. In addition, the integrated collection of multi-modal data will allow development and/or validation of new models including new histopathological, digital, and molecular markers, with the goal of better directing clinical decisions to manage patients with PCa.
RESUMO
Purpose: The objective of this work was to investigate the ability of machine learning models to use treatment plan dosimetry for prediction of clinician approval of treatment plans (no further planning needed) for left-sided whole breast radiation therapy with boost. Methods and Materials: Investigated plans were generated to deliver a dose of 40.05 Gy to the whole breast in 15 fractions over 3 weeks, with the tumor bed simultaneously boosted to 48 Gy. In addition to the manually generated clinical plan of each of the 120 patients from a single institution, an automatically generated plan was included for each patient to enhance the number of study plans to 240. In random order, the treating clinician retrospectively scored all 240 plans as (1) approved without further planning to seek improvement or (2) further planning needed, while being blind for type of plan generation (manual or automated). In total, 2 × 5 classifiers were trained and evaluated for ability to correctly predict the clinician's plan evaluations: random forest (RF) and constrained logistic regression (LR) classifiers, each trained for 5 different sets of dosimetric plan parameters (feature sets [FS]). Importances of included features for predictions were investigated to better understand clinicians' choices. Results: Although all 240 plans were in principle clinically acceptable for the clinician, only for 71.5% was no further planning required. For the most extensive FS, accuracy, area under the receiver operating characteristic curve, and Cohen's κ for generated RF/LR models for prediction of approval without further planning were 87.2 ± 2.0/86.7 ± 2.2, 0.80 ± 0.03/0.86 ± 0.02, and 0.63 ± 0.05/0.69 ± 0.04, respectively. In contrast to LR, RF performance was independent of the applied FS. For both RF and LR, whole breast excluding boost PTV (PTV40.05Gy) was the most important structure for predictions, with importance factors of 44.6% and 43%, respectively, dose recieved by 95% volume of PTV40.05 (D95%) as the most important parameter in most cases. Conclusions: The investigated use of machine learning to predict clinician approval of treatment plans is highly promising. Including nondosimetric parameters could further increase classifiers' performances. The tool could become useful for aiding treatment planners in generating plans with a high probability of being directly approved by the treating clinician.
RESUMO
BACKGROUND AND AIMS: Nonalcoholic fatty liver disease (NAFLD) is a complex disease, resulting from the interplay between environmental determinants and genetic variations. Single nucleotide polymorphism rs738409 C>G in the PNPLA3 gene is associated with hepatic fibrosis and with higher risk of developing hepatocellular carcinoma. Here, we analyzed a longitudinal cohort of biopsy-proven NAFLD subjects with the aim to identify individuals in whom genetics may have a stronger impact on disease progression. METHODS: We retrospectively analyzed 756 consecutive, prospectively enrolled biopsy-proven NAFLD subjects from Italy, United Kingdom, and Spain who were followed for a median of 84 months (interquartile range, 65-109 months). We stratified the study cohort according to sex, body mass index (BMI) ≥30 kg/m2) and age (≥50 years). Liver-related events (hepatic decompensation, hepatic encephalopathy, esophageal variceal bleeding, and hepatocellular carcinoma) were recorded during the follow-up and the log-rank test was used to compare groups. RESULTS: Overall, the median age was 48 years and most individuals were men (64.7%). The PNPLA3 rs738409 genotype was CC in 235 (31.1%), CG in 328 (43.4%), and GG in 193 (25.5%) patients. At univariate analysis, the PNPLA3 GG risk genotype was associated with female sex and inversely related to BMI (odds ratio, 1.6; 95% confidence interval, 1.1-2.2; P = .006; and odds ratio, 0.97; 95% confidence interval, 0.94-0.99; P = .043, respectively). Specifically, PNPLA3 GG risk homozygosis was more prevalent in female vs male individuals (31.5% vs 22.3%; P = .006) and in nonobese compared with obese NAFLD subjects (50.0% vs 44.2%; P = .011). Following stratification for age, sex, and BMI, we observed an increased incidence of liver-related events in the subgroup of nonobese women older than 50 years of age carrying the PNPLA3 GG risk genotype (log-rank test, P = .0047). CONCLUSIONS: Nonobese female patients with NAFLD 50 years of age and older, and carrying the PNPLA3 GG risk genotype, are at higher risk of developing liver-related events compared with those with the wild-type allele (CC/CG). This finding may have implications in clinical practice for risk stratification and personalized medicine.
Assuntos
Carcinoma Hepatocelular , Varizes Esofágicas e Gástricas , Neoplasias Hepáticas , Hepatopatia Gordurosa não Alcoólica , Humanos , Feminino , Masculino , Pessoa de Meia-Idade , Hepatopatia Gordurosa não Alcoólica/complicações , Hepatopatia Gordurosa não Alcoólica/genética , Hepatopatia Gordurosa não Alcoólica/epidemiologia , Carcinoma Hepatocelular/epidemiologia , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/complicações , Estudos Retrospectivos , Varizes Esofágicas e Gástricas/complicações , Hemorragia Gastrointestinal/complicações , Genótipo , Polimorfismo de Nucleotídeo Único , Neoplasias Hepáticas/epidemiologia , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/complicações , Predisposição Genética para DoençaRESUMO
BACKGROUND: The exposome drivers are less studied than its consequences but may be crucial in identifying population subgroups with unfavourable exposures. OBJECTIVES: We used three approaches to study the socioeconomic position (SEP) as a driver of the early-life exposome in Turin children of the NINFEA cohort (Italy). METHODS: Forty-two environmental exposures, collected at 18 months of age (N = 1989), were classified in 5 groups (lifestyle, diet, meteoclimatic, traffic-related, built environment). We performed cluster analysis to identify subjects sharing similar exposures, and intra-exposome-group Principal Component Analysis (PCA) to reduce the dimensionality. SEP at childbirth was measured through the Equivalised Household Income Indicator. SEP-exposome association was evaluated using: 1) an Exposome Wide Association Study (ExWAS), a one-exposure (SEP) one-outcome (exposome) approach; 2) multinomial regression of cluster membership on SEP; 3) regressions of each intra-exposome-group PC on SEP. RESULTS: In the ExWAS, medium/low SEP children were more exposed to greenness, pet ownership, passive smoking, TV screen and sugar; less exposed to NO2, NOX, PM25abs, humidity, built environment, traffic load, unhealthy food facilities, fruit, vegetables, eggs, grain products, and childcare than high SEP children. Medium/low SEP children were more likely to belong to a cluster with poor diet, less air pollution, and to live in the suburbs than high SEP children. Medium/low SEP children were more exposed to lifestyle PC1 (unhealthy lifestyle) and diet PC2 (unhealthy diet), and less exposed to PC1s of the built environment (urbanization factors), diet (mixed diet), and traffic (air pollution) than high SEP children. CONCLUSIONS: The three approaches provided consistent and complementary results, suggesting that children with lower SEP are less exposed to urbanization factors and more exposed to unhealthy lifestyles and diet. The simplest method, the ExWAS, conveys most of the information and is more replicable in other populations. Clustering and PCA may facilitate results interpretation and communication.
Assuntos
Poluição do Ar , Expossoma , Humanos , Criança , Coorte de Nascimento , Exposição Ambiental/análise , Fatores SocioeconômicosRESUMO
BACKGROUND/AIM: The current availability of large volumes of clinical data has provided medical departments with the opportunity for large-scale analyses, but it has also brought forth the need for an effective strategy of data-storage and data-analysis that is both technically feasible and economically sustainable in the context of limited resources and manpower. Therefore, the aim of this study was to develop a widely-usable data-collection and data-analysis workflow that could be applied in medical departments to perform high-volume relational data analysis on real-time data. METHODS: A sample project, based on a research database on prostate-specific-membrane-antigen/positron-emission-tomography scans performed in prostate cancer patients at our department, was used to develop a new workflow for data-collection and data-analysis. A checklist of requirements for a successful data-collection/analysis strategy, based on shared clinical research experience, was used as reference standard. Software libraries were selected based on widespread availability, reliability, cost, and technical expertise of the research team (REDCap-v11.0.0 for collaborative data-collection, Python-v3.8.5 for data retrieval and SQLite-v3.31.1 for data storage). The primary objective of this study was to develop and implement a workflow to: a) easily store large volumes of structured data into a relational database, b) perform scripted analyses on relational data retrieved in real-time from the database. The secondary objective was to enhance the strategy cost-effectiveness by using open-source/cost-free software libraries. RESULTS: A fully working data strategy was developed and successfully applied to a sample research project. The REDCap platform provided a remote and secure method to collaboratively collect large volumes of standardized relational data, with low technical difficulty and role-based access-control. A Python software was coded to retrieve live data through the REDCap-API and persist them to an SQLite database, preserving data-relationships. The SQL-language enabled complex datasets retrieval, while Python allowed for scripted data computation and analysis. Only cost-free software libraries were used and the sample code was made available through a GitHub repository. CONCLUSIONS: A REDCap-based data-collection and data-analysis workflow, suitable for high-volume relational data-analysis on live data, was developed and successfully implemented using open-source software.
Assuntos
Análise de Dados , Software , Humanos , Fluxo de Trabalho , Reprodutibilidade dos Testes , Bases de Dados FactuaisRESUMO
Diffuse large B-cell lymphoma (DLBCL) is the most common lymphoid neoplasm in dogs and in humans. It is characterized by a remarkable degree of clinical heterogeneity that is not completely elucidated by molecular data. This poses a major barrier to understanding the disease and its response to therapy, or when treating dogs with DLBCL within clinical trials. We performed an integrated analysis of exome (n = 77) and RNA sequencing (n = 43) data in a cohort of canine DLBCL to define the genetic landscape of this tumor. A wide range of signaling pathways and cellular processes were found in common with human DLBCL, but the frequencies of the most recurrently mutated genes (TRAF3, SETD2, POT1, TP53, MYC, FBXW7, DDX3X and TBL1XR1) differed. We developed a prognostic model integrating exonic variants and clinical and transcriptomic features to predict the outcome in dogs with DLBCL. These results comprehensively define the genetic drivers of canine DLBCL and can be prospectively utilized to identify new therapeutic opportunities.
Assuntos
Linfoma Difuso de Grandes Células B , Animais , Cães , Genômica , Humanos , Linfoma Difuso de Grandes Células B/genética , Linfoma Difuso de Grandes Células B/terapia , Linfoma Difuso de Grandes Células B/veterinária , Transdução de SinaisRESUMO
OBJECTIVE: The full phenotypic expression of non-alcoholic fatty liver disease (NAFLD) in lean subjects is incompletely characterised. We aimed to investigate prevalence, characteristics and long-term prognosis of Caucasian lean subjects with NAFLD. DESIGN: The study cohort comprises 1339 biopsy-proven NAFLD subjects from four countries (Italy, UK, Spain and Australia), stratified into lean and non-lean (body mass index (BMI) ≥25 kg/m2). Liver/non-liver-related events and survival free of transplantation were recorded during the follow-up, compared by log-rank testing and reported by adjusted HR. RESULTS: Lean patients represented 14.4% of the cohort and were predominantly of Italian origin (89%). They had less severe histological disease (lean vs non-lean: non-alcoholic steatohepatitis 54.1% vs 71.2% p<0.001; advanced fibrosis 10.1% vs 25.2% p<0.001), lower prevalence of diabetes (9.2% vs 31.4%, p<0.001), but no significant differences in the prevalence of the PNPLA3 I148M variant (p=0.57). During a median follow-up of 94 months (>10 483 person-years), 4.7% of lean vs 7.7% of non-lean patients reported liver-related events (p=0.37). No difference in survival was observed compared with non-lean NAFLD (p=0.069). CONCLUSIONS: Caucasian lean subjects with NAFLD may progress to advanced liver disease, develop metabolic comorbidities and experience cardiovascular disease (CVD) as well as liver-related mortality, independent of longitudinal progression to obesity and PNPLA3 genotype. These patients represent one end of a wide spectrum of phenotypic expression of NAFLD where the disease manifests at lower overall BMI thresholds. LAY SUMMARY: NAFLD may affect and progress in both obese and lean individuals. Lean subjects are predominantly males, have a younger age at diagnosis and are more prevalent in some geographic areas. During the follow-up, lean subjects can develop hepatic and extrahepatic disease, including metabolic comorbidities, in the absence of weight gain. These patients represent one end of a wide spectrum of phenotypic expression of NAFLD.
Assuntos
Hepatopatia Gordurosa não Alcoólica/complicações , Magreza/complicações , População Branca , Adulto , Índice de Massa Corporal , Estudos de Coortes , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Hepatopatia Gordurosa não Alcoólica/mortalidade , Hepatopatia Gordurosa não Alcoólica/patologia , Prognóstico , Taxa de Sobrevida , Magreza/mortalidade , Magreza/patologiaRESUMO
The high cosine similarity between some single-base substitution mutational signatures and their characteristic flat profiles could suggest the presence of overfitting and mathematical artefacts. The newest version (v3.3) of the signature database available in the Catalogue Of Somatic Mutations In Cancer (COSMIC) provides a collection of 79 mutational signatures, which has more than doubled with respect to previous version (30 profiles available in COSMIC signatures v2), making more critical the associations between signatures and specific mutagenic processes. This study both provides a systematic assessment of the de novo extraction task through simulation scenarios based on the latest version of the COSMIC signatures and highlights, through a novel approach using archetypal analysis, which COSMIC signatures are redundant and more likely to be considered as mathematical artefacts. 29 archetypes were able to reconstruct the profile of all the COSMIC signatures with cosine similarity > 0.8. Interestingly, these archetypes tend to group similar original signatures sharing either the same aetiology or similar biological processes. We believe that these findings will be useful to encourage the development of new de novo extraction methods avoiding the redundancy of information among the signatures while preserving the biological interpretation.
RESUMO
BACKGROUND & AIMS: Non-invasive scoring systems (NSS) are used to identify patients with non-alcoholic fatty liver disease (NAFLD) who are at risk of advanced fibrosis, but their reliability in predicting long-term outcomes for hepatic/extrahepatic complications or death and their concordance in cross-sectional and longitudinal risk stratification remain uncertain. METHODS: The most common NSS (NFS, FIB-4, BARD, APRI) and the Hepamet fibrosis score (HFS) were assessed in 1,173 European patients with NAFLD from tertiary centres. Performance for fibrosis risk stratification and for the prediction of long-term hepatic/extrahepatic events, hepatocarcinoma (HCC) and overall mortality were evaluated in terms of AUC and Harrell's c-index. For longitudinal data, NSS-based Cox proportional hazard models were trained on the whole cohort with repeated 5-fold cross-validation, sampling for testing from the 607 patients with all NSS available. RESULTS: Cross-sectional analysis revealed HFS as the best performer for the identification of significant (F0-1 vs. F2-4, AUC = 0.758) and advanced (F0-2 vs. F3-4, AUC = 0.805) fibrosis, while NFS and FIB-4 showed the best performance for detecting histological cirrhosis (range AUCs 0.85-0.88). Considering longitudinal data (follow-up between 62 and 110 months), NFS and FIB-4 were the best at predicting liver-related events (c-indices>0.7), NFS for HCC (c-index = 0.9 on average), and FIB-4 and HFS for overall mortality (c-indices >0.8). All NSS showed limited performance (c-indices <0.7) for extrahepatic events. CONCLUSIONS: Overall, NFS, HFS and FIB-4 outperformed APRI and BARD for both cross-sectional identification of fibrosis and prediction of long-term outcomes, confirming that they are useful tools for the clinical management of patients with NAFLD at increased risk of fibrosis and liver-related complications or death. LAY SUMMARY: Non-invasive scoring systems are increasingly being used in patients with non-alcoholic fatty liver disease to identify those at risk of advanced fibrosis and hence clinical complications. Herein, we compared various non-invasive scoring systems and identified those that were best at identifying risk, as well as those that were best for the prediction of long-term outcomes, such as liver-related events, liver cancer and death.
Assuntos
Hepatopatia Gordurosa não Alcoólica/complicações , Valor Preditivo dos Testes , Projetos de Pesquisa/normas , Tempo , Adulto , Área Sob a Curva , Estudos Transversais , Feminino , Humanos , Fígado/patologia , Masculino , Pessoa de Meia-Idade , Hepatopatia Gordurosa não Alcoólica/mortalidade , Prognóstico , Curva ROC , Reprodutibilidade dos Testes , Projetos de Pesquisa/tendências , Índice de Gravidade de DoençaRESUMO
BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. RESULTS: In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. CONCLUSIONS: To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open.
Assuntos
Carcinogênese/genética , Progressão da Doença , Aprendizado de Máquina , Oncologia/instrumentação , Neoplasias/genética , Medicina de Precisão/instrumentação , Neoplasias/patologiaRESUMO
Protein stability predictions are becoming essential in medicine to develop novel immunotherapeutic agents and for drug discovery. Despite the large number of computational approaches for predicting the protein stability upon mutation, there are still critical unsolved problems: 1) the limited number of thermodynamic measurements for proteins provided by current databases; 2) the large intrinsic variability of ΔΔG values due to different experimental conditions; 3) biases in the development of predictive methods caused by ignoring the anti-symmetry of ΔΔG values between mutant and native protein forms; 4) over-optimistic prediction performance, due to sequence similarity between proteins used in training and test datasets. Here, we review these issues, highlighting new challenges required to improve current tools and to achieve more reliable predictions. In addition, we provide a perspective of how these methods will be beneficial for designing novel precision medicine approaches for several genetic disorders caused by mutations, such as cancer and neurodegenerative diseases.
RESUMO
Protein solubility is a key aspect for many biotechnological, biomedical and industrial processes, such as the production of active proteins and antibodies. In addition, understanding the molecular determinants of the solubility of proteins may be crucial to shed light on the molecular mechanisms of diseases caused by aggregation processes such as amyloidosis. Here we present SKADE, a novel Neural Network protein solubility predictor and we show how it can provide novel insight into the protein solubility mechanisms, thanks to its neural attention architecture. First, we show that SKADE positively compares with state of the art tools while using just the protein sequence as input. Then, thanks to the neural attention mechanism, we use SKADE to investigate the patterns learned during training and we analyse its decision process. We use this peculiarity to show that, while the attention profiles do not correlate with obvious sequence aspects such as biophysical properties of the aminoacids, they suggest that N- and C-termini are the most relevant regions for solubility prediction and are predictive for complex emergent properties such as aggregation-prone regions involved in beta-amyloidosis and contact density. Moreover, SKADE is able to identify mutations that increase or decrease the overall solubility of the protein, allowing it to be used to perform large scale in-silico mutagenesis of proteins in order to maximize their solubility.
Assuntos
Biologia Computacional/métodos , Rede Nervosa/fisiologia , Solubilidade , Algoritmos , Sequência de Aminoácidos/fisiologia , Aminoácidos , Animais , Simulação por Computador , Humanos , Modelos Moleculares , Conformação Proteica , Proteínas/química , Proteínas/metabolismo , SoftwareRESUMO
Frataxin (FXN) is a highly conserved protein found in prokaryotes and eukaryotes that is required for efficient regulation of cellular iron homeostasis. Experimental evidence associates amino acid substitutions of the FXN to Friedreich Ataxia, a neurodegenerative disorder. Recently, new thermodynamic experiments have been performed to study the impact of somatic variations identified in cancer tissues on protein stability. The Critical Assessment of Genome Interpretation (CAGI) data provider at the University of Rome measured the unfolding free energy of a set of variants (FXN challenge data set) with far-UV circular dichroism and intrinsic fluorescence spectra. These values have been used to calculate the change in unfolding free energy between the variant and wild-type proteins at zero concentration of denaturant (ΔΔGH2O) . The FXN challenge data set, composed of eight amino acid substitutions, was used to evaluate the performance of the current computational methods for predicting the ΔΔGH2O value associated with the variants and to classify them as destabilizing and not destabilizing. For the fifth edition of CAGI, six independent research groups from Asia, Australia, Europe, and North America submitted 12 sets of predictions from different approaches. In this paper, we report the results of our assessment and discuss the limitations of the tested algorithms.
Assuntos
Substituição de Aminoácidos , Proteínas de Ligação ao Ferro/química , Proteínas de Ligação ao Ferro/genética , Algoritmos , Dicroísmo Circular , Humanos , Modelos Moleculares , Conformação Proteica , Dobramento de Proteína , Estabilidade Proteica , FrataxinaRESUMO
Canine malignant melanoma (MM) is a highly aggressive tumour with a low survival rate and represents an ideal spontaneous model for the human counterpart. Considerable progress has been recently obtained, but the therapeutic success for canine melanoma is still challenging. Little is known about the mechanisms beyond pathogenesis and melanoma development, and the molecular response to radiotherapy has never been explored before. A faster and deeper understanding of cancer mutational processes and developing mechanisms are now possible through next generation sequencing technologies. In this study, we matched whole exome and transcriptome sequencing in four dogs affected by MM at diagnosis and at disease progression to identify possible genetic mechanisms associated with therapy failure. According to previous studies, a genetic similarity between canine MM and its human counterpart was observed. Several somatic mutations were functionally related to MAPK, PI3K/AKT and p53 signalling pathways, but located in genes other than BRAF, RAS and KIT. At disease progression, several mutations were related to therapy effects. Natural killer cell-mediated cytotoxicity and several immune-system-related pathways resulted activated opening a new scenario on the microenvironment in this tumour. In conclusion, this study suggests a potential role of the immune system associated to radiotherapy in canine melanoma, but a larger sample size associated with functional studies are needed.
Assuntos
Doenças do Cão/radioterapia , Melanoma/veterinária , Transcriptoma/efeitos da radiação , Animais , Sequência de Bases , Aberrações Cromossômicas , Cães , Feminino , Perfilação da Expressão Gênica , Regulação Neoplásica da Expressão Gênica/efeitos da radiação , Masculino , Melanoma/radioterapia , MutaçãoRESUMO
Surface active maghemite nanoparticles (SAMNs) are able to recognize and bind selected proteins in complex biological systems, forming a hard protein corona. Upon a 5-min incubation in bovine whey from mastitis-affected cows, a significant enrichment of a single peptide characterized by a molecular weight at 4338 Da originated from the proteolysis of aS1-casein was observed. Notably, among the large number of macromolecules in bovine milk, the detection of this specific peptide can hardly be accomplished by conventional analytical techniques. The selective formation of a stable binding between the peptide and SAMNs is due to the stability gained by adsorption-induced surface restructuration of the nanomaterial. We attributed the surface recognition properties of SAMNs to the chelation of iron(III) sites on their surface by sterically compatible carboxylic groups of the peptide. The specific peptide recognition by SAMNs allows its easy determination by MALDI-TOF mass spectrometry, and a threshold value of its normalized peak intensity was identified by a logistic regression approach and suggested for the rapid diagnosis of the pathology. Thus, the present report proposes the analysis of hard protein corona on nanomaterials as a perspective for developing fast analytical procedures for the diagnosis of mastitis in cows. Moreover, the huge simplification of proteome complexity by exploiting the selectivity derived by the peculiar SAMN surface topography, due to the iron(III) distribution pattern, could be of general interest, leading to competitive applications in food science and in biomedicine, allowing the rapid determination of hidden biomarkers by a cutting edge diagnostic strategy. Graphical abstract The topography of iron(III) sites on surface active maghemite nanoparticles (SAMNs) allows the recognition of sterically compatible carboxylic groups on proteins and peptides in complex biological matrixes. The analysis of hard protein corona on SAMNs led to the determination of a biomarker for cow mastitis in milk by MALDI-TOF mass spectrometry.
Assuntos
Compostos Férricos/química , Mastite Bovina/diagnóstico , Proteínas do Leite/análise , Nanopartículas/química , Coroa de Proteína/análise , Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz/métodos , Soro do Leite/química , Sequência de Aminoácidos , Animais , Biomarcadores/análise , Bovinos , Feminino , Leite/química , Modelos Moleculares , Peptídeos/análise , Proteômica/métodosRESUMO
SNPs&GO is a machine learning method for predicting the association of single amino acid variations (SAVs) to disease, considering protein functional annotation. The method is a binary classifier that implements a support vector machine algorithm to discriminate between disease-related and neutral SAVs. SNPs&GO combines information from protein sequence with functional annotation encoded by gene ontology (GO) terms. Tested in sequence mode on more than 38,000 SAVs from the SwissVar dataset, our method reached 81% overall accuracy and an area under the receiving operating characteristic curve of 0.88 with low false-positive rate. In almost all the editions of the Critical Assessment of Genome Interpretation (CAGI) experiments, SNPs&GO ranked among the most accurate algorithms for predicting the effect of SAVs. In this paper, we summarize the best results obtained by SNPs&GO on disease-related variations of four CAGI challenges relative to the following genes: CHEK2 (CAGI 2010), RAD50 (CAGI 2011), p16-INK (CAGI 2013), and NAGLU (CAGI 2016). Result evaluation provides insights about the accuracy of our algorithm and the relevance of GO terms in annotating the effect of the variants. It also helps to define good practices for the detection of deleterious SAVs.
Assuntos
Substituição de Aminoácidos , Quinase do Ponto de Checagem 2/genética , Biologia Computacional/métodos , Inibidor p16 de Quinase Dependente de Ciclina/genética , Enzimas Reparadoras do DNA/genética , Proteínas de Ligação a DNA/genética , alfa-N-Acetilgalactosaminidase/genética , Hidrolases Anidrido Ácido , Algoritmos , Ontologia Genética , Predisposição Genética para Doença , Humanos , Anotação de Sequência Molecular , Curva ROC , Máquina de Vetores de SuporteRESUMO
MOTIVATION: Molecular recognition of N-terminal targeting peptides is the most common mechanism controlling the import of nuclear-encoded proteins into mitochondria and chloroplasts. When experimental information is lacking, computational methods can annotate targeting peptides, and determine their cleavage sites for characterizing protein localization, function, and mature protein sequences. The problem of discriminating mitochondrial from chloroplastic propeptides is particularly relevant when annotating proteomes of photosynthetic Eukaryotes, endowed with both types of sequences. RESULTS: Here, we introduce TPpred3, a computational method that given any Eukaryotic protein sequence performs three different tasks: (i) the detection of targeting peptides; (ii) their classification as mitochondrial or chloroplastic and (iii) the precise localization of the cleavage sites in an organelle-specific framework. Our implementation is based on our TPpred previously introduced. Here, we integrate a new N-to-1 Extreme Learning Machine specifically designed for the classification task (ii). For the last task, we introduce an organelle-specific Support Vector Machine that exploits sequence motifs retrieved with an extensive motif-discovery analysis of a large set of mitochondrial and chloroplastic proteins. We show that TPpred3 outperforms the state-of-the-art methods in all the three tasks. AVAILABILITY AND IMPLEMENTATION: The method server and datasets are available at http://tppred3.biocomp.unibo.it. CONTACT: gigi@biocomp.unibo.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.