Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 39(6)2023 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-37255310

RESUMEN

MOTIVATION: The prediction of reliable Drug-Target Interactions (DTIs) is a key task in computer-aided drug design and repurposing. Here, we present a new approach based on data fusion for DTI prediction built on top of the NXTfusion library, which generalizes the Matrix Factorization paradigm by extending it to the nonlinear inference over Entity-Relation graphs. RESULTS: We benchmarked our approach on five datasets and we compared our models against state-of-the-art methods. Our models outperform most of the existing methods and, simultaneously, retain the flexibility to predict both DTIs as binary classification and regression of the real-valued drug-target affinity, competing with models built explicitly for each task. Moreover, our findings suggest that the validation of DTI methods should be stricter than what has been proposed in some previous studies, focusing more on mimicking real-life DTI settings where predictions for previously unseen drugs, proteins, and drug-protein pairs are needed. These settings are exactly the context in which the benefit of integrating heterogeneous information with our Entity-Relation data fusion approach is the most evident. AVAILABILITY AND IMPLEMENTATION: All software and data are available at https://github.com/eugeniomazzone/CPI-NXTFusion and https://pypi.org/project/NXTfusion/.


Asunto(s)
Desarrollo de Medicamentos , Programas Informáticos , Proteínas , Interacciones Farmacológicas , Diseño de Fármacos
2.
Nucleic Acids Res ; 50(3): e16, 2022 02 22.
Artículo en Inglés | MEDLINE | ID: mdl-34792168

RESUMEN

In many cases, the unprecedented availability of data provided by high-throughput sequencing has shifted the bottleneck from a data availability issue to a data interpretation issue, thus delaying the promised breakthroughs in genetics and precision medicine, for what concerns Human genetics, and phenotype prediction to improve plant adaptation to climate change and resistance to bioagressors, for what concerns plant sciences. In this paper, we propose a novel Genome Interpretation paradigm, which aims at directly modeling the genotype-to-phenotype relationship, and we focus on A. thaliana since it is the best studied model organism in plant genetics. Our model, called Galiana, is the first end-to-end Neural Network (NN) approach following the genomes in/phenotypes out paradigm and it is trained to predict 288 real-valued Arabidopsis thaliana phenotypes from Whole Genome sequencing data. We show that 75 of these phenotypes are predicted with a Pearson correlation ≥0.4, and are mostly related to flowering traits. We show that our end-to-end NN approach achieves better performances and larger phenotype coverage than models predicting single phenotypes from the GWAS-derived known associated genes. Galiana is also fully interpretable, thanks to the Saliency Maps gradient-based approaches. We followed this interpretation approach to identify 36 novel genes that are likely to be associated with flowering traits, finding evidence for 6 of them in the existing literature.


Asunto(s)
Arabidopsis , Arabidopsis/genética , Genoma , Estudio de Asociación del Genoma Completo , Genotipo , Redes Neurales de la Computación , Fenotipo , Secuenciación Completa del Genoma
3.
Bioinformatics ; 38(10): 2802-2809, 2022 05 13.
Artículo en Inglés | MEDLINE | ID: mdl-35561176

RESUMEN

MOTIVATION: Transcriptional regulation mechanisms allow cells to adapt and respond to external stimuli by altering gene expression. The possible cell transcriptional states are determined by the underlying gene regulatory network (GRN), and reliably inferring such network would be invaluable to understand biological processes and disease progression. RESULTS: In this article, we present a novel method for the inference of GRNs, called PORTIA, which is based on robust precision matrix estimation, and we show that it positively compares with state-of-the-art methods while being orders of magnitude faster. We extensively validated PORTIA using the DREAM and MERLIN+P datasets as benchmarks. In addition, we propose a novel scoring metric that builds on graph-theoretical concepts. AVAILABILITY AND IMPLEMENTATION: The code and instructions for data acquisition and full reproduction of our results are available at https://github.com/AntoinePassemiers/PORTIA-Manuscript. PORTIA is available on PyPI as a Python package (portia-grn). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Redes Reguladoras de Genes , Regulación de la Expresión Génica
4.
Nucleic Acids Res ; 49(W1): W52-W59, 2021 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-34057475

RESUMEN

We provide integrated protein sequence-based predictions via https://bio2byte.be/b2btools/. The aim of our predictions is to identify the biophysical behaviour or features of proteins that are not readily captured by structural biology and/or molecular dynamics approaches. Upload of a FASTA file or text input of a sequence provides integrated predictions from DynaMine backbone and side-chain dynamics, conformational propensities, and derived EFoldMine early folding, DisoMine disorder, and Agmata ß-sheet aggregation. These predictions, several of which were previously not available online, capture 'emergent' properties of proteins, i.e. the inherent biophysical propensities encoded in their sequence, rather than context-dependent behaviour (e.g. final folded state). In addition, upload of a multiple sequence alignment (MSA) in a variety of formats enables exploration of the biophysical variation observed in homologous proteins. The associated plots indicate the biophysical limits of functionally relevant protein behaviour, with unusual residues flagged by a Gaussian mixture model analysis. The prediction results are available as JSON or CSV files and directly accessible via an API. Online visualisation is available as interactive plots, with brief explanations and tutorial pages included. The server and API employ an email-free token-based system that can be used to anonymously access previously generated results.


Asunto(s)
Proteínas/química , Alineación de Secuencia , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Internet
5.
Bioinformatics ; 37(16): 2275-2281, 2021 Aug 25.
Artículo en Inglés | MEDLINE | ID: mdl-33560405

RESUMEN

MOTIVATION: Modern bioinformatics is facing increasingly complex problems to solve, and we are indeed rapidly approaching an era in which the ability to seamlessly integrate heterogeneous sources of information will be crucial for the scientific progress. Here, we present a novel non-linear data fusion framework that generalizes the conventional matrix factorization paradigm allowing inference over arbitrary entity-relation graphs, and we applied it to the prediction of protein-protein interactions (PPIs). Improving our knowledge of PPI networks at the proteome scale is indeed crucial to understand protein function, physiological and disease states and cell life in general. RESULTS: We devised three data fusion-based models for the proteome-level prediction of PPIs, and we show that our method outperforms state of the art approaches on common benchmarks. Moreover, we investigate its predictions on newly published PPIs, showing that this new data has a clear shift in its underlying distributions and we thus train and test our models on this extended dataset. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

6.
Bioinformatics ; 37(20): 3473-3479, 2021 Oct 25.
Artículo en Inglés | MEDLINE | ID: mdl-33983381

RESUMEN

MOTIVATION: Proteins able to undergo liquid-liquid phase separation (LLPS) in vivo and in vitro are drawing a lot of interest, due to their functional relevance for cell life. Nevertheless, the proteome-scale experimental screening of these proteins seems unfeasible, because besides being expensive and time-consuming, LLPS is heavily influenced by multiple environmental conditions such as concentration, pH and temperature, thus requiring a combinatorial number of experiments for each protein. RESULTS: To overcome this problem, we propose a neural network model able to predict the LLPS behavior of proteins given specified experimental conditions, effectively predicting the outcome of in vitro experiments. Our model can be used to rapidly screen proteins and experimental conditions searching for LLPS, thus reducing the search space that needs to be covered experimentally. We experimentally validate Droppler's prediction on the TAR DNA-binding protein in different experimental conditions, showing the consistency of its predictions. AVAILABILITY AND IMPLEMENTATION: A python implementation of Droppler is available at https://bitbucket.org/grogdrinker/droppler. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

7.
Nucleic Acids Res ; 48(W1): W36-W40, 2020 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-32459331

RESUMEN

Nuclear magnetic resonance (NMR) spectroscopy data provides valuable information on the behaviour of proteins in solution. The primary data to determine when studying proteins are the per-atom NMR chemical shifts, which reflect the local environment of atoms and provide insights into amino acid residue dynamics and conformation. Within an amino acid residue, chemical shifts present multi-dimensional and complexly cross-correlated information, making them difficult to analyse. The ShiftCrypt method, based on neural network auto-encoder architecture, compresses the per-amino acid chemical shift information in a single, interpretable, amino acid-type independent value that reflects the biophysical state of a residue. We here present the ShiftCrypt web server, which makes the method readily available. The server accepts chemical shifts input files in the NMR Exchange Format (NEF) or NMR-STAR format, executes ShiftCrypt and visualises the results, which are also accessible via an API. It also enables the "biophysically-based" pairwise alignment of two proteins based on their ShiftCrypt values. This approach uses Dynamic Time Warping and can optionally include their amino acid code information, and has applications in, for example, the alignment of disordered regions. The server uses a token-based system to ensure the anonymity of the users and results. The web server is available at www.bio2byte.be/shiftcrypt.


Asunto(s)
Resonancia Magnética Nuclear Biomolecular/métodos , Proteínas/química , Programas Informáticos , Aminoácidos/química , Redes Neurales de la Computación , Desnaturalización Proteica , Pliegue de Proteína , Desplegamiento Proteico
8.
BMC Biol ; 19(1): 3, 2021 01 13.
Artículo en Inglés | MEDLINE | ID: mdl-33441128

RESUMEN

BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. RESULTS: In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. CONCLUSIONS: To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open.


Asunto(s)
Carcinogénesis/genética , Progresión de la Enfermedad , Aprendizaje Automático , Oncología Médica/instrumentación , Neoplasias/genética , Medicina de Precisión/instrumentación , Neoplasias/patología
9.
Bioinformatics ; 36(7): 2076-2081, 2020 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-31904854

RESUMEN

MOTIVATION: Protein beta-aggregation is an important but poorly understood phenomena involved in diseases as well as in beneficial physiological processes. However, while this task has been investigated for over 50 years, very little is known about its mechanisms of action. Moreover, the identification of regions involved in aggregation is still an open problem and the state-of-the-art methods are often inadequate in real case applications. RESULTS: In this article we present AgMata, an unsupervised tool for the identification of such regions from amino acidic sequence based on a generalized definition of statistical potentials that includes biophysical information. The tool outperforms the state-of-the-art methods on two different benchmarks. As case-study, we applied our tool to human ataxin-3, a protein involved in Machado-Joseph disease. Interestingly, AgMata identifies aggregation-prone residues that share the very same structural environment. Additionally, it successfully predicts the outcome of in vitro mutagenesis experiments, identifying point mutations that lead to an alteration of the aggregation propensity of the wild-type ataxin-3. AVAILABILITY AND IMPLEMENTATION: A python implementation of the tool is available at https://bitbucket.org/bio2byte/agmata. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Enfermedad de Machado-Joseph , Proteínas , Secuencia de Aminoácidos , Ataxina-3 , Humanos
10.
PLoS Comput Biol ; 16(4): e1007722, 2020 04.
Artículo en Inglés | MEDLINE | ID: mdl-32352965

RESUMEN

Protein solubility is a key aspect for many biotechnological, biomedical and industrial processes, such as the production of active proteins and antibodies. In addition, understanding the molecular determinants of the solubility of proteins may be crucial to shed light on the molecular mechanisms of diseases caused by aggregation processes such as amyloidosis. Here we present SKADE, a novel Neural Network protein solubility predictor and we show how it can provide novel insight into the protein solubility mechanisms, thanks to its neural attention architecture. First, we show that SKADE positively compares with state of the art tools while using just the protein sequence as input. Then, thanks to the neural attention mechanism, we use SKADE to investigate the patterns learned during training and we analyse its decision process. We use this peculiarity to show that, while the attention profiles do not correlate with obvious sequence aspects such as biophysical properties of the aminoacids, they suggest that N- and C-termini are the most relevant regions for solubility prediction and are predictive for complex emergent properties such as aggregation-prone regions involved in beta-amyloidosis and contact density. Moreover, SKADE is able to identify mutations that increase or decrease the overall solubility of the protein, allowing it to be used to perform large scale in-silico mutagenesis of proteins in order to maximize their solubility.


Asunto(s)
Biología Computacional/métodos , Red Nerviosa/fisiología , Solubilidad , Algoritmos , Secuencia de Aminoácidos/fisiología , Aminoácidos , Animales , Simulación por Computador , Humanos , Modelos Moleculares , Conformación Proteica , Proteínas/química , Proteínas/metabolismo , Programas Informáticos
11.
Bioinformatics ; 35(22): 4617-4623, 2019 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-30994888

RESUMEN

MOTIVATION: Eukaryotic cells contain different membrane-delimited compartments, which are crucial for the biochemical reactions necessary to sustain cell life. Recent studies showed that cells can also trigger the formation of membraneless organelles composed by phase-separated proteins to respond to various stimuli. These condensates provide new ways to control the reactions and phase-separation proteins (PSPs) are thus revolutionizing how cellular organization is conceived. The small number of experimentally validated proteins, and the difficulty in discovering them, remain bottlenecks in PSPs research. RESULTS: Here we present PSPer, the first in-silico screening tool for prion-like RNA-binding PSPs. We show that it can prioritize PSPs among proteins containing similar RNA-binding domains, intrinsically disordered regions and prions. PSPer is thus suitable to screen proteomes, identifying the most likely PSPs for further experimental investigation. Moreover, its predictions are fully interpretable in the sense that it assigns specific functional regions to the predicted proteins, providing valuable information for experimental investigation of targeted mutations on these regions. Finally, we show that it can estimate the ability of artificially designed proteins to form condensates (r=-0.87), thus providing an in-silico screening tool for protein design experiments. AVAILABILITY AND IMPLEMENTATION: PSPer is available at bio2byte.com/psp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Proteínas de Unión al ARN/metabolismo , Orgánulos , Priones , Proteoma
12.
Mult Scler ; 26(10): 1157-1162, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32662757

RESUMEN

BACKGROUND: We need high-quality data to assess the determinants for COVID-19 severity in people with MS (PwMS). Several studies have recently emerged but there is great benefit in aligning data collection efforts at a global scale. OBJECTIVES: Our mission is to scale-up COVID-19 data collection efforts and provide the MS community with data-driven insights as soon as possible. METHODS: Numerous stakeholders were brought together. Small dedicated interdisciplinary task forces were created to speed-up the formulation of the study design and work plan. First step was to agree upon a COVID-19 MS core data set. Second, we worked on providing a user-friendly and rapid pipeline to share COVID-19 data at a global scale. RESULTS: The COVID-19 MS core data set was agreed within 48 hours. To date, 23 data collection partners are involved and the first data imports have been performed successfully. Data processing and analysis is an on-going process. CONCLUSIONS: We reached a consensus on a core data set and established data sharing processes with multiple partners to address an urgent need for information to guide clinical practice. First results show that partners are motivated to share data to attain the ultimate joint goal: better understand the effect of COVID-19 in PwMS.


Asunto(s)
Infecciones por Coronavirus/fisiopatología , Esclerosis Múltiple/terapia , Neumonía Viral/fisiopatología , Sistema de Registros , Betacoronavirus , COVID-19 , Infecciones por Coronavirus/complicaciones , Infecciones por Coronavirus/terapia , Recolección de Datos , Humanos , Difusión de la Información , Cooperación Internacional , Esclerosis Múltiple/complicaciones , Pandemias , Neumonía Viral/complicaciones , Neumonía Viral/terapia , Factores de Riesgo , SARS-CoV-2 , Resultado del Tratamiento
13.
Bioinformatics ; 34(18): 3118-3125, 2018 09 15.
Artículo en Inglés | MEDLINE | ID: mdl-29684140

RESUMEN

Motivation: Evolutionary information is crucial for the annotation of proteins in bioinformatics. The amount of retrieved homologs often correlates with the quality of predicted protein annotations related to structure or function. With a growing amount of sequences available, fast and reliable methods for homology detection are essential, as they have a direct impact on predicted protein annotations. Results: We developed a discriminative, alignment-free algorithm for homology detection with quasi-linear complexity, enabling theoretically much faster homology searches. To reach this goal, we convert the protein sequence into numeric biophysical representations. These are shrunk to a fixed length using a novel vector quantization method which uses a Discrete Cosine Transform compression. We then compute, for each compressed representation, similarity scores between proteins with the Dynamic Time Warping algorithm and we feed them into a Random Forest. The WARP performances are comparable with state of the art methods. Availability and implementation: The method is available at http://ibsquare.be/warp. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Proteínas/química , Algoritmos , Secuencia de Aminoácidos , Compresión de Datos , Anotación de Secuencia Molecular , Programas Informáticos , Factores de Tiempo
14.
Nucleic Acids Res ; 45(15): e140, 2017 Sep 06.
Artículo en Inglés | MEDLINE | ID: mdl-28911095

RESUMEN

To further our understanding of the complexity and genetic heterogeneity of rare diseases, it has become essential to shed light on how combinations of variants in different genes are responsible for a disease phenotype. With the appearance of a resource on digenic diseases, it has become possible to evaluate how digenic combinations differ in terms of the phenotypes they produce. All instances in this resource were assigned to two classes of digenic effects, annotated as true digenic and composite classes. Whereas in the true digenic class variants in both genes are required for developing the disease, in the composite class, a variant in one gene is sufficient to produce the phenotype, but an additional variant in a second gene impacts the disease phenotype or alters the age of onset. We show that a combination of variant, gene and higher-level features can differentiate between these two classes with high accuracy. Moreover, we show via the analysis of three digenic disorders that a digenic effect decision profile, extracted from the predictive model, motivates why an instance was assigned to either of the two classes. Together, our results show that digenic disease data generates novel insights, providing a glimpse into the oligogenic realm.


Asunto(s)
Epistasis Genética/fisiología , Enfermedades Genéticas Congénitas/genética , Mutación/fisiología , Biología Computacional/métodos , Conjuntos de Datos como Asunto , Estudios de Asociación Genética/métodos , Enfermedades Genéticas Congénitas/diagnóstico , Predisposición Genética a la Enfermedad , Humanos , Modelos Genéticos , Fenotipo , Pronóstico , Estudios de Validación como Asunto
15.
Nucleic Acids Res ; 45(W1): W201-W206, 2017 07 03.
Artículo en Inglés | MEDLINE | ID: mdl-28498993

RESUMEN

High-throughput sequencing methods are generating enormous amounts of genomic data, giving unprecedented insights into human genetic variation and its relation to disease. An individual human genome contains millions of Single Nucleotide Variants: to discriminate the deleterious from the benign ones, a variety of methods have been developed that predict whether a protein-coding variant likely affects the carrier individual's health. We present such a method, DEOGEN2, which incorporates heterogeneous information about the molecular effects of the variants, the domains involved, the relevance of the gene and the interactions in which it participates. This extensive contextual information is non-linearly mapped into one single deleteriousness score for each variant. Since for the non-expert user it is sometimes still difficult to assess what this score means, how it relates to the encoded protein, and where it originates from, we developed an interactive online framework (http://deogen2.mutaframe.com/) to better present the DEOGEN2 deleteriousness predictions of all possible variants in all human proteins. The prediction is visualized so both expert and non-expert users can gain insights into the meaning, protein context and origins of each prediction.


Asunto(s)
Sustitución de Aminoácidos , Proteínas/genética , Programas Informáticos , Gráficos por Computador , Variación Genética , Humanos , Internet , Dominios Proteicos/genética , Pliegue de Proteína
16.
Bioinformatics ; 33(24): 3902-3908, 2017 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-28666322

RESUMEN

MOTIVATION: Methods able to provide reliable protein alignments are crucial for many bioinformatics applications. In the last years many different algorithms have been developed and various kinds of information, from sequence conservation to secondary structure, have been used to improve the alignment performances. This is especially relevant for proteins with highly divergent sequences. However, recent works suggest that different features may have different importance in diverse protein classes and it would be an advantage to have more customizable approaches, capable to deal with different alignment definitions. RESULTS: Here we present Rigapollo, a highly flexible pairwise alignment method based on a pairwise HMM-SVM that can use any type of information to build alignments. Rigapollo lets the user decide the optimal features to align their protein class of interest. It outperforms current state of the art methods on two well-known benchmark datasets when aligning highly divergent sequences. AVAILABILITY AND IMPLEMENTATION: A Python implementation of the algorithm is available at http://ibsquare.be/rigapollo. CONTACT: wim.vranken@vub.be. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Máquina de Vectores de Soporte , Algoritmos , Cadenas de Markov , Estructura Secundaria de Proteína , Proteínas/química , Programas Informáticos
17.
Hum Mutat ; 38(1): 86-94, 2017 01.
Artículo en Inglés | MEDLINE | ID: mdl-27667481

RESUMEN

Cysteines are among the rarest amino acids in nature, and are both functionally and structurally very important for proteins. The ability of cysteines to form disulfide bonds is especially relevant, both for constraining the folded state of the protein and for performing enzymatic duties. But how does the variation record of human proteins reflect their functional importance and structural role, especially with regard to deleterious mutations? We created HUMCYS, a manually curated dataset of single amino acid variants that (1) have a known disease/neutral phenotypic outcome and (2) cause the loss of a cysteine, in order to investigate how mutated cysteines relate to structural aspects such as surface accessibility and cysteine oxidation state. We also have developed a sequence-based in silico cysteine oxidation predictor to overcome the scarcity of experimentally derived oxidation annotations, and applied it to extend our analysis to classes of proteins for which the experimental determination of their structure is technically challenging, such as transmembrane proteins. Our investigation shows that we can gain insights into the reason behind the outcome of cysteine losses in otherwise uncharacterized proteins, and we discuss the possible molecular mechanisms leading to deleterious phenotypes, such as the involvement of the mutated cysteine in a structurally or enzymatically relevant disulfide bond.


Asunto(s)
Cisteína/genética , Modelos Biológicos , Mutación , Oxidación-Reducción , Algoritmos , Sustitución de Aminoácidos , Codón , Biología Computacional/métodos , Bases de Datos Genéticas , Estudios de Asociación Genética , Humanos , Espacio Intracelular/metabolismo , Polimorfismo de Nucleótido Simple , Transporte de Proteínas , Reproducibilidad de los Resultados , Programas Informáticos , Navegador Web
18.
Bioinformatics ; 32(12): 1797-804, 2016 06 15.
Artículo en Inglés | MEDLINE | ID: mdl-27153718

RESUMEN

MOTIVATION: There are now many predictors capable of identifying the likely phenotypic effects of single nucleotide variants (SNVs) or short in-frame Insertions or Deletions (INDELs) on the increasing amount of genome sequence data. Most of these predictors focus on SNVs and use a combination of features related to sequence conservation, biophysical, and/or structural properties to link the observed variant to either neutral or disease phenotype. Despite notable successes, the mapping between genetic variants and their phenotypic effects is riddled with levels of complexity that are not yet fully understood and that are often not taken into account in the predictions, despite their promise of significantly improving the prediction of deleterious mutants. RESULTS: We present DEOGEN, a novel variant effect predictor that can handle both missense SNVs and in-frame INDELs. By integrating information from different biological scales and mimicking the complex mixture of effects that lead from the variant to the phenotype, we obtain significant improvements in the variant-effect prediction results. Next to the typical variant-oriented features based on the evolutionary conservation of the mutated positions, we added a collection of protein-oriented features that are based on functional aspects of the gene affected. We cross-validated DEOGEN on 36 825 polymorphisms, 20 821 deleterious SNVs, and 1038 INDELs from SwissProt. The multilevel contextualization of each (variant, protein) pair in DEOGEN provides a 10% improvement of MCC with respect to current state-of-the-art tools. AVAILABILITY AND IMPLEMENTATION: The software and the data presented here is publicly available at http://ibsquare.be/deogen CONTACT: : wvranken@vub.ac.be SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Proteínas/genética , Bases de Datos de Proteínas , Variación Genética , Mutación INDEL , Programas Informáticos
19.
Biophys J ; 110(3): 572-583, 2016 Feb 02.
Artículo en Inglés | MEDLINE | ID: mdl-26840723

RESUMEN

Protein folding is in its early stages largely determined by the protein sequence and complex local interactions between amino acids, resulting in lower energy conformations that provide the context for further folding into the native state. We compiled a comprehensive data set of early folding residues based on pulsed labeling hydrogen deuterium exchange experiments. These early folding residues have corresponding higher backbone rigidity as predicted by DynaMine from sequence, an effect also present when accounting for the secondary structures in the folded protein. We then show that the amino acids involved in early folding events are not more conserved than others, but rather, early folding fragments and the secondary structure elements they are part of show a clear trend toward conserving a rigid backbone. We therefore propose that backbone rigidity is a fundamental physical feature conserved by proteins that can provide important insights into their folding mechanisms and stability.


Asunto(s)
Simulación de Dinámica Molecular , Pliegue de Proteína , Secuencia de Aminoácidos , Citocromos c/química , Datos de Secuencia Molecular , Unión Proteica , Conformación Proteica
20.
Bioinformatics ; 31(8): 1219-25, 2015 Apr 15.
Artículo en Inglés | MEDLINE | ID: mdl-25492406

RESUMEN

MOTIVATION: Cysteine residues have particular structural and functional relevance in proteins because of their ability to form covalent disulfide bonds. Bioinformatics tools that can accurately predict cysteine bonding states are already available, whereas it remains challenging to infer the disulfide connectivity pattern of unknown protein sequences. Improving accuracy in this area is highly relevant for the structural and functional annotation of proteins. RESULTS: We predict the intra-chain disulfide bond connectivity patterns starting from known cysteine bonding states with an evolutionary-based unsupervised approach called Sephiroth that relies on high-quality alignments obtained with HHblits and is based on a coarse-grained cluster-based modelization of tandem cysteine mutations within a protein family. We compared our method with state-of-the-art unsupervised predictors and achieve a performance improvement of 25-27% while requiring an order of magnitude less of aligned homologous sequences (∼10(3) instead of ∼10(4)). AVAILABILITY AND IMPLEMENTATION: The software described in this article and the datasets used are available at http://ibsquare.be/sephiroth. CONTACT: wvranken@vub.ac.be SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.


Asunto(s)
Algoritmos , Cisteína/química , Disulfuros/química , Modelos Estadísticos , Proteínas/química , Programas Informáticos , Secuencia de Aminoácidos , Análisis por Conglomerados , Cisteína/clasificación , Cisteína/genética , Humanos , Datos de Secuencia Molecular , Mutación/genética , Proteínas/análisis , Proteínas/genética , Homología de Secuencia
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA