Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 40(9)2024 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-39177091

RESUMO

MOTIVATION: Circulating-cell free DNA (cfDNA) is widely explored as a noninvasive biomarker for cancer screening and diagnosis. The ability to decode the cells of origin in cfDNA would provide biological insights into pathophysiological mechanisms, aiding in cancer characterization and directing clinical management and follow-up. RESULTS: We developed a DNA methylation signature-based deconvolution algorithm, MetDecode, for cancer tissue origin identification. We built a reference atlas exploiting de novo and published whole-genome methylation sequencing data for colorectal, breast, ovarian, and cervical cancer, and blood-cell-derived entities. MetDecode models the contributors absent in the atlas with methylation patterns learnt on-the-fly from the input cfDNA methylation profiles. In addition, our model accounts for the coverage of each marker region to alleviate potential sources of noise. In-silico experiments showed a limit of detection down to 2.88% of tumor tissue contribution in cfDNA. MetDecode produced Pearson correlation coefficients above 0.95 and outperformed other methods in simulations (P < 0.001; T-test; one-sided). In plasma cfDNA profiles from cancer patients, MetDecode assigned the correct tissue-of-origin in 84.2% of cases. In conclusion, MetDecode can unravel alterations in the cfDNA pool components by accurately estimating the contribution of multiple tissues, while supplied with an imperfect reference atlas. AVAILABILITY AND IMPLEMENTATION: MetDecode is available at https://github.com/JorisVermeeschLab/MetDecode.


Assuntos
Algoritmos , Biomarcadores Tumorais , Ácidos Nucleicos Livres , Metilação de DNA , Neoplasias , Humanos , Neoplasias/genética , Ácidos Nucleicos Livres/sangue , Biomarcadores Tumorais/sangue
2.
Sci Rep ; 14(1): 18243, 2024 08 06.
Artigo em Inglês | MEDLINE | ID: mdl-39107347

RESUMO

Individual Specific Networks (ISNs) are a tool used in computational biology to infer Individual Specific relationships between biological entities from omics data. ISNs provide insights into how the interactions among these entities affect their respective functions. To address the scarcity of solutions for efficiently computing ISNs on large biological datasets, we present ISN-tractor, a data-agnostic, highly optimized Python library to build and analyse ISNs. ISN-tractor demonstrates superior scalability and efficiency in generating Individual Specific Networks (ISNs) when compared to existing methods such as LionessR, both in terms of time and memory usage, allowing ISNs to be used on large datasets. We show how ISN-tractor can be applied to real-life datasets, including The Cancer Genome Atlas (TCGA) and HapMap, showcasing its versatility. ISN-tractor can be used to build ISNs from various -omics data types, including transcriptomics, proteomics, and genotype arrays, and can detect distinct patterns of gene interactions within and across cancer types. We also show how Filtration Curves provided valuable insights into ISN characteristics, revealing topological distinctions among individuals with different clinical outcomes. Additionally, ISN-tractor can effectively cluster populations based on genetic relationships, as demonstrated with Principal Component Analysis on HapMap data.


Assuntos
Biologia Computacional , Humanos , Biologia Computacional/métodos , Redes Reguladoras de Genes , Neoplasias/genética , Software , Proteômica/métodos , Algoritmos
3.
Sci Rep ; 13(1): 19449, 2023 11 09.
Artigo em Inglês | MEDLINE | ID: mdl-37945674

RESUMO

High-throughput sequencing allowed the discovery of many disease variants, but nowadays it is becoming clear that the abundance of genomics data mostly just moved the bottleneck in Genetics and Precision Medicine from a data availability issue to a data interpretation issue. To solve this empasse it would be beneficial to apply the latest Deep Learning (DL) methods to the Genome Interpretation (GI) problem, similarly to what AlphaFold did for Structural Biology. Unfortunately DL requires large datasets to be viable, and aggregating genomics datasets poses several legal, ethical and infrastructural complications. Federated Learning (FL) is a Machine Learning (ML) paradigm designed to tackle these issues. It allows ML methods to be collaboratively trained and tested on collections of physically separate datasets, without requiring the actual centralization of sensitive data. FL could thus be key to enable DL applications to GI on sufficiently large genomics data. We propose FedCrohn, a FL GI Neural Network model for the exome-based Crohn's Disease risk prediction, providing a proof-of-concept that FL is a viable paradigm to build novel ML GI approaches. We benchmark it in several realistic scenarios, showing that FL can indeed provide performances similar to conventional ML on centralized data, and that collaborating in FL initiatives is likely beneficial for most of the medical centers participating in them.


Assuntos
Doença de Crohn , Exoma , Humanos , Exoma/genética , Doença de Crohn/genética , Genômica , Benchmarking , Sequenciamento de Nucleotídeos em Larga Escala
4.
BMC Biol ; 19(1): 3, 2021 01 13.
Artigo em Inglês | MEDLINE | ID: mdl-33441128

RESUMO

BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. RESULTS: In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. CONCLUSIONS: To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open.


Assuntos
Carcinogênese/genética , Progressão da Doença , Aprendizado de Máquina , Oncologia/instrumentação , Neoplasias/genética , Medicina de Precisão/instrumentação , Neoplasias/patologia
5.
PLoS Comput Biol ; 16(4): e1007722, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-32352965

RESUMO

Protein solubility is a key aspect for many biotechnological, biomedical and industrial processes, such as the production of active proteins and antibodies. In addition, understanding the molecular determinants of the solubility of proteins may be crucial to shed light on the molecular mechanisms of diseases caused by aggregation processes such as amyloidosis. Here we present SKADE, a novel Neural Network protein solubility predictor and we show how it can provide novel insight into the protein solubility mechanisms, thanks to its neural attention architecture. First, we show that SKADE positively compares with state of the art tools while using just the protein sequence as input. Then, thanks to the neural attention mechanism, we use SKADE to investigate the patterns learned during training and we analyse its decision process. We use this peculiarity to show that, while the attention profiles do not correlate with obvious sequence aspects such as biophysical properties of the aminoacids, they suggest that N- and C-termini are the most relevant regions for solubility prediction and are predictive for complex emergent properties such as aggregation-prone regions involved in beta-amyloidosis and contact density. Moreover, SKADE is able to identify mutations that increase or decrease the overall solubility of the protein, allowing it to be used to perform large scale in-silico mutagenesis of proteins in order to maximize their solubility.


Assuntos
Biologia Computacional/métodos , Rede Nervosa/fisiologia , Solubilidade , Algoritmos , Sequência de Aminoácidos/fisiologia , Aminoácidos , Animais , Simulação por Computador , Humanos , Modelos Moleculares , Conformação Proteica , Proteínas/química , Proteínas/metabolismo , Software
6.
Sci Rep ; 9(1): 16932, 2019 11 15.
Artigo em Inglês | MEDLINE | ID: mdl-31729443

RESUMO

Machine learning (ML) is ubiquitous in bioinformatics, due to its versatility. One of the most crucial aspects to consider while training a ML model is to carefully select the optimal feature encoding for the problem at hand. Biophysical propensity scales are widely adopted in structural bioinformatics because they describe amino acids properties that are intuitively relevant for many structural and functional aspects of proteins, and are thus commonly used as input features for ML methods. In this paper we reproduce three classical structural bioinformatics prediction tasks to investigate the main assumptions about the use of propensity scales as input features for ML methods. We investigate their usefulness with different randomization experiments and we show that their effectiveness varies among the ML methods used and the tasks. We show that while linear methods are more dependent on the feature encoding, the specific biophysical meaning of the features is less relevant for non-linear methods. Moreover, we show that even among linear ML methods, the simpler one-hot encoding can surprisingly outperform the "biologically meaningful" scales. We also show that feature selection performed with non-linear ML methods may not be able to distinguish between randomized and "real" propensity scales by properly prioritizing to the latter. Finally, we show that learning problem-specific embeddings could be a simple, assumptions-free and optimal way to perform feature learning/engineering for structural bioinformatics tasks.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Análise de Sequência de Proteína/métodos , Aminoácidos/química , Fenômenos Biofísicos , Cisteína , Oxirredução , Pontuação de Propensão , Proteínas/química , Solventes/química
7.
Hum Mutat ; 38(1): 86-94, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-27667481

RESUMO

Cysteines are among the rarest amino acids in nature, and are both functionally and structurally very important for proteins. The ability of cysteines to form disulfide bonds is especially relevant, both for constraining the folded state of the protein and for performing enzymatic duties. But how does the variation record of human proteins reflect their functional importance and structural role, especially with regard to deleterious mutations? We created HUMCYS, a manually curated dataset of single amino acid variants that (1) have a known disease/neutral phenotypic outcome and (2) cause the loss of a cysteine, in order to investigate how mutated cysteines relate to structural aspects such as surface accessibility and cysteine oxidation state. We also have developed a sequence-based in silico cysteine oxidation predictor to overcome the scarcity of experimentally derived oxidation annotations, and applied it to extend our analysis to classes of proteins for which the experimental determination of their structure is technically challenging, such as transmembrane proteins. Our investigation shows that we can gain insights into the reason behind the outcome of cysteine losses in otherwise uncharacterized proteins, and we discuss the possible molecular mechanisms leading to deleterious phenotypes, such as the involvement of the mutated cysteine in a structurally or enzymatically relevant disulfide bond.


Assuntos
Cisteína/genética , Modelos Biológicos , Mutação , Oxirredução , Algoritmos , Substituição de Aminoácidos , Códon , Biologia Computacional/métodos , Bases de Dados Genéticas , Estudos de Associação Genética , Humanos , Espaço Intracelular/metabolismo , Polimorfismo de Nucleotídeo Único , Transporte Proteico , Reprodutibilidade dos Testes , Software , Navegador
8.
PLoS One ; 10(7): e0131792, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26161671

RESUMO

Disulfide bonds are crucial for many structural and functional aspects of proteins. They have a stabilizing role during folding, can regulate enzymatic activity and can trigger allosteric changes in the protein structure. Moreover, knowledge of the topology of the disulfide connectivity can be relevant in genomic annotation tasks and can provide long range constraints for ab-initio protein structure predictors. In this paper we describe PhyloCys, a novel unsupervised predictor of disulfide bond connectivity from known cysteine oxidation states. For each query protein, PhyloCys retrieves and aligns homologs with HHblits and builds a phylogenetic tree using ClustalW. A simplified model of cysteine co-evolution is then applied to the tree in order to hypothesize the presence of oxidized cysteines in the inner nodes of the tree, which represent ancestral protein sequences. The tree is then traversed from the leaves to the root and the putative disulfide connectivity is inferred by observing repeated patterns of tandem mutations between a sequence and its ancestors. A final correction is applied using the Edmonds-Gabow maximum weight perfect matching algorithm. The evolutionary approach applied in PhyloCys results in disulfide bond predictions equivalent to Sephiroth, another approach that takes whole sequence information into account, and is 26-29% better than state of the art methods based on cysteine covariance patterns in multiple sequence alignments, while requiring one order of magnitude fewer homologous sequences (10(3) instead of 10(4)), thus extending its range of applicability. The software described in this article and the datasets used are available at http://ibsquare.be/phylocys.


Assuntos
Biologia Computacional/métodos , Cisteína/genética , Dissulfetos/química , Mutação , Algoritmos , Cisteína/química , Cisteína/classificação , Evolução Molecular , Internet , Modelos Genéticos , Oxirredução , Filogenia , Reprodutibilidade dos Testes , Software
9.
Bioinformatics ; 31(8): 1219-25, 2015 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-25492406

RESUMO

MOTIVATION: Cysteine residues have particular structural and functional relevance in proteins because of their ability to form covalent disulfide bonds. Bioinformatics tools that can accurately predict cysteine bonding states are already available, whereas it remains challenging to infer the disulfide connectivity pattern of unknown protein sequences. Improving accuracy in this area is highly relevant for the structural and functional annotation of proteins. RESULTS: We predict the intra-chain disulfide bond connectivity patterns starting from known cysteine bonding states with an evolutionary-based unsupervised approach called Sephiroth that relies on high-quality alignments obtained with HHblits and is based on a coarse-grained cluster-based modelization of tandem cysteine mutations within a protein family. We compared our method with state-of-the-art unsupervised predictors and achieve a performance improvement of 25-27% while requiring an order of magnitude less of aligned homologous sequences (∼10(3) instead of ∼10(4)). AVAILABILITY AND IMPLEMENTATION: The software described in this article and the datasets used are available at http://ibsquare.be/sephiroth. CONTACT: wvranken@vub.ac.be SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.


Assuntos
Algoritmos , Cisteína/química , Dissulfetos/química , Modelos Estatísticos , Proteínas/química , Software , Sequência de Aminoácidos , Análise por Conglomerados , Cisteína/classificação , Cisteína/genética , Humanos , Dados de Sequência Molecular , Mutação/genética , Proteínas/análise , Proteínas/genética , Homologia de Sequência
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA