Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 82
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Nucleic Acids Res ; 51(19): 10162-10175, 2023 10 27.
Artigo em Inglês | MEDLINE | ID: mdl-37739408

RESUMO

Determining the repertoire of a microbe's molecular functions is a central question in microbial biology. Modern techniques achieve this goal by comparing microbial genetic material against reference databases of functionally annotated genes/proteins or known taxonomic markers such as 16S rRNA. Here, we describe a novel approach to exploring bacterial functional repertoires without reference databases. Our Fusion scheme establishes functional relationships between bacteria and assigns organisms to Fusion-taxa that differ from otherwise defined taxonomic clades. Three key findings of our work stand out. First, bacterial functional comparisons outperform marker genes in assigning taxonomic clades. Fusion profiles are also better for this task than other functional annotation schemes. Second, Fusion-taxa are robust to addition of novel organisms and are, arguably, able to capture the environment-driven bacterial diversity. Finally, our alignment-free nucleic acid-based Siamese Neural Network model, created using Fusion functions, enables finding shared functionality of very distant, possibly structurally different, microbial homologs. Our work can thus help annotate functional repertoires of bacterial organisms and further guide our understanding of microbial communities.


Assuntos
Bactérias , Bactérias/citologia , Bactérias/genética , Bases de Dados Factuais , Microbiota , Filogenia , RNA Ribossômico 16S/genética , Fenômenos Fisiológicos Bacterianos
2.
Nat Rev Genet ; 23(6): 322-323, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35338359

Assuntos
Algoritmos
3.
Nucleic Acids Res ; 49(22): 12673-12691, 2021 12 16.
Artigo em Inglês | MEDLINE | ID: mdl-34850938

RESUMO

Synonymous single nucleotide variants (sSNVs) are common in the human genome but are often overlooked. However, sSNVs can have significant biological impact and may lead to disease. Existing computational methods for evaluating the effect of sSNVs suffer from the lack of gold-standard training/evaluation data and exhibit over-reliance on sequence conservation signals. We developed synVep (synonymous Variant effect predictor), a machine learning-based method that overcomes both of these limitations. Our training data was a combination of variants reported by gnomAD (observed) and those unreported, but possible in the human genome (generated). We used positive-unlabeled learning to purify the generated variant set of any likely unobservable variants. We then trained two sequential extreme gradient boosting models to identify subsets of the remaining variants putatively enriched and depleted in effect. Our method attained 90% precision/recall on a previously unseen set of variants. Furthermore, although synVep does not explicitly use conservation, its scores correlated with evolutionary distances between orthologs in cross-species variation analysis. synVep was also able to differentiate pathogenic vs. benign variants, as well as splice-site disrupting variants (SDV) vs. non-SDVs. Thus, synVep provides an important improvement in annotation of sSNVs, allowing users to focus on variants that most likely harbor effects.


Assuntos
Variação Genética , Aprendizado de Máquina , Doença/genética , Genoma Humano , Humanos , Proteínas/genética , Splicing de RNA
4.
Nucleic Acids Res ; 49(W1): W535-W540, 2021 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-33999203

RESUMO

Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.


Assuntos
Conformação Proteica , Software , Sítios de Ligação , Proteínas do Nucleocapsídeo de Coronavírus/química , Proteínas de Ligação a DNA/química , Fosfoproteínas/química , Estrutura Secundária de Proteína , Proteínas/química , Proteínas/fisiologia , Proteínas de Ligação a RNA/química , Alinhamento de Sequência , Análise de Sequência de Proteína
5.
Hum Genet ; 141(10): 1615-1627, 2022 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-35347416

RESUMO

Infertility is a major reproductive health issue that affects about 12% of women of reproductive age in the United States. Aneuploidy in eggs accounts for a significant proportion of early miscarriage and in vitro fertilization failure. Recent studies have shown that genetic variants in several genes affect chromosome segregation fidelity and predispose women to a higher incidence of egg aneuploidy. However, the exact genetic causes of aneuploid egg production remain unclear, making it difficult to diagnose infertility based on individual genetic variants in mother's genome. In this study, we evaluated machine learning-based classifiers for predicting the embryonic aneuploidy risk in female IVF patients using whole-exome sequencing data. Using two exome datasets, we obtained an area under the receiver operating curve of 0.77 and 0.68, respectively. High precision could be traded off for high specificity in classifying patients by selecting different prediction score cutoffs. For example, a strict prediction score cutoff of 0.7 identified 29% of patients as high-risk with 94% precision. In addition, we identified MCM5, FGGY, and DDX60L as potential aneuploidy risk genes that contribute the most to the predictive power of the model. These candidate genes and their molecular interaction partners are enriched for meiotic-related gene ontology categories and pathways, such as microtubule organizing center and DNA recombination. In summary, we demonstrate that sequencing data can be mined to predict patients' aneuploidy risk thus improving clinical diagnosis. The candidate genes and pathways we identified are promising targets for future aneuploidy studies.


Assuntos
Infertilidade , Diagnóstico Pré-Implantação , Aneuploidia , DNA , Feminino , Fertilização in vitro , Humanos , Gravidez , Sequenciamento do Exoma
6.
J Lipid Res ; 62: 100046, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33587919

RESUMO

Lecithin:retinol acyltransferase and retinol-binding protein enable vitamin A (VA) storage and transport, respectively, maintaining tissue homeostasis of retinoids (VA derivatives). The precarious VA status of the lecithin:retinol acyltransferase-deficient (Lrat-/-) retinol-binding protein-deficient (Rbp-/-) mice rapidly deteriorates upon dietary VA restriction, leading to signs of severe vitamin A deficiency (VAD). As retinoids impact gut morphology and functions, VAD is often linked to intestinal pathological conditions and microbial dysbiosis. Thus, we investigated the contribution of VA storage and transport to intestinal retinoid homeostasis and functionalities. We showed the occurrence of intestinal VAD in Lrat-/-Rbp-/- mice, demonstrating the critical role of both pathways in preserving gut retinoid homeostasis. Moreover, in the mutant colon, VAD resulted in a compromised intestinal barrier as manifested by reduced mucins and antimicrobial defense, leaky gut, increased inflammation and oxidative stress, and altered mucosal immunocytokine profiles. These perturbations were accompanied by fecal dysbiosis, revealing that the VA status (sufficient vs. deficient), rather than the amount of dietary VA per se, is likely a major initial discriminant of the intestinal microbiome. Our data also pointed to a specific fecal taxonomic profile and distinct microbial functionalities associated with VAD. Overall, our findings revealed the suitability of the Lrat-/-Rbp-/- mice as a model to study intestinal dysfunctions and dysbiosis promoted by changes in tissue retinoid homeostasis induced by the host VA status and/or intake.


Assuntos
Vitamina A
7.
Nucleic Acids Res ; 47(21): e142, 2019 12 02.
Artigo em Inglês | MEDLINE | ID: mdl-31584091

RESUMO

Evaluating the impact of non-synonymous genetic variants is essential for uncovering disease associations and mechanisms of evolution. An in-depth understanding of sequence changes is also fundamental for synthetic protein design and stability assessments. However, the variant effect predictor performance gain observed in recent years has not kept up with the increased complexity of new methods. One likely reason for this might be that most approaches use similar sets of gene and protein features for modeling variant effects, often emphasizing sequence conservation. While high levels of conservation highlight residues essential for protein activity, much of the variation observable in vivo is arguably weaker in its impact, thus requiring evaluation at a higher level of resolution. Here, we describe functionNeutral/Toggle/Rheostatpredictor (funtrp), a novel computational method that categorizes protein positions based on the position-specific expected range of mutational impacts: Neutral (weak/no effects), Rheostat (function-tuning positions), or Toggle (on/off switches). We show that position types do not correlate strongly with familiar protein features such as conservation or protein disorder. We also find that position type distribution varies across different protein functions. Finally, we demonstrate that position types can improve performance of existing variant effect predictors and suggest a way forward for the development of new ones.


Assuntos
Biologia Computacional/métodos , Sequência Conservada/genética , Mutação/genética , Proteínas , Sequência de Aminoácidos/genética , Sequência de Bases/genética , Bases de Dados de Proteínas , Humanos , Modelos Moleculares , Proteínas/química , Proteínas/genética , Relação Estrutura-Atividade
8.
BMC Bioinformatics ; 21(1): 235, 2020 Jun 09.
Artigo em Inglês | MEDLINE | ID: mdl-32517697

RESUMO

BACKGROUND: The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction - a process called 'end-to-end learning' - has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN. RESULTS: By using different deep learning architectures, we show that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities. We found that the embedding dimension is a major factor in controlling the model performance. Surprisingly, we observed that deep learning models are capable of learning from random vectors of appropriate dimension. CONCLUSION: Our study shows that end-to-end learning is a flexible and powerful method for amino acid encoding. Further, due to the flexibility of deep learning systems, amino acid encoding schemes should be benchmarked against random vectors of the same dimension to disentangle the information content provided by the encoding scheme from the distinguishability effect provided by the scheme.


Assuntos
Aminoácidos/metabolismo , Biologia Computacional/métodos , Aprendizado Profundo/normas , Humanos
9.
Nucleic Acids Res ; 46(D1): D535-D541, 2018 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-29112720

RESUMO

Microbial functional diversification is driven by environmental factors, i.e. microorganisms inhabiting the same environmental niche tend to be more functionally similar than those from different environments. In some cases, even closely phylogenetically related microbes differ more across environments than across taxa. While microbial similarities are often reported in terms of taxonomic relationships, no existing databases directly link microbial functions to the environment. We previously developed a method for comparing microbial functional similarities on the basis of proteins translated from their sequenced genomes. Here, we describe fusionDB, a novel database that uses our functional data to represent 1374 taxonomically distinct bacteria annotated with available metadata: habitat/niche, preferred temperature, and oxygen use. Each microbe is encoded as a set of functions represented by its proteome and individual microbes are connected via common functions. Users can search fusionDB via combinations of organism names and metadata. Moreover, the web interface allows mapping new microbial genomes to the functional spectrum of reference bacteria, rendering interactive similarity networks that highlight shared functionality. fusionDB provides a fast means of comparing microbes, identifying potential horizontal gene transfer events, and highlighting key environment-specific functionality.


Assuntos
Bases de Dados Factuais , Microbiota/fisiologia , Bactérias/classificação , Bactérias/genética , Fenômenos Fisiológicos Bacterianos , Proteínas de Bactérias/genética , Proteínas de Bactérias/fisiologia , Biodiversidade , Bases de Dados Genéticas , Microbiologia Ambiental , Transferência Genética Horizontal , Humanos , Internet , Metadados , Metagenômica , Filogenia , Synechococcus/classificação , Synechococcus/genética , Synechococcus/fisiologia , Interface Usuário-Computador
10.
Nucleic Acids Res ; 46(4): e23, 2018 02 28.
Artigo em Inglês | MEDLINE | ID: mdl-29194524

RESUMO

The vast majority of microorganisms on Earth reside in often-inseparable environment-specific communities-microbiomes. Meta-genomic/-transcriptomic sequencing could reveal the otherwise inaccessible functionality of microbiomes. However, existing analytical approaches focus on attributing sequencing reads to known genes/genomes, often failing to make maximal use of available data. We created faser (functional annotation of sequencing reads), an algorithm that is optimized to map reads to molecular functions encoded by the read-correspondent genes. The mi-faser microbiome analysis pipeline, combining faser with our manually curated reference database of protein functions, accurately annotates microbiome molecular functionality. mi-faser's minutes-per-microbiome processing speed is significantly faster than that of other methods, allowing for large scale comparisons. Microbiome function vectors can be compared between different conditions to highlight environment-specific and/or time-dependent changes in functionality. Here, we identified previously unseen oil degradation-specific functions in BP oil-spill data, as well as functional signatures of individual-specific gut microbiome responses to a dietary intervention in children with Prader-Willi syndrome. Our method also revealed variability in Crohn's Disease patient microbiomes and clearly distinguished them from those of related healthy individuals. Our analysis highlighted the microbiome role in CD pathogenicity, demonstrating enrichment of patient microbiomes in functions that promote inflammation and that help bacteria survive it.


Assuntos
Metagenômica/métodos , Microbiota , Anotação de Sequência Molecular/métodos , Algoritmos , Proteínas de Bactérias/fisiologia , Criança , Doença de Crohn/microbiologia , Humanos , Síndrome de Prader-Willi/microbiologia , Alinhamento de Sequência
11.
Hum Mutat ; 40(9): 1321-1329, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31144782

RESUMO

Venous thromboembolism (VTE) is a common hematological disorder. VTE affects millions of people around the world each year and can be fatal. Earlier studies have revealed the possible VTE genetic risk factors in Europeans. The 2018 Critical Assessment of Genome Interpretation (CAGI) challenge had asked participants to distinguish between 66 VTE and 37 non-VTE African American (AA) individuals based on their exome sequencing data. We used variants from AA VTE association studies and VTE genes from DisGeNET database to evaluate VTE risk via four different approaches; two of these methods were most successful at the task. Our best performing method represented each exome as a vector of predicted functional effect scores of variants within the known genes. These exome vectors were then clustered with k-means. This approach achieved 70.8% precision and 69.7% recall in identifying VTE patients. Our second-best ranked method had collapsed the variant effect scores into gene-level function changes, using the same vector clustering approach for patient/control identification. These results show predictability of VTE risk in AA population and highlight the importance of variant-driven gene functional changes in judging disease status. Of course, more in-depth understanding of AA VTE pathogenicity is still needed for more precise predictions.


Assuntos
Biologia Computacional/métodos , Sequenciamento do Exoma/métodos , Polimorfismo de Nucleotídeo Único , Tromboembolia Venosa/genética , Negro ou Afro-Americano/genética , Estudos de Casos e Controles , Feminino , Estudos de Associação Genética , Predisposição Genética para Doença , Humanos , Masculino , Análise de Componente Principal , Estados Unidos/etnologia , Tromboembolia Venosa/tratamento farmacológico , Tromboembolia Venosa/etnologia , Varfarina
12.
Hum Mutat ; 40(9): 1486-1494, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31268618

RESUMO

The recent years have seen a drastic increase in the amount of available genomic sequences. Alongside this explosion, hundreds of computational tools were developed to assess the impact of observed genetic variation. Critical Assessment of Genome Interpretation (CAGI) provides a platform to evaluate the performance of these tools in experimentally relevant contexts. In the CAGI-5 challenge assessing the 38 missense variants affecting the human Pericentriolar material 1 protein (PCM1), our SNAP-based submission was the top performer, although it did worse than expected from other evaluations. Here, we compare the CAGI-5 submissions, and 24 additional commonly used variant effect predictors, to analyze the reasons for this observation. We identified per residue conservation, structural, and functional PCM1 characteristics, which may be responsible. As expected, predictors had a hard time distinguishing effect variants in nonconserved positions. They were also better able to call effect variants in a structurally rich region than in a less-structured one; in the latter, they more often correctly identified benign than effect variants. Curiously, most of the protein was predicted to be functionally robust to mutation-a feature that likely makes it a harder problem for generalized variant effect predictors.


Assuntos
Autoantígenos/genética , Proteínas de Ciclo Celular/genética , Biologia Computacional/métodos , Mutação de Sentido Incorreto , Algoritmos , Autoantígenos/metabolismo , Proteínas de Ciclo Celular/metabolismo , Bases de Dados Genéticas , Predisposição Genética para Doença , Humanos
13.
Hum Mutat ; 40(9): 1495-1506, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31184403

RESUMO

Thermodynamic stability is a fundamental property shared by all proteins. Changes in stability due to mutation are a widespread molecular mechanism in genetic diseases. Methods for the prediction of mutation-induced stability change have typically been developed and evaluated on incomplete and/or biased data sets. As part of the Critical Assessment of Genome Interpretation, we explored the utility of high-throughput variant stability profiling (VSP) assay data as an alternative for the assessment of computational methods and evaluated state-of-the-art predictors against over 7,000 nonsynonymous variants from two proteins. We found that predictions were modestly correlated with actual experimental values. Predictors fared better when evaluated as classifiers of extreme stability effects. While different methods emerging as top performers depending on the metric, it is nontrivial to draw conclusions on their adoption or improvement. Our analyses revealed that only 16% of all variants in VSP assays could be confidently defined as stability-affecting. Furthermore, it is unclear as to what extent VSP abundance scores were reasonable proxies for the stability-related quantities that participating methods were designed to predict. Overall, our observations underscore the need for clearly defined objectives when developing and using both computational and experimental methods in the context of measuring variant impact.


Assuntos
Biologia Computacional/métodos , Metiltransferases/química , Mutação , PTEN Fosfo-Hidrolase/química , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Metiltransferases/genética , PTEN Fosfo-Hidrolase/genética , Estabilidade Proteica
14.
Hum Mutat ; 40(9): 1474-1485, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31260570

RESUMO

The CAGI-5 pericentriolar material 1 (PCM1) challenge aimed to predict the effect of 38 transgenic human missense mutations in the PCM1 protein implicated in schizophrenia. Participants were provided with 16 benign variants (negative controls), 10 hypomorphic, and 12 loss of function variants. Six groups participated and were asked to predict the probability of effect and standard deviation associated to each mutation. Here, we present the challenge assessment. Prediction performance was evaluated using different measures to conclude in a final ranking which highlights the strengths and weaknesses of each group. The results show a great variety of predictions where some methods performed significantly better than others. Benign variants played an important role as negative controls, highlighting predictors biased to identify disease phenotypes. The best predictor, Bromberg lab, used a neural-network-based method able to discriminate between neutral and non-neutral single nucleotide polymorphisms. The CAGI-5 PCM1 challenge allowed us to evaluate the state of the art techniques for interpreting the effect of novel variants for a difficult target protein.


Assuntos
Autoantígenos/genética , Proteínas de Ciclo Celular/genética , Biologia Computacional/métodos , Mutação de Sentido Incorreto , Esquizofrenia/genética , Bases de Dados Genéticas , Predisposição Genética para Doença , Humanos , Redes Neurais de Computação , Fenótipo , Polimorfismo de Nucleotídeo Único
15.
Hum Mutat ; 40(9): 1314-1320, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31140652

RESUMO

Genetics play a key role in venous thromboembolism (VTE) risk, however established risk factors in European populations do not translate to individuals of African descent because of the differences in allele frequencies between populations. As part of the fifth iteration of the Critical Assessment of Genome Interpretation, participants were asked to predict VTE status in exome data from African American subjects. Participants were provided with 103 unlabeled exomes from patients treated with warfarin for non-VTE causes or VTE and asked to predict which disease each subject had been treated for. Given the lack of training data, many participants opted to use unsupervised machine learning methods, clustering the exomes by variation in genes known to be associated with VTE. The best performing method using only VTE related genes achieved an area under the ROC curve of 0.65. Here, we discuss the range of methods used in the prediction of VTE from sequence data and explore some of the difficulties of conducting a challenge with known confounders. In addition, we show that an existing genetic risk score for VTE that was developed in European subjects works well in African Americans.


Assuntos
Sequenciamento do Exoma/métodos , Tromboembolia Venosa/genética , Varfarina/administração & dosagem , Análise por Conglomerados , Biologia Computacional/métodos , Congressos como Assunto , Feminino , Predisposição Genética para Doença , Humanos , Masculino , Curva ROC , Aprendizado de Máquina não Supervisionado , Tromboembolia Venosa/tratamento farmacológico , Varfarina/uso terapêutico
16.
Hum Mutat ; 40(9): 1530-1545, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31301157

RESUMO

Accurate prediction of the impact of genomic variation on phenotype is a major goal of computational biology and an important contributor to personalized medicine. Computational predictions can lead to a better understanding of the mechanisms underlying genetic diseases, including cancer, but their adoption requires thorough and unbiased assessment. Cystathionine-beta-synthase (CBS) is an enzyme that catalyzes the first step of the transsulfuration pathway, from homocysteine to cystathionine, and in which variations are associated with human hyperhomocysteinemia and homocystinuria. We have created a computational challenge under the CAGI framework to evaluate how well different methods can predict the phenotypic effect(s) of CBS single amino acid substitutions using a blinded experimental data set. CAGI participants were asked to predict yeast growth based on the identity of the mutations. The performance of the methods was evaluated using several metrics. The CBS challenge highlighted the difficulty of predicting the phenotype of an ex vivo system in a model organism when classification models were trained on human disease data. We also discuss the variations in difficulty of prediction for known benign and deleterious variants, as well as identify methodological and experimental constraints with lessons to be learned for future challenges.


Assuntos
Substituição de Aminoácidos , Biologia Computacional/métodos , Cistationina beta-Sintase/genética , Cistationina/metabolismo , Cistationina beta-Sintase/metabolismo , Homocisteína/metabolismo , Humanos , Fenótipo , Medicina de Precisão
17.
Hum Mutat ; 40(9): 1612-1622, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31241222

RESUMO

The availability of disease-specific genomic data is critical for developing new computational methods that predict the pathogenicity of human variants and advance the field of precision medicine. However, the lack of gold standards to properly train and benchmark such methods is one of the greatest challenges in the field. In response to this challenge, the scientific community is invited to participate in the Critical Assessment for Genome Interpretation (CAGI), where unpublished disease variants are available for classification by in silico methods. As part of the CAGI-5 challenge, we evaluated the performance of 18 submissions and three additional methods in predicting the pathogenicity of single nucleotide variants (SNVs) in checkpoint kinase 2 (CHEK2) for cases of breast cancer in Hispanic females. As part of the assessment, the efficacy of the analysis method and the setup of the challenge were also considered. The results indicated that though the challenge could benefit from additional participant data, the combined generalized linear model analysis and odds of pathogenicity analysis provided a framework to evaluate the methods submitted for SNV pathogenicity identification and for comparison to other available methods. The outcome of this challenge and the approaches used can help guide further advancements in identifying SNV-disease relationships.


Assuntos
Neoplasias da Mama/genética , Quinase do Ponto de Checagem 2/genética , Biologia Computacional/métodos , Hispânico ou Latino/genética , Polimorfismo de Nucleotídeo Único , Adulto , Idoso , Neoplasias da Mama/etnologia , Estudos de Casos e Controles , Simulação por Computador , Feminino , Predisposição Genética para Doença , Humanos , Modelos Lineares , Pessoa de Meia-Idade , Estados Unidos/etnologia , Sequenciamento do Exoma
18.
Hum Mutat ; 40(9): 1519-1529, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31342580

RESUMO

The NAGLU challenge of the fourth edition of the Critical Assessment of Genome Interpretation experiment (CAGI4) in 2016, invited participants to predict the impact of variants of unknown significance (VUS) on the enzymatic activity of the lysosomal hydrolase α-N-acetylglucosaminidase (NAGLU). Deficiencies in NAGLU activity lead to a rare, monogenic, recessive lysosomal storage disorder, Sanfilippo syndrome type B (MPS type IIIB). This challenge attracted 17 submissions from 10 groups. We observed that top models were able to predict the impact of missense mutations on enzymatic activity with Pearson's correlation coefficients of up to .61. We also observed that top methods were significantly more correlated with each other than they were with observed enzymatic activity values, which we believe speaks to the importance of sequence conservation across the different methods. Improved functional predictions on the VUS will help population-scale analysis of disease epidemiology and rare variant association analysis.


Assuntos
Acetilglucosaminidase/metabolismo , Biologia Computacional/métodos , Mutação de Sentido Incorreto , Acetilglucosaminidase/genética , Humanos , Modelos Genéticos , Análise de Regressão
19.
Bioinformatics ; 34(13): i304-i312, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29950013

RESUMO

Motivation: The rapid drop in sequencing costs has produced many more (predicted) protein sequences than can feasibly be functionally annotated with wet-lab experiments. Thus, many computational methods have been developed for this purpose. Most of these methods employ homology-based inference, approximated via sequence alignments, to transfer functional annotations between proteins. The increase in the number of available sequences, however, has drastically increased the search space, thus significantly slowing down alignment methods. Results: Here we describe homology-derived functional similarity of proteins (HFSP), a novel computational method that uses results of a high-speed alignment algorithm, MMseqs2, to infer functional similarity of proteins on the basis of their alignment length and sequence identity. We show that our method is accurate (85% precision) and fast (more than 40-fold speed increase over state-of-the-art). HFSP can help correct at least a 16% error in legacy curations, even for a resource of as high quality as Swiss-Prot. These findings suggest HFSP as an ideal resource for large-scale functional annotation efforts. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Biologia Computacional , Anotação de Sequência Molecular , Proteínas , Sequência de Aminoácidos , Biologia Computacional/métodos , Bases de Dados de Proteínas , Anotação de Sequência Molecular/métodos , Proteínas/química , Alinhamento de Sequência
20.
Hum Mutat ; 38(9): 1042-1050, 2017 09.
Artigo em Inglês | MEDLINE | ID: mdl-28440912

RESUMO

Correct phenotypic interpretation of variants of unknown significance for cancer-associated genes is a diagnostic challenge as genetic screenings gain in popularity in the next-generation sequencing era. The Critical Assessment of Genome Interpretation (CAGI) experiment aims to test and define the state of the art of genotype-phenotype interpretation. Here, we present the assessment of the CAGI p16INK4a challenge. Participants were asked to predict the effect on cellular proliferation of 10 variants for the p16INK4a tumor suppressor, a cyclin-dependent kinase inhibitor encoded by the CDKN2A gene. Twenty-two pathogenicity predictors were assessed with a variety of accuracy measures for reliability in a medical context. Different assessment measures were combined in an overall ranking to provide more robust results. The R scripts used for assessment are publicly available from a GitHub repository for future use in similar assessment exercises. Despite a limited test-set size, our findings show a variety of results, with some methods performing significantly better. Methods combining different strategies frequently outperform simpler approaches. The best predictor, Yang&Zhou lab, uses a machine learning method combining an empirical energy function measuring protein stability with an evolutionary conservation term. The p16INK4a challenge highlights how subtle structural effects can neutralize otherwise deleterious variants.


Assuntos
Biologia Computacional/métodos , Inibidor de Quinase Dependente de Ciclina p18/genética , Variação Genética , Linhagem Celular Tumoral , Proliferação de Células , Simulação por Computador , Inibidor p16 de Quinase Dependente de Ciclina , Inibidor de Quinase Dependente de Ciclina p18/química , Bases de Dados Genéticas , Predisposição Genética para Doença , Humanos , Aprendizado de Máquina , Estabilidade Proteica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA