Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 47
Filtrar
1.
Nucleic Acids Res ; 50(D1): D543-D552, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34723319

RESUMEN

The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.


Asunto(s)
Bases de Datos de Proteínas , Metadatos/estadística & datos numéricos , Anotación de Secuencia Molecular/estadística & datos numéricos , Péptidos/química , Proteínas/química , Programas Informáticos , Secuencia de Aminoácidos , Bibliometría , Conjuntos de Datos como Asunto , Humanos , Almacenamiento y Recuperación de la Información , Internet , Espectrometría de Masas , Péptidos/genética , Péptidos/metabolismo , Proteínas/genética , Proteínas/metabolismo , Proteómica/instrumentación , Proteómica/métodos , Alineación de Secuencia
2.
Nucleic Acids Res ; 49(D1): D325-D334, 2021 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-33290552

RESUMEN

The Gene Ontology Consortium (GOC) provides the most comprehensive resource currently available for computable knowledge regarding the functions of genes and gene products. Here, we report the advances of the consortium over the past two years. The new GO-CAM annotation framework was notably improved, and we formalized the model with a computational schema to check and validate the rapidly increasing repository of 2838 GO-CAMs. In addition, we describe the impacts of several collaborations to refine GO and report a 10% increase in the number of GO annotations, a 25% increase in annotated gene products, and over 9,400 new scientific articles annotated. As the project matures, we continue our efforts to review older annotations in light of newer findings, and, to maintain consistency with other ontologies. As a result, 20 000 annotations derived from experimental data were reviewed, corresponding to 2.5% of experimental GO annotations. The website (http://geneontology.org) was redesigned for quick access to documentation, downloads and tools. To maintain an accurate resource and support traceability and reproducibility, we have made available a historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations.


Asunto(s)
Ontología de Genes , Anotación de Secuencia Molecular/estadística & datos numéricos , Interfaz Usuario-Computador , Animales , Arabidopsis/genética , Arabidopsis/metabolismo , Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Dictyostelium/genética , Dictyostelium/metabolismo , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Escherichia coli/genética , Escherichia coli/metabolismo , Humanos , Internet , Ratones , Ratas , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Schizosaccharomyces/genética , Schizosaccharomyces/metabolismo , Pez Cebra/genética , Pez Cebra/metabolismo
3.
PLoS Comput Biol ; 17(2): e1007948, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-33600408

RESUMEN

Gene function annotation is important for a variety of downstream analyses of genetic data. But experimental characterization of function remains costly and slow, making computational prediction an important endeavor. Phylogenetic approaches to prediction have been developed, but implementation of a practical Bayesian framework for parameter estimation remains an outstanding challenge. We have developed a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov Chain Monte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out cross-validation, and we further validated some of the predictions in the experimental scientific literature.


Asunto(s)
Modelos Genéticos , Anotación de Secuencia Molecular/métodos , Filogenia , Algoritmos , Animales , Teorema de Bayes , Biología Computacional , Bases de Datos Genéticas , Evolución Molecular , Ontología de Genes/estadística & datos numéricos , Humanos , Funciones de Verosimilitud , Cadenas de Markov , Ratones , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos , Método de Montecarlo , Familia de Multigenes
4.
PLoS Comput Biol ; 17(9): e1009446, 2021 09.
Artículo en Inglés | MEDLINE | ID: mdl-34555022

RESUMEN

Only a small fraction of genes deposited to databases have been experimentally characterised. The majority of proteins have their function assigned automatically, which can result in erroneous annotations. The reliability of current annotations in public databases is largely unknown; experimental attempts to validate the accuracy within individual enzyme classes are lacking. In this study we performed an overview of functional annotations to the BRENDA enzyme database. We first applied a high-throughput experimental platform to verify functional annotations to an enzyme class of S-2-hydroxyacid oxidases (EC 1.1.3.15). We chose 122 representative sequences of the class and screened them for their predicted function. Based on the experimental results, predicted domain architecture and similarity to previously characterised S-2-hydroxyacid oxidases, we inferred that at least 78% of sequences in the enzyme class are misannotated. We experimentally confirmed four alternative activities among the misannotated sequences and showed that misannotation in the enzyme class increased over time. Finally, we performed a computational analysis of annotations to all enzyme classes in the BRENDA database, and showed that nearly 18% of all sequences are annotated to an enzyme class while sharing no similarity or domain architecture to experimentally characterised representatives. We showed that even well-studied enzyme classes of industrial relevance are affected by the problem of functional misannotation.


Asunto(s)
Oxidorreductasas de Alcohol/clasificación , Bases de Datos de Proteínas/estadística & datos numéricos , Anotación de Secuencia Molecular/estadística & datos numéricos , Oxidorreductasas de Alcohol/química , Oxidorreductasas de Alcohol/genética , Animales , Biología Computacional , Enzimas/química , Enzimas/clasificación , Enzimas/genética , Humanos , Modelos Moleculares , Dominios Proteicos , Homología de Secuencia de Aminoácido
5.
Nat Rev Genet ; 17(11): 679-692, 2016 10 14.
Artículo en Inglés | MEDLINE | ID: mdl-27739534

RESUMEN

The pervasive expression of circular RNAs (circRNAs) is a recently discovered feature of gene expression in highly diverged eukaryotes. Numerous algorithms that are used to detect genome-wide circRNA expression from RNA sequencing (RNA-seq) data have been developed in the past few years, but there is little overlap in their predictions and no clear gold-standard method to assess the accuracy of these algorithms. We review sources of experimental and bioinformatic biases that complicate the accurate discovery of circRNAs and discuss statistical approaches to address these biases. We conclude with a discussion of the current experimental progress on the topic.


Asunto(s)
Biología Computacional/métodos , Anotación de Secuencia Molecular/estadística & datos numéricos , ARN/metabolismo , Análisis de Secuencia de ARN/métodos , Bases de Datos de Ácidos Nucleicos , Humanos , Anotación de Secuencia Molecular/métodos , ARN/química , ARN Circular , Programas Informáticos
6.
Brief Bioinform ; 20(4): 1071-1084, 2019 07 19.
Artículo en Inglés | MEDLINE | ID: mdl-28968784

RESUMEN

The overwhelming list of new bacterial genomes becoming available on a daily basis makes accurate genome annotation an essential step that ultimately determines the relevance of thousands of genomes stored in public databanks. The MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Starting from the results of our syntactic, functional and relational annotation pipelines, MicroScope provides an integrated environment for the expert annotation and comparative analysis of prokaryotic genomes. It combines tools and graphical interfaces to analyze genomes and to perform the manual curation of gene function in a comparative genomics and metabolic context. In this article, we describe the free-of-charge MicroScope services for the annotation and analysis of microbial (meta)genomes, transcriptomic and re-sequencing data. Then, the functionalities of the platform are presented in a way providing practical guidance and help to the nonspecialists in bioinformatics. Newly integrated analysis tools (i.e. prediction of virulence and resistance genes in bacterial genomes) and original method recently developed (the pan-genome graph representation) are also described. Integrated environments such as MicroScope clearly contribute, through the user community, to help maintaining accurate resources.


Asunto(s)
Genoma Microbiano , Genómica/métodos , Anotación de Secuencia Molecular/métodos , Programas Informáticos , Biología Computacional , Gráficos por Computador , Sistemas de Administración de Bases de Datos , Bases de Datos de Compuestos Químicos , Genómica/estadística & datos numéricos , Internet , Redes y Vías Metabólicas/genética , Fenómenos Microbiológicos , Anotación de Secuencia Molecular/estadística & datos numéricos , Interfaz Usuario-Computador
7.
Brief Bioinform ; 20(1): 168-177, 2019 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-28968630

RESUMEN

Pathway enrichment analysis has been widely used to identify cancer risk pathways, and contributes to elucidating the mechanism of tumorigenesis. However, most of the existing approaches use the outdated pathway information and neglect the complex gene interactions in pathway. Here, we first reviewed the existing widely used pathway enrichment analysis approaches briefly, and then, we proposed a novel topology-based pathway enrichment analysis (TPEA) method, which integrated topological properties and global upstream/downstream positions of genes in pathways. We compared TPEA with four widely used pathway enrichment analysis tools, including database for annotation, visualization and integrated discovery (DAVID), gene set enrichment analysis (GSEA), centrality-based pathway enrichment (CePa) and signaling pathway impact analysis (SPIA), through analyzing six gene expression profiles of three tumor types (colorectal cancer, thyroid cancer and endometrial cancer). As a result, we identified several well-known cancer risk pathways that could not be obtained by the existing tools, and the results of TPEA were more stable than that of the other tools in analyzing different data sets of the same cancer. Ultimately, we developed an R package to implement TPEA, which could online update KEGG pathway information and is available at the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/TPEA/.


Asunto(s)
Bases de Datos Genéticas/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Neoplasias/genética , Carcinogénesis/genética , Biología Computacional/métodos , Femenino , Redes Reguladoras de Genes , Humanos , Masculino , Anotación de Secuencia Molecular/estadística & datos numéricos , Transducción de Señal/genética , Programas Informáticos
8.
Brief Bioinform ; 20(1): 288-298, 2019 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-29028903

RESUMEN

RNA sequencing (RNA-seq) has become a standard procedure to investigate transcriptional changes between conditions and is routinely used in research and clinics. While standard differential expression (DE) analysis between two conditions has been extensively studied, and improved over the past decades, RNA-seq time course (TC) DE analysis algorithms are still in their early stages. In this study, we compare, for the first time, existing TC RNA-seq tools on an extensive simulation data set and validated the best performing tools on published data. Surprisingly, TC tools were outperformed by the classical pairwise comparison approach on short time series (<8 time points) in terms of overall performance and robustness to noise, mostly because of high number of false positives, with the exception of ImpulseDE2. Overlapping of candidate lists between tools improved this shortcoming, as the majority of false-positive, but not true-positive, candidates were unique for each method. On longer time series, pairwise approach was less efficient on the overall performance compared with splineTC and maSigPro, which did not identify any false-positive candidate.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Teorema de Bayes , Biología Computacional/métodos , Simulación por Computador , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Humanos , Cadenas de Markov , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos , Análisis de Secuencia de ARN/estadística & datos numéricos , Relación Señal-Ruido , Programas Informáticos , Factores de Tiempo
9.
Brief Bioinform ; 20(4): 1449-1464, 2019 07 19.
Artículo en Inglés | MEDLINE | ID: mdl-29490019

RESUMEN

Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.


Asunto(s)
Análisis por Conglomerados , Biología Computacional/métodos , Minería de Datos/métodos , Algoritmos , Macrodatos , Bases de Datos Genéticas/estadística & datos numéricos , Enfermedad/clasificación , Enfermedad/genética , Expresión Génica/efectos de los fármacos , Perfilación de la Expresión Génica/estadística & datos numéricos , Redes Reguladoras de Genes , Humanos , Anotación de Secuencia Molecular/estadística & datos numéricos
10.
PLoS Comput Biol ; 16(11): e1008325, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33180771

RESUMEN

Eukaryotic genome sequencing and de novo assembly, once the exclusive domain of well-funded international consortia, have become increasingly affordable, thus fitting the budgets of individual research groups. Third-generation long-read DNA sequencing technologies are increasingly used, providing extensive genomic toolkits that were once reserved for a few select model organisms. Generating high-quality genome assemblies and annotations for many aquatic species still presents significant challenges due to their large genome sizes, complexity, and high chromosome numbers. Indeed, selecting the most appropriate sequencing and software platforms and annotation pipelines for a new genome project can be daunting because tools often only work in limited contexts. In genomics, generating a high-quality genome assembly/annotation has become an indispensable tool for better understanding the biology of any species. Herein, we state 12 steps to help researchers get started in genome projects by presenting guidelines that are broadly applicable (to any species), sustainable over time, and cover all aspects of genome assembly and annotation projects from start to finish. We review some commonly used approaches, including practical methods to extract high-quality DNA and choices for the best sequencing platforms and library preparations. In addition, we discuss the range of potential bioinformatics pipelines, including structural and functional annotations (e.g., transposable elements and repetitive sequences). This paper also includes information on how to build a wide community for a genome project, the importance of data management, and how to make the data and results Findable, Accessible, Interoperable, and Reusable (FAIR) by submitting them to a public repository and sharing them with the research community.


Asunto(s)
Genoma , Genómica/métodos , Anotación de Secuencia Molecular/métodos , Animales , Biología Computacional , Biblioteca de Genes , Genómica/educación , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Anotación de Secuencia Molecular/estadística & datos numéricos , RNA-Seq/métodos , RNA-Seq/estadística & datos numéricos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/estadística & datos numéricos
11.
PLoS Comput Biol ; 15(4): e1006682, 2019 04.
Artículo en Inglés | MEDLINE | ID: mdl-30943207

RESUMEN

High quality gene models are necessary to expand the molecular and genetic tools available for a target organism, but these are available for only a handful of model organisms that have undergone extensive curation and experimental validation over the course of many years. The majority of gene models present in biological databases today have been identified in draft genome assemblies using automated annotation pipelines that are frequently based on orthologs from distantly related model organisms and usually have minor or major errors. Manual curation is time consuming and often requires substantial expertise, but is instrumental in improving gene model structure and identification. Manual annotation may seem to be a daunting and cost-prohibitive task for small research communities but involving undergraduates in community genome annotation consortiums can be mutually beneficial for both education and improved genomic resources. We outline a workflow for efficient manual annotation driven by a team of primarily undergraduate annotators. This model can be scaled to large teams and includes quality control processes through incremental evaluation. Moreover, it gives students an opportunity to increase their understanding of genome biology and to participate in scientific research in collaboration with peers and senior researchers at multiple institutions.


Asunto(s)
Biología Computacional/educación , Genómica/educación , Modelos Genéticos , Anotación de Secuencia Molecular/estadística & datos numéricos , Bases de Datos Genéticas/estadística & datos numéricos , Genómica/estadística & datos numéricos , Guías como Asunto , Humanos , Estudiantes
12.
BMC Bioinformatics ; 20(1): 610, 2019 Nov 27.
Artículo en Inglés | MEDLINE | ID: mdl-31775616

RESUMEN

BACKGROUND: Over the last 10 years, there have been over 3300 genome-wide association studies (GWAS). Almost every GWAS study provides a Manhattan plot either as a main figure or in the supplement. Several software packages can generate a Manhattan plot, but they are all limited in the extent to which they can annotate gene-names, allele frequencies, and variants having high impact on gene function or provide any other added information or flexibility. Furthermore, in a conventional Manhattan plot, there is no way of distinguishing a locus identified due to a single variant with very significant p-value from a locus with multiple variants which appear to be in a haplotype block having very similar p-values. RESULTS: Here we present a software tool written in R, which generates a transposed Manhattan plot along with additional features like variant consequence and minor allele frequency to annotate the plot and addresses these limitations. The software also gives flexibility on how and where the user wants to display the annotations. The software can be downloaded from CRAN repository and also from the GitHub project page. CONCLUSIONS: We present a major step up to the existing conventional Manhattan plot generation tools. We hope this form of display along with the added annotations will bring more insight to the reader from this new Manhattan++ plot.


Asunto(s)
Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Anotación de Secuencia Molecular/estadística & datos numéricos , Genoma , Humanos , Polimorfismo de Nucleótido Simple , Programas Informáticos
13.
J Exp Bot ; 70(4): 1069-1076, 2019 02 20.
Artículo en Inglés | MEDLINE | ID: mdl-30590678

RESUMEN

The use of draft genomes of different species and re-sequencing of accessions and populations are now common tools for plant biology research. The de novo assembled draft genomes make it possible to identify pivotal divergence points in the plant lineage and provide an opportunity to investigate the genomic basis and timing of biological innovations by inferring orthologs between species. Furthermore, re-sequencing facilitates the mapping and subsequent molecular characterization of causative loci for traits, such as those for plant stress tolerance and development. In both cases high-quality gene annotation-the identification of protein-coding regions, gene promoters, and 5'- and 3'-untranslated regions-is critical for investigation of gene function. Annotations are constantly improving but automated gene annotations still require manual curation and experimental validation. This is particularly important for genes with large introns, genes located in regions rich with transposable elements or repeats, large gene families, and segmentally duplicated genes. In this opinion paper, we highlight the impact of annotation quality on evolutionary analyses, genome-wide association studies, and the identification of orthologous genes in plants. Furthermore, we predict that incorporating accurate information from manual curation into databases will dramatically improve the performance of automated gene predictors.


Asunto(s)
Evolución Molecular , Genes de Plantas , Estudio de Asociación del Genoma Completo , Plantas/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Anotación de Secuencia Molecular/estadística & datos numéricos
14.
Nucleic Acids Res ; 45(8): e57, 2017 05 05.
Artículo en Inglés | MEDLINE | ID: mdl-28053114

RESUMEN

Whole transcriptome sequencing (RNA-seq) has become a standard for cataloguing and monitoring RNA populations. One of the main bottlenecks, however, is to correctly identify the different classes of RNAs among the plethora of reconstructed transcripts, particularly those that will be translated (mRNAs) from the class of long non-coding RNAs (lncRNAs). Here, we present FEELnc (FlExible Extraction of LncRNAs), an alignment-free program that accurately annotates lncRNAs based on a Random Forest model trained with general features such as multi k-mer frequencies and relaxed open reading frames. Benchmarking versus five state-of-the-art tools shows that FEELnc achieves similar or better classification performance on GENCODE and NONCODE data sets. The program also provides specific modules that enable the user to fine-tune classification accuracy, to formalize the annotation of lncRNA classes and to identify lncRNAs even in the absence of a training set of non-coding RNAs. We used FEELnc on a real data set comprising 20 canine RNA-seq samples produced by the European LUPA consortium to substantially expand the canine genome annotation to include 10 374 novel lncRNAs and 58 640 mRNA transcripts. FEELnc moves beyond conventional coding potential classifiers by providing a standardized and complete solution for annotating lncRNAs and is freely available at https://github.com/tderrien/FEELnc.


Asunto(s)
Genoma , Anotación de Secuencia Molecular/métodos , ARN Largo no Codificante/genética , Programas Informáticos , Transcriptoma , Animales , Benchmarking , Árboles de Decisión , Perros , Regulación de la Expresión Génica , Humanos , Ratones , Anotación de Secuencia Molecular/estadística & datos numéricos , Sistemas de Lectura Abierta , ARN Largo no Codificante/clasificación , ARN Largo no Codificante/metabolismo , ARN Mensajero/clasificación , ARN Mensajero/genética , ARN Mensajero/metabolismo , Análisis de Secuencia de ARN
15.
Nucleic Acids Res ; 44(6): e58, 2016 Apr 07.
Artículo en Inglés | MEDLINE | ID: mdl-26657634

RESUMEN

CircRNAs are novel members of the non-coding RNA family. For several decades circRNAs have been known to exist, however only recently the widespread abundance has become appreciated. Annotation of circRNAs depends on sequencing reads spanning the backsplice junction and therefore map as non-linear reads in the genome. Several pipelines have been developed to specifically identify these non-linear reads and consequently predict the landscape of circRNAs based on deep sequencing datasets. Here, we use common RNAseq datasets to scrutinize and compare the output from five different algorithms; circRNA_finder, find_circ, CIRCexplorer, CIRI, and MapSplice and evaluate the levels of bona fide and false positive circRNAs based on RNase R resistance. By this approach, we observe surprisingly dramatic differences between the algorithms specifically regarding the highly expressed circRNAs and the circRNAs derived from proximal splice sites. Collectively, this study emphasizes that circRNA annotation should be handled with care and that several algorithms should ideally be combined to achieve reliable predictions.


Asunto(s)
Algoritmos , Artefactos , Anotación de Secuencia Molecular/estadística & datos numéricos , ARN/química , Programas Informáticos , Empalme Alternativo , Biblioteca de Genes , Genoma Humano , Humanos , Anotación de Secuencia Molecular/métodos , Conformación de Ácido Nucleico , ARN/genética , ARN/metabolismo , ARN Circular , Análisis de Secuencia de ARN
16.
Brief Bioinform ; 16(2): 255-64, 2015 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-24626529

RESUMEN

High-throughput DNA sequencing has become a mainstay for the discovery of genomic variants that may cause disease or affect phenotype. A next-generation sequencing pipeline typically identifies thousands of variants in each sample. A particular challenge is the annotation of each variant in a way that is useful to downstream consumers of the data, such as clinical sequencing centers or researchers. These users may require that all data storage and analysis remain on secure local servers to protect patient confidentiality or intellectual property, may have unique and changing needs to draw on a variety of annotation data sets and may prefer not to rely on closed-source applications beyond their control. Here we describe scalable methods for using the plugin capability of the Ensembl Variant Effect Predictor to enrich its basic set of variant annotations with additional data on genes, function, conservation, expression, diseases, pathways and protein structure, and describe an extensible framework for easily adding additional custom data sets.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Anotación de Secuencia Molecular/estadística & datos numéricos , Análisis de Secuencia de ADN/estadística & datos numéricos , Biología Computacional , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Variación Genética , Humanos , Programas Informáticos
17.
PLoS Comput Biol ; 10(2): e1003460, 2014 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-24516372

RESUMEN

A major challenge in interpreting the large volume of mutation data identified by next-generation sequencing (NGS) is to distinguish driver mutations from neutral passenger mutations to facilitate the identification of targetable genes and new drugs. Current approaches are primarily based on mutation frequencies of single-genes, which lack the power to detect infrequently mutated driver genes and ignore functional interconnection and regulation among cancer genes. We propose a novel mutation network method, VarWalker, to prioritize driver genes in large scale cancer mutation data. VarWalker fits generalized additive models for each sample based on sample-specific mutation profiles and builds on the joint frequency of both mutation genes and their close interactors. These interactors are selected and optimized using the Random Walk with Restart algorithm in a protein-protein interaction network. We applied the method in >300 tumor genomes in two large-scale NGS benchmark datasets: 183 lung adenocarcinoma samples and 121 melanoma samples. In each cancer, we derived a consensus mutation subnetwork containing significantly enriched consensus cancer genes and cancer-related functional pathways. These cancer-specific mutation networks were then validated using independent datasets for each cancer. Importantly, VarWalker prioritizes well-known, infrequently mutated genes, which are shown to interact with highly recurrently mutated genes yet have been ignored by conventional single-gene-based approaches. Utilizing VarWalker, we demonstrated that network-assisted approaches can be effectively adapted to facilitate the detection of cancer driver genes in NGS data.


Asunto(s)
Algoritmos , Mutación , Neoplasias/genética , Oncogenes , Adenocarcinoma/genética , Adenocarcinoma del Pulmón , Biología Computacional , Secuencia de Consenso , Análisis Mutacional de ADN/estadística & datos numéricos , ADN de Neoplasias/genética , Bases de Datos Genéticas/estadística & datos numéricos , Frecuencia de los Genes , Redes Reguladoras de Genes , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Neoplasias Pulmonares/genética , Melanoma/genética , Modelos Genéticos , Anotación de Secuencia Molecular/estadística & datos numéricos
18.
J Proteome Res ; 12(6): 2571-81, 2013 Jun 07.
Artículo en Inglés | MEDLINE | ID: mdl-23668635

RESUMEN

Because of its high specificity, trypsin is the enzyme of choice in shotgun proteomics. Nonetheless, several publications do report the identification of semitryptic and nontryptic peptides. Many of these peptides are thought to be signaling peptides or to have formed during sample preparation. It is known that only a small fraction of tandem mass spectra from a trypsin-digested protein mixture can be confidently matched to tryptic peptides. If other possibilities such as post-translational modifications and single-amino acid polymorphisms are ignored, this suggests that many unidentified spectra originate from semitryptic and nontryptic peptides. To include them in database searches, however, may not improve overall peptide identification because of the possible sensitivity reduction from search space expansion. To circumvent this issue for E-value-based search methods, we have designed a scheme that categorizes qualified peptides (i.e., peptides whose differences in molecular weight from the parent ion are within a specified error tolerance) into three tiers: tryptic, semitryptic, and nontryptic. This classification allows peptides that belong to different tiers to have different Bonferroni correction factors. Our results show that this scheme can significantly improve retrieval performance compared to those of search strategies that assign equal Bonferroni correction factors to all qualified peptides.


Asunto(s)
Algoritmos , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos , Fragmentos de Péptidos/aislamiento & purificación , Análisis de Secuencia de Proteína/estadística & datos numéricos , Animales , Humanos , Proteolisis , Proteómica , Sensibilidad y Especificidad , Espectrometría de Masas en Tándem , Tripsina/química
19.
J Comput Biol ; 29(2): 140-154, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-35049334

RESUMEN

k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch-a sketching technique for representing associative maps inspired from Count-Min-and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min-an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, in terms of both the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.


Asunto(s)
Biología Computacional/métodos , Genómica/estadística & datos numéricos , Programas Informáticos , Algoritmos , Animales , Gráficos por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Genoma Humano , Humanos , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos
20.
Biochim Biophys Acta Gene Regul Mech ; 1864(11-12): 194752, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34461313

RESUMEN

Transcription plays a central role in defining the identity and functionalities of cells, as well as in their responses to changes in the cellular environment. The Gene Ontology (GO) provides a rigorously defined set of concepts that describe the functions of gene products. A GO annotation is a statement about the function of a particular gene product, represented as an association between a gene product and the biological concept a GO term defines. Critically, each GO annotation is based on traceable scientific evidence. Here, we describe the different GO terms that are associated with proteins involved in transcription and its regulation, focusing on the standard of evidence required to support these associations. This article is intended to help users of GO annotations understand how to interpret the annotations and can contribute to the consistency of GO annotations. We distinguish between three classes of activities involved in transcription or directly regulating it - general transcription factors, DNA-binding transcription factors, and transcription co-regulators.


Asunto(s)
Bases de Datos Genéticas/estadística & datos numéricos , Regulación de la Expresión Génica , Ontología de Genes/estadística & datos numéricos , Factores de Transcripción/clasificación , Biología Computacional/métodos , Anotación de Secuencia Molecular/estadística & datos numéricos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA