Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 47
Filtrar
1.
J Comput Biol ; 29(2): 140-154, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-35049334

RESUMEN

k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch-a sketching technique for representing associative maps inspired from Count-Min-and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min-an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, in terms of both the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.


Asunto(s)
Biología Computacional/métodos , Genómica/estadística & datos numéricos , Programas Informáticos , Algoritmos , Animales , Gráficos por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Genoma Humano , Humanos , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos
2.
Nucleic Acids Res ; 50(D1): D543-D552, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34723319

RESUMEN

The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.


Asunto(s)
Bases de Datos de Proteínas , Metadatos/estadística & datos numéricos , Anotación de Secuencia Molecular/estadística & datos numéricos , Péptidos/química , Proteínas/química , Programas Informáticos , Secuencia de Aminoácidos , Bibliometría , Conjuntos de Datos como Asunto , Humanos , Almacenamiento y Recuperación de la Información , Internet , Espectrometría de Masas , Péptidos/genética , Péptidos/metabolismo , Proteínas/genética , Proteínas/metabolismo , Proteómica/instrumentación , Proteómica/métodos , Alineación de Secuencia
3.
PLoS Comput Biol ; 17(9): e1009446, 2021 09.
Artículo en Inglés | MEDLINE | ID: mdl-34555022

RESUMEN

Only a small fraction of genes deposited to databases have been experimentally characterised. The majority of proteins have their function assigned automatically, which can result in erroneous annotations. The reliability of current annotations in public databases is largely unknown; experimental attempts to validate the accuracy within individual enzyme classes are lacking. In this study we performed an overview of functional annotations to the BRENDA enzyme database. We first applied a high-throughput experimental platform to verify functional annotations to an enzyme class of S-2-hydroxyacid oxidases (EC 1.1.3.15). We chose 122 representative sequences of the class and screened them for their predicted function. Based on the experimental results, predicted domain architecture and similarity to previously characterised S-2-hydroxyacid oxidases, we inferred that at least 78% of sequences in the enzyme class are misannotated. We experimentally confirmed four alternative activities among the misannotated sequences and showed that misannotation in the enzyme class increased over time. Finally, we performed a computational analysis of annotations to all enzyme classes in the BRENDA database, and showed that nearly 18% of all sequences are annotated to an enzyme class while sharing no similarity or domain architecture to experimentally characterised representatives. We showed that even well-studied enzyme classes of industrial relevance are affected by the problem of functional misannotation.


Asunto(s)
Oxidorreductasas de Alcohol/clasificación , Bases de Datos de Proteínas/estadística & datos numéricos , Anotación de Secuencia Molecular/estadística & datos numéricos , Oxidorreductasas de Alcohol/química , Oxidorreductasas de Alcohol/genética , Animales , Biología Computacional , Enzimas/química , Enzimas/clasificación , Enzimas/genética , Humanos , Modelos Moleculares , Dominios Proteicos , Homología de Secuencia de Aminoácido
4.
Biochim Biophys Acta Gene Regul Mech ; 1864(11-12): 194752, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34461313

RESUMEN

Transcription plays a central role in defining the identity and functionalities of cells, as well as in their responses to changes in the cellular environment. The Gene Ontology (GO) provides a rigorously defined set of concepts that describe the functions of gene products. A GO annotation is a statement about the function of a particular gene product, represented as an association between a gene product and the biological concept a GO term defines. Critically, each GO annotation is based on traceable scientific evidence. Here, we describe the different GO terms that are associated with proteins involved in transcription and its regulation, focusing on the standard of evidence required to support these associations. This article is intended to help users of GO annotations understand how to interpret the annotations and can contribute to the consistency of GO annotations. We distinguish between three classes of activities involved in transcription or directly regulating it - general transcription factors, DNA-binding transcription factors, and transcription co-regulators.


Asunto(s)
Bases de Datos Genéticas/estadística & datos numéricos , Regulación de la Expresión Génica , Ontología de Genes/estadística & datos numéricos , Factores de Transcripción/clasificación , Biología Computacional/métodos , Anotación de Secuencia Molecular/estadística & datos numéricos
5.
Nat Commun ; 12(1): 2845, 2021 05 14.
Artículo en Inglés | MEDLINE | ID: mdl-33990588

RESUMEN

Quantifying the overall magnitude of every single locus' genetic effect on the widely measured human phenome is of great challenge. We introduce a unified modelling technique that can consistently provide a total genetic contribution assessment (TGCA) of a gene or genetic variant without thresholding genetic association signals. Genome-wide TGCA in five UK Biobank phenotype domains highlights loci such as the HLA locus for medical conditions, the bone mineral density locus WNT16 for physical measures, and the skin tanning locus MC1R and smoking behaviour locus CHRNA3 for lifestyle. Tissue-specificity investigation reveals several tissues associated with total genetic contributions, including the brain tissues for mental health. Such associations are driven by tissue-specific gene expressions, which share genetic basis with the total genetic contributions. TGCA can provide a genome-wide atlas for the overall genetic contributions in each particular domain of human complex traits.


Asunto(s)
Genoma Humano , Modelos Genéticos , Bancos de Muestras Biológicas/estadística & datos numéricos , Simulación por Computador , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Humanos , Anotación de Secuencia Molecular/estadística & datos numéricos , Herencia Multifactorial/genética , Especificidad de Órganos/genética , Fenotipo , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo
6.
PLoS Comput Biol ; 17(2): e1007948, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-33600408

RESUMEN

Gene function annotation is important for a variety of downstream analyses of genetic data. But experimental characterization of function remains costly and slow, making computational prediction an important endeavor. Phylogenetic approaches to prediction have been developed, but implementation of a practical Bayesian framework for parameter estimation remains an outstanding challenge. We have developed a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov Chain Monte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out cross-validation, and we further validated some of the predictions in the experimental scientific literature.


Asunto(s)
Modelos Genéticos , Anotación de Secuencia Molecular/métodos , Filogenia , Algoritmos , Animales , Teorema de Bayes , Biología Computacional , Bases de Datos Genéticas , Evolución Molecular , Ontología de Genes/estadística & datos numéricos , Humanos , Funciones de Verosimilitud , Cadenas de Markov , Ratones , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos , Método de Montecarlo , Familia de Multigenes
7.
Nucleic Acids Res ; 49(D1): D325-D334, 2021 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-33290552

RESUMEN

The Gene Ontology Consortium (GOC) provides the most comprehensive resource currently available for computable knowledge regarding the functions of genes and gene products. Here, we report the advances of the consortium over the past two years. The new GO-CAM annotation framework was notably improved, and we formalized the model with a computational schema to check and validate the rapidly increasing repository of 2838 GO-CAMs. In addition, we describe the impacts of several collaborations to refine GO and report a 10% increase in the number of GO annotations, a 25% increase in annotated gene products, and over 9,400 new scientific articles annotated. As the project matures, we continue our efforts to review older annotations in light of newer findings, and, to maintain consistency with other ontologies. As a result, 20 000 annotations derived from experimental data were reviewed, corresponding to 2.5% of experimental GO annotations. The website (http://geneontology.org) was redesigned for quick access to documentation, downloads and tools. To maintain an accurate resource and support traceability and reproducibility, we have made available a historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations.


Asunto(s)
Ontología de Genes , Anotación de Secuencia Molecular/estadística & datos numéricos , Interfaz Usuario-Computador , Animales , Arabidopsis/genética , Arabidopsis/metabolismo , Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Dictyostelium/genética , Dictyostelium/metabolismo , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Escherichia coli/genética , Escherichia coli/metabolismo , Humanos , Internet , Ratones , Ratas , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Schizosaccharomyces/genética , Schizosaccharomyces/metabolismo , Pez Cebra/genética , Pez Cebra/metabolismo
8.
PLoS Comput Biol ; 16(11): e1008325, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33180771

RESUMEN

Eukaryotic genome sequencing and de novo assembly, once the exclusive domain of well-funded international consortia, have become increasingly affordable, thus fitting the budgets of individual research groups. Third-generation long-read DNA sequencing technologies are increasingly used, providing extensive genomic toolkits that were once reserved for a few select model organisms. Generating high-quality genome assemblies and annotations for many aquatic species still presents significant challenges due to their large genome sizes, complexity, and high chromosome numbers. Indeed, selecting the most appropriate sequencing and software platforms and annotation pipelines for a new genome project can be daunting because tools often only work in limited contexts. In genomics, generating a high-quality genome assembly/annotation has become an indispensable tool for better understanding the biology of any species. Herein, we state 12 steps to help researchers get started in genome projects by presenting guidelines that are broadly applicable (to any species), sustainable over time, and cover all aspects of genome assembly and annotation projects from start to finish. We review some commonly used approaches, including practical methods to extract high-quality DNA and choices for the best sequencing platforms and library preparations. In addition, we discuss the range of potential bioinformatics pipelines, including structural and functional annotations (e.g., transposable elements and repetitive sequences). This paper also includes information on how to build a wide community for a genome project, the importance of data management, and how to make the data and results Findable, Accessible, Interoperable, and Reusable (FAIR) by submitting them to a public repository and sharing them with the research community.


Asunto(s)
Genoma , Genómica/métodos , Anotación de Secuencia Molecular/métodos , Animales , Biología Computacional , Biblioteca de Genes , Genómica/educación , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Anotación de Secuencia Molecular/estadística & datos numéricos , RNA-Seq/métodos , RNA-Seq/estadística & datos numéricos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/estadística & datos numéricos
9.
J Comput Biol ; 27(9): 1407-1421, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32048871

RESUMEN

By using next-generation sequencing technologies, it is possible to quickly and inexpensively generate large numbers of relatively short reads from both the nuclear and mitochondrial DNA (mtDNA) contained in a biological sample. Unfortunately, assembling such whole-genome sequencing (WGS) data with standard de novo assemblers often fails to generate high-quality mitochondrial genome sequences due to the large difference in copy number (and hence sequencing depth) between the mitochondrial and nuclear genomes. Assembly of complete mitochondrial genome sequences is further complicated by the fact that many de novo assemblers are not designed for circular genomes and by the presence of repeats in the mitochondrial genomes of some species. In this article, we describe the Statistical Mitogenome Assembly with RepeaTs (SMART) pipeline for automated assembly of mitochondrial genomes from WGS data. SMART uses an efficient coverage-based filter to first select a subset of reads enriched in mtDNA sequences. Contigs produced by an initial assembly step are filtered using the Basic Local Alignment Search Tool searches against a comprehensive mitochondrial genome database and are used as "baits" for an alignment-based filter that produces the set of reads used in a second de novo assembly and scaffolding step. In the presence of repeats, the possible paths through the assembly graph are evaluated using a maximum likelihood model. Additionally, the assembly process is repeated for a user-specified number of times on resampled subsets of reads to select for annotation of the reconstructed sequences with highest bootstrap support. Experiments on WGS data sets from a variety of species show that the SMART pipeline produces complete circular mitochondrial genome sequences with a higher success rate than current state-of-the-art tools, particularly for low-coverage WGS data sets.


Asunto(s)
Genoma Mitocondrial/genética , Genómica/estadística & datos numéricos , Anotación de Secuencia Molecular/estadística & datos numéricos , Programas Informáticos/estadística & datos numéricos , ADN Mitocondrial/genética , Análisis de Secuencia de ADN/estadística & datos numéricos , Secuenciación Completa del Genoma/estadística & datos numéricos
10.
BMC Bioinformatics ; 20(1): 610, 2019 Nov 27.
Artículo en Inglés | MEDLINE | ID: mdl-31775616

RESUMEN

BACKGROUND: Over the last 10 years, there have been over 3300 genome-wide association studies (GWAS). Almost every GWAS study provides a Manhattan plot either as a main figure or in the supplement. Several software packages can generate a Manhattan plot, but they are all limited in the extent to which they can annotate gene-names, allele frequencies, and variants having high impact on gene function or provide any other added information or flexibility. Furthermore, in a conventional Manhattan plot, there is no way of distinguishing a locus identified due to a single variant with very significant p-value from a locus with multiple variants which appear to be in a haplotype block having very similar p-values. RESULTS: Here we present a software tool written in R, which generates a transposed Manhattan plot along with additional features like variant consequence and minor allele frequency to annotate the plot and addresses these limitations. The software also gives flexibility on how and where the user wants to display the annotations. The software can be downloaded from CRAN repository and also from the GitHub project page. CONCLUSIONS: We present a major step up to the existing conventional Manhattan plot generation tools. We hope this form of display along with the added annotations will bring more insight to the reader from this new Manhattan++ plot.


Asunto(s)
Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Anotación de Secuencia Molecular/estadística & datos numéricos , Genoma , Humanos , Polimorfismo de Nucleótido Simple , Programas Informáticos
11.
PLoS Comput Biol ; 15(4): e1006682, 2019 04.
Artículo en Inglés | MEDLINE | ID: mdl-30943207

RESUMEN

High quality gene models are necessary to expand the molecular and genetic tools available for a target organism, but these are available for only a handful of model organisms that have undergone extensive curation and experimental validation over the course of many years. The majority of gene models present in biological databases today have been identified in draft genome assemblies using automated annotation pipelines that are frequently based on orthologs from distantly related model organisms and usually have minor or major errors. Manual curation is time consuming and often requires substantial expertise, but is instrumental in improving gene model structure and identification. Manual annotation may seem to be a daunting and cost-prohibitive task for small research communities but involving undergraduates in community genome annotation consortiums can be mutually beneficial for both education and improved genomic resources. We outline a workflow for efficient manual annotation driven by a team of primarily undergraduate annotators. This model can be scaled to large teams and includes quality control processes through incremental evaluation. Moreover, it gives students an opportunity to increase their understanding of genome biology and to participate in scientific research in collaboration with peers and senior researchers at multiple institutions.


Asunto(s)
Biología Computacional/educación , Genómica/educación , Modelos Genéticos , Anotación de Secuencia Molecular/estadística & datos numéricos , Bases de Datos Genéticas/estadística & datos numéricos , Genómica/estadística & datos numéricos , Guías como Asunto , Humanos , Estudiantes
12.
Brief Bioinform ; 20(1): 168-177, 2019 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-28968630

RESUMEN

Pathway enrichment analysis has been widely used to identify cancer risk pathways, and contributes to elucidating the mechanism of tumorigenesis. However, most of the existing approaches use the outdated pathway information and neglect the complex gene interactions in pathway. Here, we first reviewed the existing widely used pathway enrichment analysis approaches briefly, and then, we proposed a novel topology-based pathway enrichment analysis (TPEA) method, which integrated topological properties and global upstream/downstream positions of genes in pathways. We compared TPEA with four widely used pathway enrichment analysis tools, including database for annotation, visualization and integrated discovery (DAVID), gene set enrichment analysis (GSEA), centrality-based pathway enrichment (CePa) and signaling pathway impact analysis (SPIA), through analyzing six gene expression profiles of three tumor types (colorectal cancer, thyroid cancer and endometrial cancer). As a result, we identified several well-known cancer risk pathways that could not be obtained by the existing tools, and the results of TPEA were more stable than that of the other tools in analyzing different data sets of the same cancer. Ultimately, we developed an R package to implement TPEA, which could online update KEGG pathway information and is available at the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/TPEA/.


Asunto(s)
Bases de Datos Genéticas/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Neoplasias/genética , Carcinogénesis/genética , Biología Computacional/métodos , Femenino , Redes Reguladoras de Genes , Humanos , Masculino , Anotación de Secuencia Molecular/estadística & datos numéricos , Transducción de Señal/genética , Programas Informáticos
13.
Brief Bioinform ; 20(4): 1071-1084, 2019 07 19.
Artículo en Inglés | MEDLINE | ID: mdl-28968784

RESUMEN

The overwhelming list of new bacterial genomes becoming available on a daily basis makes accurate genome annotation an essential step that ultimately determines the relevance of thousands of genomes stored in public databanks. The MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Starting from the results of our syntactic, functional and relational annotation pipelines, MicroScope provides an integrated environment for the expert annotation and comparative analysis of prokaryotic genomes. It combines tools and graphical interfaces to analyze genomes and to perform the manual curation of gene function in a comparative genomics and metabolic context. In this article, we describe the free-of-charge MicroScope services for the annotation and analysis of microbial (meta)genomes, transcriptomic and re-sequencing data. Then, the functionalities of the platform are presented in a way providing practical guidance and help to the nonspecialists in bioinformatics. Newly integrated analysis tools (i.e. prediction of virulence and resistance genes in bacterial genomes) and original method recently developed (the pan-genome graph representation) are also described. Integrated environments such as MicroScope clearly contribute, through the user community, to help maintaining accurate resources.


Asunto(s)
Genoma Microbiano , Genómica/métodos , Anotación de Secuencia Molecular/métodos , Programas Informáticos , Biología Computacional , Gráficos por Computador , Sistemas de Administración de Bases de Datos , Bases de Datos de Compuestos Químicos , Genómica/estadística & datos numéricos , Internet , Redes y Vías Metabólicas/genética , Fenómenos Microbiológicos , Anotación de Secuencia Molecular/estadística & datos numéricos , Interfaz Usuario-Computador
14.
Brief Bioinform ; 20(1): 288-298, 2019 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-29028903

RESUMEN

RNA sequencing (RNA-seq) has become a standard procedure to investigate transcriptional changes between conditions and is routinely used in research and clinics. While standard differential expression (DE) analysis between two conditions has been extensively studied, and improved over the past decades, RNA-seq time course (TC) DE analysis algorithms are still in their early stages. In this study, we compare, for the first time, existing TC RNA-seq tools on an extensive simulation data set and validated the best performing tools on published data. Surprisingly, TC tools were outperformed by the classical pairwise comparison approach on short time series (<8 time points) in terms of overall performance and robustness to noise, mostly because of high number of false positives, with the exception of ImpulseDE2. Overlapping of candidate lists between tools improved this shortcoming, as the majority of false-positive, but not true-positive, candidates were unique for each method. On longer time series, pairwise approach was less efficient on the overall performance compared with splineTC and maSigPro, which did not identify any false-positive candidate.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Teorema de Bayes , Biología Computacional/métodos , Simulación por Computador , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Humanos , Cadenas de Markov , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos , Análisis de Secuencia de ARN/estadística & datos numéricos , Relación Señal-Ruido , Programas Informáticos , Factores de Tiempo
15.
Brief Bioinform ; 20(4): 1449-1464, 2019 07 19.
Artículo en Inglés | MEDLINE | ID: mdl-29490019

RESUMEN

Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.


Asunto(s)
Análisis por Conglomerados , Biología Computacional/métodos , Minería de Datos/métodos , Algoritmos , Macrodatos , Bases de Datos Genéticas/estadística & datos numéricos , Enfermedad/clasificación , Enfermedad/genética , Expresión Génica/efectos de los fármacos , Perfilación de la Expresión Génica/estadística & datos numéricos , Redes Reguladoras de Genes , Humanos , Anotación de Secuencia Molecular/estadística & datos numéricos
16.
J Exp Bot ; 70(4): 1069-1076, 2019 02 20.
Artículo en Inglés | MEDLINE | ID: mdl-30590678

RESUMEN

The use of draft genomes of different species and re-sequencing of accessions and populations are now common tools for plant biology research. The de novo assembled draft genomes make it possible to identify pivotal divergence points in the plant lineage and provide an opportunity to investigate the genomic basis and timing of biological innovations by inferring orthologs between species. Furthermore, re-sequencing facilitates the mapping and subsequent molecular characterization of causative loci for traits, such as those for plant stress tolerance and development. In both cases high-quality gene annotation-the identification of protein-coding regions, gene promoters, and 5'- and 3'-untranslated regions-is critical for investigation of gene function. Annotations are constantly improving but automated gene annotations still require manual curation and experimental validation. This is particularly important for genes with large introns, genes located in regions rich with transposable elements or repeats, large gene families, and segmentally duplicated genes. In this opinion paper, we highlight the impact of annotation quality on evolutionary analyses, genome-wide association studies, and the identification of orthologous genes in plants. Furthermore, we predict that incorporating accurate information from manual curation into databases will dramatically improve the performance of automated gene predictors.


Asunto(s)
Evolución Molecular , Genes de Plantas , Estudio de Asociación del Genoma Completo , Plantas/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Anotación de Secuencia Molecular/estadística & datos numéricos
17.
Pac Symp Biocomput ; 23: 602-613, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29218918

RESUMEN

Analysis of patient genomes and transcriptomes routinely recognizes new gene sets associated with human disease. Here we present an integrative natural language processing system which infers common functions for a gene set through automatic mining of the scientific literature with biological networks. This system links genes with associated literature phrases and combines these links with protein interactions in a single heterogeneous network. Multiscale functional annotations are inferred based on network distances between phrases and genes and then visualized as an ontology of biological concepts. To evaluate this system, we predict functions for gene sets representing known pathways and find that our approach achieves substantial improvement over the conventional text-mining baseline method. Moreover, our system discovers novel annotations for gene sets or pathways without previously known functions. Two case studies demonstrate how the system is used in discovery of new cancer-related pathways with ontological annotations.


Asunto(s)
Ontología de Genes/estadística & datos numéricos , Redes Reguladoras de Genes , Anotación de Secuencia Molecular/estadística & datos numéricos , Mapas de Interacción de Proteínas , Algoritmos , Biología Computacional/métodos , Minería de Datos/estadística & datos numéricos , Humanos , Procesamiento de Lenguaje Natural , Neoplasias/genética
18.
G3 (Bethesda) ; 8(1): 1-8, 2018 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-29167271

RESUMEN

Pyrenophora teres f. teres, the causal agent of net form net blotch (NFNB) of barley, is a destructive pathogen in barley-growing regions throughout the world. Typical yield losses due to NFNB range from 10 to 40%; however, complete loss has been observed on highly susceptible barley lines where environmental conditions favor the pathogen. Currently, genomic resources for this economically important pathogen are limited to a fragmented draft genome assembly and annotation, with limited RNA support of the P. teres f. teres isolate 0-1. This research presents an updated 0-1 reference assembly facilitated by long-read sequencing and scaffolding with the assistance of genetic linkage maps. Additionally, genome annotation was mediated by RNAseq analysis using three infection time points and a pure culture sample, resulting in 11,541 high-confidence gene models. The 0-1 genome assembly and annotation presented here now contains the majority of the repetitive content of the genome. Analysis of the 0-1 genome revealed classic characteristics of a "two-speed" genome, being compartmentalized into GC-equilibrated and AT-rich compartments. The assembly of repetitive AT-rich regions will be important for future investigation of genes known as effectors, which often reside in close proximity to repetitive regions. These effectors are responsible for manipulation of the host defense during infection. This updated P. teres f. teres isolate 0-1 reference genome assembly and annotation provides a robust resource for the examination of the barley-P. teres f. teres host-pathogen coevolution.


Asunto(s)
Ascomicetos/genética , Mapeo Cromosómico/métodos , Genoma Fúngico , Hordeum/microbiología , Interacciones Huésped-Patógeno/genética , Anotación de Secuencia Molecular/estadística & datos numéricos , Ascomicetos/aislamiento & purificación , Ascomicetos/patogenicidad , Composición de Base , Ontología de Genes , Ligamiento Genético , Secuenciación de Nucleótidos de Alto Rendimiento , Enfermedades de las Plantas/microbiología , Virulencia
19.
Nucleic Acids Res ; 45(8): e57, 2017 05 05.
Artículo en Inglés | MEDLINE | ID: mdl-28053114

RESUMEN

Whole transcriptome sequencing (RNA-seq) has become a standard for cataloguing and monitoring RNA populations. One of the main bottlenecks, however, is to correctly identify the different classes of RNAs among the plethora of reconstructed transcripts, particularly those that will be translated (mRNAs) from the class of long non-coding RNAs (lncRNAs). Here, we present FEELnc (FlExible Extraction of LncRNAs), an alignment-free program that accurately annotates lncRNAs based on a Random Forest model trained with general features such as multi k-mer frequencies and relaxed open reading frames. Benchmarking versus five state-of-the-art tools shows that FEELnc achieves similar or better classification performance on GENCODE and NONCODE data sets. The program also provides specific modules that enable the user to fine-tune classification accuracy, to formalize the annotation of lncRNA classes and to identify lncRNAs even in the absence of a training set of non-coding RNAs. We used FEELnc on a real data set comprising 20 canine RNA-seq samples produced by the European LUPA consortium to substantially expand the canine genome annotation to include 10 374 novel lncRNAs and 58 640 mRNA transcripts. FEELnc moves beyond conventional coding potential classifiers by providing a standardized and complete solution for annotating lncRNAs and is freely available at https://github.com/tderrien/FEELnc.


Asunto(s)
Genoma , Anotación de Secuencia Molecular/métodos , ARN Largo no Codificante/genética , Programas Informáticos , Transcriptoma , Animales , Benchmarking , Árboles de Decisión , Perros , Regulación de la Expresión Génica , Humanos , Ratones , Anotación de Secuencia Molecular/estadística & datos numéricos , Sistemas de Lectura Abierta , ARN Largo no Codificante/clasificación , ARN Largo no Codificante/metabolismo , ARN Mensajero/clasificación , ARN Mensajero/genética , ARN Mensajero/metabolismo , Análisis de Secuencia de ARN
20.
Pac Symp Biocomput ; 22: 27-38, 2017.
Artículo en Inglés | MEDLINE | ID: mdl-27896959

RESUMEN

Automated annotation of protein function has become a critical task in the post-genomic era. Network-based approaches and homology-based approaches have been widely used and recently tested in large-scale community-wide assessment experiments. It is natural to integrate network data with homology information to further improve the predictive performance. However, integrating these two heterogeneous, high-dimensional and noisy datasets is non-trivial. In this work, we introduce a novel protein function prediction algorithm ProSNet. An integrated heterogeneous network is first built to include molecular networks of multiple species and link together homologous proteins across multiple species. Based on this integrated network, a dimensionality reduction algorithm is introduced to obtain compact low-dimensional vectors to encode proteins in the network. Finally, we develop machine learning classification algorithms that take the vectors as input and make predictions by transferring annotations both within each species and across different species. Extensive experiments on five major species demonstrate that our integration of homology with molecular networks substantially improves the predictive performance over existing approaches.


Asunto(s)
Algoritmos , Anotación de Secuencia Molecular/estadística & datos numéricos , Mapeo de Interacción de Proteínas/estadística & datos numéricos , Animales , Biología Computacional , Humanos , Aprendizaje Automático , Ratones , Homología de Secuencia de Aminoácido
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA