Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Más filtros












Base de datos
Intervalo de año de publicación
1.
bioRxiv ; 2023 Dec 02.
Artículo en Inglés | MEDLINE | ID: mdl-38076842

RESUMEN

Despite many improvements over the years, the annotation of the human genome remains imperfect, and even the best annotations of the human reference genome sometimes contradict one another. Hence, refinement of the human genome annotation is an important challenge. The use of evolutionarily conserved sequences provides a strategy for addressing this problem, and the rapidly growing number of genomes from other species increases the power of an evolution-driven approach. Using the latest large-scale whole genome alignment data, we found that splice sites from protein-coding genes in the high-quality MANE annotation are consistently conserved across more than 400 species. We also studied splice sites from the RefSeq, GENCODE, and CHESS databases that are not present in MANE, from both protein-coding genes and lncRNAs. We trained a logistic regression classifier to distinguish between the conservation patterns exhibited by splice sites from MANE versus sites that were flanked by the standard GT-AG dinucleotides, but that were chosen randomly from a sequence not under selection. We found that up to 70% of splice sites from annotated protein-coding transcripts outside of MANE exhibit conservation patterns closer to random sequence as opposed to highly-conserved splice sites from MANE. Our study highlights potentially erroneous splice sites that might require further scrutiny.

2.
Genome Biol ; 24(1): 249, 2023 10 30.
Artículo en Inglés | MEDLINE | ID: mdl-37904256

RESUMEN

CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at http://ccb.jhu.edu/chess .


Asunto(s)
Genoma Humano , Proteínas , Humanos , Filogenia , Proteínas/genética , Algoritmos , Programas Informáticos , Anotación de Secuencia Molecular
3.
Elife ; 112022 12 15.
Artículo en Inglés | MEDLINE | ID: mdl-36519529

RESUMEN

Recently developed methods to predict three-dimensional protein structure with high accuracy have opened new avenues for genome and proteome research. We explore a new hypothesis in genome annotation, namely whether computationally predicted structures can help to identify which of multiple possible gene isoforms represents a functional protein product. Guided by protein structure predictions, we evaluated over 230,000 isoforms of human protein-coding genes assembled from over 10,000 RNA sequencing experiments across many human tissues. From this set of assembled transcripts, we identified hundreds of isoforms with more confidently predicted structure and potentially superior function in comparison to canonical isoforms in the latest human gene database. We illustrate our new method with examples where structure provides a guide to function in combination with expression and evolutionary evidence. Additionally, we provide the complete set of structures as a resource to better understand the function of human genes and their isoforms. These results demonstrate the promise of protein structure prediction as a genome annotation tool, allowing us to refine even the most highly curated catalog of human proteins. More generally we demonstrate a practical, structure-guided approach that can be used to enhance the annotation of any genome.


Asunto(s)
Genoma , Transcriptoma , Humanos , Anotación de Secuencia Molecular , Isoformas de Proteínas/genética , Análisis de Secuencia de ARN
4.
Nat Commun ; 11(1): 6327, 2020 12 10.
Artículo en Inglés | MEDLINE | ID: mdl-33303762

RESUMEN

Multiple whole-genome alignment is a challenging problem in bioinformatics. Despite many successes, current methods are not able to keep up with the growing number, length, and complexity of assembled genomes, especially when computational resources are limited. Approaches based on compacted de Bruijn graphs to identify and extend anchors into locally collinear blocks have potential for scalability, but current methods do not scale to mammalian genomes. We present an algorithm, SibeliaZ-LCB, for identifying collinear blocks in closely related genomes based on analysis of the de Bruijn graph. We further incorporate this into a multiple whole-genome alignment pipeline called SibeliaZ. SibeliaZ shows run-time improvements over other methods while maintaining accuracy. On sixteen recently-assembled strains of mice, SibeliaZ runs in under 16 hours on a single machine, while other tools did not run to completion for eight mice within a week. SibeliaZ makes a significant step towards improving scalability of multiple whole-genome alignment and collinear block reconstruction algorithms on a single machine.


Asunto(s)
Algoritmos , Genoma , Animales , Secuencia de Bases , Simulación por Computador , Bases de Datos Genéticas , Variación Genética , Ratones , Nucleótidos/genética , Factores de Tiempo
5.
iScience ; 23(6): 101224, 2020 Jun 26.
Artículo en Inglés | MEDLINE | ID: mdl-32563153

RESUMEN

Pairwise whole-genome homology mapping is the problem of finding all pairs of homologous intervals between a pair of genomes. As the number of available whole genomes has been rising dramatically in the last few years, there has been a need for more scalable homology mappers. In this paper, we develop an algorithm (BubbZ) for computing whole-genome pairwise homology mappings, especially in the context of all-to-all comparison for multiple genomes. BubbZ is based on an algorithm for computing chains in compacted de Bruijn graphs. We evaluate BubbZ on simulated datasets, a dataset composed of 16 long mouse genomes, and a large dataset of 1,600 Salmonella genomes. We show up to approximately an order of magnitude speed improvement, compared with MashMap2 and Minimap2, while retaining similar accuracy.

6.
F1000Res ; 8: 1751, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-34386196

RESUMEN

In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.

7.
Bioinformatics ; 33(24): 4024-4032, 2017 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-27659452

RESUMEN

MOTIVATION: de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes). RESULTS: In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes. AVAILABILITY AND IMPLEMENTATION: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo. CONTACT: ium125@psu.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Genómica/métodos , Animales , Genoma Humano , Humanos , Primates/genética , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...