Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 47
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Cell ; 173(5): 1098-1110.e18, 2018 05 17.
Artículo en Inglés | MEDLINE | ID: mdl-29706541

RESUMEN

Bats harbor many viruses asymptomatically, including several notorious for causing extreme virulence in humans. To identify differences between antiviral mechanisms in humans and bats, we sequenced, assembled, and analyzed the genome of Rousettus aegyptiacus, a natural reservoir of Marburg virus and the only known reservoir for any filovirus. We found an expanded and diversified KLRC/KLRD family of natural killer cell receptors, MHC class I genes, and type I interferons, which dramatically differ from their functional counterparts in other mammals. Such concerted evolution of key components of bat immunity is strongly suggestive of novel modes of antiviral defense. An evaluation of the theoretical function of these genes suggests that an inhibitory immune state may exist in bats. Based on our findings, we hypothesize that tolerance of viral infection, rather than enhanced potency of antiviral defenses, may be a key mechanism by which bats asymptomatically host viruses that are pathogenic in humans.


Asunto(s)
Quirópteros/genética , Genoma , Inmunidad Innata/genética , Secuencia de Aminoácidos , Animales , Línea Celular , Quirópteros/clasificación , Quirópteros/inmunología , Mapeo Cromosómico , Reservorios de Enfermedades/virología , Egipto , Evolución Molecular , Variación Genética , Antígenos de Histocompatibilidad Clase I/clasificación , Antígenos de Histocompatibilidad Clase I/genética , Humanos , Interferón Tipo I/clasificación , Interferón Tipo I/genética , Enfermedad del Virus de Marburg/inmunología , Enfermedad del Virus de Marburg/patología , Marburgvirus/fisiología , Subfamília C de Receptores Similares a Lectina de Células NK/química , Subfamília C de Receptores Similares a Lectina de Células NK/clasificación , Subfamília C de Receptores Similares a Lectina de Células NK/genética , Subfamília D de Receptores Similares a Lectina de las Células NK/química , Subfamília D de Receptores Similares a Lectina de las Células NK/clasificación , Subfamília D de Receptores Similares a Lectina de las Células NK/genética , Filogenia , Alineación de Secuencia
2.
Nature ; 604(7905): 310-315, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35388217

RESUMEN

Comprehensive genome annotation is essential to understand the impact of clinically relevant variants. However, the absence of a standard for clinical reporting and browser display complicates the process of consistent interpretation and reporting. To address these challenges, Ensembl/GENCODE1 and RefSeq2 launched a joint initiative, the Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration, to converge on human gene and transcript annotation and to jointly define a high-value set of transcripts and corresponding proteins. Here, we describe the MANE transcript sets for use as universal standards for variant reporting and browser display. The MANE Select set identifies a representative transcript for each human protein-coding gene, whereas the MANE Plus Clinical set provides additional transcripts at loci where the Select transcripts alone are not sufficient to report all currently known clinical variants. Each MANE transcript represents an exact match between the exonic sequences of an Ensembl/GENCODE transcript and its counterpart in RefSeq such that the identifiers can be used synonymously. We have now released MANE Select transcripts for 97% of human protein-coding genes, including all American College of Medical Genetics and Genomics Secondary Findings list v3.0 (ref. 3) genes. MANE transcripts are accessible from major genome browsers and key resources. Widespread adoption of these transcript sets will increase the consistency of reporting, facilitate the exchange of data regardless of the annotation source and help to streamline clinical interpretation.


Asunto(s)
Biología Computacional , Bases de Datos Genéticas , Genómica , Genoma , Humanos , Difusión de la Información , Anotación de Secuencia Molecular , National Library of Medicine (U.S.) , Estados Unidos
3.
Nature ; 611(7936): 519-531, 2022 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-36261518

RESUMEN

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.


Asunto(s)
Mapeo Cromosómico , Diploidia , Genoma Humano , Genómica , Humanos , Mapeo Cromosómico/normas , Genoma Humano/genética , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas , Estándares de Referencia , Genómica/métodos , Genómica/normas , Cromosomas Humanos/genética , Variación Genética/genética
4.
Nature ; 594(7861): 77-81, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-33953399

RESUMEN

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes1,3-5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.


Asunto(s)
Evolución Molecular , Genoma/genética , Genómica , Pan paniscus/genética , Filogenia , Animales , Factor 4A Eucariótico de Iniciación/genética , Femenino , Genes , Gorilla gorilla/genética , Anotación de Secuencia Molecular/normas , Pan troglodytes/genética , Pongo/genética , Duplicaciones Segmentarias en el Genoma , Análisis de Secuencia de ADN
5.
Nature ; 592(7856): 737-746, 2021 04.
Artículo en Inglés | MEDLINE | ID: mdl-33911273

RESUMEN

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1-4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.


Asunto(s)
Genoma , Genómica/métodos , Vertebrados/genética , Animales , Aves , Biblioteca de Genes , Tamaño del Genoma , Genoma Mitocondrial , Haplotipos , Secuenciación de Nucleótidos de Alto Rendimiento , Anotación de Secuencia Molecular , Alineación de Secuencia , Análisis de Secuencia de ADN , Cromosomas Sexuales/genética
6.
Nature ; 585(7823): 79-84, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32663838

RESUMEN

After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes.


Asunto(s)
Cromosomas Humanos X/genética , Genoma Humano/genética , Telómero/genética , Centrómero/genética , Islas de CpG/genética , Metilación de ADN , ADN Satélite/genética , Femenino , Humanos , Mola Hidatiforme/genética , Masculino , Embarazo , Reproducibilidad de los Resultados , Testículo/metabolismo
7.
Nucleic Acids Res ; 52(D1): D762-D769, 2024 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-37962425

RESUMEN

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.


Asunto(s)
Archaea , Bacterias , Bases de Datos de Ácidos Nucleicos , Metagenoma , Archaea/genética , Bacterias/genética , Bases de Datos de Ácidos Nucleicos/normas , Bases de Datos de Ácidos Nucleicos/tendencias , Genoma Arqueal/genética , Genoma Bacteriano/genética , Internet , Anotación de Secuencia Molecular , Proteínas/genética
8.
Nucleic Acids Res ; 52(D1): D33-D43, 2024 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-37994677

RESUMEN

The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, SciENcv, the NIH Comparative Genomics Resource (CGR), NCBI Virus, SRA, RefSeq, foreign contamination screening tools, Taxonomy, iCn3D, ClinVar, GTR, MedGen, dbSNP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.


Asunto(s)
Bases de Datos Genéticas , National Library of Medicine (U.S.) , Biotecnología/instrumentación , Bases de Datos de Ácidos Nucleicos , Internet , Estados Unidos
9.
Proc Natl Acad Sci U S A ; 120(7): e2201076120, 2023 02 14.
Artículo en Inglés | MEDLINE | ID: mdl-36749728

RESUMEN

Sea turtles represent an ancient lineage of marine vertebrates that evolved from terrestrial ancestors over 100 Mya. The genomic basis of the unique physiological and ecological traits enabling these species to thrive in diverse marine habitats remains largely unknown. Additionally, many populations have drastically declined due to anthropogenic activities over the past two centuries, and their recovery is a high global conservation priority. We generated and analyzed high-quality reference genomes for the leatherback (Dermochelys coriacea) and green (Chelonia mydas) turtles, representing the two extant sea turtle families. These genomes are highly syntenic and homologous, but localized regions of noncollinearity were associated with higher copy numbers of immune, zinc-finger, and olfactory receptor (OR) genes in green turtles, with ORs related to waterborne odorants greatly expanded in green turtles. Our findings suggest that divergent evolution of these key gene families may underlie immunological and sensory adaptations assisting navigation, occupancy of neritic versus pelagic environments, and diet specialization. Reduced collinearity was especially prevalent in microchromosomes, with greater gene content, heterozygosity, and genetic distances between species, supporting their critical role in vertebrate evolutionary adaptation. Finally, diversity and demographic histories starkly contrasted between species, indicating that leatherback turtles have had a low yet stable effective population size, exhibit extremely low diversity compared with other reptiles, and harbor a higher genetic load compared with green turtles, reinforcing concern over their persistence under future climate scenarios. These genomes provide invaluable resources for advancing our understanding of evolution and conservation best practices in an imperiled vertebrate lineage.


Asunto(s)
Tortugas , Animales , Ecosistema , Dinámica Poblacional
10.
Nat Methods ; 19(6): 696-704, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35361932

RESUMEN

Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller's internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Nanoporos , Genoma , Genómica , Humanos , Análisis de Secuencia de ADN
11.
Proc Natl Acad Sci U S A ; 119(4)2022 01 25.
Artículo en Inglés | MEDLINE | ID: mdl-35042802

RESUMEN

A global international initiative, such as the Earth BioGenome Project (EBP), requires both agreement and coordination on standards to ensure that the collective effort generates rapid progress toward its goals. To this end, the EBP initiated five technical standards committees comprising volunteer members from the global genomics scientific community: Sample Collection and Processing, Sequencing and Assembly, Annotation, Analysis, and IT and Informatics. The current versions of the resulting standards documents are available on the EBP website, with the recognition that opportunities, technologies, and challenges may improve or change in the future, requiring flexibility for the EBP to meet its goals. Here, we describe some highlights from the proposed standards, and areas where additional challenges will need to be met.


Asunto(s)
Secuencia de Bases/genética , Eucariontes/genética , Genómica/normas , Animales , Biodiversidad , Genómica/métodos , Humanos , Estándares de Referencia , Valores de Referencia , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas
12.
Nucleic Acids Res ; 50(D1): D20-D26, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34850941

RESUMEN

The National Center for Biotechnology Information (NCBI) produces a variety of online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, RefSeq, SRA, Virus, dbSNP, dbVar, ClinicalTrials.gov, MMDB, iCn3D and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.


Asunto(s)
Biotecnología/tendencias , Bases de Datos Genéticas/tendencias , Bases de Datos de Compuestos Químicos , Bases de Datos de Ácidos Nucleicos , Bases de Datos de Proteínas , Humanos , Internet , National Library of Medicine (U.S.) , PubMed , Estados Unidos
13.
Nucleic Acids Res ; 49(D1): D1020-D1028, 2021 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-33270901

RESUMEN

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Genoma Arqueal/genética , Genoma Bacteriano/genética , Anotación de Secuencia Molecular/métodos , Proteínas/genética , Curaduría de Datos/métodos , Minería de Datos/métodos , Genómica/métodos , Internet , Proteínas/clasificación , Interfaz Usuario-Computador
14.
Nucleic Acids Res ; 48(D1): D9-D16, 2020 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-31602479

RESUMEN

The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface, a sequence database search and a gene orthologs page. Additional resources that were updated in the past year include PMC, Bookshelf, My Bibliography, Assembly, RefSeq, viral genomes, the prokaryotic genome annotation pipeline, Genome Workbench, dbSNP, BLAST, Primer-BLAST, IgBLAST and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.


Asunto(s)
Biología Computacional/métodos , Biología Computacional/organización & administración , Bases de Datos Genéticas , National Library of Medicine (U.S.) , Bases de Datos de Ácidos Nucleicos , Genómica/métodos , Humanos , PubMed , Estados Unidos , Navegador Web
15.
Genome Res ; 27(5): 849-864, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28396521

RESUMEN

The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.


Asunto(s)
Mapeo Contig/métodos , Genoma Humano , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Mapeo Contig/normas , Genómica/normas , Haploidia , Haplotipos , Humanos , Polimorfismo Genético , Estándares de Referencia , Análisis de Secuencia de ADN/normas
16.
Nucleic Acids Res ; 46(D1): D851-D860, 2018 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-29112715

RESUMEN

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.


Asunto(s)
Curaduría de Datos , Bases de Datos de Ácidos Nucleicos , Genoma , Anotación de Secuencia Molecular , Células Procariotas , Archaea/genética , Bacterias/genética , Bases de Datos de Proteínas , Eucariontes/genética , Predicción , Humanos , Homología de Secuencia , Programas Informáticos , Virus/genética
17.
Plant J ; 89(4): 789-804, 2017 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-27862469

RESUMEN

The flowering plant Arabidopsis thaliana is a dicot model organism for research in many aspects of plant biology. A comprehensive annotation of its genome paves the way for understanding the functions and activities of all types of transcripts, including mRNA, the various classes of non-coding RNA, and small RNA. The TAIR10 annotation update had a profound impact on Arabidopsis research but was released more than 5 years ago. Maintaining the accuracy of the annotation continues to be a prerequisite for future progress. Using an integrative annotation pipeline, we assembled tissue-specific RNA-Seq libraries from 113 datasets and constructed 48 359 transcript models of protein-coding genes in eleven tissues. In addition, we annotated various classes of non-coding RNA including microRNA, long intergenic RNA, small nucleolar RNA, natural antisense transcript, small nuclear RNA, and small RNA using published datasets and in-house analytic results. Altogether, we identified 635 novel protein-coding genes, 508 novel transcribed regions, 5178 non-coding RNAs, and 35 846 small RNA loci that were formerly unannotated. Analysis of the splicing events and RNA-Seq based expression profiles revealed the landscapes of gene structures, untranslated regions, and splicing activities to be more intricate than previously appreciated. Furthermore, we present 692 uniformly expressed housekeeping genes, 43% of whose human orthologs are also housekeeping genes. This updated Arabidopsis genome annotation with a substantially increased resolution of gene models will not only further our understanding of the biological processes of this plant model but also of other species.


Asunto(s)
Proteínas de Arabidopsis/genética , Arabidopsis/genética , Perfilación de la Expresión Génica , Regulación de la Expresión Génica de las Plantas/genética , Genoma de Planta/genética , ARN de Planta/genética , Transcriptoma/genética
18.
Nucleic Acids Res ; 44(D1): D73-80, 2016 Jan 04.
Artículo en Inglés | MEDLINE | ID: mdl-26578580

RESUMEN

The NCBI Assembly database (www.ncbi.nlm.nih.gov/assembly/) provides stable accessioning and data tracking for genome assembly data. The model underlying the database can accommodate a range of assembly structures, including sets of unordered contig or scaffold sequences, bacterial genomes consisting of a single complete chromosome, or complex structures such as a human genome with modeled allelic variation. The database provides an assembly accession and version to unambiguously identify the set of sequences that make up a particular version of an assembly, and tracks changes to updated genome assemblies. The Assembly database reports metadata such as assembly names, simple statistical reports of the assembly (number of contigs and scaffolds, contiguity metrics such as contig N50, total sequence length and total gap length) as well as the assembly update history. The Assembly database also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Consortium (INSDC) and the assembly represented in the NCBI RefSeq project. Users can find assemblies of interest by querying the Assembly Resource directly or by browsing available assemblies for a particular organism. Links in the Assembly Resource allow users to easily download sequence and annotations for current versions of genome assemblies from the NCBI genomes FTP site.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Genómica , Animales , Genoma , Humanos , Internet , Ratones
19.
Nucleic Acids Res ; 44(D1): D733-45, 2016 Jan 04.
Artículo en Inglés | MEDLINE | ID: mdl-26553804

RESUMEN

The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.


Asunto(s)
Bases de Datos Genéticas , Genómica , Animales , Bovinos , Perfilación de la Expresión Génica , Genoma Fúngico , Genoma Humano , Genoma Microbiano , Genoma de Planta , Genoma Viral , Genómica/normas , Humanos , Invertebrados/genética , Ratones , Anotación de Secuencia Molecular , Nematodos/genética , Filogenia , ARN Largo no Codificante/genética , Ratas , Estándares de Referencia , Análisis de Secuencia de Proteína , Análisis de Secuencia de ARN , Vertebrados/genética
20.
Nucleic Acids Res ; 42(Database issue): D756-63, 2014 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-24259432

RESUMEN

The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript and protein sequence records derived from data in public sequence archives and from computation, curation and collaboration (http://www.ncbi.nlm.nih.gov/refseq/). We report here on growth of the mammalian and human subsets, changes to NCBI's eukaryotic annotation pipeline and modifications affecting transcript and protein records. Recent changes to NCBI's eukaryotic genome annotation pipeline provide higher throughput, and the addition of RNAseq data to the pipeline results in a significant expansion of the number of transcripts and novel exons annotated on mammalian RefSeq genomes. Recent annotation changes include reporting supporting evidence for transcript records, modification of exon feature annotation and the addition of a structured report of gene and sequence attributes of biological interest. We also describe a revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins and we summarize the current status of the RefSeqGene project.


Asunto(s)
Bases de Datos Genéticas , Genómica , Mamíferos/genética , Animales , Eucariontes/genética , Exones , Genoma , Genómica/normas , Humanos , Internet , Anotación de Secuencia Molecular , Proteínas/química , Proteínas/genética , ARN/química , Estándares de Referencia
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA