Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
Add more filters










Publication year range
1.
PLoS Biol ; 22(5): e3002405, 2024 May 07.
Article in English | MEDLINE | ID: mdl-38713717

ABSTRACT

We report a new visualization tool for analysis of whole-genome assembly-assembly alignments, the Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/). CGV visualizes pairwise same-species and cross-species alignments provided by National Center for Biotechnology Information (NCBI) using assembly alignment algorithms developed by us and others. Researchers can examine large structural differences spanning chromosomes, such as inversions or translocations. Users can also navigate to regions of interest, where they can detect and analyze smaller-scale deletions and rearrangements within specific chromosome or gene regions. RefSeq or user-provided gene annotation is displayed where available. CGV currently provides approximately 800 alignments from over 350 animal, plant, and fungal species. CGV and related NCBI viewers are undergoing active development to further meet needs of the research community in comparative genome visualization.

2.
Cell Genom ; 4(4): 100527, 2024 Apr 10.
Article in English | MEDLINE | ID: mdl-38537634

ABSTRACT

The seventh iteration of the reference genome assembly for Rattus norvegicus-mRatBN7.2-corrects numerous misplaced segments and reduces base-level errors by approximately 9-fold and increases contiguity by 290-fold compared with its predecessor. Gene annotations are now more complete, improving the mapping precision of genomic, transcriptomic, and proteomics datasets. We jointly analyzed 163 short-read whole-genome sequencing datasets representing 120 laboratory rat strains and substrains using mRatBN7.2. We defined ∼20.0 million sequence variations, of which 18,700 are predicted to potentially impact the function of 6,677 genes. We also generated a new rat genetic map from 1,893 heterogeneous stock rats and annotated transcription start sites and alternative polyadenylation sites. The mRatBN7.2 assembly, along with the extensive analysis of genomic variations among rat strains, enhances our understanding of the rat genome, providing researchers with an expanded resource for studies involving rats.


Subject(s)
Genome , Genomics , Rats , Animals , Genome/genetics , Molecular Sequence Annotation , Whole Genome Sequencing , Genetic Variation/genetics
3.
Genome Biol ; 25(1): 60, 2024 Feb 26.
Article in English | MEDLINE | ID: mdl-38409096

ABSTRACT

Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI's Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 min. Testing FCS-GX on artificially fragmented genomes demonstrates high sensitivity and specificity for diverse contaminant species. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination, comprising 0.16% of total bases, with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/ or https://doi.org/10.5281/zenodo.10651084 .


Subject(s)
Databases, Nucleic Acid , Genome , Software
4.
BMC Biol ; 22(1): 16, 2024 Jan 25.
Article in English | MEDLINE | ID: mdl-38273363

ABSTRACT

BACKGROUND: Understanding genome organization and evolution is important for species involved in transmission of human diseases, such as mosquitoes. Anophelinae and Culicinae subfamilies of mosquitoes show striking differences in genome sizes, sex chromosome arrangements, behavior, and ability to transmit pathogens. However, the genomic basis of these differences is not fully understood. METHODS: In this study, we used a combination of advanced genome technologies such as Oxford Nanopore Technology sequencing, Hi-C scaffolding, Bionano, and cytogenetic mapping to develop an improved chromosome-scale genome assembly for the West Nile vector Culex quinquefasciatus. RESULTS: We then used this assembly to annotate odorant receptors, odorant binding proteins, and transposable elements. A genomic region containing male-specific sequences on chromosome 1 and a polymorphic inversion on chromosome 3 were identified in the Cx. quinquefasciatus genome. In addition, the genome of Cx. quinquefasciatus was compared with the genomes of other mosquitoes such as malaria vectors An. coluzzi and An. albimanus, and the vector of arboviruses Ae. aegypti. Our work confirms significant expansion of the two chemosensory gene families in Cx. quinquefasciatus, as well as a significant increase and relocation of the transposable elements in both Cx. quinquefasciatus and Ae. aegypti relative to the Anophelines. Phylogenetic analysis clarifies the divergence time between the mosquito species. Our study provides new insights into chromosomal evolution in mosquitoes and finds that the X chromosome of Anophelinae and the sex-determining chromosome 1 of Culicinae have a significantly higher rate of evolution than autosomes. CONCLUSION: The improved Cx. quinquefasciatus genome assembly uncovered new details of mosquito genome evolution and has the potential to speed up the development of novel vector control strategies.


Subject(s)
Aedes , Culex , Animals , Humans , Male , Phylogeny , DNA Transposable Elements/genetics , Mosquito Vectors/genetics , Culex/genetics , Aedes/genetics , Chromosomes , Evolution, Molecular
5.
Nucleic Acids Res ; 52(D1): D33-D43, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37994677

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, SciENcv, the NIH Comparative Genomics Resource (CGR), NCBI Virus, SRA, RefSeq, foreign contamination screening tools, Taxonomy, iCn3D, ClinVar, GTR, MedGen, dbSNP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.


Subject(s)
Databases, Genetic , National Library of Medicine (U.S.) , Biotechnology/instrumentation , Databases, Nucleic Acid , Internet , United States
6.
bioRxiv ; 2023 Nov 29.
Article in English | MEDLINE | ID: mdl-38077029

ABSTRACT

We report a new visualization tool for analysis of whole genome assembly-assembly alignments, the Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/). CGV visualizes pairwise same-species and cross-species alignments provided by NCBI using assembly alignment algorithms developed by us and others. Researchers can examine the alignments between the two assemblies using two alternate views: a chromosome ideogram-based view or a 2D genome dotplot. Whole genome alignment views expose large structural differences spanning chromosomes, such as inversions or translocations. Users can also navigate to regions of interest, where they can detect and analyze smaller-scale deletions and rearrangements within specific chromosome or gene regions. RefSeq or user-provided gene annotation is displayed in the ideogram view where available. CGV currently provides approximately 700 alignments from over 300 animal, plant, and fungal species. CGV and related NCBI viewers are undergoing active development to further meet needs of the research community in comparative genome visualization.

7.
Nature ; 622(7981): 41-47, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37794265

ABSTRACT

Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.


Subject(s)
Genes , Genome, Human , Molecular Sequence Annotation , Protein Isoforms , Humans , Genome, Human/genetics , Molecular Sequence Annotation/standards , Molecular Sequence Annotation/trends , Protein Isoforms/genetics , Human Genome Project , Pseudogenes , RNA/genetics
8.
bioRxiv ; 2023 06 06.
Article in English | MEDLINE | ID: mdl-37292984

ABSTRACT

Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI's Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 minutes. Testing FCS-GX on artificially fragmented genomes demonstrates sensitivity >95% for diverse contaminant species and specificity >99.93%. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination (0.16% of total bases), with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/.

9.
bioRxiv ; 2023 Sep 28.
Article in English | MEDLINE | ID: mdl-37214860

ABSTRACT

The seventh iteration of the reference genome assembly for Rattus norvegicus-mRatBN7.2-corrects numerous misplaced segments and reduces base-level errors by approximately 9-fold and increases contiguity by 290-fold compared to its predecessor. Gene annotations are now more complete, significantly improving the mapping precision of genomic, transcriptomic, and proteomics data sets. We jointly analyzed 163 short-read whole genome sequencing datasets representing 120 laboratory rat strains and substrains using mRatBN7.2. We defined ~20.0 million sequence variations, of which 18.7 thousand are predicted to potentially impact the function of 6,677 genes. We also generated a new rat genetic map from 1,893 heterogeneous stock rats and annotated transcription start sites and alternative polyadenylation sites. The mRatBN7.2 assembly, along with the extensive analysis of genomic variations among rat strains, enhances our understanding of the rat genome, providing researchers with an expanded resource for studies involving rats.

10.
ArXiv ; 2023 Mar 24.
Article in English | MEDLINE | ID: mdl-36994150

ABSTRACT

Scientists have been trying to identify all of the genes in the human genome since the initial draft of the genome was published in 2001. Over the intervening years, much progress has been made in identifying protein-coding genes, and the estimated number has shrunk to fewer than 20,000, although the number of distinct protein-coding isoforms has expanded dramatically. The invention of high-throughput RNA sequencing and other technological breakthroughs have led to an explosion in the number of reported non-coding RNA genes, although most of them do not yet have any known function. A combination of recent advances offers a path forward to identifying these functions and towards eventually completing the human gene catalogue. However, much work remains to be done before we have a universal annotation standard that includes all medically significant genes, maintains their relationships with different reference genomes, and describes clinically relevant genetic variants.

11.
Nucleic Acids Res ; 51(D1): D29-D38, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36370100

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. New resources include the Comparative Genome Resource (CGR) and the BLAST ClusteredNR database. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, IgBLAST, GDV, RefSeq, NCBI Virus, GenBank type assemblies, iCn3D, ClinVar, GTR, dbGaP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.


Subject(s)
Databases, Genetic , Databases, Nucleic Acid , United States , National Library of Medicine (U.S.) , Sequence Alignment , Biotechnology , Internet
12.
Hypertension ; 80(1): 138-146, 2023 01.
Article in English | MEDLINE | ID: mdl-36330812

ABSTRACT

BACKGROUND: We report the creation and evaluation of a de novo assembly of the genome of the spontaneously hypertensive rat, the most widely used model of human cardiovascular disease. METHODS: The genome is assembled from long read sequencing (PacBio HiFi and continuous long read data [CLR]) and scaffolded with long-range structural information obtained from Bionano optical maps and proximity ligation sequencing proximity analysis of the genome. The genome assembly was polished with Illumina short reads. Completeness of the assembly was investigated using Benchmarking Universal Single Copy Orthologs analysis. The genome assembly was also evaluated with the rat reference gene set, using NCBI automated protocols. We also generated orthogonal single molecule transcript sequence reads (Iso-Seq) from 8 tissues and used them to validate the coding assembly, to annotate the assembly with RNA transcripts representing unique full length transcript isoforms for each gene and to determine whether divergences between RefSeq sequences and the assembly were attributable to assembly errors or polymorphisms. RESULTS: The assembly analysis indicates that this assembly is comparable in contiguity and completeness to the current rat reference assembly, while the use of HiFi sequencing yields an assembly that is more correct at the single base level. Synteny analysis was performed to uncover the extent of synteny and the presence and distribution of chromosomal rearrangements between the reference and this assembly. CONCLUSION: The resulting genome assembly is reference quality and captures significant structural variation.


Subject(s)
Stroke , Humans , Rats , Animals , Rats, Inbred SHR , Stroke/genetics
13.
Nature ; 604(7905): 310-315, 2022 04.
Article in English | MEDLINE | ID: mdl-35388217

ABSTRACT

Comprehensive genome annotation is essential to understand the impact of clinically relevant variants. However, the absence of a standard for clinical reporting and browser display complicates the process of consistent interpretation and reporting. To address these challenges, Ensembl/GENCODE1 and RefSeq2 launched a joint initiative, the Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration, to converge on human gene and transcript annotation and to jointly define a high-value set of transcripts and corresponding proteins. Here, we describe the MANE transcript sets for use as universal standards for variant reporting and browser display. The MANE Select set identifies a representative transcript for each human protein-coding gene, whereas the MANE Plus Clinical set provides additional transcripts at loci where the Select transcripts alone are not sufficient to report all currently known clinical variants. Each MANE transcript represents an exact match between the exonic sequences of an Ensembl/GENCODE transcript and its counterpart in RefSeq such that the identifiers can be used synonymously. We have now released MANE Select transcripts for 97% of human protein-coding genes, including all American College of Medical Genetics and Genomics Secondary Findings list v3.0 (ref. 3) genes. MANE transcripts are accessible from major genome browsers and key resources. Widespread adoption of these transcript sets will increase the consistency of reporting, facilitate the exchange of data regardless of the annotation source and help to streamline clinical interpretation.


Subject(s)
Computational Biology , Databases, Genetic , Genomics , Genome , Humans , Information Dissemination , Molecular Sequence Annotation , National Library of Medicine (U.S.) , United States
14.
Genome Res ; 32(1): 175-188, 2022 01.
Article in English | MEDLINE | ID: mdl-34876495

ABSTRACT

Eukaryotic genomes contain many nongenic elements that function in gene regulation, chromosome organization, recombination, repair, or replication, and mutation of those elements can affect genome function and cause disease. Although numerous epigenomic studies provide high coverage of gene regulatory regions, those data are not usually exposed in traditional genome annotation and can be difficult to access and interpret without field-specific expertise. The National Center for Biotechnology Information (NCBI) therefore provides RefSeq Functional Elements (RefSeqFEs), which represent experimentally validated human and mouse nongenic elements derived from the literature. The curated data set is comprised of richly annotated sequence records, descriptive records in the NCBI Gene database, reference genome feature annotation, and activity-based interactions between nongenic regions, target genes, and each other. The data set provides succinct functional details and transparent experimental evidence, leverages data from multiple experimental sources, is readily accessible and adaptable, and uses a flexible data model. The data have multiple uses for basic functional discovery, bioinformatics studies, genetic variant interpretation; as known positive controls for epigenomic data evaluation; and as reference standards for functional interactions. Comparisons to other gene regulatory data sets show that the RefSeqFE data set includes a wider range of feature types representing more areas of biology, but it is comparatively smaller and subject to data selection biases. RefSeqFEs thus provide an alternative and complementary resource for experimentally assayed functional elements, with future data set growth expected.


Subject(s)
Computational Biology , Genome , Animals , Databases, Genetic , Eukaryota/genetics , Humans , Mice , Reference Standards
16.
Genes (Basel) ; 12(6)2021 05 30.
Article in English | MEDLINE | ID: mdl-34070911

ABSTRACT

The domestic dog has evolved to be an important biomedical model for studies regarding the genetic basis of disease, morphology and behavior. Genetic studies in the dog have relied on a draft reference genome of a purebred female boxer dog named "Tasha" initially published in 2005. Derived from a Sanger whole genome shotgun sequencing approach coupled with limited clone-based sequencing, the initial assembly and subsequent updates have served as the predominant resource for canine genetics for 15 years. While the initial assembly produced a good-quality draft, as with all assemblies produced at the time, it contained gaps, assembly errors and missing sequences, particularly in GC-rich regions, which are found at many promoters and in the first exons of protein-coding genes. Here, we present Dog10K_Boxer_Tasha_1.0, an improved chromosome-level highly contiguous genome assembly of Tasha created with long-read technologies that increases sequence contiguity >100-fold, closes >23,000 gaps of the CanFam3.1 reference assembly and improves gene annotation by identifying >1200 new protein-coding transcripts. The assembly and annotation are available at NCBI under the accession GCF_000002285.5.


Subject(s)
Dogs/genetics , Genome , Animals , Contig Mapping , Molecular Sequence Annotation
17.
BMC Biol ; 19(1): 41, 2021 03 10.
Article in English | MEDLINE | ID: mdl-33750380

ABSTRACT

BACKGROUND: The stable fly, Stomoxys calcitrans, is a major blood-feeding pest of livestock that has near worldwide distribution, causing an annual cost of over $2 billion for control and product loss in the USA alone. Control of these flies has been limited to increased sanitary management practices and insecticide application for suppressing larval stages. Few genetic and molecular resources are available to help in developing novel methods for controlling stable flies. RESULTS: This study examines stable fly biology by utilizing a combination of high-quality genome sequencing and RNA-Seq analyses targeting multiple developmental stages and tissues. In conjunction, 1600 genes were manually curated to characterize genetic features related to stable fly reproduction, vector host interactions, host-microbe dynamics, and putative targets for control. Most notable was characterization of genes associated with reproduction and identification of expanded gene families with functional associations to vision, chemosensation, immunity, and metabolic detoxification pathways. CONCLUSIONS: The combined sequencing, assembly, and curation of the male stable fly genome followed by RNA-Seq and downstream analyses provide insights necessary to understand the biology of this important pest. These resources and new data will provide the groundwork for expanding the tools available to control stable fly infestations. The close relationship of Stomoxys to other blood-feeding (horn flies and Glossina) and non-blood-feeding flies (house flies, medflies, Drosophila) will facilitate understanding of the evolutionary processes associated with development of blood feeding among the Cyclorrhapha.


Subject(s)
Genome, Insect , Host-Parasite Interactions/genetics , Insect Control , Muscidae/genetics , Animals , Reproduction/genetics
18.
BMC Genomics ; 21(1): 227, 2020 Mar 14.
Article in English | MEDLINE | ID: mdl-32171258

ABSTRACT

BACKGROUND: Halyomorpha halys (Stål), the brown marmorated stink bug, is a highly invasive insect species due in part to its exceptionally high levels of polyphagy. This species is also a nuisance due to overwintering in human-made structures. It has caused significant agricultural losses in recent years along the Atlantic seaboard of North America and in continental Europe. Genomic resources will assist with determining the molecular basis for this species' feeding and habitat traits, defining potential targets for pest management strategies. RESULTS: Analysis of the 1.15-Gb draft genome assembly has identified a wide variety of genetic elements underpinning the biological characteristics of this formidable pest species, encompassing the roles of sensory functions, digestion, immunity, detoxification and development, all of which likely support H. halys' capacity for invasiveness. Many of the genes identified herein have potential for biomolecular pesticide applications. CONCLUSIONS: Availability of the H. halys genome sequence will be useful for the development of environmentally friendly biomolecular pesticides to be applied in concert with more traditional, synthetic chemical-based controls.


Subject(s)
Heteroptera/genetics , Insect Proteins/genetics , Insecticide Resistance , Whole Genome Sequencing/methods , Animals , Ecosystem , Gene Transfer, Horizontal , Genome Size , Heteroptera/classification , Introduced Species , Phylogeny
19.
Nucleic Acids Res ; 48(D1): D9-D16, 2020 01 08.
Article in English | MEDLINE | ID: mdl-31602479

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface, a sequence database search and a gene orthologs page. Additional resources that were updated in the past year include PMC, Bookshelf, My Bibliography, Assembly, RefSeq, viral genomes, the prokaryotic genome annotation pipeline, Genome Workbench, dbSNP, BLAST, Primer-BLAST, IgBLAST and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.


Subject(s)
Computational Biology/methods , Computational Biology/organization & administration , Databases, Genetic , National Library of Medicine (U.S.) , Databases, Nucleic Acid , Genomics/methods , Humans , PubMed , United States , Web Browser
20.
Nat Commun ; 10(1): 260, 2019 01 16.
Article in English | MEDLINE | ID: mdl-30651564

ABSTRACT

Rapid innovation in sequencing technologies and improvement in assembly algorithms have enabled the creation of highly contiguous mammalian genomes. Here we report a chromosome-level assembly of the water buffalo (Bubalus bubalis) genome using single-molecule sequencing and chromatin conformation capture data. PacBio Sequel reads, with a mean length of 11.5 kb, helped to resolve repetitive elements and generate sequence contiguity. All five B. bubalis sub-metacentric chromosomes were correctly scaffolded with centromeres spanned. Although the index animal was partly inbred, 58% of the genome was haplotype-phased by FALCON-Unzip. This new reference genome improves the contig N50 of the previous short-read based buffalo assembly more than a thousand-fold and contains only 383 gaps. It surpasses the human and goat references in sequence contiguity and facilitates the annotation of hard to assemble gene clusters such as the major histocompatibility complex (MHC).


Subject(s)
Buffaloes/genetics , Chromosomes, Mammalian/genetics , Contig Mapping/methods , Genome/genetics , Goats/genetics , Animals , Chromatin/chemistry , Chromatin/genetics , Female , Genomics/methods , Haplotypes , High-Throughput Nucleotide Sequencing , Humans , Major Histocompatibility Complex/genetics , Molecular Sequence Annotation/methods , Multigene Family/genetics , Repetitive Sequences, Nucleic Acid/genetics , Whole Genome Sequencing
SELECTION OF CITATIONS
SEARCH DETAIL
...