Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 72
Filtrar
1.
Nature ; 622(7981): 41-47, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37794265

RESUMEN

Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.


Asunto(s)
Genes , Genoma Humano , Anotación de Secuencia Molecular , Isoformas de Proteínas , Humanos , Genoma Humano/genética , Anotación de Secuencia Molecular/normas , Anotación de Secuencia Molecular/tendencias , Isoformas de Proteínas/genética , Proyecto Genoma Humano , Seudogenes , ARN/genética
2.
PLoS Comput Biol ; 17(7): e1008984, 2021 07.
Artículo en Inglés | MEDLINE | ID: mdl-34329294

RESUMEN

Erroneous conversion of gene names into other dates and other data types has been a frustration for computational biologists for years. We hypothesized that such errors in supplementary files might diminish after a report in 2016 highlighting the extent of the problem. To assess this, we performed a scan of supplementary files published in PubMed Central from 2014 to 2020. Overall, gene name errors continued to accumulate unabated in the period after 2016. An improved scanning software we developed identified gene name errors in 30.9% (3,436/11,117) of articles with supplementary Excel gene lists; a figure significantly higher than previously estimated. This is due to gene names being converted not just to dates and floating-point numbers, but also to internal date format (five-digit numbers). These findings further reinforce that spreadsheets are ill-suited to use with large genomic data.


Asunto(s)
Biología Computacional/normas , Genes/genética , Anotación de Secuencia Molecular/normas , Humanos , PubMed , Programas Informáticos , Terminología como Asunto
3.
Am J Hum Genet ; 108(9): 1551-1557, 2021 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-34329581

RESUMEN

Clinical validity assessments of gene-disease associations underpin analysis and reporting in diagnostic genomics, and yet wide variability exists in practice, particularly in use of these assessments for virtual gene panel design and maintenance. Harmonization efforts are hampered by the lack of agreed terminology, agreed gene curation standards, and platforms that can be used to identify and resolve discrepancies at scale. We undertook a systematic comparison of the content of 80 virtual gene panels used in two healthcare systems by multiple diagnostic providers in the United Kingdom and Australia. The process was enabled by a shared curation platform, PanelApp, and resulted in the identification and review of 2,144 discordant gene ratings, demonstrating the utility of sharing structured gene-disease validity assessments and collaborative discordance resolution in establishing national and international consensus.


Asunto(s)
Consenso , Curaduría de Datos/normas , Enfermedades Genéticas Congénitas/genética , Genómica/normas , Anotación de Secuencia Molecular/normas , Australia , Biomarcadores/metabolismo , Curaduría de Datos/métodos , Atención a la Salud , Expresión Génica , Ontología de Genes , Enfermedades Genéticas Congénitas/diagnóstico , Enfermedades Genéticas Congénitas/patología , Genómica/métodos , Humanos , Aplicaciones Móviles/provisión & distribución , Terminología como Asunto , Reino Unido
4.
Nature ; 594(7861): 77-81, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-33953399

RESUMEN

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes1,3-5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.


Asunto(s)
Evolución Molecular , Genoma/genética , Genómica , Pan paniscus/genética , Filogenia , Animales , Factor 4A Eucariótico de Iniciación/genética , Femenino , Genes , Gorilla gorilla/genética , Anotación de Secuencia Molecular/normas , Pan troglodytes/genética , Pongo/genética , Duplicaciones Segmentarias en el Genoma , Análisis de Secuencia de ADN
5.
Proteins ; 89(2): 242-250, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-32935893

RESUMEN

A major challenge for protein databases is reconciling information from diverse sources. This is especially difficult when some information consists of secondary, human-interpreted rather than primary data. For example, the Swiss-Prot database contains curated annotations of subcellular location that are based on predictions from protein sequence, statements in scientific articles, and published experimental evidence. The Human Protein Atlas (HPA) consists of millions of high-resolution microscopic images that show protein spatial distribution on a cellular and subcellular level. These images are manually annotated with protein subcellular locations by trained experts. The image annotations in HPA can capture the variation of subcellular location across different cell lines, tissues, or tissue states. Systematic investigation of the consistency between HPA and Swiss-Prot assignments of subcellular location, which is important for understanding and utilizing protein location data from the two databases, has not been described previously. In this paper, we quantitatively evaluate the consistency of subcellular location annotations between HPA and Swiss-Prot at multiple levels, as well as variation of protein locations across cell lines and tissues. Our results show that annotations of these two databases differ significantly in many cases, leading to proposed procedures for deriving and integrating the protein subcellular location data. We also find that proteins having highly variable locations are more likely to be biomarkers of diseases, providing support for incorporating analysis of subcellular location in protein biomarker identification and screening.


Asunto(s)
Bases de Datos de Proteínas/normas , Anotación de Secuencia Molecular/normas , Proteínas/metabolismo , Atlas como Asunto , Compartimento Celular , Línea Celular , Células Eucariotas/metabolismo , Células Eucariotas/ultraestructura , Humanos , Variaciones Dependientes del Observador , Proteínas/química , Proteínas/genética , Reproducibilidad de los Resultados , Incertidumbre
6.
Genomics ; 113(1 Pt 2): 748-754, 2021 01.
Artículo en Inglés | MEDLINE | ID: mdl-33053411

RESUMEN

Next Generation Sequencing (NGS), and specifically targeted panel sequencing is the state-of-the-art in clinical genetic diagnosis of Mendelian diseases. However, the bioinformatics analysis and interpretation of the generated data can be challenging. A spotlight on the default transcript selection of a user-friendly, commercially available software that is widely used by genetics professionals, i.e. Illumina® VariantStudio®, is presented. For the sake of comparison, we employed Ensembl VEP, an open-source command-line tool, as it provides flexibility regarding transcript selection. The analysis of NGS data deriving from sequencing of 857 germline DNA samples of cancer patients indicated a concordance of 82.82% between the two software programs. Significantly, using the default transcript configuration of VariantStudio®, we failed to annotate correctly 11.45% of the identified loss-of-function variants. Our results underline the importance of cautious software and transcript selection and the need for reliable, white-box data analysis, along with bioinformatics expertise in clinical diagnostics.


Asunto(s)
Pruebas Genéticas/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Anotación de Secuencia Molecular/métodos , Neoplasias/genética , Pruebas Genéticas/normas , Mutación de Línea Germinal , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Humanos , Anotación de Secuencia Molecular/normas , Neoplasias/diagnóstico , Sensibilidad y Especificidad , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas
7.
Cancer Res ; 81(2): 282-288, 2021 01 15.
Artículo en Inglés | MEDLINE | ID: mdl-33115802

RESUMEN

Although next-generation sequencing is widely used in cancer to profile tumors and detect variants, most somatic variant callers used in these pipelines identify variants at the lowest possible granularity, single-nucleotide variants (SNV). As a result, multiple adjacent SNVs are called individually instead of as a multi-nucleotide variants (MNV). With this approach, the amino acid change from the individual SNV within a codon could be different from the amino acid change based on the MNV that results from combining SNV, leading to incorrect conclusions about the downstream effects of the variants. Here, we analyzed 10,383 variant call files (VCF) from the Cancer Genome Atlas (TCGA) and found 12,141 incorrectly annotated MNVs. Analysis of seven commonly mutated genes from 178 studies in cBioPortal revealed that MNVs were consistently missed in 20 of these studies, whereas they were correctly annotated in 15 more recent studies. At the BRAF V600 locus, the most common example of MNV, several public datasets reported separate BRAF V600E and BRAF V600M variants instead of a single merged V600K variant. VCFs from the TCGA Mutect2 caller were used to develop a solution to merge SNV to MNV. Our custom script used the phasing information from the SNV VCF and determined whether SNVs were at the same codon and needed to be merged into MNV before variant annotation. This study shows that institutions performing NGS sequencing for cancer genomics should incorporate the step of merging MNV as a best practice in their pipelines. SIGNIFICANCE: Identification of incorrect mutation calls in TCGA, including clinically relevant BRAF V600 and KRAS G12, will influence research and potentially clinical decisions.


Asunto(s)
Genoma Humano , Genómica/normas , Anotación de Secuencia Molecular/normas , Mutación , Neoplasias/genética , Polimorfismo de Nucleótido Simple , Error Científico Experimental/estadística & datos numéricos , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Neoplasias/patología
8.
PLoS Genet ; 16(12): e1009060, 2020 12.
Artículo en Inglés | MEDLINE | ID: mdl-33320851

RESUMEN

Gene-based association tests aggregate genotypes across multiple variants for each gene, providing an interpretable gene-level analysis framework for genome-wide association studies (GWAS). Early gene-based test applications often focused on rare coding variants; a more recent wave of gene-based methods, e.g. TWAS, use eQTLs to interrogate regulatory associations. Regulatory variants are expected to be particularly valuable for gene-based analysis, since most GWAS associations to date are non-coding. However, identifying causal genes from regulatory associations remains challenging and contentious. Here, we present a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations. We compare power and accuracy identifying causal genes across single-annotation, omnibus, and annotation-agnostic gene-based tests in simulation studies and an analysis of 128 traits from the UK Biobank, and find that incorporating heterogeneous annotations in gene-based association analysis increases power and performance identifying causal genes.


Asunto(s)
Estudio de Asociación del Genoma Completo/métodos , Anotación de Secuencia Molecular/métodos , Algoritmos , Estudio de Asociación del Genoma Completo/normas , Humanos , Anotación de Secuencia Molecular/normas , Polimorfismo Genético , Sitios de Carácter Cuantitativo , Reproducibilidad de los Resultados
10.
BMC Genomics ; 21(1): 708, 2020 Oct 12.
Artículo en Inglés | MEDLINE | ID: mdl-33045985

RESUMEN

BACKGROUND: Nematode model organisms such as Caenorhabditis elegans and Pristionchus pacificus are powerful systems for studying the evolution of gene function at a mechanistic level. However, the identification of P. pacificus orthologs of candidate genes known from C. elegans is complicated by the discrepancy in the quality of gene annotations, a common problem in nematode and invertebrate genomics. RESULTS: Here, we combine comparative genomic screens for suspicious gene models with community-based curation to further improve the quality of gene annotations in P. pacificus. We extend previous curations of one-to-one orthologs to larger gene families and also orphan genes. Cross-species comparisons of protein lengths, screens for atypical domain combinations and species-specific orphan genes resulted in 4311 candidate genes that were subject to community-based curation. Corrections for 2946 gene models were implemented in a new version of the P. pacificus gene annotations. The new set of gene annotations contains 28,896 genes and has a single copy ortholog completeness level of 97.6%. CONCLUSIONS: Our work demonstrates the effectiveness of comparative genomic screens to identify suspicious gene models and the scalability of community-based approaches to improve the quality of thousands of gene models. Similar community-based approaches can help to improve the quality of gene annotations in other invertebrate species, including parasitic nematodes.


Asunto(s)
Anotación de Secuencia Molecular , Rabdítidos , Animales , Caenorhabditis elegans/genética , Genómica , Anotación de Secuencia Molecular/métodos , Anotación de Secuencia Molecular/normas , Rabdítidos/genética , Especificidad de la Especie
11.
Biochemistry ; 59(35): 3258-3270, 2020 09 08.
Artículo en Inglés | MEDLINE | ID: mdl-32786413

RESUMEN

Free guanidine is increasingly recognized as a relevant molecule in biological systems. Recently, it was reported that urea carboxylase acts preferentially on guanidine, and consequently, it was considered to participate directly in guanidine biodegradation. Urea carboxylase combines with allophanate hydrolase to comprise the activity of urea amidolyase, an enzyme predominantly found in bacteria and fungi that catalyzes the carboxylation and subsequent hydrolysis of urea to ammonia and carbon dioxide. Here, we demonstrate that urea carboxylase and allophanate hydrolase from Pseudomonas syringae are insufficient to catalyze the decomposition of guanidine. Rather, guanidine is decomposed to ammonia through the combined activities of urea carboxylase, allophanate hydrolase, and two additional proteins of the DUF1989 protein family, expansively annotated as urea carboxylase-associated family proteins. These proteins comprise the subunits of a heterodimeric carboxyguanidine deiminase (CgdAB), which hydrolyzes carboxyguanidine to N-carboxyurea (allophanate). The genes encoding CgdAB colocalize with genes encoding urea carboxylase and allophanate hydrolase. However, 25% of urea carboxylase genes, including all fungal urea amidolyases, do not colocalize with cgdAB. This subset of urea carboxylases correlates with a notable Asp to Asn mutation in the carboxyltransferase active site. Consistent with this observation, we demonstrate that fungal urea amidolyase retains a strong substrate preference for urea. The combined activities of urea carboxylase, carboxyguanidine deiminase and allophanate hydrolase represent a newly recognized pathway for the biodegradation of guanidine. These findings reinforce the relevance of guanidine as a biological metabolite and reveal a broadly distributed group of enzymes that act on guanidine in bacteria.


Asunto(s)
Guanidina/metabolismo , Hidrolasas/metabolismo , Nitrógeno/metabolismo , Pseudomonas syringae/enzimología , Urea/metabolismo , Alofanato Hidrolasa/química , Alofanato Hidrolasa/metabolismo , Amoníaco/metabolismo , Ligasas de Carbono-Nitrógeno/química , Ligasas de Carbono-Nitrógeno/metabolismo , Catálisis , Citrulinación/fisiología , Hidrolasas/química , Redes y Vías Metabólicas/fisiología , Anotación de Secuencia Molecular/normas , Subunidades de Proteína/química , Subunidades de Proteína/metabolismo , Pseudomonas syringae/metabolismo
12.
Nature ; 583(7817): 578-584, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32699395

RESUMEN

Bats possess extraordinary adaptations, including flight, echolocation, extreme longevity and unique immunity. High-quality genomes are crucial for understanding the molecular basis and evolution of these traits. Here we incorporated long-read sequencing and state-of-the-art scaffolding protocols1 to generate, to our knowledge, the first reference-quality genomes of six bat species (Rhinolophus ferrumequinum, Rousettus aegyptiacus, Phyllostomus discolor, Myotis myotis, Pipistrellus kuhlii and Molossus molossus). We integrated gene projections from our 'Tool to infer Orthologs from Genome Alignments' (TOGA) software with de novo and homology gene predictions as well as short- and long-read transcriptomics to generate highly complete gene annotations. To resolve the phylogenetic position of bats within Laurasiatheria, we applied several phylogenetic methods to comprehensive sets of orthologous protein-coding and noncoding regions of the genome, and identified a basal origin for bats within Scrotifera. Our genome-wide screens revealed positive selection on hearing-related genes in the ancestral branch of bats, which is indicative of laryngeal echolocation being an ancestral trait in this clade. We found selection and loss of immunity-related genes (including pro-inflammatory NF-κB regulators) and expansions of anti-viral APOBEC3 genes, which highlights molecular mechanisms that may contribute to the exceptional immunity of bats. Genomic integrations of diverse viruses provide a genomic record of historical tolerance to viral infection in bats. Finally, we found and experimentally validated bat-specific variation in microRNAs, which may regulate bat-specific gene-expression programs. Our reference-quality bat genomes provide the resources required to uncover and validate the genomic basis of adaptations of bats, and stimulate new avenues of research that are directly relevant to human health and disease1.


Asunto(s)
Adaptación Fisiológica/genética , Quirópteros/genética , Evolución Molecular , Genoma/genética , Genómica/normas , Adaptación Fisiológica/inmunología , Animales , Quirópteros/clasificación , Quirópteros/inmunología , Elementos Transponibles de ADN/genética , Inmunidad/genética , Anotación de Secuencia Molecular/normas , Filogenia , ARN no Traducido/genética , Estándares de Referencia , Reproducibilidad de los Resultados , Integración Viral/genética , Virus/genética
13.
Nature ; 583(7818): 693-698, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32728248

RESUMEN

The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.


Asunto(s)
Bases de Datos Genéticas , Genoma/genética , Genómica , Anotación de Secuencia Molecular , Animales , Sitios de Unión , Cromatina/genética , Cromatina/metabolismo , Metilación de ADN , Bases de Datos Genéticas/normas , Bases de Datos Genéticas/tendencias , Regulación de la Expresión Génica/genética , Genoma Humano/genética , Genómica/normas , Genómica/tendencias , Histonas/metabolismo , Humanos , Ratones , Anotación de Secuencia Molecular/normas , Control de Calidad , Secuencias Reguladoras de Ácidos Nucleicos/genética , Factores de Transcripción/metabolismo
14.
Trends Genet ; 36(7): 461-463, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32544447

RESUMEN

Since 2002, published miRNAs have been collected and named by the online repository miRBase. However, with 11 000 annual publications this has become challenging. Recently, four specialized miRNA databases were published, addressing particular needs for diverse scientific communities. This development provides major opportunities for the future of miRNA annotation and nomenclature.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Regulación de la Expresión Génica , MicroARNs/genética , Anotación de Secuencia Molecular/normas , Análisis de Secuencia de ARN/normas , Programas Informáticos , Genómica , Humanos
15.
FEMS Microbiol Rev ; 44(4): 418-431, 2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-32386204

RESUMEN

With the rapid increase in the number of sequenced prokaryotic genomes, relying on automated gene annotation became a necessity. Multiple lines of evidence, however, suggest that current bacterial genome annotations may contain inconsistencies and are incomplete, even for so-called well-annotated genomes. We here discuss underexplored sources of protein diversity and new methodologies for high-throughput genome reannotation. The expression of multiple molecular forms of proteins (proteoforms) from a single gene, particularly driven by alternative translation initiation, is gaining interest as a prominent contributor to bacterial protein diversity. In consequence, riboproteogenomic pipelines were proposed to comprehensively capture proteoform expression in prokaryotes by the complementary use of (positional) proteomics and the direct readout of translated genomic regions using ribosome profiling. To complement these discoveries, tailored strategies are required for the functional characterization of newly discovered bacterial proteoforms.


Asunto(s)
Bacterias/genética , Proteínas Bacterianas/genética , Genoma Bacteriano/genética , Anotación de Secuencia Molecular/normas , Proteogenómica , Proteínas Bacterianas/química
16.
Nature ; 581(7809): 452-458, 2020 05.
Artículo en Inglés | MEDLINE | ID: mdl-32461655

RESUMEN

The acceleration of DNA sequencing in samples from patients and population studies has resulted in extensive catalogues of human genetic variation, but the interpretation of rare genetic variants remains problematic. A notable example of this challenge is the existence of disruptive variants in dosage-sensitive disease genes, even in apparently healthy individuals. Here, by manual curation of putative loss-of-function (pLoF) variants in haploinsufficient disease genes in the Genome Aggregation Database (gnomAD)1, we show that one explanation for this paradox involves alternative splicing of mRNA, which allows exons of a gene to be expressed at varying levels across different cell types. Currently, no existing annotation tool systematically incorporates information about exon expression into the interpretation of variants. We develop a transcript-level annotation metric known as the 'proportion expressed across transcripts', which quantifies isoform expression for variants. We calculate this metric using 11,706 tissue samples from the Genotype Tissue Expression (GTEx) project2 and show that it can differentiate between weakly and highly evolutionarily conserved exons, a proxy for functional importance. We demonstrate that expression-based annotation selectively filters 22.8% of falsely annotated pLoF variants found in haploinsufficient disease genes in gnomAD, while removing less than 4% of high-confidence pathogenic variants in the same genes. Finally, we apply our expression filter to the analysis of de novo variants in patients with autism spectrum disorder and intellectual disability or developmental disorders to show that pLoF variants in weakly expressed regions have similar effect sizes to those of synonymous variants, whereas pLoF variants in highly expressed exons are most strongly enriched among cases. Our annotation is fast, flexible and generalizable, making it possible for any variant file to be annotated with any isoform expression dataset, and will be valuable for the genetic diagnosis of rare diseases, the analysis of rare variant burden in complex disorders, and the curation and prioritization of variants in recall-by-genotype studies.


Asunto(s)
Enfermedad/genética , Haploinsuficiencia/genética , Mutación con Pérdida de Función/genética , Anotación de Secuencia Molecular , Transcripción Genética , Transcriptoma/genética , Trastorno del Espectro Autista/genética , Conjuntos de Datos como Asunto , Discapacidades del Desarrollo/genética , Exones/genética , Femenino , Genotipo , Humanos , Discapacidad Intelectual/genética , Masculino , Anotación de Secuencia Molecular/normas , Distribución de Poisson , ARN Mensajero/análisis , ARN Mensajero/genética , Enfermedades Raras/diagnóstico , Enfermedades Raras/genética , Reproducibilidad de los Resultados , Secuenciación del Exoma
17.
Annu Rev Genomics Hum Genet ; 21: 55-79, 2020 08 31.
Artículo en Inglés | MEDLINE | ID: mdl-32421357

RESUMEN

Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.


Asunto(s)
Genoma Humano , Anotación de Secuencia Molecular/métodos , Anotación de Secuencia Molecular/normas , Humanos
18.
BMC Bioinformatics ; 21(1): 211, 2020 May 24.
Artículo en Inglés | MEDLINE | ID: mdl-32448124

RESUMEN

BACKGROUND: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. RESULTS: We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of "alerts" that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank's submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available (https://github.com/nawrockie/vadr) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally. CONCLUSION: VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions.


Asunto(s)
Betacoronavirus , Infecciones por Coronavirus , Bases de Datos de Ácidos Nucleicos , Anotación de Secuencia Molecular , Pandemias , Neumonía Viral , Programas Informáticos , Betacoronavirus/genética , COVID-19 , Infecciones por Coronavirus/genética , Virus ADN , Genómica , Humanos , Anotación de Secuencia Molecular/normas , Neumonía Viral/genética , SARS-CoV-2 , Virus
19.
Gigascience ; 9(4)2020 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-32315029

RESUMEN

BACKGROUND: Jellyfish belong to the phylum Cnidaria, which occupies an important phylogenetic location in the early-branching Metazoa lineages. The jellyfish Rhopilema esculentum is an important fishery resource in China. However, the genome resource of R. esculentum has not been reported to date. FINDINGS: In this study, we constructed a chromosome-level genome assembly of R. esculentum using Pacific Biosciences, Illumina, and Hi-C sequencing technologies. The final genome assembly was ∼275.42 Mb, with a contig N50 length of 1.13 Mb. Using Hi-C technology to identify the contacts among contigs, 260.17 Mb (94.46%) of the assembled genome were anchored onto 21 pseudochromosomes with a scaffold N50 of 12.97 Mb. We identified 17,219 protein-coding genes, with an average CDS length of 1,575 bp. The genome-wide phylogenetic analysis indicated that R. esculentum might have evolved more slowly than the other scyphozoan species used in this study. In addition, 127 toxin-like genes were identified, and 1 toxin-related "hub" was found by a genomic survey. CONCLUSIONS: We have generated a chromosome-level genome assembly of R. esculentum that could provide a valuable genomic background for studying the biology and pharmacology of jellyfish, as well as the evolutionary history of Cnidaria.


Asunto(s)
Cromosomas/genética , Cnidarios/genética , Genoma/genética , Estándares de Referencia , Animales , China/epidemiología , Genómica/normas , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Anotación de Secuencia Molecular/normas
20.
Gigascience ; 9(3)2020 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-32170312

RESUMEN

BACKGROUND: Over the past few years the variety of experimental designs and protocols for sequencing experiments increased greatly. To ensure the wide usability of the produced data beyond an individual project, rich and systematic annotation of the underlying experiments is crucial. FINDINGS: We first developed an annotation structure that captures the overall experimental design as well as the relevant details of the steps from the biological sample to the library preparation, the sequencing procedure, and the sequencing and processed files. Through various design features, such as controlled vocabularies and different field requirements, we ensured a high annotation quality, comparability, and ease of annotation. The structure can be easily adapted to a large variety of species. We then implemented the annotation strategy in a user-hosted web platform with data import, query, and export functionality. CONCLUSIONS: We present here an annotation structure and user-hosted platform for sequencing experiment data, suitable for lab-internal documentation, collaborations, and large-scale annotation efforts.


Asunto(s)
Anotación de Secuencia Molecular/métodos , Análisis de Secuencia/métodos , Programas Informáticos , Anotación de Secuencia Molecular/normas , Análisis de Secuencia/normas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...