RESUMEN
PURPOSE: The ongoing lack of data standardization severely undermines the potential for automated learning from the vast amount of information routinely archived in electronic health records (EHRs), radiation oncology information systems, treatment planning systems, and other cancer care and outcomes databases. We sought to create a standardized ontology for clinical data, social determinants of health, and other radiation oncology concepts and interrelationships. METHODS AND MATERIALS: The American Association of Physicists in Medicine's Big Data Science Committee was initiated in July 2019 to explore common ground from the stakeholders' collective experience of issues that typically compromise the formation of large inter- and intra-institutional databases from EHRs. The Big Data Science Committee adopted an iterative, cyclical approach to engaging stakeholders beyond its membership to optimize the integration of diverse perspectives from the community. RESULTS: We developed the Operational Ontology for Oncology (O3), which identified 42 key elements, 359 attributes, 144 value sets, and 155 relationships ranked in relative importance of clinical significance, likelihood of availability in EHRs, and the ability to modify routine clinical processes to permit aggregation. Recommendations are provided for best use and development of the O3 to 4 constituencies: device manufacturers, centers of clinical care, researchers, and professional societies. CONCLUSIONS: O3 is designed to extend and interoperate with existing global infrastructure and data science standards. The implementation of these recommendations will lower the barriers for aggregation of information that could be used to create large, representative, findable, accessible, interoperable, and reusable data sets to support the scientific objectives of grant programs. The construction of comprehensive "real-world" data sets and application of advanced analytical techniques, including artificial intelligence, holds the potential to revolutionize patient management and improve outcomes by leveraging increased access to information derived from larger, more representative data sets.
Asunto(s)
Neoplasias , Oncología por Radiación , Humanos , Inteligencia Artificial , Consenso , Neoplasias/radioterapia , InformáticaRESUMEN
The HuRef Genome Browser is a web application for the navigation and analysis of the previously published genome of a human individual, termed HuRef. The browser provides a comparative view between the NCBI human reference sequence and the HuRef assembly, and it enables the navigation of the HuRef genome in the context of HuRef, NCBI and Ensembl annotations. Single nucleotide polymorphisms, indels, inversions, structural and copy-number variations are shown in the context of existing functional annotations on either genome in the comparative view. Demonstrated here are some potential uses of the browser to enable a better understanding of individual human genetic variation. The browser provides full access to the underlying reads with sequence and quality information, the genome assembly and the evidence supporting the identification of DNA polymorphisms. The HuRef Browser is a unique and versatile tool for browsing genome assemblies and studying individual human sequence variation in a diploid context. The browser is available online at http://huref.jcvi.org.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , Variación Genética , Genoma Humano , Genómica , Humanos , Internet , Programas InformáticosRESUMEN
Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.
Asunto(s)
Mapeo Cromosómico , Diploidia , Genoma Humano , Análisis de Secuencia de ADN , Secuencia de Bases , Mapeo Cromosómico/instrumentación , Mapeo Cromosómico/métodos , Cromosomas Humanos , Cromosomas Humanos Y/genética , Dosificación de Gen , Genotipo , Haplotipos , Proyecto Genoma Humano , Humanos , Mutación INDEL , Hibridación Fluorescente in Situ , Masculino , Análisis por Micromatrices , Persona de Mediana Edad , Datos de Secuencia Molecular , Linaje , Fenotipo , Polimorfismo de Nucleótido Simple , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN/instrumentación , Análisis de Secuencia de ADN/métodosRESUMEN
The world's oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition. These samples, collected across a several-thousand km transect from the North Atlantic through the Panama Canal and ending in the South Pacific yielded an extensive dataset consisting of 7.7 million sequencing reads (6.3 billion bp). Though a few major microbial clades dominate the planktonic marine niche, the dataset contains great diversity with 85% of the assembled sequence and 57% of the unassembled data being unique at a 98% sequence identity cutoff. Using the metadata associated with each sample and sequencing library, we developed new comparative genomic and assembly methods. One comparative genomic method, termed "fragment recruitment," addressed questions of genome structure, evolution, and taxonomic or phylogenetic diversity, as well as the biochemical diversity of genes and gene families. A second method, termed "extreme assembly," made possible the assembly and reconstruction of large segments of abundant but clearly nonclonal organisms. Within all abundant populations analyzed, we found extensive intra-ribotype diversity in several forms: (1) extensive sequence variation within orthologous regions throughout a given genome; despite coverage of individual ribotypes approaching 500-fold, most individual sequencing reads are unique; (2) numerous changes in gene content some with direct adaptive implications; and (3) hypervariable genomic islands that are too variable to assemble. The intra-ribotype diversity is organized into genetically isolated populations that have overlapping but independent distributions, implying distinct environmental preference. We present novel methods for measuring the genomic similarity between metagenomic samples and show how they may be grouped into several community types. Specific functional adaptations can be identified both within individual ribotypes and across the entire community, including proteorhodopsin spectral tuning and the presence or absence of the phosphate-binding gene PstS.
Asunto(s)
Microbiología del Agua , Biología Computacional , Cadena Alimentaria , Océanos y Mares , Plancton , Especificidad de la EspecieRESUMEN
The Genomic Contextual Data Markup Language (GCDML) is a core project of the Genomic Standards Consortium (GSC) that implements the "Minimum Information about a Genome Sequence" (MIGS) specification and its extension, the "Minimum Information about a Metagenome Sequence" (MIMS). GCDML is an XML Schema for generating MIGS/MIMS compliant reports for data entry, exchange, and storage. When mature, this sample-centric, strongly-typed schema will provide a diverse set of descriptors for describing the exact origin and processing of a biological sample, from sampling to sequencing, and subsequent analysis. Here we describe the need for such a project, outline design principles required to support the project, and make an open call for participation in defining the future content of GCDML. GCDML is freely available, and can be downloaded, along with documentation, from the GSC Web site (http://gensc.org).
Asunto(s)
Bases de Datos Genéticas , Genómica , Lenguajes de ProgramaciónRESUMEN
This meeting report summarizes the proceedings of the "eGenomics: Cataloguing our Complete Genome Collection IV" workshop held June 6-8, 2007, at the National Institute for Environmental eScience (NIEeS), Cambridge, United Kingdom. This fourth workshop of the Genomic Standards Consortium (GSC) was a mix of short presentations, strategy discussions, and technical sessions. Speakers provided progress reports on the development of the "Minimum Information about a Genome Sequence" (MIGS) specification and the closely integrated "Minimum Information about a Metagenome Sequence" (MIMS) specification. The key outcome of the workshop was consensus on the next version of the MIGS/MIMS specification (v1.2). This drove further definition and restructuring of the MIGS/MIMS XML schema (syntax). With respect to semantics, a term vetting group was established to ensure that terms are properly defined and submitted to the appropriate ontology projects. Perhaps the single most important outcome of the workshop was a proposal to move beyond the concept of "minimum" to create a far richer XML schema that would define a "Genomic Contextual Data Markup Language" (GCDML) suitable for wider semantic integration across databases. GCDML will contain not only curated information (e.g., compliant with MIGS/MIMS), but also be extended to include a variety of data processing and calculations. Further information about the Genomic Standards Consortium and its range of activities can be found at http://gensc.org.
Asunto(s)
Bases de Datos Genéticas , Genómica , Educación , Lenguajes de Programación , Estándares de ReferenciaRESUMEN
Caminibacter mediatlanticus strain TB-2(T) [1], is a thermophilic, anaerobic, chemolithoautotrophic bacterium, isolated from the walls of an active deep-sea hydrothermal vent chimney on the Mid-Atlantic Ridge and the type strain of the species. C. mediatlanticus is a Gram-negative member of the Epsilonproteobacteria (order Nautiliales) that grows chemolithoautotrophically with H(2) as the energy source and CO(2) as the carbon source. Nitrate or sulfur is used as the terminal electron acceptor, with resulting production of ammonium and hydrogen sulfide, respectively. In view of the widespread distribution, importance and physiological characteristics of thermophilic Epsilonproteobacteria in deep-sea geothermal environments, it is likely that these organisms provide a relevant contribution to both primary productivity and the biogeochemical cycling of carbon, nitrogen and sulfur at hydrothermal vents. Here we report the main features of the genome of C. mediatlanticus strain TB-2(T).
RESUMEN
The JCVI metagenomics analysis pipeline provides for the efficient and consistent annotation of shotgun metagenomics sequencing data for sampling communities of prokaryotic organisms. The process can be equally applied to individual sequence reads from traditional Sanger capillary electrophoresis sequences, newer technologies such as 454 pyrosequencing, or sequence assemblies derived from one or more of these data types. It includes the analysis of both coding and non-coding genes, whether full-length or, as is often the case for shotgun metagenomics, fragmentary. The system is designed to provide the best-supported conservative functional annotation based on a combination of trusted homology-based scientific evidence and computational assertions and an annotation value hierarchy established through extensive manual curation. The functional annotation attributes assigned by this system include gene name, gene symbol, GO terms, EC numbers, and JCVI functional role categories.
RESUMEN
With the quantity of genomic data increasing at an exponential rate, it is imperative that these data be captured electronically, in a standard format. Standardization activities must proceed within the auspices of open-access and international working bodies. To tackle the issues surrounding the development of better descriptions of genomic investigations, we have formed the Genomic Standards Consortium (GSC). Here, we introduce the minimum information about a genome sequence (MIGS) specification with the intent of promoting participation in its development and discussing the resources that will be required to develop improved mechanisms of metadata capture and exchange. As part of its wider goals, the GSC also supports improving the 'transparency' of the information contained in existing genomic databases.
Asunto(s)
Mapeo Cromosómico/métodos , Mapeo Cromosómico/normas , Bases de Datos Factuales/normas , Difusión de la Información/métodos , Almacenamiento y Recuperación de la Información/normas , Teoría de la Información , InternacionalidadRESUMEN
We present a draft sequence of the genome of Aedes aegypti, the primary vector for yellow fever and dengue fever, which at approximately 1376 million base pairs is about 5 times the size of the genome of the malaria vector Anopheles gambiae. Nearly 50% of the Ae. aegypti genome consists of transposable elements. These contribute to a factor of approximately 4 to 6 increase in average gene length and in sizes of intergenic regions relative to An. gambiae and Drosophila melanogaster. Nonetheless, chromosomal synteny is generally maintained among all three insects, although conservation of orthologous gene order is higher (by a factor of approximately 2) between the mosquito species than between either of them and the fruit fly. An increase in genes encoding odorant binding, cytochrome P450, and cuticle domains relative to An. gambiae suggests that members of these protein families underpin some of the biological differences between the two mosquito species.
Asunto(s)
Aedes/genética , Genoma de los Insectos , Insectos Vectores/genética , Aedes/metabolismo , Animales , Anopheles/genética , Anopheles/metabolismo , Arbovirus , Secuencia de Bases , Elementos Transponibles de ADN , Dengue/prevención & control , Dengue/transmisión , Drosophila melanogaster/genética , Femenino , Genes de Insecto , Humanos , Proteínas de Insectos/genética , Insectos Vectores/metabolismo , Masculino , Proteínas de Transporte de Membrana/genética , Datos de Secuencia Molecular , Familia de Multigenes , Estructura Terciaria de Proteína/genética , Análisis de Secuencia de ADN , Caracteres Sexuales , Procesos de Determinación del Sexo , Especificidad de la Especie , Sintenía , Transcripción Genética , Fiebre Amarilla/prevención & control , Fiebre Amarilla/transmisiónRESUMEN
Since its introduction a decade ago, whole-genome shotgun sequencing (WGS) has been the main approach for producing cost-effective and high-quality genome sequence data. Until now, the Sanger sequencing technology that has served as a platform for WGS has not been truly challenged by emerging technologies. The recent introduction of the pyrosequencing-based 454 sequencing platform (454 Life Sciences, Branford, CT) offers a very promising sequencing technology alternative for incorporation in WGS. In this study, we evaluated the utility and cost-effectiveness of a hybrid sequencing approach using 3730xl Sanger data and 454 data to generate higher-quality lower-cost assemblies of microbial genomes compared to current Sanger sequencing strategies alone.
Asunto(s)
Biotecnología/métodos , Genes Bacterianos , Genoma Bacteriano , Análisis de Secuencia de ADN/métodos , Biotecnología/tendencias , Biología Computacional/métodos , Mapeo ContigRESUMEN
We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304-1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860-921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.
Asunto(s)
Biología Computacional , Genoma Humano , Proyecto Genoma Humano , Biología Computacional/normas , Mapeo Contig/normas , Humanos , ARN Mensajero/análisis , Programas InformáticosRESUMEN
The high degree of similarity between the mouse and human genomes is demonstrated through analysis of the sequence of mouse chromosome 16 (Mmu 16), which was obtained as part of a whole-genome shotgun assembly of the mouse genome. The mouse genome is about 10% smaller than the human genome, owing to a lower repetitive DNA content. Comparison of the structure and protein-coding potential of Mmu 16 with that of the homologous segments of the human genome identifies regions of conserved synteny with human chromosomes (Hsa) 3, 8, 12, 16, 21, and 22. Gene content and order are highly conserved between Mmu 16 and the syntenic blocks of the human genome. Of the 731 predicted genes on Mmu 16, 509 align with orthologs on the corresponding portions of the human genome, 44 are likely paralogous to these genes, and 164 genes have homologs elsewhere in the human genome; there are 14 genes for which we could find no human counterpart.