RESUMEN
A systematic way of recording data use conditions that are based on consent permissions as found in the datasets of the main public genome archives (NCBI dbGaP and EMBL-EBI/CRG EGA).
Asunto(s)
Bases de Datos de Ácidos Nucleicos , Genoma , Biblioteca Genómica , Investigación sobre Servicios de SaludRESUMEN
The Clinical Genome Resource (ClinGen)'s work to develop a knowledge base to support the understanding of genes and variants for use in precision medicine and research depends on robust, broadly applicable, and adaptable technical standards for sharing data and information. To forward this goal, ClinGen has joined with the Global Alliance for Genomics and Health (GA4GH) to support the development of open, freely-available technical standards and regulatory frameworks for secure and responsible sharing of genomic and health-related data. In its capacity as one of the 15 inaugural GA4GH "Driver Projects," ClinGen is providing input on the key standards needs of the global genomics community, and has committed to participate on GA4GH Work Streams to support the development of: (1) a standard model for computer-readable variant representation; (2) a data model for linking variant data to annotations; (3) a specification to enable sharing of genomic variant knowledge and associated clinical interpretations; and (4) a set of best practices for use of phenotype and disease ontologies. ClinGen's participation as a GA4GH Driver Project will provide a robust environment to test drive emerging genomic knowledge sharing standards and prove their utility among the community, while accelerating the construction of the ClinGen evidence base.
Asunto(s)
Genoma Humano/genética , Difusión de la Información/métodos , Biología Computacional , Bases de Datos Genéticas , Variación Genética , Genómica , Humanos , Medicina de PrecisiónRESUMEN
BACKGROUND: The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information. We describe genomic variation identified in the initial recruitment cohort of 56 volunteers. METHODS: Volunteers were screened for eligibility and provided informed consent for open data sharing. Using blood DNA, we performed whole genome sequencing and identified all possible classes of DNA variants. A genetic counsellor explained the implication of the results to each participant. RESULTS: Whole genome sequencing of the first 56 participants identified 207 662 805 sequence variants and 27 494 copy number variations. We analyzed a prioritized disease-associated data set (n = 1606 variants) according to standardized guidelines, and interpreted 19 variants in 14 participants (25%) as having obvious health implications. Six of these variants (e.g., in BRCA1 or mosaic loss of an X chromosome) were pathogenic or likely pathogenic. Seven were risk factors for cancer, cardiovascular or neurobehavioural conditions. Four other variants - associated with cancer, cardiac or neurodegenerative phenotypes - remained of uncertain significance because of discrepancies among databases. We also identified a large structural chromosome aberration and a likely pathogenic mitochondrial variant. There were 172 recessive disease alleles (e.g., 5 individuals carried mutations for cystic fibrosis). Pharmacogenomics analyses revealed another 3.9 potentially relevant genotypes per individual. INTERPRETATION: Our analyses identified a spectrum of genetic variants with potential health impact in 25% of participants. When also considering recessive alleles and variants with potential pharmacologic relevance, all 56 participants had medically relevant findings. Although access is mostly limited to research, whole genome sequencing can provide specific and novel information with the potential of major impact for health care.
Asunto(s)
Variación Genética/genética , Genoma Humano/genética , Análisis de Secuencia de ADN/métodos , Secuenciación Completa del Genoma/métodos , Canadá , Femenino , Genes Recesivos/genética , Predisposición Genética a la Enfermedad/genética , Humanos , MasculinoRESUMEN
Recent technologies have made it cost-effective to collect diverse types of genome-wide data. Computational methods are needed to combine these data to create a comprehensive view of a given disease or a biological process. Similarity network fusion (SNF) solves this problem by constructing networks of samples (e.g., patients) for each available data type and then efficiently fusing these into one network that represents the full spectrum of underlying data. For example, to create a comprehensive view of a disease given a cohort of patients, SNF computes and fuses patient similarity networks obtained from each of their data types separately, taking advantage of the complementarity in the data. We used SNF to combine mRNA expression, DNA methylation and microRNA (miRNA) expression data for five cancer data sets. SNF substantially outperforms single data type analysis and established integrative approaches when identifying cancer subtypes and is effective for predicting survival.
Asunto(s)
Biología Computacional/métodos , Redes Reguladoras de Genes , Genómica , Estadística como Asunto/métodos , Neoplasias Encefálicas/genética , Enfermedad/genética , Glioblastoma/genética , HumanosRESUMEN
High-throughput RNA sequencing (RNA-seq) promises to revolutionize our understanding of genes and their role in human disease by characterizing the RNA content of tissues and cells. The realization of this promise, however, is conditional on the development of effective computational methods for the identification and quantification of transcripts from incomplete and noisy data. In this article, we introduce iReckon, a method for simultaneous determination of the isoforms and estimation of their abundances. Our probabilistic approach incorporates multiple biological and technical phenomena, including novel isoforms, intron retention, unspliced pre-mRNA, PCR amplification biases, and multimapped reads. iReckon utilizes regularized expectation-maximization to accurately estimate the abundances of known and novel isoforms. Our results on simulated and real data demonstrate a superior ability to discover novel isoforms with a significantly reduced number of false-positive predictions, and our abundance accuracy prediction outmatches that of other state-of-the-art tools. Furthermore, we have applied iReckon to two cancer transcriptome data sets, a triple-negative breast cancer patient sample and the MCF7 breast cancer cell line, and show that iReckon is able to reconstruct the complex splicing changes that were not previously identified. QT-PCR validations of the isoforms detected in the MCF7 cell line confirmed all of iReckon's predictions and also showed strong agreement (r(2) = 0.94) with the predicted abundances.
Asunto(s)
Algoritmos , Simulación por Computador , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Empalme del ARN , Análisis de Secuencia de ARN/métodos , Femenino , Humanos , Células MCF-7 , Precursores del ARN/genética , Precursores del ARN/metabolismo , ARN Mensajero/genética , ARN Mensajero/metabolismo , Transcriptoma , Proteína p53 Supresora de Tumor/genética , Proteína p53 Supresora de Tumor/metabolismoRESUMEN
High-throughput sequencing (HTS) technologies are providing an unprecedented capacity for data generation, and there is a corresponding need for efficient data exploration and analysis capabilities. Although most existing tools for HTS data analysis are developed for either automated (e.g. genotyping) or visualization (e.g. genome browsing) purposes, such tools are most powerful when combined. For example, integration of visualization and computation allows users to iteratively refine their analyses by updating computational parameters within the visual framework in real-time. Here we introduce the second version of the Savant Genome Browser, a standalone program for visual and computational analysis of HTS data. Savant substantially improves upon its predecessor and existing tools by introducing innovative visualization modes and navigation interfaces for several genomic datatypes, and synergizing visual and automated analyses in a way that is powerful yet easy even for non-expert users. We also present a number of plugins that were developed by the Savant Community, which demonstrate the power of integrating visual and automated analyses using Savant. The Savant Genome Browser is freely available (open source) at www.savantbrowser.com.
Asunto(s)
Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Gráficos por Computador , Mutación INDEL , Internet , Polimorfismo de Nucleótido Simple , Población/genéticaRESUMEN
The COVID-19 pandemic led to a large global effort to sequence SARS-CoV-2 genomes from patient samples to track viral evolution and inform the public health response. Millions of SARS-CoV-2 genome sequences have been deposited in global public repositories. The Canadian COVID-19 Genomics Network (CanCOGeN - VirusSeq), a consortium tasked with coordinating expanded sequencing of SARS-CoV-2 genomes across Canada early in the pandemic, created the Canadian VirusSeq Data Portal, with associated data pipelines and procedures, to support these efforts. The goal of VirusSeq was to allow open access to Canadian SARS-CoV-2 genomic sequences and enhanced, standardized contextual data that were unavailable in other repositories and that meet FAIR standards (Findable, Accessible, Interoperable and Reusable). In addition, the portal data submission pipeline contains data quality checking procedures and appropriate acknowledgement of data generators that encourages collaboration. From inception to execution, the portal was developed with a conscientious focus on strong data governance principles and practices. Extensive efforts ensured a commitment to Canadian privacy laws, data security standards, and organizational processes. This portal has been coupled with other resources, such as Viral AI, and was further leveraged by the Coronavirus Variants Rapid Response Network (CoVaRR-Net) to produce a suite of continually updated analytical tools and notebooks. Here we highlight this portal (https://virusseq-dataportal.ca/), including its contextual data not available elsewhere, and the Duotang (https://covarr-net.github.io/duotang/duotang.html), a web platform that presents key genomic epidemiology and modelling analyses on circulating and emerging SARS-CoV-2 variants in Canada. Duotang presents dynamic changes in variant composition of SARS-CoV-2 in Canada and by province, estimates variant growth, and displays complementary interactive visualizations, with a text overview of the current situation. The VirusSeq Data Portal and Duotang resources, alongside additional analyses and resources computed from the portal (COVID-MVP, CoVizu), are all open source and freely available. Together, they provide an updated picture of SARS-CoV-2 evolution to spur scientific discussions, inform public discourse, and support communication with and within public health authorities. They also serve as a framework for other jurisdictions interested in open, collaborative sequence data sharing and analyses.
Asunto(s)
COVID-19 , Genoma Viral , SARS-CoV-2 , Canadá/epidemiología , SARS-CoV-2/genética , Humanos , COVID-19/epidemiología , COVID-19/virología , Genómica/métodos , Pandemias , Bases de Datos GenéticasRESUMEN
We have developed PhenoTips: open source software for collecting and analyzing phenotypic information for patients with genetic disorders. Our software combines an easy-to-use interface, compatible with any device that runs a Web browser, with a standardized database back end. The PhenoTips' user interface closely mirrors clinician workflows so as to facilitate the recording of observations made during the patient encounter. Collected data include demographics, medical history, family history, physical and laboratory measurements, physical findings, and additional notes. Phenotypic information is represented using the Human Phenotype Ontology; however, the complexity of the ontology is hidden behind a user interface, which combines simple selection of common phenotypes with error-tolerant, predictive search of the entire ontology. PhenoTips supports accurate diagnosis by analyzing the entered data, then suggesting additional clinical investigations and providing Online Mendelian Inheritance in Man (OMIM) links to likely disorders. By collecting, classifying, and analyzing phenotypic information during the patient encounter, PhenoTips allows for streamlining of clinic workflow, efficient data entry, improved diagnosis, standardization of collected patient phenotypes, and sharing of anonymized patient phenotype data for the study of rare disorders. Our source code and a demo version of PhenoTips are available at http://phenotips.org.
Asunto(s)
Bases de Datos Genéticas , Enfermedades Genéticas Congénitas/diagnóstico , Enfermedades Genéticas Congénitas/genética , Investigación Genética , Fenotipo , Programas Informáticos , Interfaz Usuario-Computador , Algoritmos , Niño , Sistemas de Administración de Bases de Datos , Bases de Datos Factuales , Ontología de Genes , Humanos , Almacenamiento y Recuperación de la InformaciónRESUMEN
The development of high-throughput sequencing (HTS) technologies has opened the door to novel methods for detecting copy number variants (CNVs) in the human genome. While in the past CNVs have been detected based on array CGH data, recent studies have shown that depth-of-coverage information from HTS technologies can also be used for the reliable identification of large copy-variable regions. Such methods, however, are hindered by sequencing biases that lead certain regions of the genome to be over- or undersampled, lowering their resolution and ability to accurately identify the exact breakpoints of the variants. In this work, we develop a method for CNV detection that supplements the depth-of-coverage with paired-end mapping information, where mate pairs mapping discordantly to the reference serve to indicate the presence of variation. Our algorithm, called CNVer, combines this information within a unified computational framework called the donor graph, allowing us to better mitigate the sequencing biases that cause uneven local coverage and accurately predict CNVs. We use CNVer to detect 4879 CNVs in the recently described genome of a Yoruban individual. Most of the calls (77%) coincide with previously known variants within the Database of Genomic Variants, while 81% of deletion copy number variants previously known for this individual coincide with one of our loss calls. Furthermore, we demonstrate that CNVer can reconstruct the absolute copy counts of segments of the donor genome and evaluate the feasibility of using CNVer with low coverage datasets.
Asunto(s)
Emparejamiento Base/fisiología , Mapeo Cromosómico/métodos , Variaciones en el Número de Copia de ADN , Análisis Mutacional de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Algoritmos , Secuencia de Bases/fisiología , Rotura Cromosómica , Variaciones en el Número de Copia de ADN/genética , Barajamiento de ADN , Humanos , Reproducibilidad de los ResultadosRESUMEN
MOTIVATION: The advent of high-throughput sequencing (HTS) technologies has made it affordable to sequence many individuals' genomes. Simultaneously the computational analysis of the large volumes of data generated by the new sequencing machines remains a challenge. While a plethora of tools are available to map the resulting reads to a reference genome, and to conduct primary analysis of the mappings, it is often necessary to visually examine the results and underlying data to confirm predictions and understand the functional effects, especially in the context of other datasets. RESULTS: We introduce Savant, the Sequence Annotation, Visualization and ANalysis Tool, a desktop visualization and analysis browser for genomic data. Savant was developed for visualizing and analyzing HTS data, with special care taken to enable dynamic visualization in the presence of gigabases of genomic reads and references the size of the human genome. Savant supports the visualization of genome-based sequence, point, interval and continuous datasets, and multiple visualization modes that enable easy identification of genomic variants (including single nucleotide polymorphisms, structural and copy number variants), and functional genomic information (e.g. peaks in ChIP-seq data) in the context of genomic annotations. AVAILABILITY: Savant is freely available at http://compbio.cs.toronto.edu/savant.
Asunto(s)
Genómica/métodos , Programas Informáticos , Secuencia de Bases , Genoma , Genoma Humano , Ensayos Analíticos de Alto Rendimiento , Humanos , Polimorfismo de Nucleótido Simple , Análisis de SecuenciaRESUMEN
We promote a shared vision and guide for how and when to federate genomic and health-related data sharing, enabling connections and insights across independent, secure databases. The GA4GH encourages a federated approach wherein data providers have the mandate and resources to share, but where data cannot move for legal or technical reasons. We recommend a federated approach to connect national genomics initiatives into a global network and precision medicine resource.
RESUMEN
The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25-70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced Ciona savignyi individual to the reference genome. We demonstrate that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual. SHRiMP is freely available at http://compbio.cs.toronto.edu/shrimp.
Asunto(s)
Mapeo Cromosómico/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Animales , Simulación por Computador , Modelos Estadísticos , Reproducibilidad de los Resultados , Urocordados/genéticaRESUMEN
In the version of this article initially published, Lena Dolman's second affiliation was given as Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK. The correct second affiliation is Ontario Institute for Cancer Research, Toronto, Ontario, Canada. The error has been corrected in the HTML and PDF versions of the article.
RESUMEN
The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use. However, under current practices, the data is fragmented into many distinct datasets, and researchers must go through a separate application process for each dataset. This is time-consuming both for the researchers and the data stewards, and it reduces the velocity of research and new discoveries that could improve human health. We propose to simplify this process, by introducing a standard Library Card that identifies and authenticates researchers across all participating datasets. Each researcher would only need to apply once to establish their bona fides as a qualified researcher, and could then use the Library Card to access a wide range of datasets that use a compatible data access policy and authentication protocol.
RESUMEN
The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model-"registered access"-to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research. A registered access policy would enable a range of categories of users to gain access, starting with researchers and clinical care professionals. It would also facilitate general use and reuse of data but within the bounds of consent restrictions and other ethical obligations. In piloting registered access with the Scientific Demonstration data sharing projects of GA4GH, we provide additional ethics, policy and technical guidance to facilitate the implementation of this access model in an international setting.