RESUMO
In 2020, Novartis Pharmaceuticals Corporation and the U.S. Food and Drug Administration (FDA) started a 4-year scientific collaboration to approach complex new data modalities and advanced analytics. The scientific question was to find novel radio-genomics-based prognostic and predictive factors for HR+/HER- metastatic breast cancer under a Research Collaboration Agreement. This collaboration has been providing valuable insights to help successfully implement future scientific projects, particularly using artificial intelligence and machine learning. This tutorial aims to provide tangible guidelines for a multi-omics project that includes multidisciplinary expert teams, spanning across different institutions. We cover key ideas, such as "maintaining effective communication" and "following good data science practices," followed by the four steps of exploratory projects, namely (1) plan, (2) design, (3) develop, and (4) disseminate. We break each step into smaller concepts with strategies for implementation and provide illustrations from our collaboration to further give the readers actionable guidance.
Assuntos
Inteligência Artificial , Multiômica , Humanos , Aprendizado de Máquina , GenômicaRESUMO
BACKGROUND: Novartis and the University of Oxford's Big Data Institute (BDI) have established a research alliance with the aim to improve health care and drug development by making it more efficient and targeted. Using a combination of the latest statistical machine learning technology with an innovative IT platform developed to manage large volumes of anonymised data from numerous data sources and types we plan to identify novel patterns with clinical relevance which cannot be detected by humans alone to identify phenotypes and early predictors of patient disease activity and progression. METHOD: The collaboration focuses on highly complex autoimmune diseases and develops a computational framework to assemble a research-ready dataset across numerous modalities. For the Multiple Sclerosis (MS) project, the collaboration has anonymised and integrated phase II to phase IV clinical and imaging trial data from ≈35,000 patients across all clinical phenotypes and collected in more than 2200 centres worldwide. For the "IL-17" project, the collaboration has anonymised and integrated clinical and imaging data from over 30 phase II and III Cosentyx clinical trials including more than 15,000 patients, suffering from four autoimmune disorders (Psoriasis, Axial Spondyloarthritis, Psoriatic arthritis (PsA) and Rheumatoid arthritis (RA)). RESULTS: A fundamental component of successful data analysis and the collaborative development of novel machine learning methods on these rich data sets has been the construction of a research informatics framework that can capture the data at regular intervals where images could be anonymised and integrated with the de-identified clinical data, quality controlled and compiled into a research-ready relational database which would then be available to multi-disciplinary analysts. The collaborative development from a group of software developers, data wranglers, statisticians, clinicians, and domain scientists across both organisations has been key. This framework is innovative, as it facilitates collaborative data management and makes a complicated clinical trial data set from a pharmaceutical company available to academic researchers who become associated with the project. CONCLUSIONS: An informatics framework has been developed to capture clinical trial data into a pipeline of anonymisation, quality control, data exploration, and subsequent integration into a database. Establishing this framework has been integral to the development of analytical tools.
Assuntos
Ciência de Dados , Disseminação de Informação , Bases de Dados Factuais , Desenvolvimento de Medicamentos , Humanos , Projetos de PesquisaRESUMO
Accurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long-read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.
Assuntos
Variação Estrutural do Genoma , Técnicas de Genotipagem , Genoma Humano , HumanosRESUMO
SUMMARY: We describe a novel computational method for genotyping repeats using sequence graphs. This method addresses the long-standing need to accurately genotype medically important loci containing repeats adjacent to other variants or imperfect DNA repeats such as polyalanine repeats. Here we introduce a new version of our repeat genotyping software, ExpansionHunter, that uses this method to perform targeted genotyping of a broad class of such loci. AVAILABILITY AND IMPLEMENTATION: ExpansionHunter is implemented in C++ and is available under the Apache License Version 2.0. The source code, documentation, and Linux/macOS binaries are available at https://github.com/Illumina/ExpansionHunter/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Repetições de Microssatélites , Software , GenótipoRESUMO
Standardized benchmarking approaches are required to assess the accuracy of variants called from sequence data. Although variant-calling tools and the metrics used to assess their performance continue to improve, important challenges remain. Here, as part of the Global Alliance for Genomics and Health (GA4GH), we present a benchmarking framework for variant calling. We provide guidance on how to match variant calls with different representations, define standard performance metrics, and stratify performance by variant type and genome context. We describe limitations of high-confidence calls and regions that can be used as truth sets (for example, single-nucleotide variant concordance of two methods is 99.7% inside versus 76.5% outside high-confidence regions). Our web-based app enables comparison of variant calls against truth sets to obtain a standardized performance report. Our approach has been piloted in the PrecisionFDA variant-calling challenges to identify the best-in-class variant-calling methods within high-confidence regions. Finally, we recommend a set of best practices for using our tools and evaluating the results.
Assuntos
Benchmarking , Exoma/genética , Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Genômica/tendências , Células Germinativas , Humanos , Polimorfismo de Nucleotídeo Único/genética , SoftwareRESUMO
In the version of this article initially published online, two pairs of headings were switched with each other in Table 4: "Recall (PCR free)" was switched with "Recall (with PCR)," and "Precision (PCR free)" was switched with "Precision (with PCR)." The error has been corrected in the print, PDF and HTML versions of this article.
RESUMO
We describe Strelka2 ( https://github.com/Illumina/strelka ), an open-source small-variant-calling method for research and clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model-based estimation of insertion/deletion error parameters from each sample, an efficient tiered haplotype-modeling strategy, and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperformed the current leading tools in terms of both variant-calling accuracy and computing cost.
Assuntos
Variação Genética , Mutação em Linhagem Germinativa , Software , Bases de Dados Genéticas/estatística & dados numéricos , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Mutação INDEL , Modelos Genéticos , Neoplasias/genética , Sequenciamento Completo do Genoma/estatística & dados numéricosRESUMO
Horizontal gene transfer accelerates bacterial adaptation to novel environments, allowing selection to act on genes that have evolved in multiple genetic backgrounds. This can lead to ecological specialization. However, little is known about how zoonotic bacteria maintain the ability to colonize multiple hosts whilst competing with specialists in the same niche. Here we develop a stochastic evolutionary model and show how genetic transfer of host segregating alleles, distributed as predicted for niche specifying genes, and the opportunity for host transition could interact to promote the emergence of host generalist lineages of the zoonotic bacterium Campylobacter. Using a modelling approach we show that increasing levels of homologous recombination enhance the efficiency with which selection can fix combinations of beneficial alleles, speeding adaptation. We then show how these predictions change in a multi-host system, with low levels of recombination, consistent with real r/m estimates, increasing the standing variation in the population, allowing a more effective response to changes in the selective landscape. Our analysis explains how observed gradients of host specialism and generalism can evolve in a multihost system through the transfer of ecologically important loci among coexisting strains.
Assuntos
Adaptação Biológica , Adaptação Fisiológica , Evolução Biológica , Campylobacter/genética , Campylobacter/fisiologia , Modelos Genéticos , Transferência Genética Horizontal , Recombinação Genética , Seleção GenéticaRESUMO
Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased "Platinum" variant catalog of 4.7 million single-nucleotide variants (SNVs) plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and 11 children of this pedigree. Platinum genotypes are highly concordant with the current catalog of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission ("nonplatinum") revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.
Assuntos
Genoma Humano/genética , Genômica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Bases de Dados Genéticas , Exoma/genética , Genótipo , Humanos , Mutação INDEL/genética , Linhagem , Polimorfismo de Nucleotídeo Único , SoftwareRESUMO
BACKGROUND: Phylogenetic footprinting is a comparative method based on the principle that functional sequence elements will acquire fewer mutations over time than non-functional sequences. Successful comparisons of distantly related species will thus yield highly important sequence elements likely to serve fundamental biological roles. RNA regulatory elements are less well understood than those in DNA. In this study we use the emerging model organism Nasonia vitripennis, a parasitic wasp, in a comparative analysis against 12 insect genomes to identify deeply conserved non-coding elements (CNEs) conserved in large groups of insects, with a focus on 5' UTRs and promoter sequences. RESULTS: We report the identification of 322 CNEs conserved across a broad range of insect orders. The identified regions are associated with regulatory and developmental genes, and contain short footprints revealing aspects of their likely function in translational regulation. The most ancient regions identified in our analysis were all found to overlap transcribed regions of genes, reflecting stronger conservation of translational regulatory elements than transcriptional elements. Further expanding sequence analyses to non-insect species we also report the discovery of, to our knowledge, the two oldest and most ubiquitous CNE's yet described in the animal kingdom (700 MYA). These ancient conserved non-coding elements are associated with the two ribosomal stalk genes, RPLP1 and RPLP2, and were very likely functional in some of the earliest animals. CONCLUSIONS: We report the identification of the most deeply conserved CNE's found to date, and several other deeply conserved elements which are without exception, part of 5' untranslated regions of transcripts, and occur in a number of key translational regulatory genes, highlighting translational regulation of translational regulators as a conserved feature of insect genomes.
Assuntos
Vespas/genética , Regiões 5' não Traduzidas , Animais , Sequência de Bases , Sequência Conservada , Genes Controladores do Desenvolvimento , Genoma de Inseto , Insetos/classificação , Insetos/genética , Dados de Sequência Molecular , Filogenia , Regiões Promotoras Genéticas , Sequências Reguladoras de Ácido Nucleico , Alinhamento de SequênciaRESUMO
Daily synchronous rhythms of cell division at the tissue or organism level are observed in many species and suggest that the circadian clock and cell cycle oscillators are coupled. For mammals, despite known mechanistic interactions, the effect of such coupling on clock and cell cycle progression, and hence its biological relevance, is not understood. In particular, we do not know how the temporal organization of cell division at the single-cell level produces this daily rhythm at the tissue level. Here we use multispectral imaging of single live cells, computational methods, and mathematical modeling to address this question in proliferating mouse fibroblasts. We show that in unsynchronized cells the cell cycle and circadian clock robustly phase lock each other in a 1:1 fashion so that in an expanding cell population the two oscillators oscillate in a synchronized way with a common frequency. Dexamethasone-induced synchronization reveals additional clock states. As well as the low-period phase-locked state there are distinct coexisting states with a significantly higher period clock. Cells transition to these states after dexamethasone synchronization. The temporal coordination of cell division by phase locking to the clock at a single-cell level has significant implications because disordered circadian function is increasingly being linked to the pathogenesis of many diseases, including cancer.
Assuntos
Proteínas CLOCK/metabolismo , Proteínas de Ciclo Celular/metabolismo , Animais , Ritmo Circadiano/efeitos dos fármacos , Dexametasona/farmacologia , Camundongos , Células NIH 3T3RESUMO
Specific stages of the cell cycle are often restricted to particular times of day because of regulation by the circadian clock. In zebrafish, both mitosis (M phase) and DNA synthesis (S phase) are clock-controlled in cell lines and during embryo development. Despite the ubiquitousness of this phenomenon, relatively little is known about the underlying mechanism linking the clock to the cell cycle. In this study, we describe an evolutionarily conserved cell-cycle regulator, cyclin-dependent kinase inhibitor 1d (20 kDa protein, p20), which along with p21, is a strongly rhythmic gene and directly clock-controlled. Both p20 and p21 regulate the G1/S transition of the cell cycle. However, their expression patterns differ, with p20 predominant in developing brain and peak expression occurring 6 h earlier than p21. p20 expression is also p53-independent in contrast to p21 regulation. Such differences provide a unique mechanism whereby S phase is set to different times of day in a tissue-specific manner, depending on the balance of these two inhibitors.
Assuntos
Ritmo Circadiano/genética , Proteínas Inibidoras de Quinase Dependente de Ciclina/metabolismo , Replicação do DNA/genética , Pontos de Checagem da Fase G1 do Ciclo Celular/genética , Proteínas de Peixe-Zebra/metabolismo , Peixe-Zebra/genética , Sequência de Aminoácidos , Animais , Sequência de Bases , Encéfalo/metabolismo , Linhagem Celular , Ritmo Circadiano/fisiologia , Biologia Computacional , Proteínas Inibidoras de Quinase Dependente de Ciclina/genética , Inibidor de Quinase Dependente de Ciclina p21/metabolismo , Replicação do DNA/fisiologia , Citometria de Fluxo , Pontos de Checagem da Fase G1 do Ciclo Celular/fisiologia , Imuno-Histoquímica , Hibridização In Situ , Funções Verossimilhança , Microscopia de Fluorescência , Modelos Genéticos , Dados de Sequência Molecular , Nocodazol , Filogenia , Estrutura Terciária de Proteína , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Alinhamento de Sequência , Análise de Sequência de DNA , Fatores de Tempo , Peixe-Zebra/fisiologia , Proteínas de Peixe-Zebra/genéticaRESUMO
Conserved noncoding sequences (CNSs) in DNA are reliable pointers to regulatory elements controlling gene expression. Using a comparative genomics approach with four dicotyledonous plant species (Arabidopsis thaliana, papaya [Carica papaya], poplar [Populus trichocarpa], and grape [Vitis vinifera]), we detected hundreds of CNSs upstream of Arabidopsis genes. Distinct positioning, length, and enrichment for transcription factor binding sites suggest these CNSs play a functional role in transcriptional regulation. The enrichment of transcription factors within the set of genes associated with CNS is consistent with the hypothesis that together they form part of a conserved transcriptional network whose function is to regulate other transcription factors and control development. We identified a set of promoters where regulatory mechanisms are likely to be shared between the model organism Arabidopsis and other dicots, providing areas of focus for further research.
Assuntos
Arabidopsis/genética , Carica/genética , DNA de Plantas/química , Regulação da Expressão Gênica de Plantas , Redes Reguladoras de Genes , Populus/genética , Vitis/genética , Sítios de Ligação , Sequência Conservada , Genômica , Nucleossomos/metabolismo , Análise de Sequência de DNA , SoftwareRESUMO
Identification of regulatory sequences within non-coding regions of DNA is an essential step towards elucidation of gene networks. This approach constitutes a major challenge, however, as only a very small fraction of non-coding DNA is thought to contribute to gene regulation. The mapping of regulatory regions traditionally involves the laborious construction of promoter deletion series which are then fused to reporter genes and assayed in transgenic organisms. Bioinformatic methods can be used to scan sequences for matches for known regulatory motifs, however these methods are currently hampered by the relatively small amount of such motifs and by a high false-discovery rate. Here, we demonstrate a robust and highly sensitive, in silico method to identify evolutionarily conserved regions within non-coding DNA. Sequence conservation within these regions is taken as evidence for evolutionary pressure against mutations, which is suggestive of functional importance. We test this method on a small set of well characterised promoters, and show that it successfully identifies known regulatory regions. We further show that these evolutionarily conserved sequences contain clusters of transcription binding sites, often described as regulatory modules. A version of the tool optimised for the analysis of plant promoters is available online at http://wsbc.warwick.ac.uk/ears/main.php.