ABSTRACT
MOTIVATION: In diploid organisms, phasing is the problem of assigning the alleles at heterozygous variants to one of two haplotypes. Reads from PacBio HiFi sequencing provide long, accurate observations that can be used as the basis for both calling and phasing variants. HiFi reads also excel at calling larger classes of variation, such as structural or tandem repeat variants. However, current phasing tools typically only phase small variants, leaving larger variants unphased. RESULTS: We developed HiPhase, a tool that jointly phases SNVs, indels, structural, and tandem repeat variants. The main benefits of HiPhase are (i) dual mode allele assignment for detecting large variants, (ii) a novel application of the A*-algorithm to phasing, and (iii) logic allowing phase blocks to span breaks caused by alignment issues around reference gaps and homozygous deletions. In our assessment, HiPhase produced an average phase block NG50 of 480 kb with 929 switchflip errors and fully phased 93.8% of genes, improving over the current state of the art. Additionally, HiPhase jointly phases SNVs, indels, structural, and tandem repeat variants and includes innate multi-threading, statistics gathering, and concurrent phased alignment output generation. AVAILABILITY AND IMPLEMENTATION: HiPhase is available as source code and a pre-compiled Linux binary with a user guide at https://github.com/PacificBiosciences/HiPhase.
Subject(s)
Genome, Human , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA , Algorithms , Haplotypes , Tandem Repeat SequencesABSTRACT
A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is more challenging. In this study, we manually curated 1235 SVs, which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app-SVCurator-to help GIAB curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. 'Expert' curators were 93% concordant with each other, and 37 of the 61 curators had at least 78% concordance with a set of 'expert' curators. The curators were least concordant for complex SVs and SVs that had inaccurate breakpoints or size predictions. After filtering events with low concordance among curators, we produced high confidence labels for 935 events. The SVCurator crowdsourced labels were 94.5% concordant with the heuristic-based draft benchmark SV callset from GIAB. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies.
Subject(s)
Genome, Human , Genomic Structural Variation , Heuristics , Humans , INDEL MutationABSTRACT
The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes.
Subject(s)
Biomarkers/analysis , Genetic Variation , Genome, Human , Haploidy , Hydatidiform Mole/genetics , Sequence Analysis, DNA/methods , Single-Cell Analysis/methods , Female , High-Throughput Nucleotide Sequencing , Humans , Molecular Sequence Annotation , PregnancyABSTRACT
Genetic variation in cis-regulatory elements is thought to be a major driving force in morphological and physiological changes. However, identifying transcription factor binding events that code for complex traits remains a challenge, motivating novel means of detecting putatively important binding events. Using a curated set of 1154 high-quality transcription factor motifs, we demonstrate that independently eroded binding sites are enriched for independently lost traits in three distinct pairs of placental mammals. We show that these independently eroded events pinpoint the loss of hindlimbs in dolphin and manatee, degradation of vision in naked mole-rat and star-nosed mole, and the loss of external testes in white rhinoceros and Weddell seal. We additionally show that our method may also be utilized with more than two species. Our study exhibits a novel methodology to detect cis-regulatory mutations which help explain a portion of the molecular mechanism underlying complex trait formation and loss.
Subject(s)
Evolution, Molecular , Nucleotide Motifs/genetics , Regulatory Sequences, Nucleic Acid/genetics , Transcription Factors/genetics , Vision, Ocular/genetics , Animals , Binding Sites/genetics , Dolphins/genetics , Dolphins/physiology , Hindlimb/physiology , Male , Mammals/genetics , Mammals/physiology , Mole Rats/genetics , Mole Rats/physiology , Protein Binding/genetics , Testis/physiology , Trichechus/genetics , Trichechus/physiology , Vision, Ocular/physiologyABSTRACT
PURPOSE: Exome sequencing and diagnosis is beginning to spread across the medical establishment. The most time-consuming part of genome-based diagnosis is the manual step of matching the potentially long list of patient candidate genes to patient phenotypes to identify the causative disease. METHODS: We introduce Phrank (for phenotype ranking), an information theory-inspired method that utilizes a Bayesian network to prioritize candidate diseases or genes, as a stand-alone module that can be run with any underlying knowledgebase and any variant filtering scheme. RESULTS: Phrank outperforms existing methods at ranking the causative disease or gene when applied to 169 real patient exomes with Mendelian diagnoses. Phrank's greatest improvement is in disease space, where across all 169 patients it ranks only 3 diseases on average ahead of the true diagnosis, whereas Phenomizer ranks 32 diseases ahead of the causal one. CONCLUSIONS: Using Phrank to rank all patient candidate genes or diseases, as they start working through a new case, will save the busy clinician much time in deriving a genetic diagnosis.
Subject(s)
Diagnosis, Computer-Assisted , Genetic Diseases, Inborn/diagnosis , Genetic Testing , Phenotype , Software , Benchmarking , Computational Biology/methods , Exome , Humans , Knowledge Bases , Pathology, Molecular/methodsABSTRACT
PurposeCurrent clinical genomics assays primarily utilize short-read sequencing (SRS), but SRS has limited ability to evaluate repetitive regions and structural variants. Long-read sequencing (LRS) has complementary strengths, and we aimed to determine whether LRS could offer a means to identify overlooked genetic variation in patients undiagnosed by SRS.MethodsWe performed low-coverage genome LRS to identify structural variants in a patient who presented with multiple neoplasia and cardiac myxomata, in whom the results of targeted clinical testing and genome SRS were negative.ResultsThis LRS approach yielded 6,971 deletions and 6,821 insertions > 50 bp. Filtering for variants that are absent in an unrelated control and overlap a disease gene coding exon identified three deletions and three insertions. One of these, a heterozygous 2,184 bp deletion, overlaps the first coding exon of PRKAR1A, which is implicated in autosomal dominant Carney complex. RNA sequencing demonstrated decreased PRKAR1A expression. The deletion was classified as pathogenic based on guidelines for interpretation of sequence variants.ConclusionThis first successful application of genome LRS to identify a pathogenic variant in a patient suggests that LRS has significant potential for the identification of disease-causing structural variation. Larger studies will ultimately be required to evaluate the potential clinical utility of LRS.
Subject(s)
Genetic Association Studies , Genetic Diseases, Inborn/diagnosis , Genetic Diseases, Inborn/genetics , Genetic Predisposition to Disease , Genetic Variation , Genome, Human , Genomics , Sequence Analysis, DNA , Child , Cyclic AMP-Dependent Protein Kinase RIalpha Subunit/genetics , Echocardiography , Genomics/methods , Humans , Male , Phenotype , Sequence Analysis, DNA/methods , Sequence DeletionABSTRACT
Robinow syndrome (RS) is a well-recognized Mendelian disorder known to demonstrate both autosomal dominant and autosomal recessive inheritance. Typical manifestations include short stature, characteristic facies, and skeletal anomalies. Recessive inheritance has been associated with mutations in ROR2 while dominant inheritance has been observed for mutations in WNT5A, DVL1, and DVL3. Through trio whole genome sequencing, we identified a homozygous frameshifting single nucleotide deletion in WNT5A in a previously reported, deceased infant with a unique constellation of features comprising a 46,XY disorder of sex development with multiple congenital malformations including congenital diaphragmatic hernia, ambiguous genitalia, dysmorphic facies, shortened long bones, adactyly, and ventricular septal defect. The parents, who are both heterozygous for the deletion, appear clinically unaffected. In conjunction with published observations of Wnt5a double knockout mice, we provide evidence for the possibility of autosomal recessive inheritance in association with WNT5A loss-of-function mutations in RS.
Subject(s)
Alleles , Craniofacial Abnormalities/diagnosis , Craniofacial Abnormalities/genetics , Dwarfism/diagnosis , Dwarfism/genetics , Limb Deformities, Congenital/diagnosis , Limb Deformities, Congenital/genetics , Loss of Function Mutation , Phenotype , Urogenital Abnormalities/diagnosis , Urogenital Abnormalities/genetics , Wnt-5a Protein/genetics , Animals , Disease Models, Animal , Female , Frameshift Mutation , Gene Frequency , Genetic Association Studies , Homozygote , Humans , Infant , Mice , Mice, Knockout , Point Mutation , Severity of Illness Index , Symptom Assessment , Ultrasonography , Whole Genome SequencingABSTRACT
Mutations of genes within the phosphatidylinositol-3-kinase (PI3K)-AKT-MTOR pathway are well known causes of brain overgrowth (megalencephaly) as well as segmental cortical dysplasia (such as hemimegalencephaly, focal cortical dysplasia and polymicrogyria). Mutations of the AKT3 gene have been reported in a few individuals with brain malformations, to date. Therefore, our understanding regarding the clinical and molecular spectrum associated with mutations of this critical gene is limited, with no clear genotype-phenotype correlations. We sought to further delineate this spectrum, study levels of mosaicism and identify genotype-phenotype correlations of AKT3-related disorders. We performed targeted sequencing of AKT3 on individuals with these phenotypes by molecular inversion probes and/or Sanger sequencing to determine the type and level of mosaicism of mutations. We analysed all clinical and brain imaging data of mutation-positive individuals including neuropathological analysis in one instance. We performed ex vivo kinase assays on AKT3 engineered with the patient mutations and examined the phospholipid binding profile of pleckstrin homology domain localizing mutations. We identified 14 new individuals with AKT3 mutations with several phenotypes dependent on the type of mutation and level of mosaicism. Our comprehensive clinical characterization, and review of all previously published patients, broadly segregates individuals with AKT3 mutations into two groups: patients with highly asymmetric cortical dysplasia caused by the common p.E17K mutation, and patients with constitutional AKT3 mutations exhibiting more variable phenotypes including bilateral cortical malformations, polymicrogyria, periventricular nodular heterotopia and diffuse megalencephaly without cortical dysplasia. All mutations increased kinase activity, and pleckstrin homology domain mutants exhibited enhanced phospholipid binding. Overall, our study shows that activating mutations of the critical AKT3 gene are associated with a wide spectrum of brain involvement ranging from focal or segmental brain malformations (such as hemimegalencephaly and polymicrogyria) predominantly due to mosaic AKT3 mutations, to diffuse bilateral cortical malformations, megalencephaly and heterotopia due to constitutional AKT3 mutations. We also provide the first detailed neuropathological examination of a child with extreme megalencephaly due to a constitutional AKT3 mutation. This child has one of the largest documented paediatric brain sizes, to our knowledge. Finally, our data show that constitutional AKT3 mutations are associated with megalencephaly, with or without autism, similar to PTEN-related disorders. Recognition of this broad clinical and molecular spectrum of AKT3 mutations is important for providing early diagnosis and appropriate management of affected individuals, and will facilitate targeted design of future human clinical trials using PI3K-AKT pathway inhibitors.
Subject(s)
Developmental Disabilities/genetics , Megalencephaly/genetics , Mutation/genetics , Proto-Oncogene Proteins c-akt/genetics , Brain/diagnostic imaging , Child , Developmental Disabilities/diagnostic imaging , Developmental Disabilities/pathology , Female , Genetic Association Studies , HEK293 Cells , Humans , Immunoprecipitation , Magnetic Resonance Imaging , Male , Megalencephaly/diagnostic imaging , Megalencephaly/pathology , Mutagenesis, Site-Directed/methods , Phosphatidylinositols/metabolism , TransfectionABSTRACT
Microbiota regulate intestinal physiology by modifying host gene expression along the length of the intestine, but the underlying regulatory mechanisms remain unresolved. Transcriptional specificity occurs through interactions between transcription factors (TFs) and cis-regulatory regions (CRRs) characterized by nucleosome-depleted accessible chromatin. We profiled transcriptome and accessible chromatin landscapes in intestinal epithelial cells (IECs) from mice reared in the presence or absence of microbiota. We show that regional differences in gene transcription along the intestinal tract were accompanied by major alterations in chromatin accessibility. Surprisingly, we discovered that microbiota modify host gene transcription in IECs without significantly impacting the accessible chromatin landscape. Instead, microbiota regulation of host gene transcription might be achieved by differential expression of specific TFs and enrichment of their binding sites in nucleosome-depleted CRRs near target genes. Our results suggest that the chromatin landscape in IECs is preprogrammed by the host in a region-specific manner to permit responses to microbiota through binding of open CRRs by specific TFs.
Subject(s)
Chromatin Assembly and Disassembly , Intestinal Mucosa/metabolism , Microbiota , Transcription, Genetic , Animals , Intestinal Mucosa/microbiology , Mice , Mice, Inbred C57BL , Organ Specificity , Promoter Regions, Genetic , Transcription Factors/genetics , Transcription Factors/metabolism , TranscriptomeABSTRACT
PURPOSE: Clinical exome sequencing is nondiagnostic for about 75% of patients evaluated for a possible Mendelian disorder. We examined the ability of systematic reevaluation of exome data to establish additional diagnoses. METHODS: The exome and phenotypic data of 40 individuals with previously nondiagnostic clinical exomes were reanalyzed with current software and literature. RESULTS: A definitive diagnosis was identified for 4 of 40 participants (10%). In these cases the causative variant is de novo and in a relevant autosomal-dominant disease gene. The literature to tie the causative genes to the participants' phenotypes was weak, nonexistent, or not readily located at the time of the initial clinical exome reports. At the time of diagnosis by reanalysis, the supporting literature was 1 to 3 years old. CONCLUSION: Approximately 250 gene-disease and 9,200 variant-disease associations are reported annually. This increase in information necessitates regular reevaluation of nondiagnostic exomes. To be practical, systematic reanalysis requires further automation and more up-to-date variant databases. To maximize the diagnostic yield of exome sequencing, providers should periodically request reanalysis of nondiagnostic exomes. Accordingly, policies regarding reanalysis should be weighed in combination with factors such as cost and turnaround time when selecting a clinical exome laboratory.Genet Med 19 2, 209-214.
Subject(s)
Exome Sequencing/standards , Genetic Diseases, Inborn/diagnosis , Genetic Diseases, Inborn/genetics , Genetics, Medical/standards , Child , Child, Preschool , Exome/genetics , Female , Genetic Diseases, Inborn/pathology , Humans , Infant , Male , Mutation , Pedigree , Sequence Analysis, DNAABSTRACT
Humans differ from other animals in many aspects of anatomy, physiology, and behaviour; however, the genotypic basis of most human-specific traits remains unknown. Recent whole-genome comparisons have made it possible to identify genes with elevated rates of amino acid change or divergent expression in humans, and non-coding sequences with accelerated base pair changes. Regulatory alterations may be particularly likely to produce phenotypic effects while preserving viability, and are known to underlie interesting evolutionary differences in other species. Here we identify molecular events particularly likely to produce significant regulatory changes in humans: complete deletion of sequences otherwise highly conserved between chimpanzees and other mammals. We confirm 510 such deletions in humans, which fall almost exclusively in non-coding regions and are enriched near genes involved in steroid hormone signalling and neural function. One deletion removes a sensory vibrissae and penile spine enhancer from the human androgen receptor (AR) gene, a molecular change correlated with anatomical loss of androgen-dependent sensory vibrissae and penile spines in the human lineage. Another deletion removes a forebrain subventricular zone enhancer near the tumour suppressor gene growth arrest and DNA-damage-inducible, gamma (GADD45G), a loss correlated with expansion of specific brain regions in humans. Deletions of tissue-specific enhancers may thus accompany both loss and gain traits in the human lineage, and provide specific examples of the kinds of regulatory alterations and inactivation events long proposed to have an important role in human evolutionary divergence.
Subject(s)
Biological Evolution , DNA/genetics , Genome, Human/genetics , Human Characteristics , Regulatory Sequences, Nucleic Acid/genetics , Sequence Deletion/genetics , Animals , Brain/anatomy & histology , Brain/metabolism , Chromosomes, Mammalian/genetics , Conserved Sequence/genetics , DNA, Intergenic/genetics , Enhancer Elements, Genetic/genetics , Evolution, Molecular , Genes, Tumor Suppressor , Humans , Male , Mice , Organ Specificity , Pan troglodytes/genetics , Penis/anatomy & histology , Penis/metabolism , Species Specificity , Transgenes/geneticsABSTRACT
The human genome encodes 1500-2000 different transcription factors (TFs). ChIP-seq is revealing the global binding profiles of a fraction of TFs in a fraction of their biological contexts. These data show that the majority of TFs bind directly next to a large number of context-relevant target genes, that most binding is distal, and that binding is context specific. Because of the effort and cost involved, ChIP-seq is seldom used in search of novel TF function. Such exploration is instead done using expression perturbation and genetic screens. Here we propose a comprehensive computational framework for transcription factor function prediction. We curate 332 high-quality nonredundant TF binding motifs that represent all major DNA binding domains, and improve cross-species conserved binding site prediction to obtain 3.3 million conserved, mostly distal, binding site predictions. We combine these with 2.4 million facts about all human and mouse gene functions, in a novel statistical framework, in search of enrichments of particular motifs next to groups of target genes of particular functions. Rigorous parameter tuning and a harsh null are used to minimize false positives. Our novel PRISM (predicting regulatory information from single motifs) approach obtains 2543 TF function predictions in a large variety of contexts, at a false discovery rate of 16%. The predictions are highly enriched for validated TF roles, and 45 of 67 (67%) tested binding site regions in five different contexts act as enhancers in functionally matched cells.
Subject(s)
Binding Sites/genetics , Computational Biology , Software , Transcription Factors/genetics , Algorithms , Animals , Base Sequence , DNA-Binding Proteins/genetics , Genome , Humans , Mice , Protein Binding/genetics , Regulatory Sequences, Nucleic AcidABSTRACT
Genetic studies have identified a core set of transcription factors and target genes that control the development of the neocortex, the region of the human brain responsible for higher cognition. The specific regulatory interactions between these factors, many key upstream and downstream genes, and the enhancers that mediate all these interactions remain mostly uncharacterized. We perform p300 ChIP-seq to identify over 6,600 candidate enhancers active in the dorsal cerebral wall of embryonic day 14.5 (E14.5) mice. Over 95% of the peaks we measure are conserved to human. Eight of ten (80%) candidates tested using mouse transgenesis drive activity in restricted laminar patterns within the neocortex. GREAT based computational analysis reveals highly significant correlation with genes expressed at E14.5 in key areas for neocortex development, and allows the grouping of enhancers by known biological functions and pathways for further studies. We find that multiple genes are flanked by dozens of candidate enhancers each, including well-known key neocortical genes as well as suspected and novel genes. Nearly a quarter of our candidate enhancers are conserved well beyond mammals. Human and zebrafish regions orthologous to our candidate enhancers are shown to most often function in other aspects of central nervous system development. Finally, we find strong evidence that specific interspersed repeat families have contributed potentially key developmental enhancers via co-option. Our analysis expands the methodologies available for extracting the richness of information found in genome-wide functional maps.
Subject(s)
Enhancer Elements, Genetic , Evolution, Molecular , Neocortex/growth & development , Regulatory Sequences, Nucleic Acid/genetics , Animals , Base Sequence , Conserved Sequence/genetics , Gene Expression Regulation, Developmental , Humans , Mice , Neocortex/metabolism , Oligonucleotide Array Sequence Analysis , Promoter Regions, Genetic , Transcription Factors/genetics , Zebrafish/genetics , Zebrafish/growth & developmentABSTRACT
Enhancers are essential gene regulatory elements whose alteration can lead to morphological differences between species, developmental abnormalities, and human disease. Current strategies to identify enhancers focus primarily on noncoding sequences and tend to exclude protein coding sequences. Here, we analyzed 25 available ChIP-seq data sets that identify enhancers in an unbiased manner (H3K4me1, H3K27ac, and EP300) for peaks that overlap exons. We find that, on average, 7% of all ChIP-seq peaks overlap coding exons (after excluding for peaks that overlap with first exons). By using mouse and zebrafish enhancer assays, we demonstrate that several of these exonic enhancer (eExons) candidates can function as enhancers of their neighboring genes and that the exonic sequence is necessary for enhancer activity. Using ChIP, 3C, and DNA FISH, we further show that one of these exonic limb enhancers, Dync1i1 exon 15, has active enhancer marks and physically interacts with Dlx5/6 promoter regions 900 kb away. In addition, its removal by chromosomal abnormalities in humans could cause split hand and foot malformation 1 (SHFM1), a disorder associated with DLX5/6. These results demonstrate that DNA sequences can have a dual function, operating as coding exons in one tissue and enhancers of nearby gene(s) in another tissue, suggesting that phenotypes resulting from coding mutations could be caused not only by protein alteration but also by disrupting the regulation of another gene.
Subject(s)
Enhancer Elements, Genetic , Exons , Gene Expression Regulation , Animals , Chromatin Immunoprecipitation , Chromosome Aberrations , Cytoplasmic Dyneins/genetics , Extremities/embryology , Extremities/physiology , Female , Homeodomain Proteins/genetics , Humans , In Situ Hybridization, Fluorescence , Limb Deformities, Congenital/genetics , Male , Mice , Mice, Transgenic , Promoter Regions, Genetic , Zebrafish/geneticsABSTRACT
Identifying enhancers regulating gene expression remains an important and challenging task. While recent sequencing-based methods provide epigenomic characteristics that correlate well with enhancer activity, it remains onerous to comprehensively identify all enhancers across development. Here we introduce a computational framework to identify tissue-specific enhancers evolving under purifying selection. First, we incorporate high-confidence binding site predictions with target gene functional enrichment analysis to identify transcription factors (TFs) likely functioning in a particular context. We then search the genome for clusters of binding sites for these TFs, overcoming previous constraints associated with biased manual curation of TFs or enhancers. Applying our method to the placenta, we find 33 known and implicate 17 novel TFs in placental function, and discover 2,216 putative placenta enhancers. Using luciferase reporter assays, 31/36 (86%) tested candidates drive activity in placental cells. Our predictions agree well with recent epigenomic data in human and mouse, yet over half our loci, including 7/8 (87%) tested regions, are novel. Finally, we establish that our method is generalizable by applying it to 5 additional tissues: heart, pancreas, blood vessel, bone marrow, and liver.
Subject(s)
Enhancer Elements, Genetic , Transcription Factors/metabolism , Algorithms , Amino Acid Motifs , Animals , Automation , Binding Sites , Cluster Analysis , Computational Biology , Computer Simulation , Epigenomics , Female , Gene Expression Profiling , Gene Expression Regulation , Humans , Mice , Placenta/physiology , Pregnancy , Trophoblasts/cytologyABSTRACT
Many important model organisms for biomedical and evolutionary research have sequenced genomes, but occupy a phylogenetically isolated position, evolutionarily distant from other sequenced genomes. This phylogenetic isolation is exemplified for zebrafish, a vertebrate model for cis-regulation, development and human disease, whose evolutionary distance to all other currently sequenced fish exceeds the distance between human and chicken. Such large distances make it difficult to align genomes and use them for comparative analysis beyond gene-focused questions. In particular, detecting conserved non-genic elements (CNEs) as promising cis-regulatory elements with biological importance is challenging. Here, we develop a general comparative genomics framework to align isolated genomes and to comprehensively detect CNEs. Our approach integrates highly sensitive and quality-controlled local alignments and uses alignment transitivity and ancestral reconstruction to bridge large evolutionary distances. We apply our framework to zebrafish and demonstrate substantially improved CNE detection and quality compared with previous sets. Our zebrafish CNE set comprises 54 533 CNEs, of which 11 792 (22%) are conserved to human or mouse. Our zebrafish CNEs (http://zebrafish.stanford.edu) are highly enriched in known enhancers and extend existing experimental (ChIP-Seq) sets. The same framework can now be applied to the isolated genomes of frog, amphioxus, Caenorhabditis elegans and many others.
Subject(s)
Computational Biology/methods , Conserved Sequence , Phylogeny , Sequence Analysis, DNA/methods , Zebrafish/genetics , Animals , Base Sequence , Evolution, Molecular , Genomics/methods , Internet , Molecular Sequence Annotation , Regulatory Sequences, Nucleic Acid , Sensitivity and Specificity , Sequence Alignment , Synteny , Zebrafish/classificationABSTRACT
The identification of homologies, whether morphological, molecular, or genetic, is fundamental to our understanding of common biological principles. Homologies bridging the great divide between deuterostomes and protostomes have served as the basis for current models of animal evolution and development. It is now appreciated that these two clades share a common developmental toolkit consisting of conserved transcription factors and signaling pathways. These patterning genes sometimes show common expression patterns and genetic interactions, suggesting the existence of similar or even conserved regulatory apparatus. However, previous studies have found no regulatory sequence conserved between deuterostomes and protostomes. Here we describe the first such enhancers, which we call bilaterian conserved regulatory elements (Bicores). Bicores show conservation of sequence and gene synteny. Sequence conservation of Bicores reflects conserved patterns of transcription factor binding sites. We predict that Bicores act as response elements to signaling pathways, and we show that Bicores are developmental enhancers that drive expression of transcriptional repressors in the vertebrate central nervous system. Although the small number of identified Bicores suggests extensive rewiring of cis-regulation between the protostome and deuterostome clades, additional Bicores may be revealed as our understanding of cis-regulatory logic and sample of bilaterian genomes continue to grow.
Subject(s)
Enhancer Elements, Genetic , Genome , Invertebrates/genetics , Transcription Factors/genetics , Vertebrates/genetics , Amino Acid Sequence , Animals , Binding Sites , Biological Evolution , Central Nervous System/embryology , Central Nervous System/metabolism , Conserved Sequence , Gene Expression Regulation, Developmental , Humans , Invertebrates/embryology , Invertebrates/metabolism , Molecular Sequence Data , Protein Binding , Sequence Alignment , Signal Transduction , Synteny , Transcription Factors/metabolism , Vertebrates/embryology , Vertebrates/metabolismABSTRACT
The Genome in a Bottle Consortium (GIAB), hosted by the National Institute of Standards and Technology (NIST), is developing new matched tumor-normal samples, the first to be explicitly consented for public dissemination of genomic data and cell lines. Here, we describe a comprehensive genomic dataset from the first individual, HG008, including DNA from an adherent, epithelial-like pancreatic ductal adenocarcinoma (PDAC) tumor cell line and matched normal cells from duodenal and pancreatic tissues. Data for the tumor-normal matched samples comes from thirteen distinct state-of-the-art whole genome measurement technologies, including high depth short and long-read bulk whole genome sequencing (WGS), single cell WGS, and Hi-C, and karyotyping. These data will be used by the GIAB Consortium to develop matched tumor-normal benchmarks for somatic variant detection. We expect these data to facilitate innovation for whole genome measurement technologies, de novo assembly of tumor and normal genomes, and bioinformatic tools to identify small and structural somatic mutations. This first-of-its-kind broadly consented open-access resource will facilitate further understanding of sequencing methods used for cancer biology.
ABSTRACT
BACKGROUND: Long-read sequencing (LRS) techniques have been very successful in identifying structural variants (SVs). However, the high error rate of LRS made the detection of small variants (substitutions and short indels < 20 bp) more challenging. The introduction of PacBio HiFi sequencing makes LRS also suited for detecting small variation. Here we evaluate the ability of HiFi reads to detect de novo mutations (DNMs) of all types, which are technically challenging variant types and a major cause of sporadic, severe, early-onset disease. METHODS: We sequenced the genomes of eight parent-child trios using high coverage PacBio HiFi LRS (~ 30-fold coverage) and Illumina short-read sequencing (SRS) (~ 50-fold coverage). De novo substitutions, small indels, short tandem repeats (STRs) and SVs were called in both datasets and compared to each other to assess the accuracy of HiFi LRS. In addition, we determined the parent-of-origin of the small DNMs using phasing. RESULTS: We identified a total of 672 and 859 de novo substitutions/indels, 28 and 126 de novo STRs, and 24 and 1 de novo SVs in LRS and SRS respectively. For the small variants, there was a 92 and 85% concordance between the platforms. For the STRs and SVs, the concordance was 3.6 and 0.8%, and 4 and 100% respectively. We successfully validated 27/54 LRS-unique small variants, of which 11 (41%) were confirmed as true de novo events. For the SRS-unique small variants, we validated 42/133 DNMs and 8 (19%) were confirmed as true de novo event. Validation of 18 LRS-unique de novo STR calls confirmed none of the repeat expansions as true DNM. Confirmation of the 23 LRS-unique SVs was possible for 19 candidate SVs of which 10 (52.6%) were true de novo events. Furthermore, we were able to assign 96% of DNMs to their parental allele with LRS data, as opposed to just 20% with SRS data. CONCLUSIONS: HiFi LRS can now produce the most comprehensive variant dataset obtainable by a single technology in a single laboratory, allowing accurate calling of substitutions, indels, STRs and SVs. The accuracy even allows sensitive calling of DNMs on all variant levels, and also allows for phasing, which helps to distinguish true positive from false positive DNMs.
Subject(s)
High-Throughput Nucleotide Sequencing , INDEL Mutation , Humans , Alleles , Microsatellite RepeatsABSTRACT
Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10-25 kilobases), accurate 'HiFi' reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer-encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.