Search | VHL Regional Portal

1.

Cost-Effective Cas9-Mediated Targeted Sequencing of Spinocerebellar Ataxia Repeat Expansions.

Tachikawa, Keiji; Shimizu, Takahiro; Imai, Takeshi; Ko, Riyoko; Kawai, Yosuke; Omae, Yosuke; Tokunaga, Katsushi; Frith, Martin C; Yamano, Yoshihisa; Mitsuhashi, Satomi.

J Mol Diagn ; 26(2): 85-95, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38008286

ABSTRACT

Hereditary repeat diseases are caused by an abnormal expansion of short tandem repeats in the genome. Among them, spinocerebellar ataxia (SCA) is a heterogeneous disease, and currently, 16 responsible repeats are known. Genetic diagnosis is obtained by analyzing the number of repeats through separate testing of each repeat. Although simultaneous detection of candidate repeats using current massively parallel sequencing technologies has been developed to avoid complicated multiple experiments, these methods are generally expensive. This study developed a cost-effective SCA repeat panel [Flongle SCA repeat panel sequencing (FLO-SCAp)] using Cas9-mediated targeted long-read sequencing and the smallest long-read sequencing apparatus, Flongle. This panel enabled the detection of repeat copy number changes, internal repeat sequences, and DNA methylation in seven patients with different repeat expansion diseases. The median (interquartile range) values of coverage and on-target rate were 39.5 (12 to 72) and 11.6% (7.5% to 16.5%), respectively. This approach was validated by comparing repeat copy number changes measured by FLO-SCAp and short-read whole-genome sequencing. A high correlation was observed between FLO-SCAp and short-read whole-genome sequencing when the repeat length was ≤250 bp (r = 0.98; P < 0.001). Thus, FLO-SCAp represents the most cost-effective method for conducting multiplex testing of repeats and can serve as the first-line diagnostic tool for SCA.

Subject(s)

CRISPR-Cas Systems , Spinocerebellar Ataxias , Humans , Cost-Benefit Analysis , Spinocerebellar Ataxias/diagnosis , Spinocerebellar Ataxias/genetics , Microsatellite Repeats/genetics , Whole Genome Sequencing , High-Throughput Nucleotide Sequencing

2.

DNA Conserved in Diverse Animals Since the Precambrian Controls Genes for Embryonic Development.

Frith, Martin C; Ni, Shengliang.

Mol Biol Evol ; 40(12)2023 Dec 01.

Article in English | MEDLINE | ID: mdl-38085182

ABSTRACT

DNA that controls gene expression (e.g. enhancers, promoters) has seemed almost never to be conserved between distantly related animals, like vertebrates and arthropods. This is mysterious, because development of such animals is partly organized by homologous genes with similar complex expression patterns, termed "deep homology." Here, we report 25 regulatory DNA segments conserved across bilaterian animals, of which 7 are also conserved in cnidaria (coral and sea anemone). They control developmental genes (e.g. Nr2f, Ptch, Rfx1/3, Sall, Smad6, Sp5, Tbx2/3), including six homeobox genes: Gsx, Hmx, Meis, Msx, Six1/2, and Zfhx3/4. The segments contain perfectly or near-perfectly conserved CCAAT boxes, E-boxes, and other sequences recognized by regulatory proteins. More such DNA conservation will surely be found soon, as more genomes are published and sequence comparison is optimized. This reveals a control system for animal development conserved since the Precambrian.

Subject(s)

Anthozoa , Genes, Homeobox , Animals , DNA , Transcription Factors/genetics , Anthozoa/genetics , Embryonic Development/genetics , Conserved Sequence/genetics

3.

Biallelic structural variations within FGF12 detected by long-read sequencing in epilepsy.

Ohori, Sachiko; Miyauchi, Akihiko; Osaka, Hitoshi; Lourenco, Charles Marques; Arakaki, Naohiro; Sengoku, Toru; Ogata, Kazuhiro; Honjo, Rachel Sayuri; Kim, Chong Ae; Mitsuhashi, Satomi; Frith, Martin C; Seyama, Rie; Tsuchida, Naomi; Uchiyama, Yuri; Koshimizu, Eriko; Hamanaka, Kohei; Misawa, Kazuharu; Miyatake, Satoko; Mizuguchi, Takeshi; Saito, Kuniaki; Fujita, Atsushi; Matsumoto, Naomichi.

Life Sci Alliance ; 6(8)2023 08.

Article in English | MEDLINE | ID: mdl-37286232

ABSTRACT

We discovered biallelic intragenic structural variations (SVs) in FGF12 by applying long-read whole genome sequencing to an exome-negative patient with developmental and epileptic encephalopathy (DEE). We also found another DEE patient carrying a biallelic (homozygous) single-nucleotide variant (SNV) in FGF12 that was detected by exome sequencing. FGF12 heterozygous recurrent missense variants with gain-of-function or heterozygous entire duplication of FGF12 are known causes of epilepsy, but biallelic SNVs/SVs have never been described. FGF12 encodes intracellular proteins interacting with the C-terminal domain of the alpha subunit of voltage-gated sodium channels 1.2, 1.5, and 1.6, promoting excitability by delaying fast inactivation of the channels. To validate the molecular pathomechanisms of these biallelic FGF12 SVs/SNV, highly sensitive gene expression analyses using lymphoblastoid cells from the patient with biallelic SVs, structural considerations, and Drosophila in vivo functional analysis of the SNV were performed, confirming loss-of-function. Our study highlights the importance of small SVs in Mendelian disorders, which may be overlooked by exome sequencing but can be detected efficiently by long-read whole genome sequencing, providing new insights into the pathomechanisms of human diseases.

Subject(s)

Epilepsy , Mutation, Missense , Humans , Epilepsy/genetics , Fibroblast Growth Factors

4.

Analysis of Tandem Repeat Expansions Using Long DNA Reads.

Mitsuhashi, Satomi; Frith, Martin C.

Methods Mol Biol ; 2632: 147-159, 2023.

Article in English | MEDLINE | ID: mdl-36781727

ABSTRACT

Abnormal expansion or shortening of tandem repeats can cause a variety of genetic diseases. The use of long DNA reads has facilitated the analysis of disease-causing repeats in the human genome. Long read sequencers enable us to directly analyze repeat length and sequence content by covering whole repeats; they are therefore considered suitable for the analysis of long tandem repeats. Here, we describe an expanded repeat analysis using target sequencing data produced by the Oxford Nanopore Technologies (hereafter referred to as ONT) nanopore sequencer.

Subject(s)

High-Throughput Nucleotide Sequencing , Nanopores , Humans , Tandem Repeat Sequences/genetics , Sequence Analysis, DNA , DNA/genetics

5.

Finding Rearrangements in Nanopore DNA Reads with LAST and dnarrange.

Frith, Martin C; Mitsuhashi, Satomi.

Methods Mol Biol ; 2632: 161-175, 2023.

Article in English | MEDLINE | ID: mdl-36781728

ABSTRACT

Long-read DNA sequencing techniques such as nanopore are especially useful for characterizing complex sequence rearrangements, which occur in some genetic diseases and also during evolution. Analyzing the sequence data to understand such rearrangements is not trivial, due to sequencing error, rearrangement intricacy, and abundance of repeated similar sequences in genomes.The LAST and dnarrange software packages can resolve complex relationships between DNA sequences and characterize changes such as gene conversion, processed pseudogene insertion, and chromosome shattering. They can filter out numerous rearrangements shared by controls, e.g., healthy humans versus a patient, to focus on rearrangements unique to the patient. One useful ingredient is last-train, which learns the rates (probabilities) of deletions, insertions, and each kind of base match and mismatch. These probabilities are then used to find the most likely sequence relationships/alignments, which is especially useful for DNA with unusual rates, such as DNA from Plasmodium falciparum (malaria) with â¼80% a+t. This is also useful for less-studied species that lack reference genomes, so the DNA reads are compared to a different species' genome. We also point out that a reference genome with ancestral alleles would be ideal.

Subject(s)

Nanopores , Humans , DNA , Sequence Analysis, DNA/methods , Genome , Gene Rearrangement , High-Throughput Nucleotide Sequencing/methods

6.

An immune-suppressing protein in human endogenous retroviruses.

Zhang, Huan; Ni, Shengliang; Frith, Martin C.

Bioinform Adv ; 3(1): vbad013, 2023.

Article in English | MEDLINE | ID: mdl-36818731

ABSTRACT

Motivation: Retroviruses are important contributors to disease and evolution in vertebrates. Sometimes, retrovirus DNA is heritably inserted in a vertebrate genome: an endogenous retrovirus (ERV). Vertebrate genomes have many such virus-derived fragments, usually with mutations disabling their original functions. Results: Some primate ERVs appear to encode an overlooked protein. This protein is homologous to protein MC132 from Molluscum contagiosum virus, which is a human poxvirus, not a retrovirus. MC132 suppresses the immune system by targeting NF- κ B, and it had no known homologs until now. The ERV homologs of MC132 in the human genome are mostly disrupted by mutations, but there is an intact copy on chromosome 4. We found homologs of MC132 in ERVs of apes, monkeys and bushbaby, but not tarsiers, lemurs or non-primates. This suggests that some primate retroviruses had, or have, an extra immune-suppressing protein, which underwent horizontal genetic transfer between unrelated viruses. Contact: mcfrith@edu.k.u-tokyo.ac.jp.

7.

How to optimally sample a sequence for rapid analysis.

Frith, Martin C; Shaw, Jim; Spouge, John L.

Bioinformatics ; 39(2)2023 02 03.

Article in English | MEDLINE | ID: mdl-36702468

ABSTRACT

MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Software , Sequence Analysis, DNA/methods

8.

Improved DNA-Versus-Protein Homology Search for Protein Fossils.

Yao, Yin; Frith, Martin C.

IEEE/ACM Trans Comput Biol Bioinform ; 20(3): 1691-1699, 2023.

Article in English | MEDLINE | ID: mdl-35617174

ABSTRACT

Protein fossils, i.e., noncoding DNA descended from coding DNA, arise frequently from transposable elements (TEs), decayed genes, and viral integrations. They can reveal, and mislead about, evolutionary history and relationships. They have been detected by comparing DNA to protein sequences, but current methods are not optimized for this task. We describe a powerful DNA-protein homology search method. We use a 64×21 substitution matrix, which is fitted to sequence data, automatically learning the genetic code. We detect subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences). Our method detects TE protein fossils much more sensitively than blastx, and faster. Of the â¼ 7 major categories of eukaryotic TE, three were long thought absent in mammals: we find two of them in the human genome, polinton and DIRS/Ngaro. This method increases our power to find ancient fossils, and perhaps to detect non-standard genetic codes. The alternative-alignments and significance paradigm is not specific to DNA-protein comparison, and could benefit homology search generally. This is an extended version of a conference paper (Yao & Frith, 2021).

9.

Paleozoic Protein Fossils Illuminate the Evolution of Vertebrate Genomes and Transposable Elements.

Frith, Martin C.

Mol Biol Evol ; 39(4)2022 04 11.

Article in English | MEDLINE | ID: mdl-35348724

ABSTRACT

Genomes hold a treasure trove of protein fossils: Fragments of formerly protein-coding DNA, which mainly come from transposable elements (TEs) or host genes. These fossils reveal ancient evolution of TEs and genomes, and many fossils have been exapted to perform diverse functions important for the host's fitness. However, old and highly degraded fossils are hard to identify, standard methods (e.g. BLAST) are not optimized for this task, and few Paleozoic protein fossils have been found. Here, a recently optimized method is used to find protein fossils in vertebrate genomes. It finds Paleozoic fossils predating the amphibian/amniote divergence from most major TE categories, including virus-related Polinton and Gypsy elements. It finds 10 fossils in the human genome (eight from TEs and two from host genes) that predate the last common ancestor of all jawed vertebrates, probably from the Ordovician period. It also finds types of transposon and retrotransposon not found in human before. These fossils have extreme sequence conservation, indicating exaptation: some have evidence of gene-regulatory function, and they tend to lie nearest to developmental genes. Some ancient fossils suggest "genome tectonics," where two fragments of one TE have drifted apart by up to megabases, possibly explaining gene deserts and large introns. This paints a picture of great TE diversity in our aquatic ancestors, with patchy TE inheritance by later vertebrates, producing new genes and regulatory elements on the way. Host-gene fossils too have contributed anciently conserved DNA segments. This paves the way to further studies of ancient protein fossils.

Subject(s)

DNA Transposable Elements , Fossils , Animals , DNA Transposable Elements/genetics , Evolution, Molecular , Humans , Regulatory Sequences, Nucleic Acid , Retroelements , Vertebrates/genetics

10.

Author Correction: Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network.

Grapotte, Mathys; Saraswat, Manu; Bessière, Chloé; Menichelli, Christophe; Ramilowski, Jordan A; Severin, Jessica; Hayashizaki, Yoshihide; Itoh, Masayoshi; Tagami, Michihira; Murata, Mitsuyoshi; Kojima-Ishiyama, Miki; Noma, Shohei; Noguchi, Shuhei; Kasukawa, Takeya; Hasegawa, Akira; Suzuki, Harukazu; Nishiyori-Sueki, Hiromi; Frith, Martin C; Chatelain, Clément; Carninci, Piero; de Hoon, Michiel J L; Wasserman, Wyeth W; Bréhélin, Laurent; Lecellier, Charles-Henri.

Nat Commun ; 13(1): 1200, 2022 Mar 01.

Article in English | MEDLINE | ID: mdl-35232988

11.

Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network.

Grapotte, Mathys; Saraswat, Manu; Bessière, Chloé; Menichelli, Christophe; Ramilowski, Jordan A; Severin, Jessica; Hayashizaki, Yoshihide; Itoh, Masayoshi; Tagami, Michihira; Murata, Mitsuyoshi; Kojima-Ishiyama, Miki; Noma, Shohei; Noguchi, Shuhei; Kasukawa, Takeya; Hasegawa, Akira; Suzuki, Harukazu; Nishiyori-Sueki, Hiromi; Frith, Martin C; Chatelain, Clément; Carninci, Piero; de Hoon, Michiel J L; Wasserman, Wyeth W; Bréhélin, Laurent; Lecellier, Charles-Henri.

Nat Commun ; 12(1): 3297, 2021 06 02.

Article in English | MEDLINE | ID: mdl-34078885

ABSTRACT

Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.

Subject(s)

Microsatellite Repeats , Neural Networks, Computer , Neurodegenerative Diseases/genetics , Transcription Initiation Site , Transcription Initiation, Genetic , A549 Cells , Animals , Base Sequence , Computational Biology/methods , Deep Learning , Enhancer Elements, Genetic , Genome, Human , High-Throughput Nucleotide Sequencing , Humans , Mice , Neurodegenerative Diseases/diagnosis , Neurodegenerative Diseases/metabolism , Polymorphism, Genetic , Promoter Regions, Genetic

12.

Nanopore direct RNA sequencing detects DUX4-activated repeats and isoforms in human muscle cells.

Mitsuhashi, Satomi; Nakagawa, So; Sasaki-Honda, Mitsuru; Sakurai, Hidetoshi; Frith, Martin C; Mitsuhashi, Hiroaki.

Hum Mol Genet ; 30(7): 552-563, 2021 05 12.

Article in English | MEDLINE | ID: mdl-33693705

ABSTRACT

Facioscapulohumeral muscular dystrophy (FSHD) is an inherited muscle disease caused by misexpression of the DUX4 gene in skeletal muscle. DUX4 is a transcription factor, which is normally expressed in the cleavage-stage embryo and regulates gene expression involved in early embryonic development. Recent studies revealed that DUX4 also activates the transcription of repetitive elements such as endogenous retroviruses (ERVs), mammalian apparent long terminal repeat (LTR)-retrotransposons and pericentromeric satellite repeats (Human Satellite II). DUX4-bound ERV sequences also create alternative promoters for genes or long non-coding RNAs, producing fusion transcripts. To further understand transcriptional regulation by DUX4, we performed nanopore long-read direct RNA sequencing (dRNA-seq) of human muscle cells induced by DUX4, because long reads show whole isoforms with greater confidence. We successfully detected differential expression of known DUX4-induced genes and discovered 61 differentially expressed repeat loci, which are near DUX4-ChIP peaks. We also identified 247 gene-ERV fusion transcripts, of which 216 were not reported previously. In addition, long-read dRNA-seq clearly shows that RNA splicing is a common event in DUX4-activated ERV transcripts. Long-read analysis showed non-LTR transposons including Alu elements are also transcribed from LTRs. Our findings revealed further complexity of DUX4-induced ERV transcripts. This catalogue of DUX4-activated repetitive elements may provide useful information to elucidate the pathology of FSHD. Also, our results indicate that nanopore dRNA-seq has complementary strengths to conventional short-read complementary DNA sequencing.

Subject(s)

Homeodomain Proteins/genetics , Muscle, Skeletal/metabolism , Muscular Dystrophy, Facioscapulohumeral/genetics , Nanopores , Repetitive Sequences, Nucleic Acid/genetics , Sequence Analysis, RNA/methods , Cell Line, Tumor , Gene Expression Profiling , Gene Expression Regulation , Humans , Muscle Cells/metabolism , Muscular Dystrophy, Facioscapulohumeral/pathology , Protein Isoforms/genetics , RNA Isoforms/genetics , Reverse Transcriptase Polymerase Chain Reaction , Sequence Analysis, RNA/statistics & numerical data

13.

Significant non-existence of sequences in genomes and proteomes.

Koulouras, Grigorios; Frith, Martin C.

Nucleic Acids Res ; 49(6): 3139-3155, 2021 04 06.

Article in English | MEDLINE | ID: mdl-33693858

ABSTRACT

Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.

Subject(s)

Databases, Genetic , Genomics/methods , Proteomics/methods , Animals , Genome , Humans , Markov Chains , Mutation , Peptides/chemistry , Proteome , Software , Viruses/genetics

14.

Long-read whole-genome sequencing identified a partial MBD5 deletion in an exome-negative patient with neurodevelopmental disorder.

Ohori, Sachiko; Tsuburaya, Rie S; Kinoshita, Masako; Miyagi, Etsuko; Mizuguchi, Takeshi; Mitsuhashi, Satomi; Frith, Martin C; Matsumoto, Naomichi.

J Hum Genet ; 66(7): 697-705, 2021 Jul.

Article in English | MEDLINE | ID: mdl-33510365

ABSTRACT

Whole-exome sequencing (WES) can detect not only single-nucleotide variants in causal genes, but also pathogenic copy-number variations using several methods. However, there may be overlooked pathogenic variations in the out of target genome regions of WES analysis (e.g., promoters), leaving many patients undiagnosed. Whole-genome sequencing (WGS) can potentially analyze such regions. We applied long-read nanopore WGS and our recently developed analysis pipeline "dnarrange" to a patient who was undiagnosed by trio-based WES analysis, and identified a heterozygous 97-kb deletion partially involving 5'-untranslated exons of MBD5, which was outside the WES target regions. The phenotype of the patient, a 32-year-old male, was consistent with haploinsufficiency of MBD5. The transcript level of MBD5 in the patient's lymphoblastoid cells was reduced. We therefore concluded that the partial MBD5 deletion is the culprit for this patient. Furthermore, we found other rare structural variations (SVs) in this patient, i.e., a large inversion and a retrotransposon insertion, which were not seen in 33 controls. Although we considered that they are benign SVs, this finding suggests that our pipeline using long-read WGS is useful for investigating various types of potentially pathogenic SVs. In conclusion, we identified a 97-kb deletion, which causes haploinsufficiency of MBD5 in a patient with neurodevelopmental disorder, demonstrating that long-read WGS is a powerful technique to discover pathogenic SVs.

Subject(s)

DNA-Binding Proteins/genetics , Genetic Predisposition to Disease , Neurodevelopmental Disorders/genetics , Adult , Exome/genetics , Haploinsufficiency/genetics , Humans , Male , Mutagenesis, Insertional/genetics , Neurodevelopmental Disorders/pathology , Retroelements/genetics , Whole Genome Sequencing

15.

Genome-wide survey of tandem repeats by nanopore sequencing shows that disease-associated repeats are more polymorphic in the general population.

Mitsuhashi, Satomi; Frith, Martin C; Matsumoto, Naomichi.

BMC Med Genomics ; 14(1): 17, 2021 01 07.

Article in English | MEDLINE | ID: mdl-33413375

ABSTRACT

BACKGROUND: Tandem repeats are highly mutable and contribute to the development of human disease by a variety of mechanisms. It is difficult to predict which tandem repeats may cause a disease. One hypothesis is that changeable tandem repeats are the source of genetic diseases, because disease-causing repeats are polymorphic in healthy individuals. However, it is not clear whether disease-causing repeats are more polymorphic than other repeats. METHODS: We performed a genome-wide survey of the millions of human tandem repeats using publicly available long read genome sequencing data from 21 humans. We measured tandem repeat copy number changes using tandem-genotypes. Length variation of known disease-associated repeats was compared to other repeat loci. RESULTS: We found that known Mendelian disease-causing or disease-associated repeats, especially CAG and 5'UTR GGC repeats, are relatively long and polymorphic in the general population. We also show that repeat lengths of two disease-causing tandem repeats, in ATXN3 and GLS, are correlated with near-by GWAS SNP genotypes. CONCLUSIONS: We provide a catalog of polymorphic tandem repeats across a variety of repeat unit lengths and sequences, from long read sequencing data. This method especially if used in genome wide association study, may indicate possible new candidates of pathogenic or biologically important tandem repeats in human genomes.

Subject(s)

Nanopore Sequencing , Genome-Wide Association Study , High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA

16.

Minimally overlapping words for sequence similarity search.

Frith, Martin C; Noé, Laurent; Kucherov, Gregory.

Bioinformatics ; 36(22-23): 5344-5350, 2021 Apr 01.

Article in English | MEDLINE | ID: mdl-33346833

ABSTRACT

MOTIVATION: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. RESULTS: Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed 'minimizer' sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. AVAILABILITY AND IMPLEMENTATION: Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

17.

lamassemble: Multiple Alignment and Consensus Sequence of Long Reads.

Frith, Martin C; Mitsuhashi, Satomi; Katoh, Kazutaka.

Methods Mol Biol ; 2231: 135-145, 2021.

Article in English | MEDLINE | ID: mdl-33289891

ABSTRACT

Long DNA and RNA reads from nanopore and PacBio technologies have many applications, but the raw reads have a substantial error rate. More accurate sequences can be obtained by merging multiple reads from overlapping parts of the same sequence. lamassemble aligns up to â¼1000 reads to each other, and makes a consensus sequence, which is often much more accurate than the raw reads. It is useful for studying a region of interest such as an expanded tandem repeat or other disease-causing mutation.

Subject(s)

Consensus Sequence , Genomics/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software , Animals , Genetic Techniques , High-Throughput Nucleotide Sequencing , Humans , Nanopores

18.

A pipeline for complete characterization of complex germline rearrangements from long DNA reads.

Mitsuhashi, Satomi; Ohori, Sachiko; Katoh, Kazutaka; Frith, Martin C; Matsumoto, Naomichi.

Genome Med ; 12(1): 67, 2020 07 31.

Article in English | MEDLINE | ID: mdl-32731881

ABSTRACT

BACKGROUND: Many genetic/genomic disorders are caused by genomic rearrangements. Standard methods can often characterize these variations only partly, e.g., copy number changes or breakpoints. It is important to fully understand the order and orientation of rearranged fragments, with precise breakpoints, to know the pathogenicity of the rearrangements. METHODS: We performed whole-genome-coverage nanopore sequencing of long DNA reads from four patients with chromosomal translocations. We identified rearrangements relative to a reference human genome, subtracted rearrangements shared by any of 33 control individuals, and determined the order and orientation of rearranged fragments, with our newly developed analysis pipeline. RESULTS: We describe the full characterization of complex chromosomal rearrangements, by filtering out genomic rearrangements seen in controls without the same disease, reducing the number of loci per patient from a few thousand to a few dozen. Breakpoint detection was very accurate; we usually see ~ 0 ± 1 base difference from Sanger sequencing-confirmed breakpoints. For one patient with two reciprocal chromosomal translocations, we find that the translocation points have complex rearrangements of multiple DNA fragments involving 5 chromosomes, which we could order and orient by an automatic algorithm, thereby fully reconstructing the rearrangement. A rearrangement is more than the sum of its parts: some properties, such as sequence loss, can be inferred only after reconstructing the whole rearrangement. In this patient, the rearrangements were evidently caused by shattering of the chromosomes into multiple fragments, which rejoined in a different order and orientation with loss of some fragments. CONCLUSIONS: We developed an effective analytic pipeline to find chromosomal aberration in congenital diseases by filtering benign changes, only from long read sequencing. Our algorithm for reconstruction of complex rearrangements is useful to interpret rearrangements with many breakpoints, e.g., chromothripsis. Our approach promises to fully characterize many congenital germline rearrangements, provided they do not involve poorly understood loci such as centromeric repeats.

Subject(s)

Gene Rearrangement , Genome-Wide Association Study , Germ-Line Mutation , Chromosome Aberrations , Chromosome Breakpoints , Genetic Association Studies/methods , Genetic Predisposition to Disease , Genome, Human , Genomics/methods , High-Throughput Nucleotide Sequencing , Humans , Translocation, Genetic , Whole Genome Sequencing

19.

Long-read DNA sequencing fully characterized chromothripsis in a patient with Langer-Giedion syndrome and Cornelia de Lange syndrome-4.

Lei, Ming; Liang, Desheng; Yang, Yifeng; Mitsuhashi, Satomi; Katoh, Kazutaka; Miyake, Noriko; Frith, Martin C; Wu, Lingqian; Matsumoto, Naomichi.

J Hum Genet ; 65(8): 667-674, 2020 Aug.

Article in English | MEDLINE | ID: mdl-32296131

ABSTRACT

Chromothripsis is a type of chaotic complex genomic rearrangement caused by a single event of chromosomal shattering and repair processes. Chromothripsis is known to cause rare congenital diseases when it occurs in germline cells, however, current genome analysis technologies have difficulty in detecting and deciphering chromothripsis. It is possible that this type of complex rearrangement may be overlooked in rare-disease patients whose genetic diagnosis is unsolved. We applied long read nanopore sequencing and our recently developed analysis pipeline dnarrange to a patient who has a reciprocal chromosomal translocation t(8;18)(q22;q21) as a result of chromothripsis between the two chromosomes, and fully characterize the complex rearrangements at the translocation site. The patient genome was evidently shattered into 19 fragments, and rejoined into derivative chromosomes in a random order and orientation. The reconstructed patient genome indicates loss of five genomic regions, which all overlap with microarray-detected copy number losses. We found that two disease-related genes RAD21 and EXT1 were lost by chromothripsis. These two genes could fully explain the disease phenotype with facial dysmorphisms and bone abnormality, which is likely a contiguous gene syndrome, Cornelia de Lange syndrome type IV (CdLs-4) and atypical Langer-Giedion syndrome (LGS), also known as trichorhinophalangeal syndrome type II (TRPSII). This provides evidence that our approach based on long read sequencing can fully characterize chromothripsis in a patient's genome, which is important for understanding the phenotype of disease caused by complex genomic rearrangement.

Subject(s)

Cell Cycle Proteins/genetics , Chromothripsis , DNA-Binding Proteins/genetics , De Lange Syndrome/genetics , Langer-Giedion Syndrome/genetics , N-Acetylglucosaminyltransferases/genetics , Child , Chromosome Deletion , De Lange Syndrome/diagnosis , De Lange Syndrome/physiopathology , Genome , Humans , Langer-Giedion Syndrome/diagnosis , Langer-Giedion Syndrome/physiopathology , Male , Nanopore Sequencing , Phenotype , Sequence Analysis, DNA , Translocation, Genetic

20.

Long-read sequencing identifies the pathogenic nucleotide repeat expansion in RFC1 in a Japanese case of CANVAS.

Nakamura, Haruko; Doi, Hiroshi; Mitsuhashi, Satomi; Miyatake, Satoko; Katoh, Kazutaka; Frith, Martin C; Asano, Tetsuya; Kudo, Yosuke; Ikeda, Takuya; Kubota, Shun; Kunii, Misako; Kitazawa, Yu; Tada, Mikiko; Okamoto, Mitsuo; Joki, Hideto; Takeuchi, Hideyuki; Matsumoto, Naomichi; Tanaka, Fumiaki.

J Hum Genet ; 65(5): 475-480, 2020 May.

Article in English | MEDLINE | ID: mdl-32066831

ABSTRACT

Recently, a recessively inherited intronic repeat expansion in replication factor C1 (RFC1) was identified in cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome (CANVAS). Here, we describe a Japanese case of genetically confirmed CANVAS with autonomic failure and auditory hallucination. The case showed impaired uptake of iodine-123-metaiodobenzylguanidine and 123I-ioflupane in the cardiac sympathetic nerve and dopaminergic neurons, respectively, by single-photon emission computed tomography. Long-read sequencing identified biallelic pathogenic (AAGGG)n nucleotide repeat expansion in RFC1 and heterozygous benign (TAAAA)n and (TAGAA)n expansions in brain expressed, associated with NEDD4 (BEAN1). Enrichment of the repeat regions in RFC1 and BEAN1 using a Cas9-mediated system clearly distinguished between pathogenic and benign repeat expansions. The haplotype around RFC1 indicated that the (AAGGG)n expansion in our case was on the same ancestral allele as that of European cases. Thus, long-read sequencing facilitates precise genetic diagnosis of diseases with complex repeat structures and various expansions.

Subject(s)

Bilateral Vestibulopathy/genetics , Cerebellar Ataxia/genetics , DNA Repeat Expansion , Replication Protein C/genetics , Sequence Analysis, DNA , Aged, 80 and over , Asian People , Bilateral Vestibulopathy/diagnosis , Cerebellar Ataxia/diagnosis , Female , Humans , Japan , Nedd4 Ubiquitin Protein Ligases/genetics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL