Your browser doesn't support javascript.
loading
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.
Tørresen, Ole K; Star, Bastiaan; Mier, Pablo; Andrade-Navarro, Miguel A; Bateman, Alex; Jarnot, Patryk; Gruca, Aleksandra; Grynberg, Marcin; Kajava, Andrey V; Promponas, Vasilis J; Anisimova, Maria; Jakobsen, Kjetill S; Linke, Dirk.
Afiliación
  • Tørresen OK; Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway.
  • Star B; Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway.
  • Mier P; Faculty of Biology, Johannes Gutenberg University Mainz, Hans-Dieter-Husch-Weg 15, 55128 Mainz, Germany.
  • Andrade-Navarro MA; Faculty of Biology, Johannes Gutenberg University Mainz, Hans-Dieter-Husch-Weg 15, 55128 Mainz, Germany.
  • Bateman A; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton. CB10 1SD, UK.
  • Jarnot P; Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland.
  • Gruca A; Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland.
  • Grynberg M; Institute of Biochemistry and Biophysics PAS, Pawinskiego 5A, 02-106 Warsaw, Poland.
  • Kajava AV; Centre de Recherche en Biologie cellulaire de Montpellier, UMR 5237 CNRS, Universite Montpellier 1919 Route de Mende, CEDEX 5, 34293 Montpellier, France.
  • Promponas VJ; Institut de Biologie Computationnelle, 34095 Montpellier, France.
  • Anisimova M; Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, PO Box 20537, CY 1678 Nicosia, Cyprus.
  • Jakobsen KS; Institute of Applied Simulations, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Wädenswil, Switzerland.
  • Linke D; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
Nucleic Acids Res ; 47(21): 10994-11006, 2019 12 02.
Article en En | MEDLINE | ID: mdl-31584084
ABSTRACT
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
Asunto(s)

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: ADN / Secuencias Repetidas en Tándem / Bases de Datos de Ácidos Nucleicos / Bases de Datos de Proteínas / Error Científico Experimental Tipo de estudio: Risk_factors_studies Límite: Animals Idioma: En Revista: Nucleic Acids Res Año: 2019 Tipo del documento: Article País de afiliación: Noruega

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: ADN / Secuencias Repetidas en Tándem / Bases de Datos de Ácidos Nucleicos / Bases de Datos de Proteínas / Error Científico Experimental Tipo de estudio: Risk_factors_studies Límite: Animals Idioma: En Revista: Nucleic Acids Res Año: 2019 Tipo del documento: Article País de afiliación: Noruega