Your browser doesn't support javascript.
loading
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes.
Contreras-Moreira, Bruno; Filippi, Carla V; Naamati, Guy; García Girón, Carlos; Allen, James E; Flicek, Paul.
Afiliação
  • Contreras-Moreira B; European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
  • Filippi CV; European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
  • Naamati G; Instituto de Biotecnología, Centro de Investigaciones en Ciencias Veterinarias y Agronómicas (CICVyA), Instituto Nacional de Tecnología Agropecuaria (INTA); Instituto de Agrobiotecnología y Biología Molecular (IABIMO), INTA-Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) Nicolas
  • García Girón C; CONICET, Av Rivadavia 1917, C1033AAJ Ciudad de Buenos Aires, Argentina.
  • Allen JE; European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
  • Flicek P; European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
Plant Genome ; 14(3): e20143, 2021 11.
Article em En | MEDLINE | ID: mdl-34562304
ABSTRACT
The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https//github.com/Ensembl/plant-scripts.
Assuntos

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Sequências Repetitivas de Ácido Nucleico / Genoma de Planta Idioma: En Ano de publicação: 2021 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Sequências Repetitivas de Ácido Nucleico / Genoma de Planta Idioma: En Ano de publicação: 2021 Tipo de documento: Article