A machine learning based framework to identify and classify long terminal repeat retrotransposons.

Schietgat, Leander; Vens, Celine; Cerri, Ricardo; Fischer, Carlos N; Costa, Eduardo; Ramon, Jan; Carareto, Claudia M A; Blockeel, Hendrik

Schietgat, Leander; Vens, Celine; Cerri, Ricardo; Fischer, Carlos N; Costa, Eduardo; Ramon, Jan; Carareto, Claudia M A; Blockeel, Hendrik.

Afiliación

Schietgat L; Department of Computer Science, KU Leuven, Leuven, Belgium.
Vens C; Department of Computer Science, KU Leuven, Leuven, Belgium.
Cerri R; Department of Public Health and Primary Care, KU Leuven Kulak, Kortrijk, Belgium.
Fischer CN; Department of Respiratory Medicine, Ghent University, and VIB Inflammation Research Center, Ghent, Belgium.
Costa E; Department of Computer Science, UFSCar Federal University of São Carlos, São Carlos, São Paulo, Brazil.
Ramon J; Department of Statistics, Applied Mathematics, and Computer Science, UNESP São Paulo State University, Rio Claro, São Paulo, Brazil.
Carareto CMA; Department of Computer Science, KU Leuven, Leuven, Belgium.
Blockeel H; Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, São Paulo, Brazil.

PLoS Comput Biol ; 14(4): e1006097, 2018 04.

Article en En | MEDLINE | ID: mdl-29684010

RESUMEN

Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner's predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.

Asunto(s)

Aprendizaje Automático; Retroelementos; Secuencias Repetidas Terminales; Animales; Arabidopsis/genética; Proteínas de Arabidopsis/genética; Biología Computacional; Secuencia Conservada; ADN de Plantas/genética; Árboles de Decisión; Proteínas de Drosophila/genética; Drosophila melanogaster/genética; Evolución Molecular; Genoma de los Insectos; Genoma de Planta; Programas Informáticos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Retroelementos / Secuencias Repetidas Terminales / Aprendizaje Automático Tipo de estudio: Prognostic_studies Límite: Animals Idioma: En Revista: PLoS Comput Biol Asunto de la revista: BIOLOGIA / INFORMATICA MEDICA Año: 2018 Tipo del documento: Article País de afiliación: Bélgica

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google