iSeg: an efficient algorithm for segmentation of genomic and epigenomic data.

Girimurugan, Senthil B; Liu, Yuhang; Lung, Pei-Yau; Vera, Daniel L; Dennis, Jonathan H; Bass, Hank W; Zhang, Jinfeng

Girimurugan, Senthil B; Liu, Yuhang; Lung, Pei-Yau; Vera, Daniel L; Dennis, Jonathan H; Bass, Hank W; Zhang, Jinfeng.

Afiliação

Girimurugan SB; Department of Mathematics, Florida Gulf Coast University, Fort Myers, FL, USA.
Liu Y; Department of Statistics, Florida State University, Tallahassee, FL, USA.
Lung PY; Department of Statistics, Florida State University, Tallahassee, FL, USA.
Vera DL; Center for Genomics and Personalized Medicine, Florida State University, Tallahassee, FL, USA.
Dennis JH; Department of Biological Science, Florida State University, Tallahassee, FL, USA.
Bass HW; Department of Biological Science, Florida State University, Tallahassee, FL, USA.
Zhang J; Department of Statistics, Florida State University, Tallahassee, FL, USA. jinfeng@stat.fsu.edu.

BMC Bioinformatics ; 19(1): 131, 2018 04 11.

Article em En | MEDLINE | ID: mdl-29642840

RESUMO

BACKGROUND: Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems. RESULTS: We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences. CONCLUSIONS: We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions.

Assuntos

Algoritmos; Bases de Dados Genéticas; Epigenômica; Simulação por Computador; Variações do Número de Cópias de DNA/genética; Desoxirribonucleases/metabolismo; Genoma; Humanos; Modelos Estatísticos; Neoplasias/genética; Zea mays/genética

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Bases de Dados Genéticas / Epigenômica Tipo de estudo: Prognostic_studies / Risk_factors_studies Limite: Humans Idioma: En Ano de publicação: 2018 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google