Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
1.
BMC Bioinformatics ; 22(1): 323, 2021 Jun 14.
Artigo em Inglês | MEDLINE | ID: mdl-34126932

RESUMO

BACKGROUND: Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them. RESULTS: Our comparisons on seven reference datasets of histone modifications (H3K36me3 & H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model with alternative noise assumptions and supervised learning of the penalty parameter reduces the over-dispersion exhibited by count data. These models, implemented in the R package CROCS ( https://github.com/aLiehrmann/CROCS ), detect the peaks more accurately than algorithms which rely on natural assumptions. CONCLUSION: The segmentation models we propose can benefit researchers in the field of epigenetics by providing new high-quality peak prediction tracks for H3K36me3 and H3K4me3 histone modifications.


Assuntos
Sequenciamento de Cromatina por Imunoprecipitação , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Imunoprecipitação da Cromatina , Análise de Sequência de DNA
2.
Am J Hum Genet ; 103(4): 474-483, 2018 10 04.
Artigo em Inglês | MEDLINE | ID: mdl-30220433

RESUMO

Advances in high-throughput DNA sequencing have revolutionized the discovery of variants in the human genome; however, interpreting the phenotypic effects of those variants is still a challenge. While several computational approaches to predict variant impact are available, their accuracy is limited and further improvement is needed. Here, we introduce ClinPred, an efficient tool for identifying disease-relevant nonsynonymous variants. Our predictor incorporates two machine learning algorithms that use existing pathogenicity scores and, notably, benefits from inclusion of normal population allele frequency from the gnomAD database as an input feature. Another major strength of our approach is the use of ClinVar-a rapidly growing database that allows selection of confidently annotated disease-causing variants-as a training set. Compared to other methods, ClinPred showed superior accuracy for predicting pathogenicity, achieving the highest area under the curve (AUC) score and increasing both the specificity and sensitivity in different test datasets. It also obtained the best performance according to various other metrics. Moreover, ClinPred performance remained robust with respect to disease type (cancer or rare disease) and mechanism (gain or loss of function). Importantly, we observed that adding allele frequency as a predictive feature-as opposed to setting fixed allele frequency cutoffs-boosts the performance of prediction. We provide pre-computed ClinPred scores for all possible human missense variants in the exome to facilitate its use by the community.


Assuntos
Biologia Computacional/métodos , Doença/genética , Polimorfismo de Nucleotídeo Único/genética , Algoritmos , Área Sob a Curva , Exoma/genética , Frequência do Gene/genética , Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Aprendizado de Máquina , Software
3.
Biostatistics ; 21(4): 709-726, 2020 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-30753436

RESUMO

Calcium imaging data promises to transform the field of neuroscience by making it possible to record from large populations of neurons simultaneously. However, determining the exact moment in time at which a neuron spikes, from a calcium imaging data set, amounts to a non-trivial deconvolution problem which is of critical importance for downstream analyses. While a number of formulations have been proposed for this task in the recent literature, in this article, we focus on a formulation recently proposed in Jewell and Witten (2018. Exact spike train inference via $\ell_{0} $ optimization. The Annals of Applied Statistics12(4), 2457-2482) that can accurately estimate not just the spike rate, but also the specific times at which the neuron spikes. We develop a much faster algorithm that can be used to deconvolve a fluorescence trace of 100 000 timesteps in less than a second. Furthermore, we present a modification to this algorithm that precludes the possibility of a "negative spike". We demonstrate the performance of this algorithm for spike deconvolution on calcium imaging datasets that were recently released as part of the $\texttt{spikefinder}$ challenge (http://spikefinder.codeneuro.org/). The algorithm presented in this article was used in the Allen Institute for Brain Science's "platform paper" to decode neural activity from the Allen Brain Observatory; this is the main scientific paper in which their data resource is presented. Our $\texttt{C++}$ implementation, along with $\texttt{R}$ and $\texttt{python}$ wrappers, is publicly available. $\texttt{R}$ code is available on $\texttt{CRAN}$ and $\texttt{Github}$, and $\texttt{python}$ wrappers are available on $\texttt{Github}$; see https://github.com/jewellsean/FastLZeroSpikeInference.


Assuntos
Cálcio , Neurônios , Algoritmos , Encéfalo/diagnóstico por imagem , Diagnóstico por Imagem , Humanos
4.
Bioinformatics ; 33(4): 491-499, 2017 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-27797775

RESUMO

Motivation: Many peak detection algorithms have been proposed for ChIP-seq data analysis, but it is not obvious which algorithm and what parameters are optimal for any given dataset. In contrast, regions with and without obvious peaks can be easily labeled by visual inspection of aligned read counts in a genome browser. We propose a supervised machine learning approach for ChIP-seq data analysis, using labels that encode qualitative judgments about which genomic regions contain or do not contain peaks. The main idea is to manually label a small subset of the genome, and then learn a model that makes consistent peak predictions on the rest of the genome. Results: We created 7 new histone mark datasets with 12 826 visually determined labels, and analyzed 3 existing transcription factor datasets. We observed that default peak detection parameters yield high false positive rates, which can be reduced by learning parameters using a relatively small training set of labeled data from the same experiment type. We also observed that labels from different people are highly consistent. Overall, these data indicate that our supervised labeling method is useful for quantitatively training and testing peak detection algorithms. Availability and Implementation: Labeled histone mark data http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/ , R package to compute the label error of predicted peaks https://github.com/tdhock/PeakError. Contacts: toby.hocking@mail.mcgill.ca or guil.bourque@mcgill.ca. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Imunoprecipitação da Cromatina/métodos , Análise de Sequência de DNA/métodos , Software , Aprendizado de Máquina Supervisionado , Animais , Genômica/métodos , Humanos
5.
BMC Bioinformatics ; 14: 164, 2013 May 22.
Artigo em Inglês | MEDLINE | ID: mdl-23697330

RESUMO

BACKGROUND: Many models have been proposed to detect copy number alterations in chromosomal copy number profiles, but it is usually not obvious to decide which is most effective for a given data set. Furthermore, most methods have a smoothing parameter that determines the number of breakpoints and must be chosen using various heuristics. RESULTS: We present three contributions for copy number profile smoothing model selection. First, we propose to select the model and degree of smoothness that maximizes agreement with visual breakpoint region annotations. Second, we develop cross-validation procedures to estimate the error of the trained models. Third, we apply these methods to compare 17 smoothing models on a new database of 575 annotated neuroblastoma copy number profiles, which we make available as a public benchmark for testing new algorithms. CONCLUSIONS: Whereas previous studies have been qualitative or limited to simulated data, our annotation-guided approach is quantitative and suggests which algorithms are fastest and most accurate in practice on real data. In the neuroblastoma data, the equivalent pelt.n and cghseg.k methods were the best breakpoint detectors, and exhibited reasonable computation times.


Assuntos
Pontos de Quebra do Cromossomo , Dosagem de Genes/genética , Perfilação da Expressão Gênica/métodos , Modelos Genéticos , Algoritmos , Mapeamento Cromossômico/métodos , Humanos
6.
Pac Symp Biocomput ; 25: 367-378, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-31797611

RESUMO

Joint peak detection is a central problem when comparing samples in epigenomic data analysis, but current algorithms for this task are unsupervised and limited to at most 2 sample types. We propose PeakSegPipeline, a new genome-wide multi-sample peak calling pipeline for epigenomic data sets. It performs peak detection using a constrained maximum likelihood segmentation model with essentially only one free parameter that needs to be tuned: the number of peaks. To select the number of peaks, we propose to learn a penalty function based on user-provided labels that indicate genomic regions with or without peaks in specific samples. In comparisons with state-of-the-art peak detection algorithms, PeakSegPipeline achieves similar or better accuracy, and a more interpretable model with overlapping peaks that occur in exactly the same positions across all samples. Our novel approach is able to learn that predicted peak sizes vary by experiment type.


Assuntos
Algoritmos , Biologia Computacional , Genômica , Aprendizado de Máquina
7.
PLoS One ; 5(8): e11913, 2010 Aug 02.
Artigo em Inglês | MEDLINE | ID: mdl-20689851

RESUMO

BACKGROUND: The recent advent of high-throughput SNP genotyping technologies has opened new avenues of research for population genetics. In particular, a growing interest in the identification of footprints of selection, based on genome scans for adaptive differentiation, has emerged. METHODOLOGY/PRINCIPAL FINDINGS: The purpose of this study is to develop an efficient model-based approach to perform bayesian exploratory analyses for adaptive differentiation in very large SNP data sets. The basic idea is to start with a very simple model for neutral loci that is easy to implement under a bayesian framework and to identify selected loci as outliers via Posterior Predictive P-values (PPP-values). Applications of this strategy are considered using two different statistical models. The first one was initially interpreted in the context of populations evolving respectively under pure genetic drift from a common ancestral population while the second one relies on populations under migration-drift equilibrium. Robustness and power of the two resulting bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations. An application to a cattle data set is also provided. CONCLUSIONS/SIGNIFICANCE: The procedure described turns out to be much faster than former bayesian approaches and also reasonably efficient especially to detect loci under positive selection.


Assuntos
Bases de Dados Genéticas , Genômica/métodos , Polimorfismo de Nucleotídeo Único , Seleção Genética , Adaptação Fisiológica , Animais , Teorema de Bayes , Bovinos , Deleção de Genes , Loci Gênicos/genética , Genótipo , Funções Verossimilhança , Modelos Genéticos , Reprodutibilidade dos Testes
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA