Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Bioinformatics ; 35(10): 1720-1728, 2019 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-30321307

RESUMO

MOTIVATION: Approximate Bayesian computation (ABC) has grown into a standard methodology that manages Bayesian inference for models associated with intractable likelihood functions. Most ABC implementations require the preliminary selection of a vector of informative statistics summarizing raw data. Furthermore, in almost all existing implementations, the tolerance level that separates acceptance from rejection of simulated parameter values needs to be calibrated. RESULTS: We propose to conduct likelihood-free Bayesian inferences about parameters with no prior selection of the relevant components of the summary statistics and bypassing the derivation of the associated tolerance level. The approach relies on the random forest (RF) methodology of Breiman (2001) applied in a (non-parametric) regression setting. We advocate the derivation of a new RF for each component of the parameter vector of interest. When compared with earlier ABC solutions, this method offers significant gains in terms of robustness to the choice of the summary statistics, does not depend on any type of tolerance level, and is a good trade-off in term of quality of point estimator precision and credible interval estimations for a given computing time. We illustrate the performance of our methodological proposal and compare it with earlier ABC methods on a Normal toy example and a population genetics example dealing with human population evolution. AVAILABILITY AND IMPLEMENTATION: All methods designed here have been incorporated in the R package abcrf (version 1.7.1) available on CRAN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Teorema de Bayes , Biometria , Simulação por Computador , Genética Populacional , Humanos , Funções Verossimilhança
2.
Mol Ecol ; 29(23): 4542-4558, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-33000872

RESUMO

Dating population divergence within species from molecular data and relating such dating to climatic and biogeographic changes is not trivial. Yet it can help formulating evolutionary hypotheses regarding local adaptation and future responses to changing environments. Key issues include statistical selection of a demographic and historical scenario among a set of possible scenarios, and estimation of the parameter(s) of interest under the chosen scenario. Such inferences greatly benefit from (a) independent information on evolutionary rate and pattern at genetic markers; and (b) new statistical approaches, such as approximate Bayesian computation-random forest (ABC-RF), which provides reliable inference at a low computational cost and the possibility to measure prediction quality at the exact position of the observed data set. Here, we show full potential of the ABC-RF approach including prior knowledge on microsatellite genetic markers to decipher the evolutionary history of the African arid-adapted pest locust, Schistocerca gregaria, with support for a southern colonization of Africa, from a low number of founders of northern origin, dating back 2.6 Ky (90% CI: 0.9-6.6 Ky). We verify that this divergence time estimate accurately reflected true divergence time values by computing accuracy at a local posterior scale from simulated pseudo-observed data sets. The inferred divergence history is better explained by the peculiar biology of S. gregaria, which involves a density-dependent swarming phase with some exceptional spectacular migrations, rather than a continuous colonization resulting from the continental expansion of open vegetation habitats during more ancient Quaternary glacial climatic episodes.


Assuntos
Genética Populacional , Gafanhotos , África , Animais , Teorema de Bayes , Variação Genética , Gafanhotos/genética , Repetições de Microssatélites/genética
3.
PLoS Comput Biol ; 14(1): e1005921, 2018 01.
Artigo em Inglês | MEDLINE | ID: mdl-29293496

RESUMO

Gene expression is orchestrated by distinct regulatory regions to ensure a wide variety of cell types and functions. A challenge is to identify which regulatory regions are active, what are their associated features and how they work together in each cell type. Several approaches have tackled this problem by modeling gene expression based on epigenetic marks, with the ultimate goal of identifying driving regions and associated genomic variations that are clinically relevant in particular in precision medicine. However, these models rely on experimental data, which are limited to specific samples (even often to cell lines) and cannot be generated for all regulators and all patients. In addition, we show here that, although these approaches are accurate in predicting gene expression, inference of TF combinations from this type of models is not straightforward. Furthermore these methods are not designed to capture regulation instructions present at the sequence level, before the binding of regulators or the opening of the chromatin. Here, we probe sequence-level instructions for gene expression and develop a method to explain mRNA levels based solely on nucleotide features. Our method positions nucleotide composition as a critical component of gene expression. Moreover, our approach, able to rank regulatory regions according to their contribution, unveils a strong influence of the gene body sequence, in particular introns. We further provide evidence that the contribution of nucleotide content can be linked to co-regulations associated with genome 3D architecture and to associations of genes within topologically associated domains.


Assuntos
Composição de Bases , Regulação da Expressão Gênica , Sequências Reguladoras de Ácido Nucleico , Biologia Computacional , Variações do Número de Cópias de DNA , Elementos Facilitadores Genéticos , Genoma Humano , Humanos , Modelos Genéticos , Neoplasias/genética , Neoplasias/metabolismo , Polimorfismo de Nucleotídeo Único , Regiões Promotoras Genéticas , Locos de Características Quantitativas , RNA Mensageiro/química , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
4.
Mol Biol Evol ; 34(4): 980-996, 2017 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-28122970

RESUMO

Deciphering invasion routes from molecular data is crucial to understanding biological invasions, including identifying bottlenecks in population size and admixture among distinct populations. Here, we unravel the invasion routes of the invasive pest Drosophila suzukii using a multi-locus microsatellite dataset (25 loci on 23 worldwide sampling locations). To do this, we use approximate Bayesian computation (ABC), which has improved the reconstruction of invasion routes, but can be computationally expensive. We use our study to illustrate the use of a new, more efficient, ABC method, ABC random forest (ABC-RF) and compare it to a standard ABC method (ABC-LDA). We find that Japan emerges as the most probable source of the earliest recorded invasion into Hawaii. Southeast China and Hawaii together are the most probable sources of populations in western North America, which then in turn served as sources for those in eastern North America. European populations are genetically more homogeneous than North American populations, and their most probable source is northeast China, with evidence of limited gene flow from the eastern US as well. All introduced populations passed through bottlenecks, and analyses reveal five distinct admixture events. These findings can inform hypotheses concerning how this species evolved between different and independent source and invasive populations. Methodological comparisons indicate that ABC-RF and ABC-LDA show concordant results if ABC-LDA is based on a large number of simulated datasets but that ABC-RF out-performs ABC-LDA when using a comparable and more manageable number of simulated datasets, especially when analyzing complex introduction scenarios.


Assuntos
Teorema de Bayes , Drosophila/genética , Genética Populacional/métodos , Filogeografia/métodos , Animais , China , Simulação por Computador , Variação Genética/genética , Genótipo , Havaí , Espécies Introduzidas , Japão , Repetições de Microssatélites/genética , Modelos Genéticos , América do Norte
5.
Bioinformatics ; 32(6): 859-66, 2016 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-26589278

RESUMO

MOTIVATION: Approximate Bayesian computation (ABC) methods provide an elaborate approach to Bayesian inference on complex models, including model choice. Both theoretical arguments and simulation experiments indicate, however, that model posterior probabilities may be poorly evaluated by standard ABC techniques. RESULTS: We propose a novel approach based on a machine learning tool named random forests (RF) to conduct selection among the highly complex models covered by ABC algorithms. We thus modify the way Bayesian model selection is both understood and operated, in that we rephrase the inferential goal as a classification problem, first predicting the model that best fits the data with RF and postponing the approximation of the posterior probability of the selected model for a second stage also relying on RF. Compared with earlier implementations of ABC model choice, the ABC RF approach offers several potential improvements: (i) it often has a larger discriminative power among the competing models, (ii) it is more robust against the number and choice of statistics summarizing the data, (iii) the computing effort is drastically reduced (with a gain in computation efficiency of at least 50) and (iv) it includes an approximation of the posterior probability of the selected model. The call to RF will undoubtedly extend the range of size of datasets and complexity of models that ABC can handle. We illustrate the power of this novel methodology by analyzing controlled experiments as well as genuine population genetics datasets. AVAILABILITY AND IMPLEMENTATION: The proposed methodology is implemented in the R package abcrf available on the CRAN. CONTACT: jean-michel.marin@umontpellier.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genética Populacional , Algoritmos , Teorema de Bayes , Simulação por Computador , Modelos Genéticos
6.
Bioinformatics ; 30(8): 1187-1189, 2014 04 15.
Artigo em Inglês | MEDLINE | ID: mdl-24389659

RESUMO

MOTIVATION: DIYABC is a software package for a comprehensive analysis of population history using approximate Bayesian computation on DNA polymorphism data. Version 2.0 implements a number of new features and analytical methods. It allows (i) the analysis of single nucleotide polymorphism data at large number of loci, apart from microsatellite and DNA sequence data, (ii) efficient Bayesian model choice using linear discriminant analysis on summary statistics and (iii) the serial launching of multiple post-processing analyses. DIYABC v2.0 also includes a user-friendly graphical interface with various new options. It can be run on three operating systems: GNU/Linux, Microsoft Windows and Apple Os X. AVAILABILITY: Freely available with a detailed notice document and example projects to academic users at http://www1.montpellier.inra.fr/CBGP/diyabc CONTACT: estoup@supagro.inra.fr Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genética Populacional/métodos , Polimorfismo de Nucleotídeo Único , Software , Teorema de Bayes , Biologia Computacional , Humanos , Repetições de Microssatélites , Análise de Sequência de DNA
7.
Proc Natl Acad Sci U S A ; 108(37): 15112-7, 2011 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-21876135

RESUMO

Approximate Bayesian computation (ABC) have become an essential tool for the analysis of complex stochastic models. Grelaud et al. [(2009) Bayesian Anal 3:427-442] advocated the use of ABC for model choice in the specific case of Gibbs random fields, relying on an intermodel sufficiency property to show that the approximation was legitimate. We implemented ABC model choice in a wide range of phylogenetic models in the Do It Yourself-ABC (DIY-ABC) software [Cornuet et al. (2008) Bioinformatics 24:2713-2719]. We now present arguments as to why the theoretical arguments for ABC model choice are missing, because the algorithm involves an unknown loss of information induced by the use of insufficient summary statistics. The approximation error of the posterior probabilities of the models under comparison may thus be unrelated with the computational effort spent in running an ABC algorithm. We then conclude that additional empirical verifications of the performances of the ABC procedure as those available in DIY-ABC are necessary to conduct model choice.


Assuntos
Teorema de Bayes , Biologia Computacional/métodos , Simulação por Computador , Genética Populacional
8.
Mol Ecol Resour ; 21(8): 2598-2613, 2021 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-33950563

RESUMO

Simulation-based methods such as approximate Bayesian computation (ABC) are well-adapted to the analysis of complex scenarios of populations and species genetic history. In this context, supervised machine learning (SML) methods provide attractive statistical solutions to conduct efficient inferences about scenario choice and parameter estimation. The Random Forest methodology (RF) is a powerful ensemble of SML algorithms used for classification or regression problems. Random Forest allows conducting inferences at a low computational cost, without preliminary selection of the relevant components of the ABC summary statistics, and bypassing the derivation of ABC tolerance levels. We have implemented a set of RF algorithms to process inferences using simulated data sets generated from an extended version of the population genetic simulator implemented in DIYABC v2.1.0. The resulting computer package, named DIYABC Random Forest v1.0, integrates two functionalities into a user-friendly interface: the simulation under custom evolutionary scenarios of different types of molecular data (microsatellites, DNA sequences or SNPs) and RF treatments including statistical tools to evaluate the power and accuracy of inferences. We illustrate the functionalities of DIYABC Random Forest v1.0 for both scenario choice and parameter estimation through the analysis of pseudo-observed and real data sets corresponding to pool-sequencing and individual-sequencing SNP data sets. Because of the properties inherent to the implemented RF methods and the large feature vector (including various summary statistics and their linear combinations) available for SNP data, DIYABC Random Forest v1.0 can efficiently contribute to the analysis of large SNP data sets to make inferences about complex population genetic histories.


Assuntos
Algoritmos , Genética Populacional , Teorema de Bayes , Simulação por Computador , Demografia , Polimorfismo de Nucleotídeo Único , Aprendizado de Máquina Supervisionado
9.
Bioinformatics ; 24(23): 2713-9, 2008 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-18842597

RESUMO

UNLABELLED: Genetic data obtained on population samples convey information about their evolutionary history. Inference methods can extract part of this information but they require sophisticated statistical techniques that have been made available to the biologist community (through computer programs) only for simple and standard situations typically involving a small number of samples. We propose here a computer program (DIY ABC) for inference based on approximate Bayesian computation (ABC), in which scenarios can be customized by the user to fit many complex situations involving any number of populations and samples. Such scenarios involve any combination of population divergences, admixtures and population size changes. DIY ABC can be used to compare competing scenarios, estimate parameters for one or more scenarios and compute bias and precision measures for a given scenario and known values of parameters (the current version applies to unlinked microsatellite data). This article describes key methods used in the program and provides its main features. The analysis of one simulated and one real dataset, both with complex evolutionary scenarios, illustrates the main possibilities of DIY ABC. AVAILABILITY: The software DIY ABC is freely available at http://www.montpellier.inra.fr/CBGP/diyabc.


Assuntos
Genética Populacional/métodos , Software , Algoritmos , Teorema de Bayes , Evolução Molecular , Humanos , Grupos Populacionais/genética
10.
Mol Ecol Resour ; 12(5): 846-55, 2012 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-22571382

RESUMO

Comparison of demo-genetic models using Approximate Bayesian Computation (ABC) is an active research field. Although large numbers of populations and models (i.e. scenarios) can be analysed with ABC using molecular data obtained from various marker types, methodological and computational issues arise when these numbers become too large. Moreover, Robert et al. (Proceedings of the National Academy of Sciences of the United States of America, 2011, 108, 15112) have shown that the conclusions drawn on ABC model comparison cannot be trusted per se and required additional simulation analyses. Monte Carlo inferential techniques to empirically evaluate confidence in scenario choice are very time-consuming, however, when the numbers of summary statistics (Ss) and scenarios are large. We here describe a methodological innovation to process efficient ABC scenario probability computation using linear discriminant analysis (LDA) on Ss before computing logistic regression. We used simulated pseudo-observed data sets (pods) to assess the main features of the method (precision and computation time) in comparison with traditional probability estimation using raw (i.e. not LDA transformed) Ss. We also illustrate the method on real microsatellite data sets produced to make inferences about the invasion routes of the coccinelid Harmonia axyridis. We found that scenario probabilities computed from LDA-transformed and raw Ss were strongly correlated. Type I and II errors were similar for both methods. The faster probability computation that we observed (speed gain around a factor of 100 for LDA-transformed Ss) substantially increases the ability of ABC practitioners to analyse large numbers of pods and hence provides a manageable way to empirically evaluate the power available to discriminate among a large set of complex scenarios.


Assuntos
Bioestatística/métodos , Biologia Computacional/métodos , Modelos Genéticos , Animais , Besouros/genética , Marcadores Genéticos , Genética Populacional
11.
Nat Struct Mol Biol ; 19(8): 837-44, 2012 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-22751019

RESUMO

DNA replication is highly regulated, ensuring faithful inheritance of genetic information through each cell cycle. In metazoans, this process is initiated at many thousands of DNA replication origins whose cell type-specific distribution and usage are poorly understood. We exhaustively mapped the genome-wide location of replication origins in human cells using deep sequencing of short nascent strands and identified ten times more origin positions than we expected; most of these positions were conserved in four different human cell lines. Furthermore, we identified a consensus G-quadruplex-forming DNA motif that can predict the position of DNA replication origins in human cells, accounting for their distribution, usage efficiency and timing. Finally, we discovered a cell type-specific reprogrammable signature of cell identity that was revealed by specific efficiencies of conserved origin positions and not by the selection of cell type-specific subsets of origins.


Assuntos
Quadruplex G , Origem de Replicação/genética , Sequência de Bases , Linhagem Celular , Mapeamento Cromossômico , Sequência Consenso , Primers do DNA/genética , Replicação do DNA/genética , Genoma Humano , Células HeLa , Humanos , Motivos de Nucleotídeos
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa