Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
1.
Bull Math Biol ; 85(11): 114, 2023 10 12.
Artigo em Inglês | MEDLINE | ID: mdl-37828255

RESUMO

The serial nature of reactions involved in the RNA life-cycle motivates the incorporation of delays in models of transcriptional dynamics. The models couple a transcriptional process to a fairly general set of delayed monomolecular reactions with no feedback. We provide numerical strategies for calculating the RNA copy number distributions induced by these models, and solve several systems with splicing, degradation, and catalysis. An analysis of single-cell and single-nucleus RNA sequencing data using these models reveals that the kinetics of nuclear export do not appear to require invocation of a non-Markovian waiting time.


Assuntos
Conceitos Matemáticos , Modelos Biológicos , Processos Estocásticos , Simulação por Computador , RNA , Cadeias de Markov , Algoritmos
2.
HardwareX ; 10: e00201, 2021 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-35607693

RESUMO

We present colosseum, a low-cost, modular, and automated fluid sampling device for scalable fluidic applications. The colosseum fraction collector uses a single motor, can be built for less than $100 using off-the-shelf and 3D-printed components, and can be assembled in less than an hour. Build Instructions and source files are available at https://doi.org/10.5281/zenodo.4677604.

3.
Phys Rev E ; 102(2-1): 022409, 2020 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-32942485

RESUMO

We explore a Markov model used in the analysis of gene expression, involving the bursty production of pre-mRNA, its conversion to mature mRNA, and its consequent degradation. We demonstrate that the integration used to compute the solution of the stochastic system can be approximated by the evaluation of special functions. Furthermore, the form of the special function solution generalizes to a broader class of burst distributions. In light of the broader goal of biophysical parameter inference from transcriptomics data, we apply the method to simulated data, demonstrating effective control of precision and runtime. Finally, we propose and validate a non-Bayesian approach for parameter estimation based on the characteristic function of the target joint distribution of pre-mRNA and mRNA.


Assuntos
Modelos Genéticos , Transcrição Gênica , Cadeias de Markov , Precursores de RNA/genética , RNA Mensageiro/genética , Processos Estocásticos
4.
Stat Appl Genet Mol Biol ; 10(1)2011 Sep 23.
Artigo em Inglês | MEDLINE | ID: mdl-23089814

RESUMO

Recent experimental and computational work confirms that CpGs can be unmethylated inside coding exons, thereby showing that codons may be subjected to both genomic and epigenomic constraint. It is therefore of interest to identify coding CpG islands (CCGIs) that are regions inside exons enriched for CpGs. The difficulty in identifying such islands is that coding exons exhibit sequence biases determined by codon usage and constraints that must be taken into account. We present a method for finding CCGIs that showcases a novel approach we have developed for identifying regions of interest that are significant (with respect to a Markov chain) for the counts of any pattern. Our method begins with the exact computation of tail probabilities for the number of CpGs in all regions contained in coding exons, and then applies a greedy algorithm for selecting islands from among the regions. We show that the greedy algorithm provably optimizes a biologically motivated criterion for selecting islands while controlling the false discovery rate. We applied this approach to the human genome (hg18) and annotated CpG islands in coding exons. The statistical criterion we apply to evaluating islands reduces the number of false positives in existing annotations, while our approach to defining islands reveals significant numbers of undiscovered CCGIs in coding exons. Many of these appear to be examples of functional epigenetic specialization in coding exons.


Assuntos
Biologia Computacional/métodos , Ilhas de CpG , Cadeias de Markov , Software , Algoritmos , Linhagem Celular , Metilação de DNA , Epigênese Genética , Éxons , Genoma Humano , Genômica/métodos , Humanos , Anotação de Sequência Molecular , Curva ROC , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Fatores de Tempo
5.
PLoS Comput Biol ; 6(8): e1000888, 2010 Aug 19.
Artigo em Inglês | MEDLINE | ID: mdl-20856582

RESUMO

The ability to assay genome-scale methylation patterns using high-throughput sequencing makes it possible to carry out association studies to determine the relationship between epigenetic variation and phenotype. While bisulfite sequencing can determine a methylome at high resolution, cost inhibits its use in comparative and population studies. MethylSeq, based on sequencing of fragment ends produced by a methylation-sensitive restriction enzyme, is a method for methyltyping (survey of methylation states) and is a site-specific and cost-effective alternative to whole-genome bisulfite sequencing. Despite its advantages, the use of MethylSeq has been restricted by biases in MethylSeq data that complicate the determination of methyltypes. Here we introduce a statistical method, MetMap, that produces corrected site-specific methylation states from MethylSeq experiments and annotates unmethylated islands across the genome. MetMap integrates genome sequence information with experimental data, in a statistically sound and cohesive Bayesian Network. It infers the extent of methylation at individual CGs and across regions, and serves as a framework for comparative methylation analysis within and among species. We validated MetMap's inferences with direct bisulfite sequencing, showing that the methylation status of sites and islands is accurately inferred. We used MetMap to analyze MethylSeq data from four human neutrophil samples, identifying novel, highly unmethylated islands that are invisible to sequence-based annotation strategies. The combination of MethylSeq and MetMap is a powerful and cost-effective tool for determining genome-scale methyltypes suitable for comparative and association studies.


Assuntos
Metilação de DNA , Genoma Humano , Modelos Genéticos , População/genética , Análise de Sequência de DNA/métodos , Software , Teorema de Bayes , Ilhas de CpG , Genômica/economia , Genômica/métodos , Humanos , Neutrófilos , Sulfitos/química
6.
BMC Bioinformatics ; 11: 430, 2010 Aug 18.
Artigo em Inglês | MEDLINE | ID: mdl-20718980

RESUMO

BACKGROUND: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce a coding of the shape of the coverage depth function as a tree and explain how this can be used to detect regions with anomalous coverage. This modeling perspective is especially germane to current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. RESULTS: Under the mild assumptions that fragment start sites are Poisson distributed and successive fragment lengths are independent and identically distributed, we observe that, regardless of fragment length distribution, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the successive jumps of the coverage function, and show that they can be encoded as a random tree that is approximately a Galton-Watson tree with generation-dependent geometric offspring distributions whose parameters can be computed. CONCLUSIONS: We extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. Our approach leads to explicit determinations of the null distributions of certain test statistics, while for others it greatly simplifies the approximation of their null distributions by simulation. Our focus on fragments also leads to a new approach to visualizing sequencing data that is of independent interest.


Assuntos
Modelos Estatísticos , Análise de Sequência de DNA/métodos , Sequência de Bases , Fragmentação do DNA , Método de Monte Carlo , Distribuições Estatísticas
7.
PLoS Comput Biol ; 5(5): e1000392, 2009 May.
Artigo em Inglês | MEDLINE | ID: mdl-19478997

RESUMO

We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment--previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches--yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.


Assuntos
Interpretação Estatística de Dados , Modelos Genéticos , Alinhamento de Sequência/métodos , Software , Algoritmos , Sequência de Aminoácidos , Animais , Inteligência Artificial , Sequência de Bases , Bases de Dados Genéticas , Humanos , Cadeias de Markov , Dados de Sequência Molecular , Sensibilidade e Especificidade , Análise de Sequência
8.
BMC Evol Biol ; 7: 60, 2007 Apr 13.
Artigo em Inglês | MEDLINE | ID: mdl-17433106

RESUMO

BACKGROUND: Understanding interactions between mutations and how they affect fitness is a central problem in evolutionary biology that bears on such fundamental issues as the structure of fitness landscapes and the evolution of sex. To date, analyses of fitness landscapes have focused either on the overall directional curvature of the fitness landscape or on the distribution of pairwise interactions. In this paper, we propose and employ a new mathematical approach that allows a more complete description of multi-way interactions and provides new insights into the structure of fitness landscapes. RESULTS: We apply the mathematical theory of gene interactions developed by Beerenwinkel et al. to a fitness landscape for Escherichia coli obtained by Elena and Lenski. The genotypes were constructed by introducing nine mutations into a wild-type strain and constructing a restricted set of 27 double mutants. Despite the absence of mutants higher than second order, our analysis of this genotypic space points to previously unappreciated gene interactions, in addition to the standard pairwise epistasis. Our analysis confirms Elena and Lenski's inference that the fitness landscape is complex, so that an overall measure of curvature obscures a diversity of interaction types. We also demonstrate that some mutations contribute disproportionately to this complexity. In particular, some mutations are systematically better than others at mixing with other mutations. We also find a strong correlation between epistasis and the average fitness loss caused by deleterious mutations. In particular, the epistatic deviations from multiplicative expectations tend toward more positive values in the context of more deleterious mutations, emphasizing that pairwise epistasis is a local property of the fitness landscape. Finally, we determine the geometry of the fitness landscape, which reflects many of these biologically interesting features. CONCLUSION: A full description of complex fitness landscapes requires more information than the average curvature or the distribution of independent pairwise interactions. We have proposed a mathematical approach that, in principle, allows a complete description and, in practice, can suggest new insights into the structure of real fitness landscapes. Our analysis emphasizes the value of non-independent genotypes for these inferences.


Assuntos
Evolução Biológica , Epistasia Genética , Escherichia coli/genética , Modelos Genéticos , Escherichia coli/fisiologia , Regulação Bacteriana da Expressão Gênica , Genótipo , Cadeias de Markov , Mutação , Seleção Genética
9.
J Comput Biol ; 12(6): 599-608, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16108706

RESUMO

The Gibbs sampling method has been widely used for sequence analysis after it was successfully applied to the problem of identifying regulatory motif sequences upstream of genes. Since then, numerous variants of the original idea have emerged: however, in all cases the application has been to finding short motifs in collections of short sequences (typically less than 100 nucleotides long). In this paper, we introduce a Gibbs sampling approach for identifying genes in multiple large genomic sequences up to hundreds of kilobases long. This approach leverages the evolutionary relationships between the sequences to improve the gene predictions, without explicitly aligning the sequences. We have applied our method to the analysis of genomic sequence from 14 genomic regions, totaling roughly 1.8 Mb of sequence in each organism. We show that our approach compares favorably with existing ab initio approaches to gene finding, including pairwise comparison based gene prediction methods which make explicit use of alignments. Furthermore, excellent performance can be obtained with as little as four organisms, and the method overcomes a number of difficulties of previous comparison based gene finding approaches: it is robust with respect to genomic rearrangements, can work with draft sequence, and is fast (linear in the number and length of the sequences). It can also be seamlessly integrated with Gibbs sampling motif detection methods.


Assuntos
Algoritmos , Éxons , Modelos Genéticos , Proteínas/genética , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Animais , Genômica , Humanos , Cadeias de Markov , Camundongos , Ratos , Sensibilidade e Especificidade
10.
Proc Natl Acad Sci U S A ; 101(46): 16138-43, 2004 Nov 16.
Artigo em Inglês | MEDLINE | ID: mdl-15534223

RESUMO

One of the major successes in computational biology has been the unification, by using the graphical model formalism, of a multitude of algorithms for annotating and comparing biological sequences. Graphical models that have been applied to these problems include hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems that are associated with different statistical models. This article introduces the polytope propagation algorithm for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models.


Assuntos
Análise de Sequência/estatística & dados numéricos , Algoritmos , Cadeias de Markov , Modelos Estatísticos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de DNA/estatística & dados numéricos
11.
Proc Natl Acad Sci U S A ; 101(46): 16132-7, 2004 Nov 16.
Artigo em Inglês | MEDLINE | ID: mdl-15534224

RESUMO

This article presents a unified mathematical framework for inference in graphical models, building on the observation that graphical models are algebraic varieties. From this geometric viewpoint, observations generated from a model are coordinates of a point in the variety, and the sum-product algorithm is an efficient tool for evaluating specific coordinates. Here, we address the question of how the solutions to various inference problems depend on the model parameters. The proposed answer is expressed in terms of tropical algebraic geometry. The Newton polytope of a statistical model plays a key role. Our results are applied to the hidden Markov model and the general Markov model on a binary tree.


Assuntos
Modelos Estatísticos , Algoritmos , Cadeias de Markov , Matemática
12.
Bioinformatics ; 20(12): 1850-60, 2004 Aug 12.
Artigo em Inglês | MEDLINE | ID: mdl-14988105

RESUMO

MOTIVATION: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. RESULTS: We show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe shadower, our implementation of such a prediction system. We find that shadower outperforms previously reported ab initio gene finders, including comparative human-mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of shadower's performance which reveals that as few as five well-chosen species may suffice to attain maximal sensitivity and specificity in exon demarcation. AVAILABILITY: A Web server is available at http://bonaire.lbl.gov/shadower


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Evolução Molecular , Perfilação da Expressão Gênica/métodos , Modelos Genéticos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Cadeias de Markov , Modelos Estatísticos , Filogenia , Homologia de Sequência do Ácido Nucleico , Software
13.
Bioinformatics ; 19 Suppl 2: ii36-41, 2003 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-14534169

RESUMO

The standard method of applying hidden Markov models to biological problems is to find a Viterbi (maximal weight) path through the HMM graph. The Viterbi algorithm reduces the problem of finding the most likely hidden state sequence that explains given observations, to a dynamic programming problem for corresponding directed acyclic graphs. For example, in the gene finding application, the HMM is used to find the most likely underlying gene structure given a DNA sequence. In this note we discuss the applications of sampling methods for HMMs. The standard sampling algorithm for HMMs is a variant of the common forward-backward and backtrack algorithms, and has already been applied in the context of Gibbs sampling methods. Nevetheless, the practice of sampling state paths from HMMs does not seem to have been widely adopted, and important applications have been overlooked. We show how sampling can be used for finding alternative splicings for genes, including alternative splicings that are conserved between genes from related organisms. We also show how sampling from the posterior distribution is a natural way to compute probabilities for predicted exons and gene structures being correct under the assumed model. Finally, we describe a new memory efficient sampling algorithm for certain classes of HMMs which provides a practical sampling alternative to the Hirschberg algorithm for optimal alignment. The ideas presented have applications not only to gene finding and HMMs but more generally to stochastic context free grammars and RNA structure prediction.


Assuntos
Algoritmos , Processamento Alternativo/genética , Reconhecimento Automatizado de Padrão/métodos , Sítios de Splice de RNA/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Inteligência Artificial , Sequência de Bases , Sequência Conservada , Cadeias de Markov , Dados de Sequência Molecular , Homologia de Sequência do Ácido Nucleico
14.
J Comput Biol ; 10(3-4): 509-20, 2003.
Artigo em Inglês | MEDLINE | ID: mdl-12935341

RESUMO

The application of Needleman-Wunsch alignment techniques to biological sequences is complicated by two serious problems when the sequences are long: the running time, which scales as the product of the lengths of sequences, and the difficulty in obtaining suitable parameters that produce meaningful alignments. The running time problem is often corrected by reducing the search space, using techniques such as banding, or chaining of high-scoring pairs. The parameter problem is more difficult to fix, partly because the probabilistic model, which Needleman-Wunsch is equivalent to, does not capture a key feature of biological sequence alignments, namely the alternation of conserved blocks and seemingly unrelated nonconserved segments. We present a solution to the problem of designing efficient search spaces for pair hidden Markov models that align biological sequences by taking advantage of their associated features. Our approach leads to an optimization problem, for which we obtain a 2-approximation algorithm, and that is based on the construction of Manhattan networks, which are close relatives of Steiner trees. We describe the underlying theory and show how our methods can be applied to alignment of DNA sequences in practice, successfully reducing the Viterbi algorithm search space of alignment PHMMs by three orders of magnitude.


Assuntos
Biologia Computacional/métodos , Interpretação Estatística de Dados , Alinhamento de Sequência/métodos , Algoritmos , Animais , Antígenos CD4/genética , Humanos , Cadeias de Markov , Camundongos
15.
Nucleic Acids Res ; 31(13): 3507-9, 2003 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-12824355

RESUMO

SLAM is a program that simultaneously aligns and annotates pairs of homologous sequences. The SLAM web server integrates SLAM with repeat masking tools and the AVID alignment program to allow for rapid alignment and gene prediction in user submitted sequences. Along with annotations and alignments for the submitted sequences, users obtain a list of predicted conserved non-coding sequences (and their associated alignments). The web site also links to whole genome annotations of the human, mouse and rat genomes produced with the SLAM program. The server can be accessed at http://bio.math.berkeley.edu/slam.


Assuntos
Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Sequência de Aminoácidos , Animais , Sequência de Bases , Sequência Conservada , Componentes do Gene , Humanos , Internet , Cadeias de Markov , Camundongos , Peptídeos/química , RNA Mensageiro/química , RNA não Traduzido/química , Ratos
16.
Genome Res ; 13(3): 496-502, 2003 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-12618381

RESUMO

Comparative-based gene recognition is driven by the principle that conserved regions between related organisms are more likely than divergent regions to be coding. We describe a probabilistic framework for gene structure and alignment that can be used to simultaneously find both the gene structure and alignment of two syntenic genomic regions. A key feature of the method is the ability to enhance gene predictions by finding the best alignment between two syntenic sequences, while at the same time finding biologically meaningful alignments that preserve the correspondence between coding exons. Our probabilistic framework is the generalized pair hidden Markov model, a hybrid of (1). generalized hidden Markov models, which have been used previously for gene finding, and (2). pair hidden Markov models, which have applications to sequence alignment. We have built a gene finding and alignment program called SLAM, which aligns and identifies complete exon/intron structures of genes in two related but unannotated sequences of DNA. SLAM is able to reliably predict gene structures for any suitably related pair of organisms, most notably with fewer false-positive predictions compared to previous methods (examples are provided for Homo sapiens/Mus musculus and Plasmodium falciparum/Plasmodium vivax comparisons). Accuracy is obtained by distinguishing conserved noncoding sequence (CNS) from conserved coding sequence. CNS annotation is a novel feature of SLAM and may be useful for the annotation of UTRs, regulatory elements, and other noncoding features.


Assuntos
Genes/genética , Cadeias de Markov , Alinhamento de Sequência/estatística & dados numéricos , Software , Animais , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Sequência Conservada/genética , DNA/genética , DNA de Protozoário/genética , Genes de Protozoários/genética , Humanos , Camundongos , Plasmodium falciparum/genética , Plasmodium vivax/genética , Alinhamento de Sequência/métodos , Design de Software , Especificidade da Espécie
17.
J Comput Biol ; 9(2): 389-99, 2002.
Artigo em Inglês | MEDLINE | ID: mdl-12015888

RESUMO

Hidden Markov models (HMMs) have been successfully applied to a variety of problems in molecular biology, ranging from alignment problems to gene finding and annotation. Alignment problems can be solved with pair HMMs, while gene finding programs rely on generalized HMMs in order to model exon lengths. In this paper, we introduce the generalized pair HMM (GPHMM), which is an extension of both pair and generalized HMMs. We show how GPHMMs, in conjunction with approximate alignments, can be used for cross-species gene finding and describe applications to DNA-cDNA and DNA-protein alignment. GPHMMs provide a unifying and probabilistically sound theory for modeling these problems.


Assuntos
Cadeias de Markov , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Biologia Computacional , DNA/genética , Modelos Estatísticos , Proteínas/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA