Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
PLoS Comput Biol ; 20(5): e1011543, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38768195

RESUMEN

Random forests have emerged as a promising tool in comparative metagenomics because they can predict environmental characteristics based on microbial composition in datasets where ß-diversity metrics fall short of revealing meaningful relationships between samples. Nevertheless, despite this efficacy, they lack biological insight in tandem with their predictions, potentially hindering scientific advancement. To overcome this limitation, we leverage a geometric characterization of random forests to introduce a data-driven phylogenetic ß-diversity metric, the adaptive Haar-like distance. This new metric assigns a weight to each internal node (i.e., split or bifurcation) of a reference phylogeny, indicating the relative importance of that node in discerning environmental samples based on their microbial composition. Alongside this, a weighted nearest-neighbors classifier, constructed using the adaptive metric, can be used as a proxy for the random forest while maintaining accuracy on par with that of the original forest and another state-of-the-art classifier, CoDaCoRe. As shown in datasets from diverse microbial environments, however, the new metric and classifier significantly enhance the biological interpretability and visualization of high-dimensional metagenomic samples.


Asunto(s)
Algoritmos , Biología Computacional , Metagenómica , Filogenia , Metagenómica/métodos , Biología Computacional/métodos , Microbiota/genética , Aprendizaje Automático , Metagenoma/genética
2.
J Math Biol ; 87(2): 26, 2023 07 10.
Artículo en Inglés | MEDLINE | ID: mdl-37428265

RESUMEN

Data taking values on discrete sample spaces are the embodiment of modern biological research. "Omics" experiments based on high-throughput sequencing produce millions of symbolic outcomes in the form of reads (i.e., DNA sequences of a few dozens to a few hundred nucleotides). Unfortunately, these intrinsically non-numerical datasets often deviate dramatically from natural assumptions a practitioner might make, and the possible sources of this deviation are usually poorly characterized. This contrasts with numerical datasets where Gaussian-type errors are often well-justified. To overcome this hurdle, we introduce the notion of latent weight, which measures the largest expected fraction of samples from a probabilistic source that conform to a model in a class of idealized models. We examine various properties of latent weights, which we specialize to the class of exchangeable probability distributions. As proof of concept, we analyze DNA methylation data from the 22 human autosome pairs. Contrary to what is usually assumed in the literature, we provide strong evidence that highly specific methylation patterns are overrepresented at some genomic locations when latent weights are taken into account.


Asunto(s)
Genoma , Genómica , Humanos , Probabilidad , Secuenciación de Nucleótidos de Alto Rendimiento
3.
J Math Biol ; 79(1): 1-29, 2019 07.
Artículo en Inglés | MEDLINE | ID: mdl-30929047

RESUMEN

Numerous data analysis and data mining techniques require that data be embedded in a Euclidean space. When faced with symbolic datasets, particularly biological sequence data produced by high-throughput sequencing assays, conventional embedding approaches like binary and k-mer count vectors may be too high dimensional or coarse-grained to learn from the data effectively. Other representation techniques such as Multidimensional Scaling (MDS) and Node2Vec may be inadequate for large datasets as they require recomputing the full embedding from scratch when faced with new, unclassified data. To overcome these issues we amend the graph-theoretic notion of "metric dimension" to that of "multilateration." Much like trilateration can be used to represent points in the Euclidean plane by their distances to three non-colinear points, multilateration allows us to represent any node in a graph by its distances to a subset of nodes. Unfortunately, the problem of determining a minimal subset and hence the lowest dimensional embedding is NP-complete for general graphs. However, by specializing to Hamming graphs, which are particularly well suited to representing biological sequences, we can readily generate low-dimensional embeddings to map sequences of arbitrary length to a real space. As proof-of-concept, we use MDS, Node2Vec, and multilateration-based embeddings to classify DNA 20-mers centered at intron-exon boundaries. Although these different techniques perform comparably, MDS and Node2Vec potentially suffer from scalability issues with increasing sequence length whereas multilateration provides an efficient means of mapping long genomic sequences.


Asunto(s)
Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Simulación por Computador , Análisis de Datos , Minería de Datos/métodos , Genómica/estadística & datos numéricos , Análisis de Componente Principal , Prueba de Estudio Conceptual
4.
J Math Biol ; 74(1-2): 77-97, 2017 01.
Artículo en Inglés | MEDLINE | ID: mdl-27142882

RESUMEN

A mixture model and statistical method is proposed to interpret the distribution of reads from a nascent transcriptional assay, such as global run-on sequencing (GRO-seq) data. The model is annotation agnostic and leverages on current understanding of the behavior of RNA polymerase II. Briefly, it assumes that polymerase loads at key positions (transcription start sites) within the genome. Once loaded, polymerase either remains in the initiation form (with some probability) or transitions into an elongating form (with the remaining probability). The model can be fit genome-wide, allowing patterns of Pol II behavior to be assessed on each distinct transcript. Furthermore, it allows for the first time a principled approach to distinguishing the initiation signal from the elongation signal; in particular, it implies a data driven method for calculating the pausing index, a commonly used metric that informs on the behavior of RNA polymerase II. We demonstrate that this approach improves on existing analyses of GRO-seq data and uncovers a novel biological understanding of the impact of knocking down the Male Specific Lethal (MSL) complex in Drosophilia melanogaster.


Asunto(s)
Modelos Biológicos , ARN Polimerasa II/metabolismo , Transcripción Genética/genética , Animales , Simulación por Computador , Drosophila melanogaster/genética , Técnicas de Silenciamiento del Gen , Genes Letales/genética , Genoma/genética , Regiones Promotoras Genéticas , Análisis de Secuencia de ADN
5.
J Math Biol ; 69(1): 147-82, 2014 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-23739838

RESUMEN

Sojourn-times provide a versatile framework to assess the statistical significance of motifs in genome-wide searches even under non-Markovian background models. However, the large state spaces encountered in genomic sequence analyses make the exact calculation of sojourn-time distributions computationally intractable in long sequences. Here, we use coupling and analytic combinatoric techniques to approximate these distributions in the general setting of Polish state spaces, which encompass discrete state spaces. Our approximations are accompanied with explicit, easy to compute, error bounds for total variation distance. Broadly speaking, if Tn is the random number of times a Markov chain visits a certain subset T of states in its first n transitions, then we can usually approximate the distribution of Tn for n of order (1 − α)(−m), where m is the largest integer for which the exact distribution of Tm is accessible and 0 ≤ α ≤ 1 is an ergodicity coefficient associated with the probability transition kernel of the chain. This gives access to approximations of sojourn-times in the intermediate regime where n is perhaps too large for exact calculations, but too small to rely on Normal approximations or stationarity assumptions underlying Poisson and compound Poisson approximations. As proof of concept, we approximate the distribution of the number of matches with a motif in promoter regions of C.


Asunto(s)
Secuencia de Bases/genética , Cadenas de Markov , Modelos Estadísticos , Motivos de Nucleótidos/genética , Animales , Caenorhabditis elegans/genética , Regiones Promotoras Genéticas
6.
RNA ; 16(12): 2370-83, 2010 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-20940341

RESUMEN

The invariant choice of L-amino acids and D-ribose RNA for biological translation requires explanation. Here we study this chiral choice using mixed, equimolar D-ribose RNAs having 15, 18, 21, 27, 35, and 45 contiguous randomized nucleotides. These are used for simultaneous affinity selection of the smallest bound and eluted RNAs using equal amounts of L- and D-His immobilized on an achiral glass support, with racemic histidine elution. The experiment as a whole therefore determines whether RNA containing D-ribose binds L-histidine or D-histidine more easily (that is, by using a site that is more abundant/requires fewer nucleotides). The most prevalent/smallest RNA sites are reproducibly and repeatedly selected and there is a four- to sixfold greater abundance of L-histidine sites. RNA's chiral D-ribose therefore yields a more frequent fit to L-histidine. Accordingly, a D-ribose RNA site for L-His is smaller by the equivalent of just over one conserved nucleotide. The most prevalent L-His site also performs better than the most frequent D-His site-but rarer D-ribose RNAs can bind D-His with excellent affinity and discrimination. The prevalent L-His site is one we have selected before under very different conditions. Thus, selection is again reproducible, as is the recurrence of cognate coding triplets in these most probable L-His sites. If our selected RNA population were equilibrated with racemic His, we calculate that L-His would participate in seven of eight His:RNA complexes, or more. Thus, if D-ribose RNA were first chosen biologically, translational L-His usage could have followed.


Asunto(s)
Histidina/química , ARN/química , ARN/metabolismo , Ribosa/química , Ribosa/metabolismo , Secuencia de Bases , Dominio Catalítico , Código Genético/fisiología , Modelos Biológicos , Datos de Secuencia Molecular , Conformación de Ácido Nucleico , ARN/síntesis química , Distribución Aleatoria , Estereoisomerismo , Especificidad por Sustrato
7.
RNA ; 16(2): 280-9, 2010 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-20032164

RESUMEN

Different chemical and mutational processes within genomes give rise to sequences with different compositions and perhaps different capacities for evolution. The evolution of functional RNAs may occur on a "neutral network" in which sequences with any given function can easily mutate to sequences with any other. This neutral network hypothesis is more likely if there is a particular region of composition that contains sequences that are functional in general, and if many different functions are possible within this preferred region of composition. We show that sequence preferences in active sites recovered by in vitro selection combine with biophysical folding rules to support the neutral network hypothesis. These simple active-site specifications and folding preferences obtained by artificial selection experiments recapture the previously observed purine bias and specific spread along the GC axis of naturally occurring aptamers and ribozymes isolated from organisms, although other types of RNAs, such as miRNA precursors and spliceosomal RNAs, that act primarily through complementarity to other amino acids do not share these preferences. These universal evolved sequence features are therefore intrinsic in RNA molecules that bind small-molecule targets or catalyze reactions.


Asunto(s)
ARN/química , ARN/genética , Aptámeros de Nucleótidos/química , Aptámeros de Nucleótidos/genética , Aptámeros de Nucleótidos/metabolismo , Composición de Base , Secuencia de Bases , Sitios de Unión/genética , Fenómenos Biofísicos , Biología Computacional , Modelos Genéticos , Modelos Moleculares , Modelos Estadísticos , Mutación , Conformación de Ácido Nucleico , Distribución de Poisson , ARN/metabolismo , ARN Catalítico/química , ARN Catalítico/genética , ARN Catalítico/metabolismo , Técnica SELEX de Producción de Aptámeros , Selección Genética
8.
BMC Bioinformatics ; 9: 511, 2008 Dec 01.
Artículo en Inglés | MEDLINE | ID: mdl-19046431

RESUMEN

BACKGROUND: The nucleotide substitution rate matrix is a key parameter of molecular evolution. Several methods for inferring this parameter have been proposed, with different mathematical bases. These methods include counting sequence differences and taking the log of the resulting probability matrices, methods based on Markov triples, and maximum likelihood methods that infer the substitution probabilities that lead to the most likely model of evolution. However, the speed and accuracy of these methods has not been compared. RESULTS: Different methods differ in performance by orders of magnitude (ranging from 1 ms to 10 s per matrix), but differences in accuracy of rate matrix reconstruction appear to be relatively small. Encouragingly, relatively simple and fast methods can provide results at least as accurate as far more complex and computationally intensive methods, especially when the sequences to be compared are relatively short. CONCLUSION: Based on the conditions tested, we recommend the use of method of Gojobori et al. (1982) for long sequences (> 600 nucleotides), and the method of Goldman et al. (1996) for shorter sequences (< 600 nucleotides). The method of Barry and Hartigan (1987) can provide somewhat more accuracy, measured as the Euclidean distance between the true and inferred matrices, on long sequences (> 2000 nucleotides) at the expense of substantially longer computation time. The availability of methods that are both fast and accurate will allow us to gain a global picture of change in the nucleotide substitution rate matrix on a genomewide scale across the tree of life.


Asunto(s)
Biología Computacional/métodos , Análisis Mutacional de ADN/métodos , Evolución Molecular , Nucleótidos/genética , Algoritmos , Simulación por Computador , ADN/genética , Interpretación Estadística de Datos , Modelos Logísticos , Cadenas de Markov , Modelos Genéticos , Filogenia , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
9.
Front Biosci ; 13: 6060-71, 2008 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-18508643

RESUMEN

The abundance of simple but functional RNA sites in random-sequence pools is critical for understanding emergence of RNA functions in nature and in the laboratory today. The complexity of a site is typically measured in terms of information, i.e. the Shannon entropy of the positions in a multiple sequence alignment. However, this calculation can be incorrect by many orders of magnitude. Here we compare several methods for estimating the abundance of RNA active-site patterns in the context of in vitro selection (SELEX), highlighting the strengths and weaknesses of each. We include in these methods a new approach that yields confidence bounds for the exact probability of finding specific kinds of RNA active sites. We show that all of the methods that take modularity into account provide far more accurate estimates of this probability than the informational methods, and that fast approximate methods are suitable for a wide range of RNA motifs.


Asunto(s)
ARN/genética , ARN/metabolismo , Sitios de Unión , ADN/genética , ADN/metabolismo , Matemática , Modelos Teóricos , Distribución de Poisson , Probabilidad , Proteínas/genética , Proteínas/metabolismo , Procesos Estocásticos
10.
IEEE/ACM Trans Comput Biol Bioinform ; 14(5): 1070-1081, 2017.
Artículo en Inglés | MEDLINE | ID: mdl-26829802

RESUMEN

We present a fast and simple algorithm to detect nascent RNA transcription in global nuclear run-on sequencing (GRO-seq). GRO-seq is a relatively new protocol that captures nascent transcripts from actively engaged polymerase, providing a direct read-out on bona fide transcription. Most traditional assays, such as RNA-seq, measure steady state RNA levels which are affected by transcription, post-transcriptional processing, and RNA stability. GRO-seq data, however, presents unique analysis challenges that are only beginning to be addressed. Here, we describe a new algorithm, Fast Read Stitcher (FStitch), that takes advantage of two popular machine-learning techniques, hidden Markov models and logistic regression, to classify which regions of the genome are transcribed. Given a small user-defined training set, our algorithm is accurate, robust to varying read depth, annotation agnostic, and fast. Analysis of GRO-seq data without a priori need for annotation uncovers surprising new insights into several aspects of the transcription process.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Anotación de Secuencia Molecular/métodos , ARN/genética , Análisis de Secuencia de ARN/métodos , Bases de Datos Genéticas , Humanos , Cadenas de Markov , ARN/análisis
11.
mSystems ; 2(1)2017.
Artículo en Inglés | MEDLINE | ID: mdl-28144630

RESUMEN

Advances in sequencing technologies have enabled novel insights into microbial niche differentiation, from analyzing environmental samples to understanding human diseases and informing dietary studies. However, identifying the microbial taxa that differentiate these samples can be challenging. These issues stem from the compositional nature of 16S rRNA gene data (or, more generally, taxon or functional gene data); the changes in the relative abundance of one taxon influence the apparent abundances of the others. Here we acknowledge that inferring properties of individual bacteria is a difficult problem and instead introduce the concept of balances to infer meaningful properties of subcommunities, rather than properties of individual species. We show that balances can yield insights about niche differentiation across multiple microbial environments, including soil environments and lung sputum. These techniques have the potential to reshape how we carry out future ecological analyses aimed at revealing differences in relative taxonomic abundances across different samples. IMPORTANCE By explicitly accounting for the compositional nature of 16S rRNA gene data through the concept of balances, balance trees yield novel biological insights into niche differentiation. The software to perform this analysis is available under an open-source license and can be obtained at https://github.com/biocore/gneiss. Author Video: An author video summary of this article is available.

12.
PLoS One ; 7(11): e42368, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-23139734

RESUMEN

A classical problem in statistics is estimating the expected coverage of a sample, which has had applications in gene expression, microbial ecology, optimization, and even numismatics. Here we consider a related extension of this problem to random samples of two discrete distributions. Specifically, we estimate what we call the dissimilarity probability of a sample, i.e., the probability of a draw from one distribution not being observed in [Formula: see text] draws from another distribution. We show our estimator of dissimilarity to be a [Formula: see text]-statistic and a uniformly minimum variance unbiased estimator of dissimilarity over the largest appropriate range of [Formula: see text]. Furthermore, despite the non-Markovian nature of our estimator when applied sequentially over [Formula: see text], we show it converges uniformly in probability to the dissimilarity parameter, and we present criteria when it is approximately normally distributed and admits a consistent jackknife estimator of its variance. As proof of concept, we analyze V35 16S rRNA data to discern between various microbial environments. Other potential applications concern any situation where dissimilarity of two discrete distributions may be of interest. For instance, in SELEX experiments, each urn could represent a random RNA pool and each draw a possible solution to a particular binding site problem over that pool. The dissimilarity of these pools is then related to the probability of finding binding site solutions in one pool that are absent in the other.


Asunto(s)
Modelos Estadísticos , Probabilidad , Bases de Datos como Asunto , Humanos , Metagenoma/genética , Modelos Biológicos
13.
PLoS One ; 6(6): e21105, 2011.
Artículo en Inglés | MEDLINE | ID: mdl-21738613

RESUMEN

The availability of high-throughput parallel methods for sequencing microbial communities is increasing our knowledge of the microbial world at an unprecedented rate. Though most attention has focused on determining lower-bounds on the α-diversity i.e. the total number of different species present in the environment, tight bounds on this quantity may be highly uncertain because a small fraction of the environment could be composed of a vast number of different species. To better assess what remains unknown, we propose instead to predict the fraction of the environment that belongs to unsampled classes. Modeling samples as draws with replacement of colored balls from an urn with an unknown composition, and under the sole assumption that there are still undiscovered species, we show that conditionally unbiased predictors and exact prediction intervals (of constant length in logarithmic scale) are possible for the fraction of the environment that belongs to unsampled classes. Our predictions are based on a poissonization argument, which we have implemented in what we call the Embedding algorithm. In fixed i.e. non-randomized sample sizes, the algorithm leads to very accurate predictions on a sub-sample of the original sample. We quantify the effect of fixed sample sizes on our prediction intervals and test our methods and others found in the literature against simulated environments, which we devise taking into account datasets from a human-gut and -hand microbiota. Our methodology applies to any dataset that can be conceptualized as a sample with replacement from an urn. In particular, it could be applied, for example, to quantify the proportion of all the unseen solutions to a binding site problem in a random RNA pool, or to reassess the surveillance of a certain terrorist group, predicting the conditional probability that it deploys a new tactic in a next attack.


Asunto(s)
Algoritmos , Microbiología Ambiental
14.
J Math Biol ; 56(1-2): 51-92, 2008 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-17668213

RESUMEN

RNA motifs typically consist of short, modular patterns that include base pairs formed within and between modules. Estimating the abundance of these patterns is of fundamental importance for assessing the statistical significance of matches in genomewide searches, and for predicting whether a given function has evolved many times in different species or arose from a single common ancestor. In this manuscript, we review in an integrated and self-contained manner some basic concepts of automata theory, generating functions and transfer matrix methods that are relevant to pattern analysis in biological sequences. We formalize, in a general framework, the concept of Markov chain embedding to analyze patterns in random strings produced by a memoryless source. This conceptualization, together with the capability of automata to recognize complicated patterns, allows a systematic analysis of problems related to the occurrence and frequency of patterns in random strings. The applications we present focus on the concept of synchronization of automata, as well as automata used to search for a finite number of keywords (including sets of patterns generated according to base pairing rules) in a general text.


Asunto(s)
Biología Computacional/métodos , Cadenas de Markov , Reconocimiento de Normas Patrones Automatizadas/métodos , ARN/química , ARN/genética , Alineación de Secuencia
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA