Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
Más filtros












Base de datos
Intervalo de año de publicación
1.
Med Image Anal ; 95: 103162, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38593644

RESUMEN

Active Learning (AL) has the potential to solve a major problem of digital pathology: the efficient acquisition of labeled data for machine learning algorithms. However, existing AL methods often struggle in realistic settings with artifacts, ambiguities, and class imbalances, as commonly seen in the medical field. The lack of precise uncertainty estimations leads to the acquisition of images with a low informative value. To address these challenges, we propose Focused Active Learning (FocAL), which combines a Bayesian Neural Network with Out-of-Distribution detection to estimate different uncertainties for the acquisition function. Specifically, the weighted epistemic uncertainty accounts for the class imbalance, aleatoric uncertainty for ambiguous images, and an OoD score for artifacts. We perform extensive experiments to validate our method on MNIST and the real-world Panda dataset for the classification of prostate cancer. The results confirm that other AL methods are 'distracted' by ambiguities and artifacts which harm the performance. FocAL effectively focuses on the most informative images, avoiding ambiguities and artifacts during acquisition. For both experiments, FocAL outperforms existing AL approaches, reaching a Cohen's kappa of 0.764 with only 0.69% of the labeled Panda data.


Asunto(s)
Neoplasias de la Próstata , Humanos , Neoplasias de la Próstata/diagnóstico por imagen , Neoplasias de la Próstata/patología , Masculino , Aprendizaje Automático , Teorema de Bayes , Algoritmos , Interpretación de Imagen Asistida por Computador/métodos , Artefactos , Redes Neurales de la Computación
2.
PLoS One ; 13(3): e0193067, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29494600

RESUMEN

Bulk tissue samples examined by gene expression studies are usually heterogeneous. The data gained from these samples display the confounding patterns of mixtures consisting of multiple cell types or similar cell types in various functional states, which hinders the elucidation of the molecular mechanisms underlying complex biological phenomena. A realistic approach to compensate for the limitations of experimentally separating homogenous cell populations from mixed tissues is to computationally identify cell-type specific patterns from bulk, heterogeneous measurements. We designed the CellDistinguisher algorithm to analyze the gene expression data of mixed samples, identifying genes that best distinguish biological processes and cell types. Coupled with a deconvolution algorithm that takes cell type specific gene lists as input, we show that CellDistinguisher performs as well as partial deconvolution algorithms in predicting cell type composition without the need for prior knowledge of cell type signatures. This approach is also better in predicting cell type signatures than the one-step traditional complete deconvolution methods. To illustrate its wide applicability, the algorithm was tested on multiple publicly available data sets. In each case, CellDistinguisher identified genes reflecting biological processes typical for the tissues and development stages of interest and estimated the sample compositions accurately.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Genómica/métodos , Algoritmos , Animales , Linfocitos B/citología , Linfocitos B/metabolismo , Encéfalo/citología , Encéfalo/metabolismo , Expresión Génica , Humanos , Hígado/citología , Hígado/metabolismo , Pulmón/citología , Pulmón/metabolismo , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Ratas , Análisis de Secuencia de ARN/métodos , Levaduras/citología , Levaduras/genética
3.
Nucleic Acids Res ; 38(Web Server issue): W268-74, 2010 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-20435683

RESUMEN

The transcription of a gene from its DNA template into an mRNA molecule is the first, and most heavily regulated, step in gene expression. Especially in bacteria, regulation is typically achieved via the binding of a transcription factor (protein) or small RNA molecule to the chromosomal region upstream of a regulated gene. The protein or RNA molecule recognizes a short, approximately conserved sequence within a gene's promoter region and, by binding to it, either enhances or represses expression of the nearby gene. Since the sought-for motif (pattern) is short and accommodating to variation, computational approaches that scan for binding sites have trouble distinguishing functional sites from look-alikes. Many computational approaches are unable to find the majority of experimentally verified binding sites without also finding many false positives. Phyloscan overcomes this difficulty by exploiting two key features of functional binding sites: (i) these sites are typically more conserved evolutionarily than are non-functional DNA sequences; and (ii) these sites often occur two or more times in the promoter region of a regulated gene. The website is free and open to all users, and there is no login requirement. Address: (http://bayesweb.wadsworth.org/phyloscan/).


Asunto(s)
Regulación de la Expresión Génica , Regiones Promotoras Genéticas , Programas Informáticos , Transcripción Genética , Sitios de Unión , Internet , Alineación de Secuencia , Análisis de Secuencia de ADN , Factores de Transcripción/metabolismo , Interfaz Usuario-Computador
4.
BMC Bioinformatics ; 10: 212, 2009 Jul 09.
Artículo en Inglés | MEDLINE | ID: mdl-19589158

RESUMEN

BACKGROUND: Hidden Markov models and hidden Boltzmann models are employed in computational biology and a variety of other scientific fields for a variety of analyses of sequential data. Whether the associated algorithms are used to compute an actual probability or, more generally, an odds ratio or some other score, a frequent requirement is that the error statistics of a given score be known. What is the chance that random data would achieve that score or better? What is the chance that a real signal would achieve a given score threshold? RESULTS: Here we present a novel general approach to estimating these false positive and true positive rates that is significantly more efficient than are existing general approaches. We validate the technique via an implementation within the HMMER 3.0 package, which scans DNA or protein sequence databases for patterns of interest, using a profile-HMM. CONCLUSION: The new approach is faster than general naïve sampling approaches, and more general than other current approaches. It provides an efficient mechanism by which to estimate error statistics for hidden Markov model and hidden Boltzmann model results.


Asunto(s)
Biología Computacional/métodos , Reacciones Falso Positivas , Cadenas de Markov , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de Proteína/métodos
5.
J Comput Biol ; 16(1): 1-18, 2009 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-19119992

RESUMEN

Computational biology is replete with high-dimensional discrete prediction and inference problems. Dynamic programming recursions can be applied to several of the most important of these, including sequence alignment, RNA secondary-structure prediction, phylogenetic inference, and motif finding. In these problems, attention is frequently focused on some scalar quantity of interest, a score, such as an alignment score or the free energy of an RNA secondary structure. In many cases, score is naturally defined on integers, such as a count of the number of pairing differences between two sequence alignments, or else an integer score has been adopted for computational reasons, such as in the test of significance of motif scores. The probability distribution of the score under an appropriate probabilistic model is of interest, such as in tests of significance of motif scores, or in calculation of Bayesian confidence limits around an alignment. Here we present three algorithms for calculating the exact distribution of a score of this type; then, in the context of pairwise local sequence alignments, we apply the approach so as to find the alignment score distribution and Bayesian confidence limits.


Asunto(s)
Algoritmos , Secuencia de Bases , Biología Computacional/métodos , Alineación de Secuencia/métodos , Animales , Teorema de Bayes , Drosophila/clasificación , Humanos , Matemática , Datos de Secuencia Molecular , Filogenia , Probabilidad
6.
J Comput Biol ; 15(9): 1187-94, 2008 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-18973434

RESUMEN

Measurement of the the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution. We apply our technique to pairwise local sequence alignment of nucleic acid and amino acid sequences of length up to 1000. For instance, using a BLOSUM62 scoring system for local sequence alignment, we compute that the p-value of a score of 6000 for the alignment of two sequences of length 1000 is (3.4 +/- 0.3) x 10(-1314). Further, we show that the extreme value significance statistic for the local alignment model that we examine does not follow a Gumbel distribution. A web server for this application is available at http://bayesweb.wadsworth.org/alignmentSignificanceV1/.


Asunto(s)
Alineación de Secuencia/métodos , Secuencia de Aminoácidos , Secuencia de Bases , Biología Computacional/métodos , Modelos Estadísticos , Probabilidad
7.
Bioinformatics ; 24(16): 1772-8, 2008 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-18558620

RESUMEN

MOTIVATION: A backtrace through a dynamic programming algorithm's intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward-backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis. RESULTS: Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10,000. AVAILABILITY: Sample C++-code for optimal backtrace is available in the Supplementary Materials. SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.


Asunto(s)
Algoritmos , Almacenamiento y Recuperación de la Información/métodos , Análisis Numérico Asistido por Computador , Alineación de Secuencia/métodos , Análisis de Secuencia/métodos
8.
Nucleic Acids Res ; 35(Web Server issue): W232-7, 2007 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-17483517

RESUMEN

The Gibbs Centroid Sampler is a software package designed for locating conserved elements in biopolymer sequences. The Gibbs Centroid Sampler reports a centroid alignment, i.e. an alignment that has the minimum total distance to the set of samples chosen from the a posteriori probability distribution of transcription factor binding-site alignments. In so doing, it garners information from the full ensemble of solutions, rather than only the single most probable point that is the target of many motif-finding algorithms, including its predecessor, the Gibbs Recursive Sampler. Centroid estimators have been shown to yield substantial improvements, in both sensitivity and positive predictive values, to the prediction of RNA secondary structure and motif finding. The Gibbs Centroid Sampler, along with interactive tutorials, an online user manual, and information on downloading the software, is available at: http://bayesweb.wadsworth.org/gibbs/gibbs.html.


Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Factores de Transcripción/metabolismo , Algoritmos , Sitios de Unión , Variación Genética , Internet , Cadenas de Markov , Método de Montecarlo , Secuencias Repetitivas de Ácidos Nucleicos , Estadísticas no Paramétricas , Sitio de Iniciación de la Transcripción , Interfaz Usuario-Computador
9.
Bioinformatics ; 23(14): 1718-27, 2007 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-17488758

RESUMEN

MOTIVATION: Identification of functionally conserved regulatory elements in sequence data from closely related organisms is becoming feasible, due to the rapid growth of public sequence databases. Closely related organisms are most likely to have common regulatory motifs; however, the recent speciation of such organisms results in the high degree of correlation in their genome sequences, confounding the detection of functional elements. Additionally, alignment algorithms that use optimization techniques are limited to the detection of a single alignment that may not be representative. Comparative-genomics studies must be able to address the phylogenetic correlation in the data and efficiently explore the alignment space, in order to make specific and biologically relevant predictions. RESULTS: We describe here a Gibbs sampler that employs a full phylogenetic model and reports an ensemble centroid solution. We describe regulatory motif detection using both simulated and real data, and demonstrate that this approach achieves improved specificity, sensitivity, and positive predictive value over non-phylogenetic algorithms, and over phylogenetic algorithms that report a maximum likelihood solution. AVAILABILITY: The software is freely available at http://bayesweb.wadsworth.org/gibbs/gibbs.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Animales , Automatización , Secuencia Conservada , Genoma , Genoma Bacteriano , Genómica , Humanos , Ratones , Modelos Estadísticos , Filogenia , Regiones Promotoras Genéticas , Secuencias Reguladoras de Ácidos Nucleicos
10.
Algorithms Mol Biol ; 2: 1, 2007 Jan 23.
Artículo en Inglés | MEDLINE | ID: mdl-17244358

RESUMEN

BACKGROUND: When transcription factor binding sites are known for a particular transcription factor, it is possible to construct a motif model that can be used to scan sequences for additional sites. However, few statistically significant sites are revealed when a transcription factor binding site motif model is used to scan a genome-scale database. METHODS: We have developed a scanning algorithm, PhyloScan, which combines evidence from matching sites found in orthologous data from several related species with evidence from multiple sites within an intergenic region, to better detect regulons. The orthologous sequence data may be multiply aligned, unaligned, or a combination of aligned and unaligned. In aligned data, PhyloScan statistically accounts for the phylogenetic dependence of the species contributing data to the alignment and, in unaligned data, the evidence for sites is combined assuming phylogenetic independence of the species. The statistical significance of the gene predictions is calculated directly, without employing training sets. RESULTS: In a test of our methodology on synthetic data modeled on seven Enterobacteriales, four Vibrionales, and three Pasteurellales species, PhyloScan produces better sensitivity and specificity than MONKEY, an advanced scanning approach that also searches a genome for transcription factor binding sites using phylogenetic information. The application of the algorithm to real sequence data from seven Enterobacteriales species identifies novel Crp and PurR transcription factor binding sites, thus providing several new potential sites for these transcription factors. These sites enable targeted experimental validation and thus further delineation of the Crp and PurR regulons in E. coli. CONCLUSION: Better sensitivity and specificity can be achieved through a combination of (1) using mixed alignable and non-alignable sequence data and (2) combining evidence from multiple sites within an intergenic region.

11.
Stat Appl Genet Mol Biol ; 4: Article13, 2005.
Artículo en Inglés | MEDLINE | ID: mdl-16646830

RESUMEN

Approaches based upon sequence weights, to construct a position weight matrix of nucleotides from aligned inputs, are popular but little effort has been expended to measure their quality. We derive optimal sequence weights that minimize the sum of the variances of the estimators of base frequency parameters for sequences related by a phylogenetic tree. Using these we find that approaches based upon sequence weights can perform very poorly in comparison to approaches based upon a theoretically optimal maximum-likelihood method in the inference of the parameters of a position-weight matrix. Specifically, we find that among a collection of primate sequences, even an optimal sequences-weights approach is only 51% as efficient as the maximum-likelihood approach in inferences of base frequency parameters. We also show how to employ the variance estimators to obtain a greedy ordering of species for sequencing. Application of this ordering for the weighted estimators to a primate collection yields a curve with a long plateau that is not observed with maximum-likelihood estimators. This plateau indicates that the use of weighted estimators on these data seriously limits the utility of obtaining the sequences of more than two or three additional species.

12.
Stat Appl Genet Mol Biol ; 3: Article23, 2004.
Artículo en Inglés | MEDLINE | ID: mdl-16646802

RESUMEN

Under the assumption that a significant motivation for sequencing the genomes of mammals is the resulting ability to help us locate and characterize functional DNA segments shared with humans, we have developed a statistical analysis to quantify the expected advantage. Examining uncertainty in terms of the width of a confidence interval, we show that uncertainty in the rate of nucleotide mutation can be shrunk by a factor of nearly four when nine mammals; human, chimpanzee, baboon, cat, dog, cow, pig, rat, mouse; are used instead of just two; human and mouse. Contrastingly, we show confidence interval shrinkage by a factor of only 1.5 for measurements of the distribution of nucleotides at an aligned sequence site. These additional genomes should greatly help in identifying conserved DNA sites, but would be much less effective at precisely describing the expected pattern of nucleotides at those sites.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...