Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Más filtros












Base de datos
Intervalo de año de publicación
1.
IEEE/ACM Trans Comput Biol Bioinform ; 17(4): 1222-1230, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-30507538

RESUMEN

Advances in modern genomics have allowed researchers to apply phylogenetic analyses on a genome-wide scale. While large volumes of genomic data can be generated cheaply and quickly, data missingness is a non-trivial and somewhat expected problem. Since the available information is often incomplete for a given set of genetic loci and individual organisms, a large proportion of trees that depict the evolutionary history of a single genetic locus, called gene trees, fail to contain all individuals. Data incompleteness causes difficulties in data collection, information extraction, and gene tree inference. Furthermore, identifying outlying gene trees, which can represent horizontal gene transfers, gene duplications, or hybridizations, is difficult when data is missing from the gene trees. The typical approach is to remove all individuals with missing data from the gene trees, and focus the analysis on individuals whose information is fully available - a huge loss of information. In this work, we propose and design an optimization-based imputation approach to infer the missing distances between leaves in a set of gene trees via a mixed integer non-linear programming model. We also present a new research pipeline, imPhy, that can (i) simulate a set of gene trees with leaves randomly missing in each tree, (ii) impute the missing pairwise distances in each gene tree, (iii) reconstruct the gene trees using the Neighbor Joining (NJ) and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) methods, and (iv) analyze and report the efficiency of the reconstruction. To impute the missing leaves, we employ our newly proposed non-linear programming framework, and demonstrate its capability in reconstructing gene trees with incomplete information in both simulated and empirical datasets. In the empirical datasets apicomplexa and lungfish, our imputation has very small normalized mean square errors, even in the extreme case where 50 percent of the individuals in each gene tree are missing. Data, software, and user manuals can be found at https://github.com/yasuiniko/imPhy.


Asunto(s)
Biología Computacional/métodos , Evolución Molecular , Filogenia , Programas Informáticos , Algoritmos , Animales , Bases de Datos Genéticas , Transferencia de Gen Horizontal/genética , Modelos Genéticos , Dinámicas no Lineales
2.
Neurocomputing (Amst) ; 304: 12-29, 2018 Aug 23.
Artículo en Inglés | MEDLINE | ID: mdl-30416263

RESUMEN

Many unsupervised kernel methods rely on the estimation of kernel covariance operator (kernel CO) or kernel cross-covariance operator (kernel CCO). Both are sensitive to contaminated data, even when bounded positive definite kernels are used. To the best of our knowledge, there are few well-founded robust kernel methods for statistical unsupervised learning. In addition, while the influence function (IF) of an estimator can characterize its robustness, asymptotic properties and standard error, the IF of a standard kernel canonical correlation analysis (standard kernel CCA) has not been derived yet. To fill this gap, we first propose a robust kernel covariance operator (robust kernel CO) and a robust kernel cross-covariance operator (robust kernel CCO) based on a generalized loss function instead of the quadratic loss function. Second, we derive the IF for robust kernel CCO and standard kernel CCA. Using the IF of the standard kernel CCA, we can detect influential observations from two sets of data. Finally, we propose a method based on the robust kernel CO and the robust kernel CCO, called robust kernel CCA, which is less sensitive to noise than the standard kernel CCA. The introduced principles can also be applied to many other kernel methods involving kernel CO or kernel CCO. Our experiments on both synthesized and imaging genetics data demonstrate that the proposed IF of standard kernel CCA can identify outliers. It is also seen that the proposed robust kernel CCA method performs better for ideal and contaminated data than the standard kernel CCA.

3.
Artículo en Inglés | MEDLINE | ID: mdl-27076458

RESUMEN

Recent years have witnessed a surge of biological interest in the minimum spanning tree (MST) problem for its relevance to automatic model construction using the distances between data points. Despite the increasing use of MST algorithms for this purpose, the goodness-of-fit of an MST to the data is often elusive because no quantitative criteria have been developed to measure it. Motivated by this, we provide a necessary and sufficient condition to ensure that a metric space on n points can be represented by a fully labeled tree on n vertices, and thereby determine when an MST preserves all pairwise distances between points in a finite metric space.


Asunto(s)
Algoritmos , Diferenciación Celular/fisiología , Biología Computacional/métodos , Modelos Biológicos , Células Madre Hematopoyéticas/citología , Células Madre Hematopoyéticas/fisiología
4.
Neural Comput ; 28(2): 382-444, 2016 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-26654205

RESUMEN

This letter addresses the problem of filtering with a state-space model. Standard approaches for filtering assume that a probabilistic model for observations (i.e., the observation model) is given explicitly or at least parametrically. We consider a setting where this assumption is not satisfied; we assume that the knowledge of the observation model is provided only by examples of state-observation pairs. This setting is important and appears when state variables are defined as quantities that are very different from the observations. We propose kernel Monte Carlo filter, a novel filtering method that is focused on this setting. Our approach is based on the framework of kernel mean embeddings, which enables nonparametric posterior inference using the state-observation examples. The proposed method represents state distributions as weighted samples, propagates these samples by sampling, estimates the state posteriors by kernel Bayes' rule, and resamples by kernel herding. In particular, the sampling and resampling procedures are novel in being expressed using kernel mean embeddings, so we theoretically analyze their behaviors. We reveal the following properties, which are similar to those of corresponding procedures in particle methods: the performance of sampling can degrade if the effective sample size of a weighted sample is small, and resampling improves the sampling performance by increasing the effective sample size. We first demonstrate these theoretical findings by synthetic experiments. Then we show the effectiveness of the proposed filter by artificial and real data experiments, which include vision-based mobile robot localization.


Asunto(s)
Modelos Teóricos , Método de Montecarlo , Procesamiento de Señales Asistido por Computador , Algoritmos , Simulación por Computador , Humanos , Probabilidad
5.
Stat Appl Genet Mol Biol ; 12(6): 667-78, 2013 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-24150124

RESUMEN

Approximate Bayesian computation (ABC) is a likelihood-free approach for Bayesian inferences based on a rejection algorithm method that applies a tolerance of dissimilarity between summary statistics from observed and simulated data. Although several improvements to the algorithm have been proposed, none of these improvements avoid the following two sources of approximation: 1) lack of sufficient statistics: sampling is not from the true posterior density given data but from an approximate posterior density given summary statistics; and 2) non-zero tolerance: sampling from the posterior density given summary statistics is achieved only in the limit of zero tolerance. The first source of approximation can be improved by adding a summary statistic, but an increase in the number of summary statistics could introduce additional variance caused by the low acceptance rate. Consequently, many researchers have attempted to develop techniques to choose informative summary statistics. The present study evaluated the utility of a kernel-based ABC method [Fukumizu, K., L. Song and A. Gretton (2010): "Kernel Bayes' rule: Bayesian inference with positive definite kernels," arXiv, 1009.5736 and Fukumizu, K., L. Song and A. Gretton (2011): "Kernel Bayes' rule. Advances in Neural Information Processing Systems 24." In: J. Shawe-Taylor and R. S. Zemel and P. Bartlett and F. Pereira and K. Q. Weinberger, (Eds.), pp. 1549-1557., NIPS 24: 1549-1557] for complex problems that demand many summary statistics. Specifically, kernel ABC was applied to population genetic inference. We demonstrate that, in contrast to conventional ABCs, kernel ABC can incorporate a large number of summary statistics while maintaining high performance of the inference.


Asunto(s)
Simulación por Computador , Modelos Genéticos , Algoritmos , Teorema de Bayes , Interpretación Estadística de Datos , Genética de Población , Humanos , Modelos Lineales , Programas Informáticos
6.
Neural Netw ; 21(1): 48-58, 2008 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-18206348

RESUMEN

This paper investigates the relation between over-fitting and weight size in neural network regression. The over-fitting of a network to Gaussian noise is discussed. Using re-parametrization, a network function is represented as a bounded function g multiplied by a coefficient c. This is considered to bound the squared sum of the outputs of g at given inputs away from a positive constant delta(n), which restricts the weight size of a network and enables the probabilistic upper bound of the degree of over-fitting to be derived. This reveals that the order of the probabilistic upper bound can change depending on delta(n). By applying the bound to analyze the over-fitting behavior of one Gaussian unit, it is shown that the probability of obtaining an extremely small value for the width parameter in training is close to one when the sample size is large.


Asunto(s)
Tamaño Corporal , Redes Neurales de la Computación , Regresión Psicológica , Humanos , Funciones de Verosimilitud
7.
Neural Netw ; 9(5): 871-879, 1996 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-12662569

RESUMEN

The Fisher information matrix of a multi-layer perceptron network can be singular at certain parameters, and in such cases many statistical techniques based on asymptotic theory cannot be applied properly. In this paper, we prove rigorously that the Fisher information matrix of a three-layer perceptron network is positive definite if and only if the network is irreducible; that is, if there is no hidden unit that makes no contribution to the output and there is no pair of hidden units that could be collapsed to a single unit without altering the input-output map. This implies that a network that has a singular Fisher information matrix can be reduced to a network with a positive definite Fisher information matrix by eliminating redundant hidden units. Copyright 1996 Elsevier Science Ltd

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...