Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
Proc Natl Acad Sci U S A ; 120(39): e2303904120, 2023 Sep 26.
Artigo em Inglês | MEDLINE | ID: mdl-37722063

RESUMO

Partial differential equations (PDE) learning is an emerging field that combines physics and machine learning to recover unknown physical systems from experimental data. While deep learning models traditionally require copious amounts of training data, recent PDE learning techniques achieve spectacular results with limited data availability. Still, these results are empirical. Our work provides theoretical guarantees on the number of input-output training pairs required in PDE learning. Specifically, we exploit randomized numerical linear algebra and PDE theory to derive a provably data-efficient algorithm that recovers solution operators of three-dimensional uniformly elliptic PDEs from input-output data and achieves an exponential convergence rate of the error with respect to the size of the training dataset with an exceptionally high probability of success.

2.
Proc Natl Acad Sci U S A ; 119(31): e2202116119, 2022 Aug 02.
Artigo em Inglês | MEDLINE | ID: mdl-35901210

RESUMO

A/B testing is widely used to tune search and recommendation algorithms, to compare product variants as efficiently and effectively as possible, and even to study animal behavior. With ongoing investment, due to diminishing returns, the items produced by the new alternative B show smaller and smaller improvement in quality from the items produced by the current system A. By formalizing this observation, we develop closed-form analytical expressions for the sample efficiency of a number of widely used families of slate-based comparison tests. In empirical trials, these theoretical sample complexity results are shown to be predictive of real-world testing efficiency outcomes. These findings offer opportunities for both more cost-effective testing and a better analytical understanding of the problem.

3.
J Math Biol ; 85(3): 22, 2022 08 17.
Artigo em Inglês | MEDLINE | ID: mdl-35976512

RESUMO

methods seek to infer a species tree from a set of gene trees. A desirable property of such methods is that of statistical consistency; that is, the probability of inferring the wrong species tree (the error probability) tends to 0 as the number of input gene trees becomes large. A popular paradigm is to infer a species tree that agrees with the maximum number of quartets from the input set of gene trees; this has been proved to be statistically consistent under several models of gene evolution. In this paper, we study the asymptotic behaviour of the error probability of such methods in this limit, and show that it decays exponentially. For a 4-taxon species tree, we derive a closed form for the asymptotic behaviour in terms of the probability that the gene evolution process produces the correct topology. We also derive bounds for the sample complexity (the number of gene trees required to infer the true species tree with a given probability), which outperform existing bounds. We then extend our results to bounds for the asymptotic behaviour of the error probability for any species tree, and compare these to the true error probability for some model species trees using simulations.


Assuntos
Evolução Molecular , Modelos Genéticos , Especiação Genética , Filogenia , Probabilidade
4.
Neuroimage ; 155: 549-564, 2017 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-28456584

RESUMO

Neuroscience is undergoing faster changes than ever before. Over 100 years our field qualitatively described and invasively manipulated single or few organisms to gain anatomical, physiological, and pharmacological insights. In the last 10 years neuroscience spawned quantitative datasets of unprecedented breadth (e.g., microanatomy, synaptic connections, and optogenetic brain-behavior assays) and size (e.g., cognition, brain imaging, and genetics). While growing data availability and information granularity have been amply discussed, we direct attention to a less explored question: How will the unprecedented data richness shape data analysis practices? Statistical reasoning is becoming more important to distill neurobiological knowledge from healthy and pathological brain measurements. We argue that large-scale data analysis will use more statistical models that are non-parametric, generative, and mixing frequentist and Bayesian aspects, while supplementing classical hypothesis testing with out-of-sample predictions.


Assuntos
Interpretação Estatística de Dados , Conjuntos de Dados como Assunto/tendências , Modelos Teóricos , Neurociências/tendências , Humanos
5.
Proc IEEE Inst Electr Electron Eng ; 104(1): 93-110, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-27087700

RESUMO

When can reliable inference be drawn in fue "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data". Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.

6.
Sensors (Basel) ; 15(7): 16105-35, 2015 Jul 03.
Artigo em Inglês | MEDLINE | ID: mdl-26151216

RESUMO

The most commonly used spectrum sensing techniques in cognitive radio (CR) networks, such as the energy detector (ED), matched filter (MF), and others, suffer from the noise uncertainty and signal-to-noise ratio (SNR) wall phenomenon. These detectors cannot achieve the required signal detection performance regardless of the sensing time. In this paper, we explore a signal processing scheme, namely, the generalized detector (GD) constructed based on the generalized approach to signal processing (GASP) in noise, in spectrum sensing of CR network based on antenna array with the purpose to alleviate the SNR wall problem and improve the signal detection robustness under the low SNR. The simulation results confirm our theoretical issues and effectiveness of GD implementation in CR networks based on antenna array.

7.
Neural Netw ; 169: 462-474, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37939535

RESUMO

We study the generalization capacity of group convolutional neural networks. We identify precise estimates for the VC dimensions of simple sets of group convolutional neural networks. In particular, we find that for infinite groups and appropriately chosen convolutional kernels, already two-parameter families of convolutional neural networks have an infinite VC dimension, despite being invariant to the action of an infinite group.


Assuntos
Algoritmos , Redes Neurais de Computação
8.
Neural Netw ; 167: 445-449, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37673030

RESUMO

The statistical supervised learning framework assumes an input-output set with a joint probability distribution that is reliably represented by the training dataset. The learning system is then required to output a prediction rule learned from the training dataset's input-output pairs. In this work, we investigate the relationship between the sample complexity, the empirical risk and the generalization error based on the asymptotic equipartition property (AEP) (Shannon, 1948). We provide theoretical guarantees for reliable learning under the information-theoretic AEP, with respect to the generalization error and the sample size in different settings.


Assuntos
Probabilidade , Tamanho da Amostra
9.
Front Robot AI ; 9: 797213, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35391942

RESUMO

This paper offers a new hybrid probably approximately correct (PAC) reinforcement learning (RL) algorithm for Markov decision processes (MDPs) that intelligently maintains favorable features of both model-based and model-free methodologies. The designed algorithm, referred to as the Dyna-Delayed Q-learning (DDQ) algorithm, combines model-free Delayed Q-learning and model-based R-max algorithms while outperforming both in most cases. The paper includes a PAC analysis of the DDQ algorithm and a derivation of its sample complexity. Numerical results are provided to support the claim regarding the new algorithm's sample efficiency compared to its parents as well as the best known PAC model-free and model-based algorithms in application. A real-world experimental implementation of DDQ in the context of pediatric motor rehabilitation facilitated by infant-robot interaction highlights the potential benefits of the reported method.

10.
J Ambient Intell Humaniz Comput ; 13(8): 3935-3944, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34868373

RESUMO

With the recent explosion in the number of wireless communication technologies, the frequency spectrum has become a scarce resource. The need of the hour is an efficient method to utilize the existing spectrum and Cognitive Radio is one such technology that can mitigate the spectrum scarcity. In a cognitive radio system, the unlicensed secondary user accesses the spectrum allotted to licensed primary users when it lies vacant. To implement dynamic or opportunistic access of spectrum, secondary users perform spectrum sensing, which is a quintessential part of a Cognitive radio. From the Cognitive user's point of view, lesser error probability means an increased likelihood of channel reuse when it is vacant, and a higher detection probability signifies better protection to the licensed users. In both cases the decision threshold plays a pivotal role in determining the fate of the unused spectrum. In this paper, we study the difficulty of selecting an appropriate threshold to minimize the error probability in an uncertain low SNR regime. The sensing failure issue is analyzed, and an optimal threshold is computed that yields minimum error rate. An adaptive double threshold concept has been proposed to make the detection robust and a closed-form equation for optimal threshold has been derived to minimize the error. The novel findings through simulation results exhibit improvement in Probability of detection and reduction in probability of error at low SNR in the presence of noise uncertainty factor.

11.
Adv Neural Inf Process Syst ; 34: 16671-16685, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-36168331

RESUMO

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space S and the action space A are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with | S | × | A | , which can be prohibitively large when S or A is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an ε-optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of K ( 1 - γ ) 3 ε 2 ( resp . K ( 1 - γ ) 4 ε 2 ) , up to some logarithmic factor. Here K is the feature dimension and γ ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when K is relatively small, and hence the title of this paper.

12.
J Comput Biol ; 27(4): 613-625, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-31794679

RESUMO

Reconstruction of population histories is a central problem in population genetics. Existing coalescent-based methods, such as the seminal work of Li and Durbin, attempt to solve this problem using sequence data but have no rigorous guarantees. Determining the amount of data needed to correctly reconstruct population histories is a major challenge. Using a variety of tools from information theory, the theory of extremal polynomials, and approximation theory, we prove new sharp information-theoretic lower bounds on the problem of reconstructing population structure-the history of multiple subpopulations that merge, split, and change sizes over time. Our lower bounds are exponential in the number of subpopulations, even when reconstructing recent histories. We demonstrate the sharpness of our lower bounds by providing algorithms for distinguishing and learning population histories with matching dependence on the number of subpopulations. Along the way and of independent interest, we essentially determine the optimal number of samples needed to learn an exponential mixture distribution information-theoretically, proving the upper bound by analyzing natural (and efficient) algorithms for this problem.


Assuntos
Biologia Computacional , Genética Populacional , Teoria da Informação , Modelos Genéticos , Algoritmos , Simulação por Computador , Polimorfismo de Nucleotídeo Único/genética
13.
Front Microbiol ; 10: 1277, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31244801

RESUMO

The amount of host DNA poses a major challenge to metagenome analysis. However, there is no guidance on the levels of host DNA, nor on the depth of sequencing needed to acquire meaningful information from whole metagenome sequencing (WMS). Here, we evaluated the impact of a wide range of amounts of host DNA and sequencing depths on microbiome taxonomic profiling using WMS. Synthetic samples with increasing levels of host DNA were created by spiking DNA of a mock bacterial community, with DNA from a mouse-derived cell line. Taxonomic analysis revealed that increasing proportions of host DNA led to decreased sensitivity in detecting very low and low abundant species. Reduction of sequencing depth had major impact on the sensitivity of WMS for profiling samples with 90% host DNA, increasing the number of undetected species. Finally, analysis of simulated datasets with fixed depth of 10 million reads confirmed that microbiome profiling becomes more inaccurate as the level of host DNA increases in a sample. In conclusion, samples with high amounts of host DNA coupled with reduced sequencing depths, decrease WMS coverage for characterization of the microbiome. This study highlights the importance of carefully considering these aspects in the design of WMS experiments to maximize microbiome analyses.

14.
Algorithms Mol Biol ; 14: 2, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30839943

RESUMO

BACKGROUND: Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch weights are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was D C M NJ , published in SODA 2001. The main empirical advantage of DCM NJ over other AFC methods is its use of neighbor joining (NJ) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, DCM NJ is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy. RESULTS: In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of DCM NJ but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time and space. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).

15.
Artif Intell Med ; 96: 123-133, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-31164206

RESUMO

The body constitution is much related to the diseases and the corresponding treatment programs in Traditional Chinese Medicine. It can be recognized by the tongue image diagnosis, so that it is essentially regarded as a problem of tongue image classification, where each tongue image is classified into one of nine constitution types. This paper first presents a system framework to automatically identify the constitution through natural tongue images, where deep convolutional neural networks are carefully designed for tongue coating detection, tongue coating calibration, and constitution recognition. Under the system framework, a novel complexity perception (CP) classification method is proposed to nicely perform the constitution recognition, which can better deal with the bad influence of the variation of environmental condition and the uneven distribution of the tongue images on constitution recognition performance. CP performs the constitution recognition based on the complexity of individual tongue images by selecting the classifier with the corresponding complexity. To evaluate the performance of the proposed method, experiments are conducted on three sizes of clinic tongue images from hospitals. The experimental results illustrate that CP is effective to improve the accuracy of body constitution recognition.


Assuntos
Aprendizado Profundo , Processamento de Imagem Assistida por Computador/métodos , Língua/diagnóstico por imagem , Humanos , Medicina Tradicional Chinesa , Redes Neurais de Computação
16.
Methods Mol Biol ; 1696: 1-12, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29086393

RESUMO

Free flow zonal electrophoresis (FFZE) is a versatile, reproducible, and potentially high-throughput technique for the separation of plant organelles and membranes by differences in membrane surface charge. It offers considerable benefits over traditional fractionation techniques, such as density gradient centrifugation and two-phase partitioning, as it is relatively fast, sample recovery is high, and the method provides unparalleled sample purity. It has been used to successfully purify chloroplasts and mitochondria from plants but also, to obtain highly pure fractions of plasma membrane, tonoplast, ER, Golgi, and thylakoid membranes. Application of the technique can significantly improve protein coverage in large-scale proteomics studies by decreasing sample complexity. Here, we describe the method for the fractionation of plant cellular membranes from leaves by FFZE.


Assuntos
Membrana Celular/metabolismo , Folhas de Planta/citologia , Proteômica/métodos , Fracionamento Químico , Eletroforese , Complexo de Golgi/metabolismo , Folhas de Planta/metabolismo , Proteínas de Plantas/análise , Tilacoides/metabolismo
17.
Methods Mol Biol ; 1841: 303-318, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30259495

RESUMO

Soil and litter metaproteomics, assigning soil and litter proteins to specific phylogenetic and functional groups, has a great potential to shed light on the impact of microbial diversity on soil ecosystem functioning. However, metaproteomic analysis of soil and litter is often hampered by the enormous heterogeneity of the soil matrix and high concentrations of humic acids. To circumvent these challenges, sophisticated protocols for sample preparation have to be applied. This chapter provides the reader with detailed information on well-established protocols for protein extraction from soil and litter samples together with protocols for further sample preparation for subsequent MS analyses.


Assuntos
Folhas de Planta/química , Proteoma , Proteômica , Solo/química , Eletroforese em Gel de Poliacrilamida , Proteômica/métodos
18.
Proc IEEE Int Conf Data Min ; 2011: 477-486, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-25309141

RESUMO

Finding ways of incorporating auxiliary information or auxiliary data into the learning process has been the topic of active data mining and machine learning research in recent years. In this work we study and develop a new framework for classification learning problem in which, in addition to class labels, the learner is provided with an auxiliary (probabilistic) information that reflects how strong the expert feels about the class label. This approach can be extremely useful for many practical classification tasks that rely on subjective label assessment and where the cost of acquiring additional auxiliary information is negligible when compared to the cost of the example analysis and labelling. We develop classification algorithms capable of using the auxiliary information to make the learning process more efficient in terms of the sample complexity. We demonstrate the benefit of the approach on a number of synthetic and real world data sets by comparing it to the learning with class labels only.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa