Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
Mais filtros

Base de dados
Tipo de documento
Assunto da revista
País de afiliação
Intervalo de ano de publicação
1.
Bioinformatics ; 38(9): 2389-2396, 2022 04 28.
Artigo em Inglês | MEDLINE | ID: mdl-35212706

RESUMO

MOTIVATION: Microbiome datasets provide rich information about microbial communities. However, vast library size variations across samples present great challenges for proper statistical comparisons. To deal with these challenges, rarefaction is often used in practice as a normalization technique, although there has been debate whether rarefaction should ever be used. Conventional wisdom and previous work suggested that rarefaction should never be used in practice, arguing that rarefying microbiome data is statistically inadmissible. These discussions, however, have been confined to particular parametric models and simulation studies. RESULTS: We develop a semiparametric graphical model framework for grouped microbiome data and analyze in the context of differential abundance testing the statistical trade-offs of the rarefaction procedure, accounting for latent variations and measurement errors. Under the framework, it can be shown rarefaction guarantees that subsequent permutation tests properly control the Type I error. In addition, the loss in sensitivity from rarefaction is solely due to increased measurement error; if the underlying variation in microbial composition is large among samples, rarefaction might not hurt subsequent statistical inference much. We develop the rarefaction efficiency index (REI) as an indicator for efficiency loss and illustrate it with a dataset on the effect of storage conditions for microbiome data. Simulation studies based on real data demonstrate that the impact of rarefaction on sensitivity is negligible when overdispersion is prominent, while low REI corresponds to scenarios in which rarefying might substantially lower the statistical power. Whether to rarefy or not ultimately depends on assumptions of the data generating process and characteristics of the data. AVAILABILITY AND IMPLEMENTATION: Source codes are publicly available at https://github.com/jcyhong/rarefaction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Microbiota , Microbiota/genética , Software , Simulação por Computador , Biblioteca Gênica
2.
Ann Stat ; 42(5): 1693-1724, 2014 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-25492979

RESUMO

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE-even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to [Formula: see text] if we multiply the baseline acceptance probabilities by c > 1 (and weight points with acceptance probability greater than 1), taking roughly [Formula: see text] times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

3.
J Exp Med ; 221(4)2024 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-38411617

RESUMO

In vivo T cell screens are a powerful tool for elucidating complex mechanisms of immunity, yet there is a lack of consensus on the screen design parameters required for robust in vivo screens: gene library size, cell transfer quantity, and number of mice. Here, we describe the Framework for In vivo T cell Screens (FITS) to provide experimental and analytical guidelines to determine optimal parameters for diverse in vivo contexts. As a proof-of-concept, we used FITS to optimize the parameters for a CD8+ T cell screen in the B16-OVA tumor model. We also included unique molecular identifiers (UMIs) in our screens to (1) improve statistical power and (2) track T cell clonal dynamics for distinct gene knockouts (KOs) across multiple tissues. These findings provide an experimental and analytical framework for performing in vivo screens in immune cells and illustrate a case study for in vivo T cell screens with UMIs.


Assuntos
Linfócitos T CD8-Positivos , Animais , Camundongos , Técnicas de Inativação de Genes
4.
Biometrika ; 102(2): 479-485, 2015 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-26977114

RESUMO

To most applied statisticians, a fitting procedure's degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. In particular, it is often used to parameterize the bias-variance tradeoff in model selection. We argue that, on the contrary, model complexity and degrees of freedom may correspond very poorly. We exhibit and theoretically explore various fitting procedures for which degrees of freedom is not monotonic in the model complexity parameter, and can exceed the total dimension of the ambient space even in very simple settings. We show that the degrees of freedom for any non-convex projection method can be unbounded.

5.
Methods Ecol Evol ; 6(4): 424-438, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-27840673

RESUMO

Presence-only records may provide data on the distributions of rare species, but commonly suffer from large, unknown biases due to their typically haphazard collection schemes. Presence-absence or count data collected in systematic, planned surveys are more reliable but typically less abundant.We proposed a probabilistic model to allow for joint analysis of presence-only and survey data to exploit their complementary strengths. Our method pools presence-only and presence-absence data for many species and maximizes a joint likelihood, simultaneously estimating and adjusting for the sampling bias affecting the presence-only data. By assuming that the sampling bias is the same for all species, we can borrow strength across species to efficiently estimate the bias and improve our inference from presence-only data.We evaluate our model's performance on data for 36 eucalypt species in south-eastern Australia. We find that presence-only records exhibit a strong sampling bias towards the coast and towards Sydney, the largest city. Our data-pooling technique substantially improves the out-of-sample predictive performance of our model when the amount of available presence-absence data for a given species is scarceIf we have only presence-only data and no presence-absence data for a given species, but both types of data for several other species that suffer from the same spatial sampling bias, then our method can obtain an unbiased estimate of the first species' geographic range.

6.
Ann Appl Stat ; 7(4): 1917-1939, 2013 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-25493106

RESUMO

Statistical modeling of presence-only data has attracted much recent attention in the ecological literature, leading to a proliferation of methods, including the inhomogeneous Poisson process (IPP) model, maximum entropy (Maxent) modeling of species distributions and logistic regression models. Several recent articles have shown the close relationships between these methods. We explain why the IPP intensity function is a more natural object of inference in presence-only studies than occurrence probability (which is only defined with reference to quadrat size), and why presence-only data only allows estimation of relative, and not absolute intensity of species occurrence. All three of the above techniques amount to parametric density estimation under the same exponential family model (in the case of the IPP, the fitted density is multiplied by the number of presence records to obtain a fitted intensity). We show that IPP and Maxent give the exact same estimate for this density, but logistic regression in general yields a different estimate in finite samples. When the model is misspecified-as it practically always is-logistic regression and the IPP may have substantially different asymptotic limits with large data sets. We propose "infinitely weighted logistic regression," which is exactly equivalent to the IPP in finite samples. Consequently, many already-implemented methods extending logistic regression can also extend the Maxent and IPP models in directly analogous ways using this technique.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa