RESUMEN
The physicist Ernest Rutherford said, "If your experiment needs statistics, you ought to have done a better experiment." Although this aphorism remains true for much of today's research in cell biology, a basic understanding of statistics can be useful to cell biologists to help in monitoring the conduct of their experiments, in interpreting the results, in presenting them in publications, and when critically evaluating research by others. However, training in statistics is often focused on the sophisticated needs of clinical researchers, psychologists, and epidemiologists, whose conclusions depend wholly on statistics, rather than the practical needs of cell biologists, whose experiments often provide evidence that is not statistical in nature. This review describes some of the basic statistical principles that may be of use to experimental biologists, but it does not cover the sophisticated statistics needed for papers that contain evidence of no other kind.
Asunto(s)
Biología Celular , Estadística como Asunto , Causalidad , Interpretación Estadística de Datos , Probabilidad , Reproducibilidad de los Resultados , Proyectos de Investigación , Distribuciones EstadísticasRESUMEN
Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference [K. Chaung et al., Cell 186, 5440-5456 (2023)], we develop Optimized Adaptive Statistic for Inferring Structure (OASIS), a family of statistical tests for contingency tables. OASIS constructs a test statistic which is linear in the normalized data matrix, providing closed-form P-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's P-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. Using OASIS, we develop a method that can detect SARS-CoV-2 and Mycobacterium tuberculosis strains de novo, which existing approaches cannot achieve. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single-cell RNA sequencing, where under accepted noise models OASIS provides good control of the false discovery rate, while Pearson's [Formula: see text] consistently rejects the null. Additionally, we show in simulations that OASIS is more powerful than Pearson's [Formula: see text] in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.
Asunto(s)
Genoma , Genómica , Mapeo CromosómicoRESUMEN
Set-based association analysis is a valuable tool in studying the etiology of complex diseases in genome-wide association studies, as it allows for the joint testing of variants in a region or group. Two common types of single nucleotide polymorphism (SNP)-disease functional models are recognized when evaluating the joint function of a set of SNP: the cumulative weak signal model, in which multiple functional variants with small effects contribute to disease risk, and the dominating strong signal model, in which a few functional variants with large effects contribute to disease risk. However, existing methods have two main limitations that reduce their power. Firstly, they typically only consider one disease-SNP association model, which can result in significant power loss if the model is misspecified. Secondly, they do not account for the high-dimensional nature of SNPs, leading to low power or high false positives. In this study, we propose a solution to these challenges by using a high-dimensional inference procedure that involves simultaneously fitting many SNPs in a regression model. We also propose an omnibus testing procedure that employs a robust and powerful P-value combination method to enhance the power of SNP-set association. Our results from extensive simulation studies and a real data analysis demonstrate that our set-based high-dimensional inference strategy is both flexible and computationally efficient and can substantially improve the power of SNP-set association analysis. Application to a real dataset further demonstrates the utility of the testing strategy.
Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Estudio de Asociación del Genoma Completo/métodos , Humanos , Predisposición Genética a la Enfermedad , Modelos Genéticos , Algoritmos , Simulación por ComputadorRESUMEN
Bayes factors represent a useful alternative to P-values for reporting outcomes of hypothesis tests by providing direct measures of the relative support that data provide to competing hypotheses. Unfortunately, the competing hypotheses have to be specified, and the calculation of Bayes factors in high-dimensional settings can be difficult. To address these problems, we define Bayes factor functions (BFFs) directly from common test statistics. BFFs depend on a single noncentrality parameter that can be expressed as a function of standardized effects, and plots of BFFs versus effect size provide informative summaries of hypothesis tests that can be easily aggregated across studies. Such summaries eliminate the need for arbitrary P-value thresholds to define "statistical significance." Because BFFs are defined using nonlocal alternative prior densities, they provide more rapid accumulation of evidence in favor of true null hypotheses without sacrificing efficiency in supporting true alternative hypotheses. BFFs can be expressed in closed form and can be computed easily from z, t, χ2, and F statistics.
Asunto(s)
Proyectos de Investigación , Teorema de BayesRESUMEN
DNA methylation plays a crucial role in transcriptional regulation. Reduced representation bisulfite sequencing (RRBS) is a technique of increasing use for analyzing genome-wide methylation profiles. Many computational tools such as Metilene, MethylKit, BiSeq and DMRfinder have been developed to use RRBS data for the detection of the differentially methylated regions (DMRs) potentially involved in epigenetic regulations of gene expression. For DMR detection tools, as for countless other medical applications, P-values and their adjustments are among the most standard reporting statistics used to assess the statistical significance of biological findings. However, P-values are coming under increasing criticism relating to their questionable accuracy and relatively high levels of false positive or negative indications. Here, we propose a method to calculate E-values, as likelihood ratios falling into the null hypothesis over the entire parameter space, for DMR detection in RRBS data. We also provide the R package 'metevalue' as a user-friendly interface to implement E-value calculations into various DMR detection tools. To evaluate the performance of E-values, we generated various RRBS benchmarking datasets using our simulator 'RRBSsim' with eight samples in each experimental group. Our comprehensive benchmarking analyses showed that using E-values not only significantly improved accuracy, area under ROC curve and power, over that of P-values or adjusted P-values, but also reduced false discovery rates and type I errors. In applications using real RRBS data of CRL rats and a clinical trial on low-salt diet, the use of E-values detected biologically more relevant DMRs and also improved the negative association between DNA methylation and gene expression.
Asunto(s)
Metilación de ADN , Animales , Ratas , Análisis de Secuencia de ADN/métodos , Curva ROC , Islas de CpGRESUMEN
Exact p-value (XPV)-based methods for dot product-like score functions-such as the XCorr score implemented in Tide, SEQUEST, Comet or shared peak count-based scoring in MSGF+ and ASPV-provide a fairly good calibration for peptide-spectrum-match (PSM) scoring in database searching-based MS/MS spectrum data identification. Unfortunately, standard XPV methods, in practice, cannot handle high-resolution fragmentation data produced by state-of-the-art mass spectrometers because having smaller bins increases the number of fragment matches that are assigned to incorrect bins and scored improperly. In this article, we present an extension of the XPV method, called the high-resolution exact p-value (HR-XPV) method, which can be used to calibrate PSM scores of high-resolution MS/MS spectra obtained with dot product-like scoring such as the XCorr. The HR-XPV carries remainder masses throughout the fragmentation, allowing them to greatly increase the number of fragments that are properly assigned to the correct bin and, thus, taking advantage of high-resolution data. Using four mass spectrometry data sets, our experimental results demonstrate that HR-XPV produces well-calibrated scores, which in turn results in more trusted spectrum annotations at any false discovery rate level.
Asunto(s)
Algoritmos , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Programas Informáticos , Péptidos/química , Calibración , Bases de Datos de ProteínasRESUMEN
BACKGROUND: The term eGene has been applied to define a gene whose expression level is affected by at least one independent expression quantitative trait locus (eQTL). It is both theoretically and empirically important to identify eQTLs and eGenes in genomic studies. However, standard eGene detection methods generally focus on individual cis-variants and cannot efficiently leverage useful knowledge acquired from auxiliary samples into target studies. METHODS: We propose a multilocus-based eGene identification method called TLegene by integrating shared genetic similarity information available from auxiliary studies under the statistical framework of transfer learning. We apply TLegene to eGene identification in ten TCGA cancers which have an explicit relevant tissue in the GTEx project, and learn genetic effect of variant in TCGA from GTEx. We also adopt TLegene to the Geuvadis project to evaluate its usefulness in non-cancer studies. RESULTS: We observed substantial genetic effect correlation of cis-variants between TCGA and GTEx for a larger number of genes. Furthermore, consistent with the results of our simulations, we found that TLegene was more powerful than existing methods and thus identified 169 distinct candidate eGenes, which was much larger than the approach that did not consider knowledge transfer across target and auxiliary studies. Previous studies and functional enrichment analyses provided empirical evidence supporting the associations of discovered eGenes, and it also showed evidence of allelic heterogeneity of gene expression. Furthermore, TLegene identified more eGenes in Geuvadis and revealed that these eGenes were mainly enriched in cells EBV transformed lymphocytes tissue. CONCLUSION: Overall, TLegene represents a flexible and powerful statistical method for eGene identification through transfer learning of genetic similarity shared across auxiliary and target studies.
Asunto(s)
Neoplasias , Polimorfismo de Nucleótido Simple , Humanos , Sitios de Carácter Cuantitativo/genética , Genómica , Neoplasias/genética , Aprendizaje Automático , Estudio de Asociación del Genoma Completo/métodosRESUMEN
OBJECTIVE: The fragility index (FI) measures the robustness of statistically significant findings in randomized controlled trials (RCTs) by quantifying the minimum number of event conversions required to reverse a dichotomous outcome's statistical significance. In vascular surgery, many clinical guidelines and critical decision-making points are informed by a handful of key RCTs, especially regarding open surgical versus endovascular treatment. The objective of this study is to evaluate the FI of RCTs with statistically significant primary outcomes that compared open vs endovascular surgery in vascular surgery. METHODS: In this meta-epidemiological study and systematic review, MEDLINE, Embase, and CENTRAL were searched for RCTs comparing open versus endovascular treatments for abdominal aortic aneurysms, carotid artery stenosis, and peripheral arterial disease to December 2022. RCTs with statistically significant primary outcomes were included. Data screening and extraction were conducted in duplicate. The FI was calculated by adding an event to the group with the smaller number of events while subtracting a nonevent to the same group until Fisher's exact test produced a nonstatistically significant result. The primary outcome was the FI and proportion of outcomes where the loss to follow-up was greater than the FI. The secondary outcomes assessed the relationship of the FI to disease state, presence of commercial funding, and study design. RESULTS: Overall, 5133 articles were captured in the initial search with 21 RCTs reporting 23 different primary outcomes being included in the final analysis. The median FI (first quartile, third quartile) was 3 (3, 20) with 16 outcomes (70%) reporting a loss to follow-up greater than its FI. Mann-Whitney U test revealed that commercially funded RCTs and composite outcomes had greater FIs (median, 20.0 [5.5, 24.5] vs median, 3.0 [2.0, 5.5], P = .035; median, 21 [8, 38] vs median, 3.0 [2.0, 8.5], P = .01, respectively). The FI did not vary between disease states (P = .285) or between index and follow-up trials (P = .147). There were significant correlations between the FI and P values (Pearson r = 0.90; 95% confidence interval, 0.77-0.96), and the number of events (r = 0.82; 95% confidence interval, 0.48-0.97). CONCLUSIONS: A small number of event conversions (median, 3) are needed to alter the statistical significance of primary outcomes in vascular surgery RCTs evaluating open surgical and endovascular treatments. Most studies had loss to follow-up greater than its FI, which can call into question trial results, and commercially funded studies had a greater FI. The FI and these findings should be considered in future trial design in vascular surgery.
Asunto(s)
Proyectos de Investigación , Especialidades Quirúrgicas , Humanos , Ensayos Clínicos Controlados Aleatorios como Asunto , Tamaño de la Muestra , Procedimientos Quirúrgicos Vasculares/efectos adversosRESUMEN
Arsenic is a relatively abundant metalloid that impacts DNA methylation and has been implicated in various adverse health outcomes including several cancers and diabetes. However, uncertainty remains about the identity of genomic CpGs that are sensitive to arsenic exposure, in utero or otherwise. Here we identified a high confidence set of CpG sites whose methylation is sensitive to in utero arsenic exposure. To do so, we analyzed methylation of infant CpGs as a function of maternal urinary arsenic in cord blood and placenta from geographically and ancestrally distinct human populations. Independent analyses of these distinct populations were followed by combination of results across sexes and populations/tissue types. Following these analyses, we concluded that both sex and tissue type are important drivers of heterogeneity in methylation response at several CpGs. We also identified 17 high confidence CpGs that were hypermethylated across sex, tissue type and population; 11 of these were located within protein coding genes. This pattern is consistent with hypotheses that arsenic increases cancer risk by inducing the hypermethylation of genic regions. This study represents an opportunity to understand consistent, reproducible patterns of epigenomic responses after in utero arsenic exposure and may aid towards novel biomarkers or signatures of arsenic exposure. Identifying arsenic-responsive sites can also contribute to our understanding of the biological mechanisms by which arsenic exposure can affect biological function and increase risk of cancer and other age-related diseases.
Asunto(s)
Arsénico , Neoplasias , Embarazo , Femenino , Humanos , Arsénico/toxicidad , Metilación de ADN , Placenta , Sangre Fetal , Islas de CpG , Neoplasias/inducido químicamente , Neoplasias/genética , Exposición Materna/efectos adversosRESUMEN
Statistical analysis and data visualization are integral parts of science communication. One of the major issues in current data analysis practice is an overdependency on-and misuse of-p-values. Researchers have been advocating for the estimation and reporting of effect sizes for quantitative research to enhance the clarity and effectiveness of data analysis. Reporting effect sizes in scientific publications has until now been mainly limited to numeric tables, even though effect size plotting is a more effective means of communicating results. We have developed the Durga R package for estimating and plotting effect sizes for paired and unpaired group comparisons. Durga allows users to estimate unstandardized and standardized effect sizes and bootstrapped confidence intervals of the effect sizes. The central functionality of Durga is to combine effect size visualizations with traditional plotting methods. Durga is a powerful statistical and data visualization package that is easy to use, providing the flexibility to estimate effect sizes of paired and unpaired data using different statistical methods. Durga provides a plethora of options for plotting effect size, which allows users to plot data in the most informative and aesthetic way. Here, we introduce the package and its various functions. We further describe a workflow for estimating and plotting effect sizes using example data sets.
Asunto(s)
Programas Informáticos , Interpretación Estadística de Datos , Visualización de DatosRESUMEN
We develop a method for hybrid analyses that uses external controls to augment internal control arms in randomized controlled trials (RCTs) where the degree of borrowing is determined based on similarity between RCT and external control patients to account for systematic differences (e.g., unmeasured confounders). The method represents a novel extension of the power prior where discounting weights are computed separately for each external control based on compatibility with the randomized control data. The discounting weights are determined using the predictive distribution for the external controls derived via the posterior distribution for time-to-event parameters estimated from the RCT. This method is applied using a proportional hazards regression model with piecewise constant baseline hazard. A simulation study and a real-data example are presented based on a completed trial in non-small cell lung cancer. It is shown that the case weighted power prior provides robust inference under various forms of incompatibility between the external controls and RCT population.
Asunto(s)
Proyectos de Investigación , Humanos , Simulación por Computador , Modelos de Riesgos Proporcionales , Teorema de BayesRESUMEN
Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small p-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small p-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small p-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.
Asunto(s)
Genómica , Proyectos de Investigación , Entropía , Algoritmos , Análisis de DatosRESUMEN
The paper provides computations comparing the accuracy of the saddlepoint approximation approach and the normal approximation method in approximating the mid-p-value of Wilcoxon and log-rank tests for the left-truncated data using a truncated binomial design. The paper uses real data examples to apply the comparison, along with some simulated studies. Confidence intervals are provided by the inversion of the tests under consideration.
Asunto(s)
Intervalos de Confianza , Humanos , Tamaño de la MuestraRESUMEN
The main idea of this paper is to approximate the exact p-value of a class of non-parametric, two-sample location-scale tests. In this paper, the most famous non-parametric two-sample location-scale tests are formulated in a class of linear rank tests. The permutation distribution of this class is derived from a random allocation design. This allows us to approximate the exact p-value of the non-parametric two-sample location-scale tests of the considered class using the saddlepoint approximation method. The proposed method shows high accuracy in approximating the exact p-value compared to the normal approximation method. Moreover, the proposed method only requires a few calculations and time, as in the case of the simulated method. The procedures of the proposed method are clarified through four sets of real data that represent applications for a number of different fields. In addition, a simulation study compares the proposed method with the traditional methods to approximate the exact p-value of the specified class of the non-parametric two-sample location-scale tests.
RESUMEN
First popularized almost a century ago in epidemiologic research by Ronald Fisher and Jerzy Neyman, the P-value has become perhaps the most misunderstood and even misused statistical value or descriptor. Indeed, modern clinical research has now come to be centered around and guided by an arbitrary P-value of <0.05 as a magical threshold for significance, so much so that experimental design, reporting of experimental findings, and interpretation and adoption of such findings have become largely dependent on this "significant" P-value. This has given rise to multiple biases in the overall body of biomedical literature that threatens the very validity of clinical research. Ultimately, a drive toward reporting a "significant" P-value (by various statistical manipulations) risks creating a falsely positive body of science, leading to (i) wasted resources in pursuing fruitless research and (ii) futile or even harmful policies/therapeutic recommendations. This article reviews the history of the P-value, the conceptual basis of P-value in the context of hypothesis testing and challenges in critically appraising clinical evidence vis-à-vis the P-value. This review is aimed at raising awareness of the pitfalls of this rigid observation of the threshold of statistical significance when evaluating clinical trials and to generate discussion regarding whether the scientific body needs a rethink about how we decide clinical significance.
Asunto(s)
Medicina Basada en la Evidencia , Humanos , Investigación Biomédica , Proyectos de Investigación , Interpretación Estadística de DatosRESUMEN
The fragility index is a clinically meaningful metric based on modifying patient outcomes that is increasingly used to interpret the robustness of clinical trial results. The fragility index relies on a concept that explores alternative realizations of the same clinical trial by modifying patient measurements. In this article, we propose to generalize the fragility index to a family of fragility indices called the incidence fragility indices that permit only outcome modifications that are sufficiently likely and provide an exact algorithm to calculate the incidence fragility indices. Additionally, we introduce a far-reaching generalization of the fragility index to any data type and explain how to permit only sufficiently likely modifications for nondichotomous outcomes. All of the proposed methodologies follow the fragility index concept.
Asunto(s)
Interpretación Estadística de Datos , Algoritmos , Humanos , Proyectos de Investigación , Tamaño de la MuestraRESUMEN
Clinical trials with continuous primary endpoints typically measure outcomes at baseline, at a fixed timepoint (denoted Tmin), and at intermediate timepoints. The analysis is commonly performed using the mixed model repeated measures method. It is sometimes expected that the effect size will be larger with follow-up longer than Tmin. But extending the follow-up for all patients delays trial completion. We propose an alternative trial design and analysis method that potentially increases statistical power without extending the trial duration or increasing the sample size. We propose following the last enrolled patient until Tmin, with earlier enrollees having variable follow-up durations up to a maximum of Tmax. The sample size at Tmax will be smaller than at Tmin, and due to staggered enrollment, data missing at Tmax will be missing completely at random. For analysis, we propose an alpha-adjusted procedure based on the smaller of the p values at Tmin and Tmax, termed minP $$ minP $$ . This approach can provide the highest power when the powers at Tmin and Tmax are similar. If the power at Tmin and Tmax differ significantly, the power of minP $$ minP $$ is modestly reduced compared with the larger of the two powers. Rare disease trials, due to the limited size of the patient population, may benefit the most with this design.
RESUMEN
Storey's estimator for the proportion of true null hypotheses, originally proposed under the continuous framework, has been modified in this work under the discrete framework. The modification results in improved estimation of the parameter of interest. The proposed estimator is used to formulate an adaptive version of the Benjamini-Hochberg procedure. Control over the false discovery rate by the proposed adaptive procedure has been proved analytically. The proposed estimate is also used to formulate an adaptive version of the Benjamini-Hochberg-Heyse procedure. Simulation experiments establish the conservative nature of this new adaptive procedure. Substantial amount of gain in power is observed for the new adaptive procedures over the standard procedures. For demonstration of the proposed method, two important real life gene expression data sets, one related to the study of HIV and the other related to methylation study, are used.
Asunto(s)
Simulación por ComputadorRESUMEN
The genomic analyses of pediatric acute lymphoblastic leukemia (ALL) subtypes, particularly T-cell and B-cell lineages, have been pivotal in identifying potential therapeutic targets. Typical genomic analyses have directed attention toward the most commonly mutated genes. However, assessing the contribution of mutations to cancer phenotypes is crucial. Therefore, we estimated the cancer effects (scaled selection coefficients) for somatic substitutions in T-cell and B-cell cohorts, revealing key insights into mutation contributions. Cancer effects for well-known, frequently mutated genes like NRAS and KRAS in B-ALL were high, which underscores their importance as therapeutic targets. However, less frequently mutated genes IL7R, XBP1, and TOX also demonstrated high cancer effects, suggesting pivotal roles in the development of leukemia when present. In T-ALL, KRAS and NRAS are less frequently mutated than in B-ALL. However, their cancer effects when present are high in both subtypes. Mutations in PIK3R1 and RPL10 were not at high prevalence, yet exhibited some of the highest cancer effects in individual T-cell ALL patients. Even CDKN2A, with a low prevalence and relatively modest cancer effect, is potentially highly relevant for the epistatic effects that its mutated form exerts on other mutations. Prioritizing investigation into these moderately frequent but potentially high-impact targets not only presents novel personalized therapeutic opportunities but also enhances the understanding of disease mechanisms and advances precision therapeutics for pediatric ALL.
Asunto(s)
Mutación , Humanos , Niño , Leucemia-Linfoma Linfoblástico de Células Precursoras B/genética , Leucemia-Linfoma Linfoblástico de Células Precursoras B/epidemiología , Leucemia-Linfoma Linfoblástico de Células T Precursoras/genética , Linfocitos T/inmunología , Linfocitos T/metabolismo , Linfocitos B/inmunología , Linfocitos B/metabolismoRESUMEN
The various debates around model selection paradigms are important, but in lieu of a consensus, there is a demonstrable need for a deeper appreciation of existing approaches, at least among the end-users of statistics and model selection tools. In the ecological literature, the Akaike information criterion (AIC) dominates model selection practices, and while it is a relatively straightforward concept, there exists what we perceive to be some common misunderstandings around its application. Two specific questions arise with surprising regularity among colleagues and students when interpreting and reporting AIC model tables. The first is related to the issue of 'pretending' variables, and specifically a muddled understanding of what this means. The second is related to p-values and what constitutes statistical support when using AIC. There exists a wealth of technical literature describing AIC and the relationship between p-values and AIC differences. Here, we complement this technical treatment and use simulation to develop some intuition around these important concepts. In doing so we aim to promote better statistical practices when it comes to using, interpreting and reporting models selected when using AIC.