RESUMEN
Replicability is the cornerstone of modern scientific research. Reliable identifications of genotype-phenotype associations that are significant in multiple genome-wide association studies (GWASs) provide stronger evidence for the findings. Current replicability analysis relies on the independence assumption among single-nucleotide polymorphisms (SNPs) and ignores the linkage disequilibrium (LD) structure. We show that such a strategy may produce either overly liberal or overly conservative results in practice. We develop an efficient method, ReAD, to detect replicable SNPs associated with the phenotype from two GWASs accounting for the LD structure. The local dependence structure of SNPs across two heterogeneous studies is captured by a four-state hidden Markov model (HMM) built on two sequences of p values. By incorporating information from adjacent locations via the HMM, our approach provides more accurate SNP significance rankings. ReAD is scalable, platform independent, and more powerful than existing replicability analysis methods with effective false discovery rate control. Through analysis of datasets from two asthma GWASs and two ulcerative colitis GWASs, we show that ReAD can identify replicable genetic loci that existing methods might otherwise miss.
Asunto(s)
Asma , Estudio de Asociación del Genoma Completo , Desequilibrio de Ligamiento , Polimorfismo de Nucleótido Simple , Estudio de Asociación del Genoma Completo/métodos , Humanos , Asma/genética , Cadenas de Markov , Colitis Ulcerosa/genética , Reproducibilidad de los Resultados , Fenotipo , GenotipoRESUMEN
We discuss a relatively new meta-scientific research design: many-analyst studies that attempt to assess the replicability and credibility of research based on large-scale observational data. In these studies, a large number of analysts try to answer the same research question using the same data. The key idea is the greater the variation in results, the greater the uncertainty in answering the research question and, accordingly, the lower the credibility of any individual research finding. Compared to individual replications, the large crowd of analysts allows for a more systematic investigation of uncertainty and its sources. However, many-analyst studies are also resource-intensive, and there are some doubts about their potential to provide credible assessments. We identify three issues that any many-analyst study must address: 1) identifying the source of variation in the results; 2) providing an incentive structure similar to that of standard research; and 3) conducting a proper meta-analysis of the results. We argue that some recent many-analyst studies have failed to address these issues satisfactorily and have therefore provided an overly pessimistic assessment of the credibility of science. We also provide some concrete guidance on how future many-analyst studies could provide a more constructive assessment.
RESUMEN
BACKGROUND: Single-cell transcriptome sequencing (scRNA-Seq) has allowed new types of investigations at unprecedented levels of resolution. Among the primary goals of scRNA-Seq is the classification of cells into distinct types. Many approaches build on existing clustering literature to develop tools specific to single-cell. However, almost all of these methods rely on heuristics or user-supplied parameters to control the number of clusters. This affects both the resolution of the clusters within the original dataset as well as their replicability across datasets. While many recommendations exist, in general, there is little assurance that any given set of parameters will represent an optimal choice in the trade-off between cluster resolution and replicability. For instance, another set of parameters may result in more clusters that are also more replicable. RESULTS: Here, we propose Dune, a new method for optimizing the trade-off between the resolution of the clusters and their replicability. Our method takes as input a set of clustering results-or partitions-on a single dataset and iteratively merges clusters within each partitions in order to maximize their concordance between partitions. As demonstrated on multiple datasets from different platforms, Dune outperforms existing techniques, that rely on hierarchical merging for reducing the number of clusters, in terms of replicability of the resultant merged clusters as well as concordance with ground truth. Dune is available as an R package on Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/Dune.html . CONCLUSIONS: Cluster refinement by Dune helps improve the robustness of any clustering analysis and reduces the reliance on tuning parameters. This method provides an objective approach for borrowing information across multiple clusterings to generate replicable clusters most likely to represent common biological features across multiple datasets.
Asunto(s)
RNA-Seq , Análisis de la Célula Individual , Programas Informáticos , Análisis de la Célula Individual/métodos , RNA-Seq/métodos , Análisis por Conglomerados , Algoritmos , Análisis de Secuencia de ARN/métodos , Humanos , Transcriptoma/genética , Reproducibilidad de los Resultados , Perfilación de la Expresión Génica/métodos , Análisis de Expresión Génica de una Sola CélulaRESUMEN
In biomedical research, the replicability of findings across studies is highly desired. In this study, we focus on cancer omics data, for which the examination of replicability has been mostly focused on important omics variables identified in different studies. In published literature, although there have been extensive attention and ad hoc discussions, there is insufficient quantitative research looking into replicability measures and their properties. The goal of this study is to fill this important knowledge gap. In particular, we consider three sensible replicability measures, for which we examine distributional properties and develop a way of making inference. Applying them to three The Cancer Genome Atlas (TCGA) datasets reveals in general low replicability and significant across-data variations. To further comprehend such findings, we resort to simulation, which confirms the validity of the findings with the TCGA data and further informs the dependence of replicability on signal level (or equivalently sample size). Overall, this study can advance our understanding of replicability for cancer omics and other studies that have identification as a key goal.
Asunto(s)
Investigación Biomédica , Neoplasias , Humanos , Neoplasias/genética , Tamaño de la MuestraRESUMEN
Performance in tests of various cognitive abilities has often been compared, both within and between species. In intraspecific comparisons, habitat effects on cognition has been a popular topic, frequently with an underlying assumption that urban animals should perform better than their rural conspecifics. In this study, we tested problem-solving ability in great tits Parus major, in a string-pulling and a plug-opening test. Our aim was to compare performance between urban and rural great tits, and to compare their performance with previously published problem solving studies. Our great tits perfomed better in string-pulling than their conspecifics in previous studies (solving success: 54%), and better than their close relative, the mountain chickadee Poecile gambeli, in the plug-opening test (solving success: 70%). Solving latency became shorter over four repeated sessions, indicating learning abilities, and showed among-individual correlation between the two tests. However, the solving ability did not differ between habitat types in either test. Somewhat unexpectedly, we found marked differences between study years even though we tried to keep conditions identical. These were probably due to small changes to the experimental protocol between years, for example the unavoidable changes of observers and changes in the size and material of test devices. This has an important implication: if small changes in an otherwise identical set-up can have strong effects, meaningful comparisons of cognitive performance between different labs must be extremely hard. In a wider perspective this highlights the replicability problem often present in animal behaviour studies.
Asunto(s)
Solución de Problemas , Animales , Masculino , Femenino , Ecosistema , Passeriformes/fisiologíaRESUMEN
The two-trials rule for drug approval requires "at least two adequate and well-controlled studies, each convincing on its own, to establish effectiveness." This is usually implemented by requiring two significant pivotal trials and is the standard regulatory requirement to provide evidence for a new drug's efficacy. However, there is need to develop suitable alternatives to this rule for a number of reasons, including the possible availability of data from more than two trials. I consider the case of up to three studies and stress the importance to control the partial Type-I error rate, where only some studies have a true null effect, while maintaining the overall Type-I error rate of the two-trials rule, where all studies have a null effect. Some less-known P $$ P $$ -value combination methods are useful to achieve this: Pearson's method, Edgington's method and the recently proposed harmonic mean χ 2 $$ {\chi}^2 $$ -test. I study their properties and discuss how they can be extended to a sequential assessment of success while still ensuring overall Type-I error control. I compare the different methods in terms of partial Type-I error rate, project power and the expected number of studies required. Edgington's method is eventually recommended as it is easy to implement and communicate, has only moderate partial Type-I error rate inflation but substantially increased project power.
Asunto(s)
Aprobación de Drogas , Humanos , Ensayos Clínicos como Asunto/economía , Modelos Estadísticos , Proyectos de InvestigaciónRESUMEN
PURPOSE: Our objective is to describe how the U.S. Food and Drug Administration (FDA)'s Sentinel System implements best practices to ensure trust in drug safety studies using real-world data from disparate sources. METHODS: We present a stepwise schematic for Sentinel's data harmonization, data quality check, query design and implementation, and reporting practices, and describe approaches to enhancing the transparency, reproducibility, and replicability of studies at each step. CONCLUSIONS: Each Sentinel data partner converts its source data into the Sentinel Common Data Model. The transformed data undergoes rigorous quality checks before it can be used for Sentinel queries. The Sentinel Common Data Model framework, data transformation codes for several data sources, and data quality assurance packages are publicly available. Designed to run against the Sentinel Common Data Model, Sentinel's querying system comprises a suite of pre-tested, parametrizable computer programs that allow users to perform sophisticated descriptive and inferential analysis without having to exchange individual-level data across sites. Detailed documentation of capabilities of the programs as well as the codes and information required to execute them are publicly available on the Sentinel website. Sentinel also provides public trainings and online resources to facilitate use of its data model and querying system. Its study specifications conform to established reporting frameworks aimed at facilitating reproducibility and replicability of real-world data studies. Reports from Sentinel queries and associated design and analytic specifications are available for download on the Sentinel website. Sentinel is an example of how real-world data can be used to generate regulatory-grade evidence at scale using a transparent, reproducible, and replicable process.
Asunto(s)
Farmacoepidemiología , United States Food and Drug Administration , Farmacoepidemiología/métodos , Reproducibilidad de los Resultados , United States Food and Drug Administration/normas , Humanos , Estados Unidos , Exactitud de los Datos , Sistemas de Registro de Reacción Adversa a Medicamentos/estadística & datos numéricos , Sistemas de Registro de Reacción Adversa a Medicamentos/normas , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/epidemiología , Bases de Datos Factuales/normas , Proyectos de Investigación/normasRESUMEN
PURPOSE: To assess the validity of privacy-preserving synthetic data by comparing results from synthetic versus original EHR data analysis. METHODS: A published retrospective cohort study on real-world effectiveness of COVID-19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same source, and the results were compared between synthetic versus original datasets. The endpoints included COVID-19 infection, symptomatic COVID-19 infection and hospitalization due to infection and were also assessed in several demographic and clinical subgroups. In comparing synthetic versus original data estimates, several metrices were utilized: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated five times to assess the stability of results. RESULTS: The distribution of demographic and clinical characteristics demonstrated very small difference (< 0.01 SMD). In the comparison of vaccine effectiveness assessed in relative risk reduction between synthetic versus original data, there was a 100% decision agreement, 100% estimate agreement, and a high level of confidence interval overlap (88.7%-99.7%) in all five replicates across all subgroups. Similar findings were achieved in the assessment of vaccine effectiveness against symptomatic COVID-19 Infection. In the comparison of hazard ratios for COVID 19-related hospitalization and odds ratio for symptomatic COVID-19 Infection, the Wald tests suggested no significant difference between respective effect estimates in all five replicates for all patient subgroups but there were disagreements in estimate and decision metrices in some subgroups and replicates. CONCLUSIONS: Overall, comparison of synthetic versus original real-world data demonstrated good validity and reliability. Transparency on the process to generate high fidelity synthetic data and assurances of patient privacy are warranted.
Asunto(s)
Vacunas contra la COVID-19 , COVID-19 , Registros Electrónicos de Salud , Humanos , COVID-19/prevención & control , COVID-19/epidemiología , Vacunas contra la COVID-19/administración & dosificación , Israel/epidemiología , Estudios Retrospectivos , Masculino , Femenino , Eficacia de las Vacunas , Persona de Mediana Edad , Hospitalización/estadística & datos numéricos , Reproducibilidad de los Resultados , Adulto , Anciano , Privacidad , Estudios de CohortesRESUMEN
Grouping/read-across is widely used for predicting the toxicity of data-poor target substance(s) using data-rich source substance(s). While the chemical industry and the regulators recognise its benefits, registration dossiers are often rejected due to weak analogue/category justifications based largely on the structural similarity of source and target substances. Here we demonstrate how multi-omics measurements can improve confidence in grouping via a statistical assessment of the similarity of molecular effects. Six azo dyes provided a pool of potential source substances to predict long-term toxicity to aquatic invertebrates (Daphnia magna) for the dye Disperse Yellow 3 (DY3) as the target substance. First, we assessed the structural similarities of the dyes, generating a grouping hypothesis with DY3 and two Sudan dyes within one group. Daphnia magna were exposed acutely to equi-effective doses of all seven dyes (each at 3 doses and 3 time points), transcriptomics and metabolomics data were generated from 760 samples. Multi-omics bioactivity profile-based grouping uniquely revealed that Sudan 1 (S1) is the most suitable analogue for read-across to DY3. Mapping ToxPrint structural fingerprints of the dyes onto the bioactivity profile-based grouping indicated an aromatic alcohol moiety could be responsible for this bioactivity similarity. The long-term reproductive toxicity to aquatic invertebrates of DY3 was predicted from S1 (21-day NOEC, 40 µg/L). This prediction was confirmed experimentally by measuring the toxicity of DY3 in D. magna. While limitations of this 'omics approach are identified, the study illustrates an effective statistical approach for building chemical groups.
Asunto(s)
Compuestos Azo , Colorantes , Daphnia , Contaminantes Químicos del Agua , Daphnia/efectos de los fármacos , Animales , Compuestos Azo/toxicidad , Compuestos Azo/química , Colorantes/toxicidad , Contaminantes Químicos del Agua/toxicidad , Metabolómica , Pruebas de Toxicidad/métodos , Transcriptoma/efectos de los fármacos , Daphnia magna , MultiómicaRESUMEN
Replicability takes on special meaning when researching phenomena that are embedded in space and time, including phenomena distributed on the surface and near surface of the Earth. Two principles, spatial dependence and spatial heterogeneity, are generally characteristic of such phenomena. Various practices have evolved in dealing with spatial heterogeneity, including the use of place-based models. We review the rapidly emerging applications of artificial intelligence to phenomena distributed in space and time and speculate on how the principle of spatial heterogeneity might be addressed. We introduce a concept of weak replicability and discuss possible approaches to its measurement.
RESUMEN
Collaborative efforts to directly replicate empirical studies in the medical and social sciences have revealed alarmingly low rates of replicability, a phenomenon dubbed the 'replication crisis'. Poor replicability has spurred cultural changes targeted at improving reliability in these disciplines. Given the absence of equivalent replication projects in ecology and evolutionary biology, two inter-related indicators offer the opportunity to retrospectively assess replicability: publication bias and statistical power. This registered report assesses the prevalence and severity of small-study (i.e., smaller studies reporting larger effect sizes) and decline effects (i.e., effect sizes decreasing over time) across ecology and evolutionary biology using 87 meta-analyses comprising 4,250 primary studies and 17,638 effect sizes. Further, we estimate how publication bias might distort the estimation of effect sizes, statistical power, and errors in magnitude (Type M or exaggeration ratio) and sign (Type S). We show strong evidence for the pervasiveness of both small-study and decline effects in ecology and evolution. There was widespread prevalence of publication bias that resulted in meta-analytic means being over-estimated by (at least) 0.12 standard deviations. The prevalence of publication bias distorted confidence in meta-analytic results, with 66% of initially statistically significant meta-analytic means becoming non-significant after correcting for publication bias. Ecological and evolutionary studies consistently had low statistical power (15%) with a 4-fold exaggeration of effects on average (Type M error rates = 4.4). Notably, publication bias reduced power from 23% to 15% and increased type M error rates from 2.7 to 4.4 because it creates a non-random sample of effect size evidence. The sign errors of effect sizes (Type S error) increased from 5% to 8% because of publication bias. Our research provides clear evidence that many published ecological and evolutionary findings are inflated. Our results highlight the importance of designing high-power empirical studies (e.g., via collaborative team science), promoting and encouraging replication studies, testing and correcting for publication bias in meta-analyses, and adopting open and transparent research practices, such as (pre)registration, data- and code-sharing, and transparent reporting.
Asunto(s)
Biología , Sesgo , Sesgo de Publicación , Reproducibilidad de los Resultados , Estudios Retrospectivos , Metaanálisis como AsuntoRESUMEN
Comparative simulation studies are workhorse tools for benchmarking statistical methods. As with other empirical studies, the success of simulation studies hinges on the quality of their design, execution, and reporting. If not conducted carefully and transparently, their conclusions may be misleading. In this paper, we discuss various questionable research practices, which may impact the validity of simulation studies, some of which cannot be detected or prevented by the current publication process in statistics journals. To illustrate our point, we invent a novel prediction method with no expected performance gain and benchmark it in a preregistered comparative simulation study. We show how easy it is to make the method appear superior over well-established competitor methods if questionable research practices are employed. Finally, we provide concrete suggestions for researchers, reviewers, and other academic stakeholders for improving the methodological quality of comparative simulation studies, such as preregistering simulation protocols, incentivizing neutral simulation studies, and code and data sharing.
Asunto(s)
Benchmarking , Simulación por ComputadorRESUMEN
For many problems in clinical practice, multiple treatment alternatives are available. Given data from a randomized controlled trial or an observational study, an important challenge is to estimate an optimal decision rule that specifies for each client the most effective treatment alternative, given his or her pattern of pretreatment characteristics. In the present paper we will look for such a rule within the insightful family of classification trees. Unfortunately, however, there is dearth of readily accessible software tools for optimal decision tree estimation in the case of more than two treatment alternatives. Moreover, this primary tree estimation problem is also cursed with two secondary problems: a structural missingness in typical studies on treatment evaluation (because every individual is assigned to a single treatment alternative only), and a major issue of replicability. In this paper we propose solutions for both the primary and the secondary problems at stake. We evaluate the proposed solution in a simulation study, and illustrate with an application on the search for an optimal tree-based treatment regime in a randomized controlled trial on K = 3 different types of aftercare for younger women with early-stage breast cancer. We conclude by arguing that the proposed solutions may have relevance for several other classification problems inside and outside the domain of optimal treatment assignment.
RESUMEN
The construct of personal control is crucial for understanding a variety of human behaviors. Perceived lack of control affects performance and psychological well-being in diverse contexts - educational, organizational, clinical, and social. Thus, it is important to know to what extent we can rely on the established experimental manipulations of (lack of) control. In this article, we examine the construct validity of recall-based manipulations of control (or lack thereof). Using existing datasets (Study 1a and 1b: N = 627 and N = 454, respectively) we performed content-based analyses of control experiences induced by two different procedures (free recall and positive events recall). The results indicate low comparability between high and low control conditions in terms of the emotionality of a recalled event, the domain and sphere of control, amongst other differences. In an experimental study that included three types of recall-based control manipulations (Study 2: N = 506), we found that the conditions differed not only in emotionality but also in a generalized sense of control. This suggests that different aspects of personal control can be activated, and other constructs evoked, depending on the experimental procedure. We discuss potential sources of variability between control manipulation procedures and propose improvements in practices when using experimental manipulations of sense of control and other psychological constructs.
Asunto(s)
Emociones , Recuerdo Mental , Humanos , Recuerdo Mental/fisiología , Masculino , Femenino , Adulto , Emociones/fisiología , Adulto Joven , Reproducibilidad de los Resultados , Autocontrol/psicología , Adolescente , Persona de Mediana EdadRESUMEN
In the last decade, scientists investigating human social cognition have started bringing traditional laboratory paradigms more "into the wild" to examine how socio-cognitive mechanisms of the human brain work in real-life settings. As this implies transferring 2D observational paradigms to 3D interactive environments, there is a risk of compromising experimental control. In this context, we propose a methodological approach which uses humanoid robots as proxies of social interaction partners and embeds them in experimental protocols that adapt classical paradigms of cognitive psychology to interactive scenarios. This allows for a relatively high degree of "naturalness" of interaction and excellent experimental control at the same time. Here, we present two case studies where our methods and tools were applied and replicated across two different laboratories, namely the Italian Institute of Technology in Genova (Italy) and the Agency for Science, Technology and Research in Singapore. In the first case study, we present a replication of an interactive version of a gaze-cueing paradigm reported in Kompatsiari et al. (J Exp Psychol Gen 151(1):121-136, 2022). The second case study presents a replication of a "shared experience" paradigm reported in Marchesi et al. (Technol Mind Behav 3(3):11, 2022). As both studies replicate results across labs and different cultures, we argue that our methods allow for reliable and replicable setups, even though the protocols are complex and involve social interaction. We conclude that our approach can be of benefit to the research field of social cognition and grant higher replicability, for example, in cross-cultural comparisons of social cognition mechanisms.
Asunto(s)
Cognición Social , Interacción Social , Humanos , Robótica/métodos , Masculino , Italia , Señales (Psicología) , Femenino , Adulto , Cognición/fisiología , Relaciones InterpersonalesRESUMEN
The Think/No-Think (TNT) task has just celebrated 20 years since its inception, and its use has been growing as a tool to investigate the mechanisms underlying memory control and its neural underpinnings. Here, we present a theoretical and practical guide for designing, implementing, and running TNT studies. For this purpose, we provide a step-by-step description of the structure of the TNT task, methodological choices that can be made, parameters that can be chosen, instruments available, aspects to be aware of, systematic information about how to run a study and analyze the data. Importantly, we provide a TNT training package (as Supplementary Material), that is, a series of multimedia materials (e.g., tutorial videos, informative HTML pages, MATLAB code to run experiments, questionnaires, scoring sheets, etc.) to complement this method paper and facilitate a deeper understanding of the TNT task, its rationale, and how to set it up in practice. Given the recent discussion about the replication crisis in the behavioral sciences, we hope that this contribution will increase standardization, reliability, and replicability across laboratories.
Asunto(s)
Pensamiento , Humanos , Pensamiento/fisiología , Memoria/fisiología , Reproducibilidad de los ResultadosRESUMEN
Complex systems pose significant challenges to traditional scientific and statistical methods due to their inherent unpredictability and resistance to simplification. Accurately detecting complex behavior and the uncertainty which comes with it is therefore essential. Using the context of previous studies, we introduce a new information-theoretic measure, termed "incoherence". By using an adapted Jensen-Shannon Divergence across an ensemble of outcomes, we quantify the aleatoric uncertainty of the system. First we compared this measure to established statistical tests using both continuous and discrete data. Before demonstrating how incoherence can be applied to identify key characteristics of complex systems, including sensitivity to initial conditions, criticality, and response to perturbations.
RESUMEN
Widespread failures of replication and generalization are, ironically, a scientific triumph, in that they confirm the fundamental metascientific theory that underlies our field. Generalizable and replicable findings require testing large numbers of subjects from a wide range of demographics with a large, randomly-sampled stimulus set, and using a variety of experimental parameters. Because few studies accomplish any of this, meta-scientists predict that findings will frequently fail to replicate or generalize. We argue that to be more robust and replicable, developmental psychology needs to find a mechanism for collecting data at greater scale and from more diverse populations. Luckily, this mechanism already exists: Citizen science, in which large numbers of uncompensated volunteers provide data. While best-known for its contributions to astronomy and ecology, citizen science has also produced major findings in neuroscience and psychology, and increasingly in developmental psychology. We provide examples, address practical challenges, discuss limitations, and compare to other methods of obtaining large datasets. Ultimately, we argue that the range of studies where it makes sense *not* to use citizen science is steadily dwindling.
RESUMEN
The fragility index has been increasingly used to assess the robustness of the results of clinical trials since 2014. It aims at finding the smallest number of event changes that could alter originally statistically significant results. Despite its popularity, some researchers have expressed several concerns about the validity and usefulness of the fragility index. It offers a comprehensive review of the fragility index's rationale, calculation, software, and interpretation, with emphasis on application to studies in obstetrics and gynecology. This article presents the fragility index in the settings of individual clinical trials, standard pairwise meta-analyses, and network meta-analyses. Moreover, this article provides worked examples to demonstrate how the fragility index can be appropriately calculated and interpreted. In addition, the limitations of the traditional fragility index and some solutions proposed in the literature to address these limitations were reviewed. In summary, the fragility index is recommended to be used as a supplemental measure in the reporting of clinical trials and a tool to communicate the robustness of trial results to clinicians. Other considerations that can aid in the fragility index's interpretation include the loss to follow-up and the likelihood of data modifications that achieve the loss of statistical significance.
Asunto(s)
Probabilidad , Humanos , Metaanálisis en Red , Metaanálisis como Asunto , Ensayos Clínicos como AsuntoRESUMEN
Cancer drug development is hindered by high clinical attrition rates, which are blamed on weak predictive power by preclinical models and limited replicability of preclinical findings. However, the technically feasible level of replicability remains unknown. To fill this gap, we conducted an analysis of data from the NCI60 cancer cell line screen (2.8 million compound/cell line experiments), which is to our knowledge the largest depository of experiments that have been repeatedly performed over decades. The findings revealed profound intra-laboratory data variability, although all experiments were executed following highly standardised protocols that avoid all known confounders of data quality. All compound/ cell line combinations with > 100 independent biological replicates displayed maximum GI50 (50% growth inhibition) fold changes (highest/ lowest GI50) > 5% and 70.5% displayed maximum fold changes > 1000. The highest maximum fold change was 3.16 × 1010 (lowest GI50: 7.93 ×10-10 µM, highest GI50: 25.0 µM). FDA-approved drugs and experimental agents displayed similar variation. Variability remained high after outlier removal, when only considering experiments that tested drugs at the same concentration range, and when only considering NCI60-provided quality-controlled data. In conclusion, high variability is an intrinsic feature of anti-cancer drug testing, even among standardised experiments in a world-leading research environment. Awareness of this inherent variability will support realistic data interpretation and inspire research to improve data robustness. Further research will have to show whether the inclusion of a wider variety of model systems, such as animal and/ or patient-derived models, may improve data robustness.