Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 47
Filter
1.
J Am Stat Assoc ; 119(546): 1019-1031, 2024.
Article in English | MEDLINE | ID: mdl-38974187

ABSTRACT

We introduce a simple diagnostic test for assessing the overall or partial goodness of fit of a linear causal model with errors being independent of the covariates. In particular, we consider situations where hidden confounding is potentially present. We develop a method and discuss its capability to distinguish between covariates that are confounded with the response by latent variables and those that are not. Thus, we provide a test and methodology for partial goodness of fit. The test is based on comparing a novel higher-order least squares principle with ordinary least squares. In spite of its simplicity, the proposed method is extremely general and is also proven to be valid for high-dimensional settings. Supplementary materials for this article are available online.

2.
Proc Natl Acad Sci U S A ; 121(8): e2314228121, 2024 Feb 20.
Article in English | MEDLINE | ID: mdl-38363866

ABSTRACT

In problems such as variable selection and graph estimation, models are characterized by Boolean logical structure such as the presence or absence of a variable or an edge. Consequently, false-positive error or false-negative error can be specified as the number of variables/edges that are incorrectly included or excluded in an estimated model. However, there are several other problems such as ranking, clustering, and causal inference in which the associated model classes do not admit transparent notions of false-positive and false-negative errors due to the lack of an underlying Boolean logical structure. In this paper, we present a generic approach to endow a collection of models with partial order structure, which leads to a hierarchical organization of model classes as well as natural analogs of false-positive and false-negative errors. We describe model selection procedures that provide false-positive error control in our general setting, and we illustrate their utility with numerical experiments.

3.
EClinicalMedicine ; 62: 102124, 2023 Aug.
Article in English | MEDLINE | ID: mdl-37588623

ABSTRACT

Background: When sepsis is detected, organ damage may have progressed to irreversible stages, leading to poor prognosis. The use of machine learning for predicting sepsis early has shown promise, however international validations are missing. Methods: This was a retrospective, observational, multi-centre cohort study. We developed and externally validated a deep learning system for the prediction of sepsis in the intensive care unit (ICU). Our analysis represents the first international, multi-centre in-ICU cohort study for sepsis prediction using deep learning to our knowledge. Our dataset contains 136,478 unique ICU admissions, representing a refined and harmonised subset of four large ICU databases comprising data collected from ICUs in the US, the Netherlands, and Switzerland between 2001 and 2016. Using the international consensus definition Sepsis-3, we derived hourly-resolved sepsis annotations, amounting to 25,694 (18.8%) patient stays with sepsis. We compared our approach to clinical baselines as well as machine learning baselines and performed an extensive internal and external statistical validation within and across databases, reporting area under the receiver-operating-characteristic curve (AUC). Findings: Averaged over sites, our model was able to predict sepsis with an AUC of 0.846 (95% confidence interval [CI], 0.841-0.852) on a held-out validation cohort internal to each site, and an AUC of 0.761 (95% CI, 0.746-0.770) when validating externally across sites. Given access to a small fine-tuning set (10% per site), the transfer to target sites was improved to an AUC of 0.807 (95% CI, 0.801-0.813). Our model raised 1.4 false alerts per true alert and detected 80% of the septic patients 3.7 h (95% CI, 3.0-4.3) prior to the onset of sepsis, opening a vital window for intervention. Interpretation: By monitoring clinical and laboratory measurements in a retrospective simulation of a real-time prediction scenario, a deep learning system for the detection of sepsis generalised to previously unseen ICU cohorts, internationally. Funding: This study was funded by the Personalized Health and Related Technologies (PHRT) strategic focus area of the ETH domain.

4.
Article in English | MEDLINE | ID: mdl-37502671

ABSTRACT

The advent of technological developments is allowing to gather large amounts of data in several research fields. Learning analytics (LA)/educational data mining has access to big observational unstructured data captured from educational settings and relies mostly on unsupervised machine learning (ML) algorithms to make sense of such type of data. Generalized additive models for location, scale, and shape (GAMLSS) are a supervised statistical learning framework that allows modeling all the parameters of the distribution of the response variable with respect to the explanatory variables. This article overviews the power and flexibility of GAMLSS in relation to some ML techniques. Also, GAMLSS' capability to be tailored toward causality via causal regularization is briefly commented. This overview is illustrated via a data set from the field of LA. This article is categorized under:Application Areas > Education and LearningAlgorithmic Development > StatisticsTechnologies > Machine Learning.

5.
Sci Adv ; 9(6): eade9238, 2023 02 10.
Article in English | MEDLINE | ID: mdl-36753540

ABSTRACT

Rhabdomyosarcoma (RMS) is a group of pediatric cancers with features of developing skeletal muscle. The cellular hierarchy and mechanisms leading to developmental arrest remain elusive. Here, we combined single-cell RNA sequencing, mass cytometry, and high-content imaging to resolve intratumoral heterogeneity of patient-derived primary RMS cultures. We show that the aggressive alveolar RMS (aRMS) subtype contains plastic muscle stem-like cells and cycling progenitors that drive tumor growth, and a subpopulation of differentiated cells that lost its proliferative potential and correlates with better outcomes. While chemotherapy eliminates cycling progenitors, it enriches aRMS for muscle stem-like cells. We screened for drugs hijacking aRMS toward clinically favorable subpopulations and identified a combination of RAF and MEK inhibitors that potently induces myogenic differentiation and inhibits tumor growth. Overall, our work provides insights into the developmental states underlying aRMS aggressiveness, chemoresistance, and progression and identifies the RAS pathway as a promising therapeutic target.


Subject(s)
Antineoplastic Agents , Rhabdomyosarcoma, Alveolar , Rhabdomyosarcoma , Child , Humans , Rhabdomyosarcoma, Alveolar/drug therapy , Rhabdomyosarcoma, Alveolar/genetics , Rhabdomyosarcoma, Alveolar/pathology , Rhabdomyosarcoma/drug therapy , Rhabdomyosarcoma/genetics , Rhabdomyosarcoma/pathology , Muscle, Skeletal/metabolism , Cell Differentiation , Antineoplastic Agents/therapeutic use , Cell Line, Tumor
6.
Ann Stat ; 50(3): 1320-1347, 2022 Jun.
Article in English | MEDLINE | ID: mdl-35958884

ABSTRACT

Inferring causal relationships or related associations from observational data can be invalidated by the existence of hidden confounding. We focus on a high-dimensional linear regression setting, where the measured covariates are affected by hidden confounding and propose the Doubly Debiased Lasso estimator for individual components of the regression coefficient vector. Our advocated method simultaneously corrects both the bias due to estimation of high-dimensional parameters as well as the bias caused by the hidden confounding. We establish its asymptotic normality and also prove that it is efficient in the Gauss-Markov sense. The validity of our methodology relies on a dense confounding assumption, i.e. that every confounding variable affects many covariates. The finite sample performance is illustrated with an extensive simulation study and a genomic application.

7.
Stat Comput ; 32(3): 39, 2022.
Article in English | MEDLINE | ID: mdl-35582000

ABSTRACT

Prediction models often fail if train and test data do not stem from the same distribution. Out-of-distribution (OOD) generalization to unseen, perturbed test data is a desirable but difficult-to-achieve property for prediction models and in general requires strong assumptions on the data generating process (DGP). In a causally inspired perspective on OOD generalization, the test data arise from a specific class of interventions on exogenous random variables of the DGP, called anchors. Anchor regression models, introduced by Rothenhäusler et al. (J R Stat Soc Ser B 83(2):215-246, 2021. 10.1111/rssb.12398), protect against distributional shifts in the test data by employing causal regularization. However, so far anchor regression has only been used with a squared-error loss which is inapplicable to common responses such as censored continuous or ordinal data. Here, we propose a distributional version of anchor regression which generalizes the method to potentially censored responses with at least an ordered sample space. To this end, we combine a flexible class of parametric transformation models for distributional regression with an appropriate causal regularizer under a more general notion of residuals. In an exemplary application and several simulation scenarios we demonstrate the extent to which OOD generalization is possible.

8.
Gigascience ; 122022 Dec 28.
Article in English | MEDLINE | ID: mdl-37318234

ABSTRACT

OBJECTIVE: To develop a unified framework for analyzing data from 5 large publicly available intensive care unit (ICU) datasets. FINDINGS: Using 3 American (Medical Information Mart for Intensive Care III, Medical Information Mart for Intensive Care IV, electronic ICU) and 2 European (Amsterdam University Medical Center Database, High Time Resolution ICU Dataset) databases, we constructed a mapping for each database to a set of clinically relevant concepts, which are grounded in the Observational Medical Outcomes Partnership Vocabulary wherever possible. Furthermore, we performed synchronization in the units of measurement and data type representation. On top of this, we built functionality, which allows the user to download, set up, and load data from all of the 5 databases, through a unified Application Programming Interface. The resulting ricu R-package represents the computational infrastructure for handling publicly available ICU datasets, and its latest release allows the user to load 119 existing clinical concepts from the 5 data sources. CONCLUSION: The ricu R-package (available on GitHub and CRAN) is the first tool that enables users to analyze publicly available ICU datasets simultaneously (datasets are available upon request from respective owners). Such an interface saves researchers time when analyzing ICU data and helps reproducibility. We hope that ricu can become a community-wide effort, so that data harmonization is not repeated by each research group separately. One current limitation is that concepts were added on a case-to-case basis, and therefore the resulting dictionary of concepts is not comprehensive. Further work is needed to make the dictionary comprehensive.


Subject(s)
Critical Care , Intensive Care Units , Humans , Reproducibility of Results , Critical Care/methods , Databases, Factual , Data Management
9.
Cell Syst ; 13(1): 43-57.e6, 2022 01 19.
Article in English | MEDLINE | ID: mdl-34666007

ABSTRACT

We profiled the liver transcriptome, proteome, and metabolome in 347 individuals from 58 isogenic strains of the BXD mouse population across age (7 to 24 months) and diet (low or high fat) to link molecular variations to metabolic traits. Several hundred genes are affected by diet and/or age at the transcript and protein levels. Orthologs of two aging-associated genes, St7 and Ctsd, were knocked down in C. elegans, reducing longevity in wild-type and mutant long-lived strains. The multiomics data were analyzed as segregating gene networks according to each independent variable, providing causal insight into dietary and aging effects. Candidates were cross-examined in an independent diversity outbred mouse liver dataset segregating for similar diets, with ∼80%-90% of diet-related candidate genes found in common across datasets. Together, we have developed a large multiomics resource for multivariate analysis of complex traits and demonstrate a methodology for moving from observational associations to causal connections.


Subject(s)
Caenorhabditis elegans , Liver , Animals , Caenorhabditis elegans/genetics , Diet , Gene Regulatory Networks , Liver/metabolism , Mice , Transcriptome/genetics
10.
Bioinformatics ; 38(6): 1550-1559, 2022 03 04.
Article in English | MEDLINE | ID: mdl-34927666

ABSTRACT

MOTIVATION: Signaling pathways control cellular behavior. Dysregulated pathways, for example, due to mutations that cause genes and proteins to be expressed abnormally, can lead to diseases, such as cancer. RESULTS: We introduce a novel computational approach, called Differential Causal Effects (dce), which compares normal to cancerous cells using the statistical framework of causality. The method allows to detect individual edges in a signaling pathway that are dysregulated in cancer cells, while accounting for confounding. Hence, technical artifacts have less influence on the results and dce is more likely to detect the true biological signals. We extend the approach to handle unobserved dense confounding, where each latent variable, such as, for example, batch effects or cell cycle states, affects many covariates. We show that dce outperforms competing methods on synthetic datasets and on CRISPR knockout screens. We validate its latent confounding adjustment properties on a GTEx (Genotype-Tissue Expression) dataset. Finally, in an exploratory analysis on breast cancer data from TCGA (The Cancer Genome Atlas), we recover known and discover new genes involved in breast cancer progression. AVAILABILITY AND IMPLEMENTATION: The method dce is freely available as an R package on Bioconductor (https://bioconductor.org/packages/release/bioc/html/dce.html) as well as on https://github.com/cbg-ethz/dce. The GitHub repository also contains the Snakemake workflows needed to reproduce all results presented here. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Breast Neoplasms , Software , Humans , Female , Genome , Signal Transduction
11.
Proc Natl Acad Sci U S A ; 117(42): 25963-25965, 2020 10 20.
Article in English | MEDLINE | ID: mdl-33046646
12.
Stud Health Technol Inform ; 270: 1163-1167, 2020 Jun 16.
Article in English | MEDLINE | ID: mdl-32570564

ABSTRACT

Sepsis is a highly heterogenous syndrome with variable causes and outcomes. As part of the SPHN/PHRT funding program, we aim to build a highly interoperable, interconnected network for data collection, exchange and analysis of patients on intensive care units in order to predict sepsis onset and mortality earlier. All five University Hospitals, Universities, the Swiss Institute of Bioinformatics and ETH Zurich are involved in this multi-disciplinary project. With two prospective clinical observational studies, we test our infrastructure setup and improve the framework gradually and generate relevant data for research.


Subject(s)
Sepsis , Hospitals, University , Humans , Intensive Care Units , Observational Studies as Topic , Prospective Studies , Switzerland
13.
Nat Commun ; 7: 13299, 2016 11 10.
Article in English | MEDLINE | ID: mdl-27830750

ABSTRACT

All common genome-wide association (GWA) methods rely on population structure correction, to avoid false genotype-to-phenotype associations. However, population structure correction is a stringent penalization, which also impedes identification of real associations. Using recent statistical advances, we developed a new GWA method, called Quantitative Trait Cluster Association Test (QTCAT), enabling simultaneous multi-marker associations while considering correlations between markers. With this, QTCAT overcomes the need for population structure correction and also reflects the polygenic nature of complex traits better than single-marker methods. Using simulated data, we show that QTCAT clearly outperforms linear mixed model approaches. Moreover, using QTCAT to reanalyse public human, mouse and Arabidopsis GWA data revealed nearly all known and some previously undetected associations. Following up on the most significant novel association in the Arabidopsis data allowed us to identify a so far unknown component of root growth.


Subject(s)
Chromosome Mapping/methods , Genetic Association Studies/methods , Genome-Wide Association Study/methods , Quantitative Trait Loci/genetics , Arabidopsis/genetics , Gene Frequency , Genome, Plant/genetics , Genotype , Linear Models , Phenotype , Polymorphism, Single Nucleotide , Reproducibility of Results
14.
Proc Natl Acad Sci U S A ; 113(27): 7361-8, 2016 07 05.
Article in English | MEDLINE | ID: mdl-27382150

ABSTRACT

Inferring causal effects from observational and interventional data is a highly desirable but ambitious goal. Many of the computational and statistical methods are plagued by fundamental identifiability issues, instability, and unreliable performance, especially for large-scale systems with many measured variables. We present software and provide some validation of a recently developed methodology based on an invariance principle, called invariant causal prediction (ICP). The ICP method quantifies confidence probabilities for inferring causal structures and thus leads to more reliable and confirmatory statements for causal relations and predictions of external intervention effects. We validate the ICP method and some other procedures using large-scale genome-wide gene perturbation experiments in Saccharomyces cerevisiae The results suggest that prediction and prioritization of future experimental interventions, such as gene deletions, can be improved by using our statistical inference techniques.


Subject(s)
Models, Genetic , Statistics as Topic , Algorithms , Flow Cytometry , Gene Deletion , Saccharomyces cerevisiae , Software
15.
Bioinformatics ; 32(13): 1990-2000, 2016 07 01.
Article in English | MEDLINE | ID: mdl-27153677

ABSTRACT

MOTIVATION: Although Genome Wide Association Studies (GWAS) genotype a very large number of single nucleotide polymorphisms (SNPs), the data are often analyzed one SNP at a time. The low predictive power of single SNPs, coupled with the high significance threshold needed to correct for multiple testing, greatly decreases the power of GWAS. RESULTS: We propose a procedure in which all the SNPs are analyzed in a multiple generalized linear model, and we show its use for extremely high-dimensional datasets. Our method yields P-values for assessing significance of single SNPs or groups of SNPs while controlling for all other SNPs and the family wise error rate (FWER). Thus, our method tests whether or not a SNP carries any additional information about the phenotype beyond that available by all the other SNPs. This rules out spurious correlations between phenotypes and SNPs that can arise from marginal methods because the 'spuriously correlated' SNP merely happens to be correlated with the 'truly causal' SNP. In addition, the method offers a data driven approach to identifying and refining groups of SNPs that jointly contain informative signals about the phenotype. We demonstrate the value of our method by applying it to the seven diseases analyzed by the Wellcome Trust Case Control Consortium (WTCCC). We show, in particular, that our method is also capable of finding significant SNPs that were not identified in the original WTCCC study, but were replicated in other independent studies. AVAILABILITY AND IMPLEMENTATION: Reproducibility of our research is supported by the open-source Bioconductor package hierGWAS. CONTACT: peter.buehlmann@stat.math.ethz.ch SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology/methods , Genome-Wide Association Study , Polymorphism, Single Nucleotide , Cluster Analysis , Computer Simulation , Genotype , Humans , Linear Models , Phenotype , Reproducibility of Results
16.
Int J Biostat ; 12(1): 79-95, 2016 05 01.
Article in English | MEDLINE | ID: mdl-27227719

ABSTRACT

We propose a general, modular method for significance testing of groups (or clusters) of variables in a high-dimensional linear model. In presence of high correlations among the covariables, due to serious problems of identifiability, it is indispensable to focus on detecting groups of variables rather than singletons. We propose an inference method which allows to build in hierarchical structures. It relies on repeated sample splitting and sequential rejection, and we prove that it asymptotically controls the familywise error rate. It can be implemented on any collection of clusters and leads to improved power in comparison to more standard non-sequential rejection methods. We complement the theoretical analysis with empirical results for simulated and real data.


Subject(s)
Biostatistics/methods , Computational Biology/methods , Data Interpretation, Statistical , Models, Statistical
17.
New Phytol ; 209(1): 252-64, 2016 Jan.
Article in English | MEDLINE | ID: mdl-26224411

ABSTRACT

Most plastid isoprenoids, including photosynthesis-related metabolites such as carotenoids and the side chain of chlorophylls, tocopherols (vitamin E), phylloquinones (vitamin K), and plastoquinones, derive from geranylgeranyl diphosphate (GGPP) synthesized by GGPP synthase (GGPPS) enzymes. Seven out of 10 functional GGPPS isozymes in Arabidopsis thaliana reside in plastids. We aimed to address the function of different GGPPS paralogues for plastid isoprenoid biosynthesis. We constructed a gene co-expression network (GCN) using GGPPS paralogues as guide genes and genes from the upstream and downstream pathways as query genes. Furthermore, knock-out and/or knock-down ggpps mutants were generated and their growth and metabolic phenotypes were analyzed. Also, interacting protein partners of GGPPS11 were searched for. Our data showed that GGPPS11, encoding the only plastid isozyme essential for plant development, functions as a hub gene among GGPPS paralogues and is required for the production of all major groups of plastid isoprenoids. Furthermore, we showed that the GGPPS11 protein physically interacts with enzymes that use GGPP for the production of carotenoids, chlorophylls, tocopherols, phylloquinone, and plastoquinone. GGPPS11 is a hub isozyme required for the production of most photosynthesis-related isoprenoids. Both gene co-expression and protein-protein interaction likely contribute to the channeling of GGPP by GGPPS11.


Subject(s)
Alkyl and Aryl Transferases/metabolism , Arabidopsis Proteins/metabolism , Arabidopsis/enzymology , Terpenes/metabolism , Alkyl and Aryl Transferases/genetics , Arabidopsis/genetics , Arabidopsis Proteins/genetics , Carotenoids/metabolism , Chlorophyll/metabolism , Isoenzymes , Phenotype , Photosynthesis , Plastids/enzymology , Polyisoprenyl Phosphates/metabolism , Protein Interaction Mapping
18.
Neural Comput ; 27(3): 771-99, 2015 Mar.
Article in English | MEDLINE | ID: mdl-25602767

ABSTRACT

Causal inference relies on the structure of a graph, often a directed acyclic graph (DAG). Different graphs may result in different causal inference statements and different intervention distributions. To quantify such differences, we propose a (pre-)metric between DAGs, the structural intervention distance (SID). The SID is based on a graphical criterion only and quantifies the closeness between two DAGs in terms of their corresponding causal inference statements. It is therefore well suited for evaluating graphs that are used for computing interventions. Instead of DAGs, it is also possible to compare CPDAGs, completed partially DAGs that represent Markov equivalence classes. The SID differs significantly from the widely used structural Hamming distance and therefore constitutes a valuable additional measure. We discuss properties of this distance and provide a (reasonably) efficient implementation with software code available on the first author's home page.

19.
BMC Genomics ; 15: 1162, 2014 Dec 22.
Article in English | MEDLINE | ID: mdl-25534632

ABSTRACT

BACKGROUND: Large-scale RNAi screening has become an important technology for identifying genes involved in biological processes of interest. However, the quality of large-scale RNAi screening is often deteriorated by off-targets effects. In order to find statistically significant effector genes for pathogen entry, we systematically analyzed entry pathways in human host cells for eight pathogens using image-based kinome-wide siRNA screens with siRNAs from three vendors. We propose a Parallel Mixed Model (PMM) approach that simultaneously analyzes several non-identical screens performed with the same RNAi libraries. RESULTS: We show that PMM gains statistical power for hit detection due to parallel screening. PMM allows incorporating siRNA weights that can be assigned according to available information on RNAi quality. Moreover, PMM is able to estimate a sharedness score that can be used to focus follow-up efforts on generic or specific gene regulators. By fitting a PMM model to our data, we found several novel hit genes for most of the pathogens studied. CONCLUSIONS: Our results show parallel RNAi screening can improve the results of individual screens. This is currently particularly interesting when large-scale parallel datasets are becoming more and more publicly available. Our comprehensive siRNA dataset provides a public, freely available resource for further statistical and biological analyses in the high-content, high-throughput siRNA screening field.


Subject(s)
Genomics/methods , RNA Interference , RNA, Small Interfering/genetics , Cell Line , Gene Library , Genomics/standards , High-Throughput Screening Assays , Host-Pathogen Interactions/genetics , Humans , ROC Curve , Reproducibility of Results
20.
Mol Cell Proteomics ; 13(2): 666-77, 2014 Feb.
Article in English | MEDLINE | ID: mdl-24255132

ABSTRACT

A major goal in proteomics is the comprehensive and accurate description of a proteome. This task includes not only the identification of proteins in a sample, but also the accurate quantification of their abundance. Although mass spectrometry typically provides information on peptide identity and abundance in a sample, it does not directly measure the concentration of the corresponding proteins. Specifically, most mass-spectrometry-based approaches (e.g. shotgun proteomics or selected reaction monitoring) allow one to quantify peptides using chromatographic peak intensities or spectral counting information. Ultimately, based on these measurements, one wants to infer the concentrations of the corresponding proteins. Inferring properties of the proteins based on experimental peptide evidence is often a complex problem because of the ambiguity of peptide assignments and different chemical properties of the peptides that affect the observed concentrations. We present SCAMPI, a novel generic and statistically sound framework for computing protein abundance scores based on quantified peptides. In contrast to most previous approaches, our model explicitly includes information from shared peptides to improve protein quantitation, especially in eukaryotes with many homologous sequences. The model accounts for uncertainty in the input data, leading to statistical prediction intervals for the protein scores. Furthermore, peptides with extreme abundances can be reassessed and classified as either regular data points or actual outliers. We used the proposed model with several datasets and compared its performance to that of other, previously used approaches for protein quantification in bottom-up mass spectrometry.


Subject(s)
Computational Biology/methods , Data Interpretation, Statistical , Proteins/analysis , Proteomics/statistics & numerical data , Cell Line, Tumor , Databases, Protein/statistics & numerical data , Humans , Isotope Labeling/methods , Leptospira interrogans/metabolism , Leukemia, Myeloid, Acute/metabolism , Markov Chains , Proteomics/methods , Research Design , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...