Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 46
Filtrar
1.
Proc Natl Acad Sci U S A ; 121(8): e2314228121, 2024 Feb 20.
Artigo em Inglês | MEDLINE | ID: mdl-38363866

RESUMO

In problems such as variable selection and graph estimation, models are characterized by Boolean logical structure such as the presence or absence of a variable or an edge. Consequently, false-positive error or false-negative error can be specified as the number of variables/edges that are incorrectly included or excluded in an estimated model. However, there are several other problems such as ranking, clustering, and causal inference in which the associated model classes do not admit transparent notions of false-positive and false-negative errors due to the lack of an underlying Boolean logical structure. In this paper, we present a generic approach to endow a collection of models with partial order structure, which leads to a hierarchical organization of model classes as well as natural analogs of false-positive and false-negative errors. We describe model selection procedures that provide false-positive error control in our general setting, and we illustrate their utility with numerical experiments.

2.
Bioinformatics ; 38(6): 1550-1559, 2022 03 04.
Artigo em Inglês | MEDLINE | ID: mdl-34927666

RESUMO

MOTIVATION: Signaling pathways control cellular behavior. Dysregulated pathways, for example, due to mutations that cause genes and proteins to be expressed abnormally, can lead to diseases, such as cancer. RESULTS: We introduce a novel computational approach, called Differential Causal Effects (dce), which compares normal to cancerous cells using the statistical framework of causality. The method allows to detect individual edges in a signaling pathway that are dysregulated in cancer cells, while accounting for confounding. Hence, technical artifacts have less influence on the results and dce is more likely to detect the true biological signals. We extend the approach to handle unobserved dense confounding, where each latent variable, such as, for example, batch effects or cell cycle states, affects many covariates. We show that dce outperforms competing methods on synthetic datasets and on CRISPR knockout screens. We validate its latent confounding adjustment properties on a GTEx (Genotype-Tissue Expression) dataset. Finally, in an exploratory analysis on breast cancer data from TCGA (The Cancer Genome Atlas), we recover known and discover new genes involved in breast cancer progression. AVAILABILITY AND IMPLEMENTATION: The method dce is freely available as an R package on Bioconductor (https://bioconductor.org/packages/release/bioc/html/dce.html) as well as on https://github.com/cbg-ethz/dce. The GitHub repository also contains the Snakemake workflows needed to reproduce all results presented here. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Neoplasias da Mama , Software , Humanos , Feminino , Genoma , Transdução de Sinais
3.
Ann Stat ; 50(3): 1320-1347, 2022 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35958884

RESUMO

Inferring causal relationships or related associations from observational data can be invalidated by the existence of hidden confounding. We focus on a high-dimensional linear regression setting, where the measured covariates are affected by hidden confounding and propose the Doubly Debiased Lasso estimator for individual components of the regression coefficient vector. Our advocated method simultaneously corrects both the bias due to estimation of high-dimensional parameters as well as the bias caused by the hidden confounding. We establish its asymptotic normality and also prove that it is efficient in the Gauss-Markov sense. The validity of our methodology relies on a dense confounding assumption, i.e. that every confounding variable affects many covariates. The finite sample performance is illustrated with an extensive simulation study and a genomic application.

4.
Proc Natl Acad Sci U S A ; 113(27): 7361-8, 2016 07 05.
Artigo em Inglês | MEDLINE | ID: mdl-27382150

RESUMO

Inferring causal effects from observational and interventional data is a highly desirable but ambitious goal. Many of the computational and statistical methods are plagued by fundamental identifiability issues, instability, and unreliable performance, especially for large-scale systems with many measured variables. We present software and provide some validation of a recently developed methodology based on an invariance principle, called invariant causal prediction (ICP). The ICP method quantifies confidence probabilities for inferring causal structures and thus leads to more reliable and confirmatory statements for causal relations and predictions of external intervention effects. We validate the ICP method and some other procedures using large-scale genome-wide gene perturbation experiments in Saccharomyces cerevisiae The results suggest that prediction and prioritization of future experimental interventions, such as gene deletions, can be improved by using our statistical inference techniques.


Assuntos
Modelos Genéticos , Estatística como Assunto , Algoritmos , Citometria de Fluxo , Deleção de Genes , Saccharomyces cerevisiae , Software
5.
Proc Natl Acad Sci U S A ; 117(42): 25963-25965, 2020 10 20.
Artigo em Inglês | MEDLINE | ID: mdl-33046646
6.
Bioinformatics ; 32(13): 1990-2000, 2016 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-27153677

RESUMO

MOTIVATION: Although Genome Wide Association Studies (GWAS) genotype a very large number of single nucleotide polymorphisms (SNPs), the data are often analyzed one SNP at a time. The low predictive power of single SNPs, coupled with the high significance threshold needed to correct for multiple testing, greatly decreases the power of GWAS. RESULTS: We propose a procedure in which all the SNPs are analyzed in a multiple generalized linear model, and we show its use for extremely high-dimensional datasets. Our method yields P-values for assessing significance of single SNPs or groups of SNPs while controlling for all other SNPs and the family wise error rate (FWER). Thus, our method tests whether or not a SNP carries any additional information about the phenotype beyond that available by all the other SNPs. This rules out spurious correlations between phenotypes and SNPs that can arise from marginal methods because the 'spuriously correlated' SNP merely happens to be correlated with the 'truly causal' SNP. In addition, the method offers a data driven approach to identifying and refining groups of SNPs that jointly contain informative signals about the phenotype. We demonstrate the value of our method by applying it to the seven diseases analyzed by the Wellcome Trust Case Control Consortium (WTCCC). We show, in particular, that our method is also capable of finding significant SNPs that were not identified in the original WTCCC study, but were replicated in other independent studies. AVAILABILITY AND IMPLEMENTATION: Reproducibility of our research is supported by the open-source Bioconductor package hierGWAS. CONTACT: peter.buehlmann@stat.math.ethz.ch SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Análise por Conglomerados , Simulação por Computador , Genótipo , Humanos , Modelos Lineares , Fenótipo , Reprodutibilidade dos Testes
7.
New Phytol ; 209(1): 252-64, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26224411

RESUMO

Most plastid isoprenoids, including photosynthesis-related metabolites such as carotenoids and the side chain of chlorophylls, tocopherols (vitamin E), phylloquinones (vitamin K), and plastoquinones, derive from geranylgeranyl diphosphate (GGPP) synthesized by GGPP synthase (GGPPS) enzymes. Seven out of 10 functional GGPPS isozymes in Arabidopsis thaliana reside in plastids. We aimed to address the function of different GGPPS paralogues for plastid isoprenoid biosynthesis. We constructed a gene co-expression network (GCN) using GGPPS paralogues as guide genes and genes from the upstream and downstream pathways as query genes. Furthermore, knock-out and/or knock-down ggpps mutants were generated and their growth and metabolic phenotypes were analyzed. Also, interacting protein partners of GGPPS11 were searched for. Our data showed that GGPPS11, encoding the only plastid isozyme essential for plant development, functions as a hub gene among GGPPS paralogues and is required for the production of all major groups of plastid isoprenoids. Furthermore, we showed that the GGPPS11 protein physically interacts with enzymes that use GGPP for the production of carotenoids, chlorophylls, tocopherols, phylloquinone, and plastoquinone. GGPPS11 is a hub isozyme required for the production of most photosynthesis-related isoprenoids. Both gene co-expression and protein-protein interaction likely contribute to the channeling of GGPP by GGPPS11.


Assuntos
Alquil e Aril Transferases/metabolismo , Proteínas de Arabidopsis/metabolismo , Arabidopsis/enzimologia , Terpenos/metabolismo , Alquil e Aril Transferases/genética , Arabidopsis/genética , Proteínas de Arabidopsis/genética , Carotenoides/metabolismo , Clorofila/metabolismo , Isoenzimas , Fenótipo , Fotossíntese , Plastídeos/enzimologia , Fosfatos de Poli-Isoprenil/metabolismo , Mapeamento de Interação de Proteínas
8.
Mol Cell Proteomics ; 13(2): 666-77, 2014 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-24255132

RESUMO

A major goal in proteomics is the comprehensive and accurate description of a proteome. This task includes not only the identification of proteins in a sample, but also the accurate quantification of their abundance. Although mass spectrometry typically provides information on peptide identity and abundance in a sample, it does not directly measure the concentration of the corresponding proteins. Specifically, most mass-spectrometry-based approaches (e.g. shotgun proteomics or selected reaction monitoring) allow one to quantify peptides using chromatographic peak intensities or spectral counting information. Ultimately, based on these measurements, one wants to infer the concentrations of the corresponding proteins. Inferring properties of the proteins based on experimental peptide evidence is often a complex problem because of the ambiguity of peptide assignments and different chemical properties of the peptides that affect the observed concentrations. We present SCAMPI, a novel generic and statistically sound framework for computing protein abundance scores based on quantified peptides. In contrast to most previous approaches, our model explicitly includes information from shared peptides to improve protein quantitation, especially in eukaryotes with many homologous sequences. The model accounts for uncertainty in the input data, leading to statistical prediction intervals for the protein scores. Furthermore, peptides with extreme abundances can be reassessed and classified as either regular data points or actual outliers. We used the proposed model with several datasets and compared its performance to that of other, previously used approaches for protein quantification in bottom-up mass spectrometry.


Assuntos
Biologia Computacional/métodos , Interpretação Estatística de Dados , Proteínas/análise , Proteômica/estatística & dados numéricos , Linhagem Celular Tumoral , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Marcação por Isótopo/métodos , Leptospira interrogans/metabolismo , Leucemia Mieloide Aguda/metabolismo , Cadeias de Markov , Proteômica/métodos , Projetos de Pesquisa , Software
9.
Neural Comput ; 27(3): 771-99, 2015 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-25602767

RESUMO

Causal inference relies on the structure of a graph, often a directed acyclic graph (DAG). Different graphs may result in different causal inference statements and different intervention distributions. To quantify such differences, we propose a (pre-)metric between DAGs, the structural intervention distance (SID). The SID is based on a graphical criterion only and quantifies the closeness between two DAGs in terms of their corresponding causal inference statements. It is therefore well suited for evaluating graphs that are used for computing interventions. Instead of DAGs, it is also possible to compare CPDAGs, completed partially DAGs that represent Markov equivalence classes. The SID differs significantly from the widely used structural Hamming distance and therefore constitutes a valuable additional measure. We discuss properties of this distance and provide a (reasonably) efficient implementation with software code available on the first author's home page.

10.
BMC Genomics ; 15: 1162, 2014 Dec 22.
Artigo em Inglês | MEDLINE | ID: mdl-25534632

RESUMO

BACKGROUND: Large-scale RNAi screening has become an important technology for identifying genes involved in biological processes of interest. However, the quality of large-scale RNAi screening is often deteriorated by off-targets effects. In order to find statistically significant effector genes for pathogen entry, we systematically analyzed entry pathways in human host cells for eight pathogens using image-based kinome-wide siRNA screens with siRNAs from three vendors. We propose a Parallel Mixed Model (PMM) approach that simultaneously analyzes several non-identical screens performed with the same RNAi libraries. RESULTS: We show that PMM gains statistical power for hit detection due to parallel screening. PMM allows incorporating siRNA weights that can be assigned according to available information on RNAi quality. Moreover, PMM is able to estimate a sharedness score that can be used to focus follow-up efforts on generic or specific gene regulators. By fitting a PMM model to our data, we found several novel hit genes for most of the pathogens studied. CONCLUSIONS: Our results show parallel RNAi screening can improve the results of individual screens. This is currently particularly interesting when large-scale parallel datasets are becoming more and more publicly available. Our comprehensive siRNA dataset provides a public, freely available resource for further statistical and biological analyses in the high-content, high-throughput siRNA screening field.


Assuntos
Genômica/métodos , Interferência de RNA , RNA Interferente Pequeno/genética , Linhagem Celular , Biblioteca Gênica , Genômica/normas , Ensaios de Triagem em Larga Escala , Interações Hospedeiro-Patógeno/genética , Humanos , Curva ROC , Reprodutibilidade dos Testes
11.
Bioinformatics ; 28(1): 112-8, 2012 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-22039212

RESUMO

MOTIVATION: Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. RESULTS: We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. AVAILABILITY: The package missForest is freely available from http://stat.ethz.ch/CRAN/. CONTACT: stekhoven@stat.math.ethz.ch; buhlmann@stat.math.ethz.ch


Assuntos
Algoritmos , Interpretação Estatística de Dados , Arabidopsis/metabolismo , Escherichia coli/metabolismo , Perfilação da Expressão Gênica/métodos , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/métodos
12.
Bioinformatics ; 28(21): 2819-23, 2012 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-22945788

RESUMO

Genotypic causes of a phenotypic trait are typically determined via randomized controlled intervention experiments. Such experiments are often prohibitive with respect to durations and costs, and informative prioritization of experiments is desirable. We therefore consider predicting stable rankings of genes (covariates), according to their total causal effects on a phenotype (response), from observational data. Since causal effects are generally non-identifiable from observational data only, we use a method that can infer lower bounds for the total causal effect under some assumptions. We validated our method, which we call Causal Stability Ranking (CStaR), in two situations. First, we performed knock-out experiments with Arabidopsis thaliana according to a predicted ranking based on observational gene expression data, using flowering time as phenotype of interest. Besides several known regulators of flowering time, we found almost half of the tested top ranking mutants to have a significantly changed flowering time. Second, we compared CStaR to established regression-based methods on a gene expression dataset of Saccharomyces cerevisiae. We found that CStaR outperforms these established methods. Our method allows for efficient design and prioritization of future intervention experiments, and due to its generality it can be used for a broad spectrum of applications.


Assuntos
Arabidopsis/genética , Perfilação da Expressão Gênica/métodos , Instabilidade Genômica/genética , Modelos Genéticos , Saccharomyces cerevisiae/genética , Reações Falso-Positivas , Flores/genética , Técnicas de Inativação de Genes , Genes Reguladores/genética , Genótipo , Fenótipo , Curva ROC , Análise de Regressão
13.
Mol Syst Biol ; 8: 606, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22929616

RESUMO

Leaves have a central role in plant energy capture and carbon conversion and therefore must continuously adapt their development to prevailing environmental conditions. To reveal the dynamic systems behaviour of leaf development, we profiled Arabidopsis leaf number six in depth at four different growth stages, at both the end-of-day and end-of-night, in plants growing in two controlled experimental conditions: short-day conditions with optimal soil water content and constant reduced soil water conditions. We found that the lower soil water potential led to reduced, but prolonged, growth and an adaptation at the molecular level without a drought stress response. Clustering of the protein and transcript data using a decision tree revealed different patterns in abundance changes across the growth stages and between end-of-day and end-of-night that are linked to specific biological functions. Correlations between protein and transcript levels depend on the time-of-day and also on protein localisation and function. Surprisingly, only very few of >1700 quantified proteins showed diurnal abundance fluctuations, despite strong fluctuations at the transcript level.


Assuntos
Adaptação Biológica/genética , Arabidopsis/crescimento & desenvolvimento , Folhas de Planta/crescimento & desenvolvimento , Proteoma/metabolismo , Transcriptoma/fisiologia , Arabidopsis/metabolismo , Análise por Conglomerados , Escuridão , Secas , Perfilação da Expressão Gênica/métodos , Luz , Fotoperíodo , Folhas de Planta/metabolismo , Transpiração Vegetal/fisiologia , Proteômica/métodos , Solo , Água/metabolismo
14.
Proc Natl Acad Sci U S A ; 107(27): 12101-6, 2010 Jul 06.
Artigo em Inglês | MEDLINE | ID: mdl-20562346

RESUMO

One of the major goals of proteomics is the comprehensive and accurate description of a proteome. Shotgun proteomics, the method of choice for the analysis of complex protein mixtures, requires that experimentally observed peptides are mapped back to the proteins they were derived from. This process is also known as protein inference. We present Markovian Inference of Proteins and Gene Models (MIPGEM), a statistical model based on clearly stated assumptions to address the problem of protein and gene model inference for shotgun proteomics data. In particular, we are dealing with dependencies among peptides and proteins using a Markovian assumption on k-partite graphs. We are also addressing the problems of shared peptides and ambiguous proteins by scoring the encoding gene models. Empirical results on two control datasets with synthetic mixtures of proteins and on complex protein samples of Saccharomyces cerevisiae, Drosophila melanogaster, and Arabidopsis thaliana suggest that the results with MIPGEM are competitive with existing tools for protein inference.


Assuntos
Biologia Computacional/métodos , Modelos Estatísticos , Proteínas/análise , Proteômica/métodos , Algoritmos , Animais , Proteínas de Arabidopsis/análise , Bases de Dados de Proteínas , Proteínas de Drosophila/análise , Cadeias de Markov , Peptídeos/análise , Proteoma/análise , Reprodutibilidade dos Testes , Proteínas de Saccharomyces cerevisiae/análise
15.
EClinicalMedicine ; 62: 102124, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37588623

RESUMO

Background: When sepsis is detected, organ damage may have progressed to irreversible stages, leading to poor prognosis. The use of machine learning for predicting sepsis early has shown promise, however international validations are missing. Methods: This was a retrospective, observational, multi-centre cohort study. We developed and externally validated a deep learning system for the prediction of sepsis in the intensive care unit (ICU). Our analysis represents the first international, multi-centre in-ICU cohort study for sepsis prediction using deep learning to our knowledge. Our dataset contains 136,478 unique ICU admissions, representing a refined and harmonised subset of four large ICU databases comprising data collected from ICUs in the US, the Netherlands, and Switzerland between 2001 and 2016. Using the international consensus definition Sepsis-3, we derived hourly-resolved sepsis annotations, amounting to 25,694 (18.8%) patient stays with sepsis. We compared our approach to clinical baselines as well as machine learning baselines and performed an extensive internal and external statistical validation within and across databases, reporting area under the receiver-operating-characteristic curve (AUC). Findings: Averaged over sites, our model was able to predict sepsis with an AUC of 0.846 (95% confidence interval [CI], 0.841-0.852) on a held-out validation cohort internal to each site, and an AUC of 0.761 (95% CI, 0.746-0.770) when validating externally across sites. Given access to a small fine-tuning set (10% per site), the transfer to target sites was improved to an AUC of 0.807 (95% CI, 0.801-0.813). Our model raised 1.4 false alerts per true alert and detected 80% of the septic patients 3.7 h (95% CI, 3.0-4.3) prior to the onset of sepsis, opening a vital window for intervention. Interpretation: By monitoring clinical and laboratory measurements in a retrospective simulation of a real-time prediction scenario, a deep learning system for the detection of sepsis generalised to previously unseen ICU cohorts, internationally. Funding: This study was funded by the Personalized Health and Related Technologies (PHRT) strategic focus area of the ETH domain.

16.
Artigo em Inglês | MEDLINE | ID: mdl-37502671

RESUMO

The advent of technological developments is allowing to gather large amounts of data in several research fields. Learning analytics (LA)/educational data mining has access to big observational unstructured data captured from educational settings and relies mostly on unsupervised machine learning (ML) algorithms to make sense of such type of data. Generalized additive models for location, scale, and shape (GAMLSS) are a supervised statistical learning framework that allows modeling all the parameters of the distribution of the response variable with respect to the explanatory variables. This article overviews the power and flexibility of GAMLSS in relation to some ML techniques. Also, GAMLSS' capability to be tailored toward causality via causal regularization is briefly commented. This overview is illustrated via a data set from the field of LA. This article is categorized under:Application Areas > Education and LearningAlgorithmic Development > StatisticsTechnologies > Machine Learning.

17.
Sci Adv ; 9(6): eade9238, 2023 02 10.
Artigo em Inglês | MEDLINE | ID: mdl-36753540

RESUMO

Rhabdomyosarcoma (RMS) is a group of pediatric cancers with features of developing skeletal muscle. The cellular hierarchy and mechanisms leading to developmental arrest remain elusive. Here, we combined single-cell RNA sequencing, mass cytometry, and high-content imaging to resolve intratumoral heterogeneity of patient-derived primary RMS cultures. We show that the aggressive alveolar RMS (aRMS) subtype contains plastic muscle stem-like cells and cycling progenitors that drive tumor growth, and a subpopulation of differentiated cells that lost its proliferative potential and correlates with better outcomes. While chemotherapy eliminates cycling progenitors, it enriches aRMS for muscle stem-like cells. We screened for drugs hijacking aRMS toward clinically favorable subpopulations and identified a combination of RAF and MEK inhibitors that potently induces myogenic differentiation and inhibits tumor growth. Overall, our work provides insights into the developmental states underlying aRMS aggressiveness, chemoresistance, and progression and identifies the RAS pathway as a promising therapeutic target.


Assuntos
Antineoplásicos , Rabdomiossarcoma Alveolar , Rabdomiossarcoma , Criança , Humanos , Rabdomiossarcoma Alveolar/tratamento farmacológico , Rabdomiossarcoma Alveolar/genética , Rabdomiossarcoma Alveolar/patologia , Rabdomiossarcoma/tratamento farmacológico , Rabdomiossarcoma/genética , Rabdomiossarcoma/patologia , Músculo Esquelético/metabolismo , Diferenciação Celular , Antineoplásicos/uso terapêutico , Linhagem Celular Tumoral
18.
BMC Cancer ; 12: 310, 2012 Jul 23.
Artigo em Inglês | MEDLINE | ID: mdl-22824167

RESUMO

BACKGROUND: Renal cell carcinoma (RCC) is characterized by a number of diverse molecular aberrations that differ among individuals. Recent approaches to molecularly classify RCC were based on clinical, pathological as well as on single molecular parameters. As a consequence, gene expression patterns reflecting the sum of genetic aberrations in individual tumors may not have been recognized. In an attempt to uncover such molecular features in RCC, we used a novel, unbiased and integrative approach. METHODS: We integrated gene expression data from 97 primary RCC of different pathologic parameters, 15 RCC metastases as well as 34 cancer cell lines for two-way nonsupervised hierarchical clustering using gene groups suggested by the PANTHER Classification System. We depicted the genomic landscape of the resulted tumor groups by means of Single Nuclear Polymorphism (SNP) technology. Finally, the achieved results were immunohistochemically analyzed using a tissue microarray (TMA) composed of 254 RCC. RESULTS: We found robust, genome wide expression signatures, which split RCC into three distinct molecular subgroups. These groups remained stable even if randomly selected gene sets were clustered. Notably, the pattern obtained from RCC cell lines was clearly distinguishable from that of primary tumors. SNP array analysis demonstrated differing frequencies of chromosomal copy number alterations among RCC subgroups. TMA analysis with group-specific markers showed a prognostic significance of the different groups. CONCLUSION: We propose the existence of characteristic and histologically independent genome-wide expression outputs in RCC with potential biological and clinical relevance.


Assuntos
Carcinoma de Células Renais/classificação , Perfilação da Expressão Gênica , Neoplasias Renais/classificação , Carcinoma de Células Renais/genética , Carcinoma de Células Renais/mortalidade , Carcinoma de Células Renais/patologia , Linhagem Celular Tumoral , Análise por Conglomerados , Variações do Número de Cópias de DNA , Humanos , Neoplasias Renais/genética , Neoplasias Renais/mortalidade , Neoplasias Renais/patologia , Polimorfismo de Nucleotídeo Único , Prognóstico , Modelos de Riscos Proporcionais
19.
Stat Comput ; 32(3): 39, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35582000

RESUMO

Prediction models often fail if train and test data do not stem from the same distribution. Out-of-distribution (OOD) generalization to unseen, perturbed test data is a desirable but difficult-to-achieve property for prediction models and in general requires strong assumptions on the data generating process (DGP). In a causally inspired perspective on OOD generalization, the test data arise from a specific class of interventions on exogenous random variables of the DGP, called anchors. Anchor regression models, introduced by Rothenhäusler et al. (J R Stat Soc Ser B 83(2):215-246, 2021. 10.1111/rssb.12398), protect against distributional shifts in the test data by employing causal regularization. However, so far anchor regression has only been used with a squared-error loss which is inapplicable to common responses such as censored continuous or ordinal data. Here, we propose a distributional version of anchor regression which generalizes the method to potentially censored responses with at least an ordered sample space. To this end, we combine a flexible class of parametric transformation models for distributional regression with an appropriate causal regularizer under a more general notion of residuals. In an exemplary application and several simulation scenarios we demonstrate the extent to which OOD generalization is possible.

20.
Gigascience ; 122022 Dec 28.
Artigo em Inglês | MEDLINE | ID: mdl-37318234

RESUMO

OBJECTIVE: To develop a unified framework for analyzing data from 5 large publicly available intensive care unit (ICU) datasets. FINDINGS: Using 3 American (Medical Information Mart for Intensive Care III, Medical Information Mart for Intensive Care IV, electronic ICU) and 2 European (Amsterdam University Medical Center Database, High Time Resolution ICU Dataset) databases, we constructed a mapping for each database to a set of clinically relevant concepts, which are grounded in the Observational Medical Outcomes Partnership Vocabulary wherever possible. Furthermore, we performed synchronization in the units of measurement and data type representation. On top of this, we built functionality, which allows the user to download, set up, and load data from all of the 5 databases, through a unified Application Programming Interface. The resulting ricu R-package represents the computational infrastructure for handling publicly available ICU datasets, and its latest release allows the user to load 119 existing clinical concepts from the 5 data sources. CONCLUSION: The ricu R-package (available on GitHub and CRAN) is the first tool that enables users to analyze publicly available ICU datasets simultaneously (datasets are available upon request from respective owners). Such an interface saves researchers time when analyzing ICU data and helps reproducibility. We hope that ricu can become a community-wide effort, so that data harmonization is not repeated by each research group separately. One current limitation is that concepts were added on a case-to-case basis, and therefore the resulting dictionary of concepts is not comprehensive. Further work is needed to make the dictionary comprehensive.


Assuntos
Cuidados Críticos , Unidades de Terapia Intensiva , Humanos , Reprodutibilidade dos Testes , Cuidados Críticos/métodos , Bases de Dados Factuais , Gerenciamento de Dados
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA