Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 1.557
Filtrar
Mais filtros

Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-39007597

RESUMO

Thyroid cancer incidences endure to increase even though a large number of inspection tools have been developed recently. Since there is no standard and certain procedure to follow for the thyroid cancer diagnoses, clinicians require conducting various tests. This scrutiny process yields multi-dimensional big data and lack of a common approach leads to randomly distributed missing (sparse) data, which are both formidable challenges for the machine learning algorithms. This paper aims to develop an accurate and computationally efficient deep learning algorithm to diagnose the thyroid cancer. In this respect, randomly distributed missing data stemmed singularity in learning problems is treated and dimensionality reduction with inner and target similarity approaches are developed to select the most informative input datasets. In addition, size reduction with the hierarchical clustering algorithm is performed to eliminate the considerably similar data samples. Four machine learning algorithms are trained and also tested with the unseen data to validate their generalization and robustness abilities. The results yield 100% training and 83% testing preciseness for the unseen data. Computational time efficiencies of the algorithms are also examined under the equal conditions.


Assuntos
Algoritmos , Aprendizado Profundo , Neoplasias da Glândula Tireoide , Neoplasias da Glândula Tireoide/diagnóstico , Humanos , Aprendizado de Máquina , Análise por Conglomerados
2.
Biostatistics ; 25(2): 289-305, 2024 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-36977366

RESUMO

Causally interpretable meta-analysis combines information from a collection of randomized controlled trials to estimate treatment effects in a target population in which experimentation may not be possible but from which covariate information can be obtained. In such analyses, a key practical challenge is the presence of systematically missing data when some trials have collected data on one or more baseline covariates, but other trials have not, such that the covariate information is missing for all participants in the latter. In this article, we provide identification results for potential (counterfactual) outcome means and average treatment effects in the target population when covariate data are systematically missing from some of the trials in the meta-analysis. We propose three estimators for the average treatment effect in the target population, examine their asymptotic properties, and show that they have good finite-sample performance in simulation studies. We use the estimators to analyze data from two large lung cancer screening trials and target population data from the National Health and Nutrition Examination Survey (NHANES). To accommodate the complex survey design of the NHANES, we modify the methods to incorporate survey sampling weights and allow for clustering.


Assuntos
Detecção Precoce de Câncer , Neoplasias Pulmonares , Humanos , Inquéritos Nutricionais , Neoplasias Pulmonares/epidemiologia , Simulação por Computador , Projetos de Pesquisa
3.
Mol Cell Proteomics ; 22(8): 100558, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37105364

RESUMO

Mass spectrometry (MS) enables high-throughput identification and quantification of proteins in complex biological samples and can provide insights into the global function of biological systems. Label-free quantification is cost-effective and suitable for the analysis of human samples. Despite rapid developments in label-free data acquisition workflows, the number of proteins quantified across samples can be limited by technical and biological variability. This variation can result in missing values which can in turn challenge downstream data analysis tasks. General purpose or gene expression-specific imputation algorithms are widely used to improve data completeness. Here, we propose an imputation algorithm designated for label-free MS data that is aware of the type of missingness affecting data. On published datasets acquired by data-dependent and data-independent acquisition workflows with variable degrees of biological complexity, we demonstrate that the proposed missing value estimation procedure by barycenter computation competes closely with the state-of-the-art imputation algorithms in differential abundance tasks while outperforming them in the accuracy of variance estimates of the peptide abundance measurements, and better controls the false discovery rate in label-free MS experiments. The barycenter estimation procedure is implemented in the msImpute software package and is available from the Bioconductor repository.


Assuntos
Algoritmos , Peptídeos , Humanos , Peptídeos/análise , Proteínas , Espectrometria de Massas/métodos
4.
Genet Epidemiol ; 47(1): 61-77, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36125445

RESUMO

There is an increasing interest in using multiple types of omics features (e.g., DNA sequences, RNA expressions, methylation, protein expressions, and metabolic profiles) to study how the relationships between phenotypes and genotypes may be mediated by other omics markers. Genotypes and phenotypes are typically available for all subjects in genetic studies, but typically, some omics data will be missing for some subjects, due to limitations such as cost and sample quality. In this article, we propose a powerful approach for mediation analysis that accommodates missing data among multiple mediators and allows for various interaction effects. We formulate the relationships among genetic variants, other omics measurements, and phenotypes through linear regression models. We derive the joint likelihood for models with two mediators, accounting for arbitrary patterns of missing values. Utilizing computationally efficient and stable algorithms, we conduct maximum likelihood estimation. Our methods produce unbiased and statistically efficient estimators. We demonstrate the usefulness of our methods through simulation studies and an application to the Metabolic Syndrome in Men study.


Assuntos
Análise de Mediação , Modelos Genéticos , Humanos , Genótipo , Simulação por Computador , Funções Verossimilhança , Algoritmos
5.
Clin Infect Dis ; 2024 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-38824440

RESUMO

Data on alcohol use and incident Tuberculosis (TB) infection are needed. In adults aged 15+ in rural Uganda (N=49,585), estimated risk of incident TB infection was 29.2% with alcohol use vs. 19.2% without (RR: 1.49; 95%CI: 1.40-1.60). There is potential for interventions to interrupt transmission among people who drink alcohol.

6.
Mol Biol Evol ; 40(5)2023 05 02.
Artigo em Inglês | MEDLINE | ID: mdl-37140129

RESUMO

The data available for reconstructing molecular phylogenies have become wildly disparate. Phylogenomic studies can generate data for thousands of genetic markers for dozens of species, but for hundreds of other taxa, data may be available from only a few genes. Can these two types of data be integrated to combine the advantages of both, addressing the relationships of hundreds of species with thousands of genes? Here, we show that this is possible, using data from frogs. We generated a phylogenomic data set for 138 ingroup species and 3,784 nuclear markers (ultraconserved elements [UCEs]), including new UCE data from 70 species. We also assembled a supermatrix data set, including data from 97% of frog genera (441 total), with 1-307 genes per taxon. We then produced a combined phylogenomic-supermatrix data set (a "gigamatrix") containing 441 ingroup taxa and 4,091 markers but with 86% missing data overall. Likelihood analysis of the gigamatrix yielded a generally well-supported tree among families, largely consistent with trees from the phylogenomic data alone. All terminal taxa were placed in the expected families, even though 42.5% of these taxa each had >99.5% missing data and 70.2% had >90% missing data. Our results show that missing data need not be an impediment to successfully combining very large phylogenomic and supermatrix data sets, and they open the door to new studies that simultaneously maximize sampling of genes and taxa.


Assuntos
Anuros , Animais , Filogenia , Análise de Sequência de DNA , Anuros/genética , Probabilidade
7.
Am J Epidemiol ; 193(6): 908-916, 2024 06 03.
Artigo em Inglês | MEDLINE | ID: mdl-38422371

RESUMO

Routinely collected testing data have been a vital resource for public health response during the COVID-19 pandemic and have revealed the extent to which Black and Hispanic persons have borne a disproportionate burden of SARS-CoV-2 infections and hospitalizations in the United States. However, missing race and ethnicity data and missed infections due to testing disparities limit the interpretation of testing data and obscure the true toll of the pandemic. We investigated potential bias arising from these 2 types of missing data through a case study carried out in Holyoke, Massachusetts, during the prevaccination phase of the pandemic. First, we estimated SARS-CoV-2 testing and case rates by race and ethnicity, imputing missing data using a joint modeling approach. We then investigated disparities in SARS-CoV-2 reported case rates and missed infections by comparing case rate estimates with estimates derived from a COVID-19 seroprevalence survey. Compared with the non-Hispanic White population, we found that the Hispanic population had similar testing rates (476 tested per 1000 vs 480 per 1000) but twice the case rate (8.1% vs 3.7%). We found evidence of inequitable testing, with a higher rate of missed infections in the Hispanic population than in the non-Hispanic White population (79 infections missed per 1000 vs 60 missed per 1000).


Assuntos
Teste para COVID-19 , COVID-19 , Hispânico ou Latino , SARS-CoV-2 , Humanos , COVID-19/etnologia , COVID-19/epidemiologia , COVID-19/diagnóstico , Massachusetts/epidemiologia , Teste para COVID-19/estatística & dados numéricos , Hispânico ou Latino/estatística & dados numéricos , Masculino , Feminino , Pessoa de Meia-Idade , Disparidades em Assistência à Saúde/etnologia , Disparidades em Assistência à Saúde/estatística & dados numéricos , Adulto , Disparidades nos Níveis de Saúde , Negro ou Afro-Americano/estatística & dados numéricos , Etnicidade/estatística & dados numéricos , Idoso , Diagnóstico Ausente/estatística & dados numéricos
8.
Am J Epidemiol ; 2024 Aug 27.
Artigo em Inglês | MEDLINE | ID: mdl-39191658

RESUMO

Auxiliary variables are used in multiple imputation (MI) to reduce bias and increase efficiency. These variables may often themselves be incomplete. We explored how missing data in auxiliary variables influenced estimates obtained from MI. We implemented a simulation study with three different missing data mechanisms for the outcome. We then examined the impact of increasing proportions of missing data and different missingness mechanisms for the auxiliary variable on bias of an unadjusted linear regression coefficient and the fraction of missing information. We illustrate our findings with an applied example in the Avon Longitudinal Study of Parents and Children. We found that where complete records analyses were biased, increasing proportions of missing data in auxiliary variables, under any missing data mechanism, reduced the ability of MI including the auxiliary variable to mitigate this bias. Where there was no bias in the complete records analysis, inclusion of a missing not at random auxiliary variable in MI introduced bias of potentially important magnitude (up to 17% of the effect size in our simulation). Careful consideration of the quantity and nature of missing data in auxiliary variables needs to be made when selecting them for use in MI models.

9.
Am J Epidemiol ; 193(1): 203-213, 2024 Jan 08.
Artigo em Inglês | MEDLINE | ID: mdl-37650647

RESUMO

We developed and validated a claims-based algorithm that classifies patients into obesity categories. Using Medicare (2007-2017) and Medicaid (2000-2014) claims data linked to 2 electronic health record (EHR) systems in Boston, Massachusetts, we identified a cohort of patients with an EHR-based body mass index (BMI) measurement (calculated as weight (kg)/height (m)2). We used regularized regression to select from 137 variables and built generalized linear models to classify patients with BMIs of ≥25, ≥30, and ≥40. We developed the prediction model using EHR system 1 (training set) and validated it in EHR system 2 (validation set). The cohort contained 123,432 patients in the Medicare population and 40,736 patients in the Medicaid population. The model comprised 97 variables in the Medicare set and 95 in the Medicaid set, including BMI-related diagnosis codes, cardiovascular and antidiabetic drugs, and obesity-related comorbidities. The areas under the receiver-operating-characteristic curve in the validation set were 0.72, 0.75, and 0.83 (Medicare) and 0.66, 0.66, and 0.70 (Medicaid) for BMIs of ≥25, ≥30, and ≥40, respectively. The positive predictive values were 81.5%, 80.6%, and 64.7% (Medicare) and 81.6%, 77.5%, and 62.5% (Medicaid), for BMIs of ≥25, ≥30, and ≥40, respectively. The proposed model can identify obesity categories in claims databases when BMI measurements are missing and can be used for confounding adjustment, defining subgroups, or probabilistic bias analysis.


Assuntos
Medicare , Obesidade , Idoso , Humanos , Estados Unidos/epidemiologia , Obesidade/epidemiologia , Índice de Massa Corporal , Comorbidade , Hipoglicemiantes , Registros Eletrônicos de Saúde
10.
Am J Epidemiol ; 2024 Jun 20.
Artigo em Inglês | MEDLINE | ID: mdl-38904459

RESUMO

When analyzing a selected sample from a general population, selection bias can arise relative to the causal average treatment effect (ATE) for the general population, and also relative to the ATE for the selected sample itself. We provide simple graphical rules that indicate: (1) if a selected-sample analysis will be unbiased for each ATE; (2) whether adjusting for certain covariates could eliminate selection bias. The rules can easily be checked in a standard single-world intervention graph. When the treatment could affect selection, a third estimand of potential scientific interest is the "net treatment difference", namely the net change in outcomes that would occur for the selected sample if all members of the general population were treated versus not treated, including any effects of the treatment on which individuals are in the selected sample . We provide graphical rules for this estimand as well. We decompose bias in a selected-sample analysis relative to the general-population ATE into: (1) "internal bias" relative to the net treatment difference; (2) "net-external bias", a discrepancy between the net treatment difference and the general-population ATE. Each bias can be assessed unambiguously via a distinct graphical rule, providing new conceptual insight into the mechanisms by which certain causal structures produce selection bias.

11.
Am J Epidemiol ; 193(7): 1019-1030, 2024 07 08.
Artigo em Inglês | MEDLINE | ID: mdl-38400653

RESUMO

Targeted maximum likelihood estimation (TMLE) is increasingly used for doubly robust causal inference, but how missing data should be handled when using TMLE with data-adaptive approaches is unclear. Based on data (1992-1998) from the Victorian Adolescent Health Cohort Study, we conducted a simulation study to evaluate 8 missing-data methods in this context: complete-case analysis, extended TMLE incorporating an outcome-missingness model, the missing covariate missing indicator method, and 5 multiple imputation (MI) approaches using parametric or machine-learning models. We considered 6 scenarios that varied in terms of exposure/outcome generation models (presence of confounder-confounder interactions) and missingness mechanisms (whether outcome influenced missingness in other variables and presence of interaction/nonlinear terms in missingness models). Complete-case analysis and extended TMLE had small biases when outcome did not influence missingness in other variables. Parametric MI without interactions had large bias when exposure/outcome generation models included interactions. Parametric MI including interactions performed best in bias and variance reduction across all settings, except when missingness models included a nonlinear term. When choosing a method for handling missing data in the context of TMLE, researchers must consider the missingness mechanism and, for MI, compatibility with the analysis method. In many settings, a parametric MI approach that incorporates interactions and nonlinearities is expected to perform well.


Assuntos
Causalidade , Humanos , Funções Verossimilhança , Adolescente , Interpretação Estatística de Dados , Viés , Modelos Estatísticos , Simulação por Computador
12.
Biostatistics ; 24(2): 502-517, 2023 04 14.
Artigo em Inglês | MEDLINE | ID: mdl-34939083

RESUMO

Cluster randomized trials (CRTs) randomly assign an intervention to groups of individuals (e.g., clinics or communities) and measure outcomes on individuals in those groups. While offering many advantages, this experimental design introduces challenges that are only partially addressed by existing analytic approaches. First, outcomes are often missing for some individuals within clusters. Failing to appropriately adjust for differential outcome measurement can result in biased estimates and inference. Second, CRTs often randomize limited numbers of clusters, resulting in chance imbalances on baseline outcome predictors between arms. Failing to adaptively adjust for these imbalances and other predictive covariates can result in efficiency losses. To address these methodological gaps, we propose and evaluate a novel two-stage targeted minimum loss-based estimator to adjust for baseline covariates in a manner that optimizes precision, after controlling for baseline and postbaseline causes of missing outcomes. Finite sample simulations illustrate that our approach can nearly eliminate bias due to differential outcome measurement, while existing CRT estimators yield misleading results and inferences. Application to real data from the SEARCH community randomized trial demonstrates the gains in efficiency afforded through adaptive adjustment for baseline covariates, after controlling for missingness on individual-level outcomes.


Assuntos
Avaliação de Resultados em Cuidados de Saúde , Projetos de Pesquisa , Humanos , Ensaios Clínicos Controlados Aleatórios como Assunto , Probabilidade , Viés , Análise por Conglomerados , Simulação por Computador
13.
Biostatistics ; 2023 Aug 02.
Artigo em Inglês | MEDLINE | ID: mdl-37531621

RESUMO

Cluster randomized trials (CRTs) often enroll large numbers of participants; yet due to resource constraints, only a subset of participants may be selected for outcome assessment, and those sampled may not be representative of all cluster members. Missing data also present a challenge: if sampled individuals with measured outcomes are dissimilar from those with missing outcomes, unadjusted estimates of arm-specific endpoints and the intervention effect may be biased. Further, CRTs often enroll and randomize few clusters, limiting statistical power and raising concerns about finite sample performance. Motivated by SEARCH-TB, a CRT aimed at reducing incident tuberculosis infection, we demonstrate interlocking methods to handle these challenges. First, we extend Two-Stage targeted minimum loss-based estimation to account for three sources of missingness: (i) subsampling; (ii) measurement of baseline status among those sampled; and (iii) measurement of final status among those in the incidence cohort (persons known to be at risk at baseline). Second, we critically evaluate the assumptions under which subunits of the cluster can be considered the conditionally independent unit, improving precision and statistical power but also causing the CRT to behave like an observational study. Our application to SEARCH-TB highlights the real-world impact of different assumptions on measurement and dependence; estimates relying on unrealistic assumptions suggested the intervention increased the incidence of TB infection by 18% (risk ratio [RR]=1.18, 95% confidence interval [CI]: 0.85-1.63), while estimates accounting for the sampling scheme, missingness, and within community dependence found the intervention decreased the incident TB by 27% (RR=0.73, 95% CI: 0.57-0.92).

14.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34472591

RESUMO

Missing values are common in high-throughput mass spectrometry data. Two strategies are available to address missing values: (i) eliminate or impute the missing values and apply statistical methods that require complete data and (ii) use statistical methods that specifically account for missing values without imputation (imputation-free methods). This study reviews the effect of sample size and percentage of missing values on statistical inference for multiple methods under these two strategies. With increasing missingness, the ability of imputation and imputation-free methods to identify differentially and non-differentially regulated compounds in a two-group comparison study declined. Random forest and k-nearest neighbor imputation combined with a Wilcoxon test performed well in statistical testing for up to 50% missingness with little bias in estimating the effect size. Quantile regression imputation accompanied with a Wilcoxon test also had good statistical testing outcomes but substantially distorted the difference in means between groups. None of the imputation-free methods performed consistently better for statistical testing than imputation methods.


Assuntos
Projetos de Pesquisa , Viés , Análise por Conglomerados , Espectrometria de Massas/métodos
15.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34882223

RESUMO

Clinical data are increasingly being mined to derive new medical knowledge with a goal of enabling greater diagnostic precision, better-personalized therapeutic regimens, improved clinical outcomes and more efficient utilization of health-care resources. However, clinical data are often only available at irregular intervals that vary between patients and type of data, with entries often being unmeasured or unknown. As a result, missing data often represent one of the major impediments to optimal knowledge derivation from clinical data. The Data Analytics Challenge on Missing data Imputation (DACMI) presented a shared clinical dataset with ground truth for evaluating and advancing the state of the art in imputing missing data for clinical time series. We extracted 13 commonly measured blood laboratory tests. To evaluate the imputation performance, we randomly removed one recorded result per laboratory test per patient admission and used them as the ground truth. DACMI is the first shared-task challenge on clinical time series imputation to our best knowledge. The challenge attracted 12 international teams spanning three continents across multiple industries and academia. The evaluation outcome suggests that competitive machine learning and statistical models (e.g. LightGBM, MICE and XGBoost) coupled with carefully engineered temporal and cross-sectional features can achieve strong imputation performance. However, care needs to be taken to prevent overblown model complexity. The challenge participating systems collectively experimented with a wide range of machine learning and probabilistic algorithms to combine temporal imputation and cross-sectional imputation, and their design principles will inform future efforts to better model clinical missing data.


Assuntos
Algoritmos , Aprendizado de Máquina , Estudos Transversais , Coleta de Dados , Humanos , Modelos Estatísticos
16.
Mol Phylogenet Evol ; 200: 108177, 2024 Aug 13.
Artigo em Inglês | MEDLINE | ID: mdl-39142526

RESUMO

Despite the many advances of the genomic era, there is a persistent problem in assessing the uncertainty of phylogenomic hypotheses. We see this in the recent history of phylogenetics for cockroaches and termites (Blattodea), where huge advances have been made, but there are still major inconsistencies between studies. To address this, we present a phylogenetic analysis of Blattodea that emphasizes identification and quantification of uncertainty. We analyze 1183 gene domains using three methods (multi-species coalescent inference, concatenation, and a supermatrix-supertree hybrid approach) and assess support for controversial relationships while considering data quality. The hybrid approach-here dubbed "tiered phylogenetic inference"-incorporates information about data quality into an incremental tree building framework. Leveraging this method, we are able to identify cases of low or misleading support that would not be possible otherwise, and explore them more thoroughly with follow-up tests. In particular, quality annotations pointed towards nodes with high bootstrap support that later turned out to have large ambiguities, sometimes resulting from low-quality data. We also clarify issues related to some recalcitrant nodes: Anaplectidae's placement lacks unbiased signal, Ectobiidae s.s. and Anaplectoideini need greater taxon sampling, the deepest relationships among most Blaberidae lack signal. As a result, several previous phylogenetic uncertainties are now closer to being resolved (e.g., African and Malagasy "Rhabdoblatta" spp. are the sister to all other Blaberidae, and Oxyhaloinae is sister to the remaining Blaberidae). Overall, we argue for more approaches to quantifying support that take data quality into account to uncover the nature of recalcitrant nodes.

17.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38456546

RESUMO

The problem of estimating the size of a population based on a subset of individuals observed across multiple data sources is often referred to as capture-recapture or multiple-systems estimation. This is fundamentally a missing data problem, where the number of unobserved individuals represents the missing data. As with any missing data problem, multiple-systems estimation requires users to make an untestable identifying assumption in order to estimate the population size from the observed data. If an appropriate identifying assumption cannot be found for a data set, no estimate of the population size should be produced based on that data set, as models with different identifying assumptions can produce arbitrarily different population size estimates-even with identical observed data fits. Approaches to multiple-systems estimation often do not explicitly specify identifying assumptions. This makes it difficult to decouple the specification of the model for the observed data from the identifying assumption and to provide justification for the identifying assumption. We present a re-framing of the multiple-systems estimation problem that leads to an approach that decouples the specification of the observed-data model from the identifying assumption, and discuss how common models fit into this framing. This approach takes advantage of existing software and facilitates various sensitivity analyses. We demonstrate our approach in a case study estimating the number of civilian casualties in the Kosovo war.


Assuntos
Densidade Demográfica , Humanos
18.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38364812

RESUMO

People living with HIV on antiretroviral therapy often have undetectable virus levels by standard assays, but "latent" HIV still persists in viral reservoirs. Eliminating these reservoirs is the goal of HIV cure research. The quantitative viral outgrowth assay (QVOA) is commonly used to estimate the reservoir size, that is, the infectious units per million (IUPM) of HIV-persistent resting CD4+ T cells. A new variation of the QVOA, the ultra deep sequencing assay of the outgrowth virus (UDSA), was recently developed that further quantifies the number of viral lineages within a subset of infected wells. Performing the UDSA on a subset of wells provides additional information that can improve IUPM estimation. This paper considers statistical inference about the IUPM from combined dilution assay (QVOA) and deep viral sequencing (UDSA) data, even when some deep sequencing data are missing. Methods are proposed to accommodate assays with wells sequenced at multiple dilution levels and with imperfect sensitivity and specificity, and a novel bias-corrected estimator is included for small samples. The proposed methods are evaluated in a simulation study, applied to data from the University of North Carolina HIV Cure Center, and implemented in the open-source R package SLDeepAssay.


Assuntos
Infecções por HIV , HIV-1 , Humanos , Latência Viral , HIV-1/genética , Linfócitos T CD4-Positivos , Simulação por Computador , Carga Viral
19.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38281771

RESUMO

Statistical approaches that successfully combine multiple datasets are more powerful, efficient, and scientifically informative than separate analyses. To address variation architectures correctly and comprehensively for high-dimensional data across multiple sample sets (ie, cohorts), we propose multiple augmented reduced rank regression (maRRR), a flexible matrix regression and factorization method to concurrently learn both covariate-driven and auxiliary structured variations. We consider a structured nuclear norm objective that is motivated by random matrix theory, in which the regression or factorization terms may be shared or specific to any number of cohorts. Our framework subsumes several existing methods, such as reduced rank regression and unsupervised multimatrix factorization approaches, and includes a promising novel approach to regression and factorization of a single dataset (aRRR) as a special case. Simulations demonstrate substantial gains in power from combining multiple datasets, and from parsimoniously accounting for all structured variations. We apply maRRR to gene expression data from multiple cancer types (ie, pan-cancer) from The Cancer Genome Atlas, with somatic mutations as covariates. The method performs well with respect to prediction and imputation of held-out data, and provides new insights into mutation-driven and auxiliary variations that are shared or specific to certain cancer types.


Assuntos
Neoplasias , Humanos , Análise Multivariada , Neoplasias/genética
20.
Stat Med ; 2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39054668

RESUMO

We consider the problem of optimal model averaging for partially linear models when the responses are missing at random and some covariates are measured with error. A novel weight choice criterion based on the Mallows-type criterion is proposed for the weight vector to be used in the model averaging. The resulting model averaging estimator for the partially linear models is shown to be asymptotically optimal under some regularity conditions in terms of achieving the smallest possible squared loss. In addition, the existence of a local minimizing weight vector and its convergence rate to the risk-based optimal weight vector are established. Simulation studies suggest that the proposed model averaging method generally outperforms existing methods. As an illustration, the proposed method is applied to analyze an HIV-CD4 dataset.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA