Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 129
Filtrar
1.
Biometrics ; 80(3)2024 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-39005072

RESUMEN

The increasing availability and scale of biobanks and "omic" datasets bring new horizons for understanding biological mechanisms. PathGPS is an exploratory data analysis tool to discover genetic architectures using Genome Wide Association Studies (GWAS) summary data. PathGPS is based on a linear structural equation model where traits are regulated by both genetic and environmental pathways. PathGPS decouples the genetic and environmental components by contrasting the GWAS associations of "signal" genes with those of "noise" genes. From the estimated genetic component, PathGPS then extracts genetic pathways via principal component and factor analysis, leveraging the low-rank and sparse properties. In addition, we provide a bootstrap aggregating ("bagging") algorithm to improve stability under data perturbation and hyperparameter tuning. When applied to a metabolomics dataset and the UK Biobank, PathGPS confirms several known gene-trait clusters and suggests multiple new hypotheses for future investigations.


Asunto(s)
Algoritmos , Estudio de Asociación del Genoma Completo , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Humanos , Metabolómica/métodos , Análisis de Componente Principal , Modelos Genéticos , Polimorfismo de Nucleótido Simple , Bancos de Muestras Biológicas , Simulación por Computador , Modelos Estadísticos
2.
ArXiv ; 2024 Jun 12.
Artículo en Inglés | MEDLINE | ID: mdl-38947923

RESUMEN

Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of cell-level classifiers using patient-level labels. Our approach can be used to train e.g. lasso logistic regression models, gradient boosted trees, and neural networks. When applied to clinically-annotated, primary patient samples in Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), our method accurately identifies cancer cells, generalizes across tissues and treatment timepoints, and selects biologically relevant features. In addition, MMIL is capable of incorporating cell labels into model training when they are known, providing a powerful framework for leveraging both labeled and unlabeled data simultaneously. Mixture Modeling for MIL offers a novel approach for cell classification, with significant potential to advance disease understanding and management, especially in scenarios with unknown gold-standard labels and high dimensionality.

3.
J Comput Graph Stat ; 33(2): 551-566, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38993268

RESUMEN

In clinical practice and biomedical research, measurements are often collected sparsely and irregularly in time, while the data acquisition is expensive and inconvenient. Examples include measurements of spine bone mineral density, cancer growth through mammography or biopsy, a progression of defective vision, or assessment of gait in patients with neurological disorders. Practitioners often need to infer the progression of diseases from such sparse observations. A classical tool for analyzing such data is a mixed-effect model where time is treated as both a fixed effect (population progression curve) and a random effect (individual variability). Alternatively, researchers use Gaussian processes or functional data analysis, assuming that observations are drawn from a certain distribution of processes. While these models are flexible, they rely on probabilistic assumptions, require very careful implementation, and tend to be slow in practice. In this study, we propose an alternative elementary framework for analyzing longitudinal data motivated by matrix completion. Our method yields estimates of progression curves by iterative application of the Singular Value Decomposition. Our framework covers multivariate longitudinal data, and regression and can be easily extended to other settings. As it relies on existing tools for matrix algebra, it is efficient and easy to implement. We apply our methods to understand trends of progression of motor impairment in children with Cerebral Palsy. Our model approximates individual progression curves and explains 30% of the variability. Low-rank representation of progression trends enables identification of different progression trends in subtypes of Cerebral Palsy.

4.
ArXiv ; 2024 May 07.
Artículo en Inglés | MEDLINE | ID: mdl-38764589

RESUMEN

Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on data from White British and other ancestries and validated across a cohort of over 96,000 individuals for 8 diseases. Out of 96 models trained, we report 16 with statistically significant incremental predictive performance in terms of ROC-AUC scores (p-value < 0.05), found for diabetes, arthritis, gall stones, cystitis, asthma and osteoarthritis. For the interaction and pretrained models that outperformed the baseline, the PRS score was the primary driver behind prediction. Our findings indicate that both interaction terms and pre-training can enhance prediction accuracy but for a limited set of diseases and moderate improvements in accuracy.

5.
bioRxiv ; 2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38464202

RESUMEN

Understanding the causal genetic architecture of complex phenotypes is essential for future research into disease mechanisms and potential therapies. Here, we present a novel framework for genome-wide detection of sets of variants that carry non-redundant information on the phenotypes and are therefore more likely to be causal in a biological sense. Crucially, our framework requires only summary statistics obtained from standard genome-wide marginal association testing. The described approach, implemented in open-source software, is also computationally efficient, requiring less than 15 minutes on a single CPU to perform genome-wide analysis. Through extensive genome-wide simulation studies, we show that the method can substantially outperform usual two-stage marginal association testing and fine-mapping procedures in precision and recall. In applications to a meta-analysis of ten large-scale genetic studies of Alzheimer's disease (AD), we identified 82 loci associated with AD, including 37 additional loci missed by conventional GWAS pipeline. The identified putative causal variants achieve state-of-the-art agreement with massively parallel reporter assays and CRISPR-Cas9 experiments. Additionally, we applied the method to a retrospective analysis of 67 large-scale GWAS summary statistics since 2013 for a variety of phenotypes. Results reveal the method's capacity to robustly discover additional loci for polygenic traits and pinpoint potential causal variants underpinning each locus beyond conventional GWAS pipeline, contributing to a deeper understanding of complex genetic architectures in post-GWAS analyses.

6.
BMC Med Res Methodol ; 24(1): 27, 2024 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-38302887

RESUMEN

BACKGROUND: Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated whether this equation could be used to interpolate missing growth data in children in the first three years of life and compared this interpolation to several common interpolation methods and pediatric growth models. METHODS: We developed a modified Michaelis-Menten equation and compared expected to actual growth, first in a local birth cohort (N = 97) then in a large, outpatient, pediatric sample (N = 14,695). RESULTS: The modified Michaelis-Menten equation showed excellent fit for both infant weight (median RMSE: boys: 0.22 kg [IQR:0.19; 90% < 0.43]; girls: 0.20 kg [IQR:0.17; 90% < 0.39]) and height (median RMSE: boys: 0.93 cm [IQR:0.53; 90% < 1.0]; girls: 0.91 cm [IQR:0.50;90% < 1.0]). Growth data were modeled accurately with as few as four values from routine well-baby visits in year 1 and seven values in years 1-3; birth weight or length was essential for best fit. Interpolation with this equation had comparable (for weight) or lower (for height) mean RMSE compared to the best performing alternative models. CONCLUSIONS: A modified Michaelis-Menten equation accurately describes growth in healthy babies aged 0-36 months, allowing interpolation of missing weight and height values in individual longitudinal measurement series. The growth pattern in healthy babies in resource-rich environments mirrors an enzymatic saturation curve.


Asunto(s)
Cinética , Masculino , Lactante , Femenino , Humanos , Niño , Peso al Nacer
7.
JAMA Intern Med ; 183(10): 1128-1135, 2023 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-37669046

RESUMEN

Importance: Although oral temperature is commonly assessed in medical examinations, the range of usual or "normal" temperature is poorly defined. Objective: To determine normal oral temperature ranges by age, sex, height, weight, and time of day. Design, Setting, and Participants: This cross-sectional study used clinical visit information from the divisions of Internal Medicine and Family Medicine in a single large medical care system. All adult outpatient encounters that included temperature measurements from April 28, 2008, through June 4, 2017, were eligible for inclusion. The LIMIT (Laboratory Information Mining for Individualized Thresholds) filtering algorithm was applied to iteratively remove encounters with primary diagnoses overrepresented in the tails of the temperature distribution, leaving only those diagnoses unrelated to temperature. Mixed-effects modeling was applied to the remaining temperature measurements to identify independent factors associated with normal oral temperature and to generate individualized normal temperature ranges. Data were analyzed from July 5, 2017, to June 23, 2023. Exposures: Primary diagnoses and medications, age, sex, height, weight, time of day, and month, abstracted from each outpatient encounter. Main Outcomes and Measures: Normal temperature ranges by age, sex, height, weight, and time of day. Results: Of 618 306 patient encounters, 35.92% were removed by LIMIT because they included diagnoses or medications that fell disproportionately in the tails of the temperature distribution. The encounters removed due to overrepresentation in the upper tail were primarily linked to infectious diseases (76.81% of all removed encounters); type 2 diabetes was the only diagnosis removed for overrepresentation in the lower tail (15.71% of all removed encounters). The 396 195 encounters included in the analysis set consisted of 126 705 patients (57.35% women; mean [SD] age, 52.7 [15.9] years). Prior to running LIMIT, the mean (SD) overall oral temperature was 36.71 °C (0.43 °C); following LIMIT, the mean (SD) temperature was 36.64 °C (0.35 °C). Using mixed-effects modeling, age, sex, height, weight, and time of day accounted for 6.86% (overall) and up to 25.52% (per patient) of the observed variability in temperature. Mean normal oral temperature did not reach 37 °C for any subgroup; the upper 99th percentile ranged from 36.81 °C (a tall man with underweight aged 80 years at 8:00 am) to 37.88 °C (a short woman with obesity aged 20 years at 2:00 pm). Conclusions and Relevance: The findings of this cross-sectional study suggest that normal oral temperature varies in an expected manner based on sex, age, height, weight, and time of day, allowing individualized normal temperature ranges to be established. The clinical significance of a value outside of the usual range is an area for future study.

8.
Stat Modelling ; 23(3): 203-227, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-37334164

RESUMEN

Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an ℓ2 penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the features equally, which can be ill-suited for some applications. In this article we introduce several approaches to regularizing CCA that take the underlying data structure into account. In particular, the proposed group regularized canonical correlation analysis (GRCCA) is useful when the variables are correlated in groups. We illustrate some computational strategies to avoid excessive computations with regularized CCA in high dimensions. We demonstrate the application of these methods in our motivating application from neuroscience, as well as in a small simulation example.

9.
Multivariate Behav Res ; 58(6): 1057-1071, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37229653

RESUMEN

Despite its potentials benefits, using prediction targets generated based on latent variable (LV) modeling is not a common practice in supervised learning, a dominating framework for developing prediction models. In supervised learning, it is typically assumed that the outcome to be predicted is clear and readily available, and therefore validating outcomes before predicting them is a foreign concept and an unnecessary step. The usual goal of LV modeling is inference, and therefore using it in supervised learning and in the prediction context requires a major conceptual shift. This study lays out methodological adjustments and conceptual shifts necessary for integrating LV modeling into supervised learning. It is shown that such integration is possible by combining the traditions of LV modeling, psychometrics, and supervised learning. In this interdisciplinary learning framework, generating practical outcomes using LV modeling and systematically validating them based on clinical validators are the two main strategies. In the example using the data from the Longitudinal Assessment of Manic Symptoms (LAMS) Study, a large pool of candidate outcomes is generated by flexible LV modeling. It is demonstrated that this exploratory situation can be used as an opportunity to tailor desirable prediction targets taking advantage of contemporary science and clinical insights.


Asunto(s)
Aprendizaje Automático Supervisado , Análisis de Clases Latentes
10.
J Stat Softw ; 1062023.
Artículo en Inglés | MEDLINE | ID: mdl-37138589

RESUMEN

The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models.

11.
Stat Sin ; 33(1): 259-279, 2023 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-37102071

RESUMEN

In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.

12.
Sci Adv ; 9(3): eadd1166, 2023 01 20.
Artículo en Inglés | MEDLINE | ID: mdl-36662860

RESUMEN

Although literature suggests that resistance to TNF inhibitor (TNFi) therapy in patients with ulcerative colitis (UC) is partially linked to immune cell populations in the inflamed region, there is still substantial uncertainty underlying the relevant spatial context. Here, we used the highly multiplexed immunofluorescence imaging technology CODEX to create a publicly browsable tissue atlas of inflammation in 42 tissue regions from 29 patients with UC and 5 healthy individuals. We analyzed 52 biomarkers on 1,710,973 spatially resolved single cells to determine cell types, cell-cell contacts, and cellular neighborhoods. We observed that cellular functional states are associated with cellular neighborhoods. We further observed that a subset of inflammatory cell types and cellular neighborhoods are present in patients with UC with TNFi treatment, potentially indicating resistant niches. Last, we explored applying convolutional neural networks (CNNs) to our dataset with respect to patient clinical variables. We note concerns and offer guidelines for reporting CNN-based predictions in similar datasets.


Asunto(s)
Colitis Ulcerosa , Humanos , Colitis Ulcerosa/tratamiento farmacológico , Colitis Ulcerosa/complicaciones , Inhibidores del Factor de Necrosis Tumoral/uso terapéutico , Inflamación/complicaciones , Biomarcadores
13.
Res Sq ; 2023 Jun 22.
Artículo en Inglés | MEDLINE | ID: mdl-36711501

RESUMEN

Background and Objectives: Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated this equation could be used to interpolate missing growth data in children in the first three years of life. Methods: We developed a modified Michaelis-Menten equation and compared expected to actual growth, first in a local birth cohort (N=97) then in a large, outpatient, pediatric sample (N=14,695). Results: The modified Michaelis-Menten equation showed excellent fit for both infant weight (median RMSE: boys: 0.22kg [IQR:0.19; 90%<0.43]; girls: 0.20kg [IQR:0.17; 90%<0.39]) and height (median RMSE: boys: 0.93cm [IQR:0.53; 90%<1.0]; girls: 0.91cm [IQR:0.50;90%<1.0]). Growth data were modeled accurately with as few as four values from routine well-baby visits in year 1 and seven values in years 1-3; birth weight or length was essential for best fit. Conclusions: A modified Michaelis-Menten equation accurately describes growth in healthy babies aged 0-36 months, allowing interpolation of missing weight and height values in individual longitudinal measurement series. The growth pattern in healthy babies in resource-rich environments mirrors an enzymatic saturation curve.

14.
Ann Appl Stat ; 16(3): 1891-1918, 2022 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-36091495

RESUMEN

In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)-lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.

15.
Ann Stat ; 50(2): 949-986, 2022 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-36120512

RESUMEN

Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ ℝ p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ ℝ p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ ℝ d , W ∈ ℝ p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

16.
PLoS Genet ; 18(3): e1010105, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-35324888

RESUMEN

We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman's ⍴ = 0.61, p = 2.2 x 10-59 for quantitative traits, ⍴ = 0.21, p = 9.6 x 10-4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).


Asunto(s)
Estudio de Asociación del Genoma Completo , Herencia Multifactorial , Bancos de Muestras Biológicas , Predisposición Genética a la Enfermedad , Humanos , Herencia Multifactorial/genética , Fenotipo , Factores de Riesgo , Reino Unido
17.
J Mach Learn Res ; 232022 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-37102181

RESUMEN

Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.

18.
Biostatistics ; 23(2): 626-642, 2022 04 13.
Artículo en Inglés | MEDLINE | ID: mdl-33221831

RESUMEN

Three-dimensional (3D) genome spatial organization is critical for numerous cellular processes, including transcription, while certain conformation-driven structural alterations are frequently oncogenic. Genome architecture had been notoriously difficult to elucidate, but the advent of the suite of chromatin conformation capture assays, notably Hi-C, has transformed understanding of chromatin structure and provided downstream biological insights. Although many findings have flowed from direct analysis of the pairwise proximity data produced by these assays, there is added value in generating corresponding 3D reconstructions deriving from superposing genomic features on the reconstruction. Accordingly, many methods for inferring 3D architecture from proximity data have been advanced. However, none of these approaches exploit the fact that single chromosome solutions constitute a one-dimensional (1D) curve in 3D. Rather, this aspect has either been addressed by imposition of constraints, which is both computationally burdensome and cell type specific, or ignored with contiguity imposed after the fact. Here, we target finding a 1D curve by extending principal curve methodology to the metric scaling problem. We illustrate how this approach yields a sequence of candidate solutions, indexed by an underlying smoothness or degrees-of-freedom parameter, and propose methods for selection from this sequence. We apply the methodology to Hi-C data obtained on IMR90 cells and so are positioned to evaluate reconstruction accuracy by referencing orthogonal imaging data. The results indicate the utility and reproducibility of our principal curve approach in the face of underlying structural variation.


Asunto(s)
Cromatina , Genoma , Cromatina/genética , Cromosomas , Genómica/métodos , Humanos , Reproducibilidad de los Resultados
19.
Biostatistics ; 23(2): 522-540, 2022 04 13.
Artículo en Inglés | MEDLINE | ID: mdl-32989444

RESUMEN

We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.


Asunto(s)
Algoritmos , Bancos de Muestras Biológicas , Humanos , Funciones de Verosimilitud , Modelos de Riesgos Proporcionales , Reino Unido
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...