Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 130
Filtrar
1.
JAMA Psychiatry ; 2024 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-39196567

RESUMO

Importance: Mental illnesses are a leading cause of disability globally, and functional disability is often in part caused by cognitive impairments across psychiatric disorders. However, studies have consistently reported seemingly opposite findings regarding the association between cognition and psychiatric symptoms. Objective: To determine if the association between general cognition and mental health symptoms diverges at different symptom severities in children. Design, Setting, and Participants: A total of 5175 children with complete data at 2 time points assessed 2 years apart (aged 9 to 11 years at the first assessment) from the ongoing Adolescent Brain and Cognitive Development (ABCD) study were evaluated for a general cognition factor and mental health symptoms from September 2016 to August 2020 at 21 sites across the US. Polynomial and generalized additive models afforded derivation of continuous associations between cognition and psychiatric symptoms across different ranges of symptom severity. Data were analyzed from December 2022 to April 2024. Main Outcomes and Measures: Aggregate cognitive test scores (general cognition) were primarily evaluated in relation to total and subscale-specific symptoms reported from the Child Behavioral Checklist. Results: The sample included 5175 children (2713 male [52.4%] and 2462 female [47.6%]; mean [SD] age, 10.9 [1.18] years). Previously reported mixed findings regarding the association between general cognition and symptoms may consist of several underlying, opposed associations that depend on the class and severity of symptoms. Linear models recovered differing associations between general cognition and mental health symptoms, depending on the range of symptom severities queried. Nonlinear models confirm that internalizing symptoms were significantly positively associated with cognition at low symptom burdens higher cognition = more symptoms) and significantly negatively associated with cognition at high symptom burdens. Conclusions and Relevance: The association between mental health symptoms and general cognition in this study was nonlinear. Internalizing symptoms were both positively and negatively associated with general cognition at a significant level, depending on the range of symptom severities queried in the analysis sample. These results appear to reconcile mixed findings in prior studies, which implicitly assume that symptom severity tracks linearly with cognitive ability across the entire spectrum of mental health. As the association between cognition and symptoms may be opposite in low vs high symptom severity samples, these results reveal the necessity of clinical enrichment in studies of cognitive impairment.

2.
Biometrics ; 80(3)2024 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-39005072

RESUMO

The increasing availability and scale of biobanks and "omic" datasets bring new horizons for understanding biological mechanisms. PathGPS is an exploratory data analysis tool to discover genetic architectures using Genome Wide Association Studies (GWAS) summary data. PathGPS is based on a linear structural equation model where traits are regulated by both genetic and environmental pathways. PathGPS decouples the genetic and environmental components by contrasting the GWAS associations of "signal" genes with those of "noise" genes. From the estimated genetic component, PathGPS then extracts genetic pathways via principal component and factor analysis, leveraging the low-rank and sparse properties. In addition, we provide a bootstrap aggregating ("bagging") algorithm to improve stability under data perturbation and hyperparameter tuning. When applied to a metabolomics dataset and the UK Biobank, PathGPS confirms several known gene-trait clusters and suggests multiple new hypotheses for future investigations.


Assuntos
Algoritmos , Estudo de Associação Genômica Ampla , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Humanos , Metabolômica/métodos , Análise de Componente Principal , Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Bancos de Espécimes Biológicos , Simulação por Computador , Modelos Estatísticos
3.
ArXiv ; 2024 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-38947923

RESUMO

Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of cell-level classifiers using patient-level labels. Our approach can be used to train e.g. lasso logistic regression models, gradient boosted trees, and neural networks. When applied to clinically-annotated, primary patient samples in Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), our method accurately identifies cancer cells, generalizes across tissues and treatment timepoints, and selects biologically relevant features. In addition, MMIL is capable of incorporating cell labels into model training when they are known, providing a powerful framework for leveraging both labeled and unlabeled data simultaneously. Mixture Modeling for MIL offers a novel approach for cell classification, with significant potential to advance disease understanding and management, especially in scenarios with unknown gold-standard labels and high dimensionality.

4.
J Comput Graph Stat ; 33(2): 551-566, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38993268

RESUMO

In clinical practice and biomedical research, measurements are often collected sparsely and irregularly in time, while the data acquisition is expensive and inconvenient. Examples include measurements of spine bone mineral density, cancer growth through mammography or biopsy, a progression of defective vision, or assessment of gait in patients with neurological disorders. Practitioners often need to infer the progression of diseases from such sparse observations. A classical tool for analyzing such data is a mixed-effect model where time is treated as both a fixed effect (population progression curve) and a random effect (individual variability). Alternatively, researchers use Gaussian processes or functional data analysis, assuming that observations are drawn from a certain distribution of processes. While these models are flexible, they rely on probabilistic assumptions, require very careful implementation, and tend to be slow in practice. In this study, we propose an alternative elementary framework for analyzing longitudinal data motivated by matrix completion. Our method yields estimates of progression curves by iterative application of the Singular Value Decomposition. Our framework covers multivariate longitudinal data, and regression and can be easily extended to other settings. As it relies on existing tools for matrix algebra, it is efficient and easy to implement. We apply our methods to understand trends of progression of motor impairment in children with Cerebral Palsy. Our model approximates individual progression curves and explains 30% of the variability. Low-rank representation of progression trends enables identification of different progression trends in subtypes of Cerebral Palsy.

5.
ArXiv ; 2024 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-38764589

RESUMO

Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on data from White British and other ancestries and validated across a cohort of over 96,000 individuals for 8 diseases. Out of 96 models trained, we report 16 with statistically significant incremental predictive performance in terms of ROC-AUC scores (p-value < 0.05), found for diabetes, arthritis, gall stones, cystitis, asthma and osteoarthritis. For the interaction and pretrained models that outperformed the baseline, the PRS score was the primary driver behind prediction. Our findings indicate that both interaction terms and pre-training can enhance prediction accuracy but for a limited set of diseases and moderate improvements in accuracy.

6.
bioRxiv ; 2024 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-38464202

RESUMO

Understanding the causal genetic architecture of complex phenotypes is essential for future research into disease mechanisms and potential therapies. Here, we present a novel framework for genome-wide detection of sets of variants that carry non-redundant information on the phenotypes and are therefore more likely to be causal in a biological sense. Crucially, our framework requires only summary statistics obtained from standard genome-wide marginal association testing. The described approach, implemented in open-source software, is also computationally efficient, requiring less than 15 minutes on a single CPU to perform genome-wide analysis. Through extensive genome-wide simulation studies, we show that the method can substantially outperform usual two-stage marginal association testing and fine-mapping procedures in precision and recall. In applications to a meta-analysis of ten large-scale genetic studies of Alzheimer's disease (AD), we identified 82 loci associated with AD, including 37 additional loci missed by conventional GWAS pipeline. The identified putative causal variants achieve state-of-the-art agreement with massively parallel reporter assays and CRISPR-Cas9 experiments. Additionally, we applied the method to a retrospective analysis of 67 large-scale GWAS summary statistics since 2013 for a variety of phenotypes. Results reveal the method's capacity to robustly discover additional loci for polygenic traits and pinpoint potential causal variants underpinning each locus beyond conventional GWAS pipeline, contributing to a deeper understanding of complex genetic architectures in post-GWAS analyses.

7.
BMC Med Res Methodol ; 24(1): 27, 2024 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-38302887

RESUMO

BACKGROUND: Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated whether this equation could be used to interpolate missing growth data in children in the first three years of life and compared this interpolation to several common interpolation methods and pediatric growth models. METHODS: We developed a modified Michaelis-Menten equation and compared expected to actual growth, first in a local birth cohort (N = 97) then in a large, outpatient, pediatric sample (N = 14,695). RESULTS: The modified Michaelis-Menten equation showed excellent fit for both infant weight (median RMSE: boys: 0.22 kg [IQR:0.19; 90% < 0.43]; girls: 0.20 kg [IQR:0.17; 90% < 0.39]) and height (median RMSE: boys: 0.93 cm [IQR:0.53; 90% < 1.0]; girls: 0.91 cm [IQR:0.50;90% < 1.0]). Growth data were modeled accurately with as few as four values from routine well-baby visits in year 1 and seven values in years 1-3; birth weight or length was essential for best fit. Interpolation with this equation had comparable (for weight) or lower (for height) mean RMSE compared to the best performing alternative models. CONCLUSIONS: A modified Michaelis-Menten equation accurately describes growth in healthy babies aged 0-36 months, allowing interpolation of missing weight and height values in individual longitudinal measurement series. The growth pattern in healthy babies in resource-rich environments mirrors an enzymatic saturation curve.


Assuntos
Cinética , Masculino , Lactente , Feminino , Humanos , Criança , Peso ao Nascer
8.
JAMA Intern Med ; 183(10): 1128-1135, 2023 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-37669046

RESUMO

Importance: Although oral temperature is commonly assessed in medical examinations, the range of usual or "normal" temperature is poorly defined. Objective: To determine normal oral temperature ranges by age, sex, height, weight, and time of day. Design, Setting, and Participants: This cross-sectional study used clinical visit information from the divisions of Internal Medicine and Family Medicine in a single large medical care system. All adult outpatient encounters that included temperature measurements from April 28, 2008, through June 4, 2017, were eligible for inclusion. The LIMIT (Laboratory Information Mining for Individualized Thresholds) filtering algorithm was applied to iteratively remove encounters with primary diagnoses overrepresented in the tails of the temperature distribution, leaving only those diagnoses unrelated to temperature. Mixed-effects modeling was applied to the remaining temperature measurements to identify independent factors associated with normal oral temperature and to generate individualized normal temperature ranges. Data were analyzed from July 5, 2017, to June 23, 2023. Exposures: Primary diagnoses and medications, age, sex, height, weight, time of day, and month, abstracted from each outpatient encounter. Main Outcomes and Measures: Normal temperature ranges by age, sex, height, weight, and time of day. Results: Of 618 306 patient encounters, 35.92% were removed by LIMIT because they included diagnoses or medications that fell disproportionately in the tails of the temperature distribution. The encounters removed due to overrepresentation in the upper tail were primarily linked to infectious diseases (76.81% of all removed encounters); type 2 diabetes was the only diagnosis removed for overrepresentation in the lower tail (15.71% of all removed encounters). The 396 195 encounters included in the analysis set consisted of 126 705 patients (57.35% women; mean [SD] age, 52.7 [15.9] years). Prior to running LIMIT, the mean (SD) overall oral temperature was 36.71 °C (0.43 °C); following LIMIT, the mean (SD) temperature was 36.64 °C (0.35 °C). Using mixed-effects modeling, age, sex, height, weight, and time of day accounted for 6.86% (overall) and up to 25.52% (per patient) of the observed variability in temperature. Mean normal oral temperature did not reach 37 °C for any subgroup; the upper 99th percentile ranged from 36.81 °C (a tall man with underweight aged 80 years at 8:00 am) to 37.88 °C (a short woman with obesity aged 20 years at 2:00 pm). Conclusions and Relevance: The findings of this cross-sectional study suggest that normal oral temperature varies in an expected manner based on sex, age, height, weight, and time of day, allowing individualized normal temperature ranges to be established. The clinical significance of a value outside of the usual range is an area for future study.

9.
Stat Modelling ; 23(3): 203-227, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37334164

RESUMO

Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an ℓ2 penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the features equally, which can be ill-suited for some applications. In this article we introduce several approaches to regularizing CCA that take the underlying data structure into account. In particular, the proposed group regularized canonical correlation analysis (GRCCA) is useful when the variables are correlated in groups. We illustrate some computational strategies to avoid excessive computations with regularized CCA in high dimensions. We demonstrate the application of these methods in our motivating application from neuroscience, as well as in a small simulation example.

10.
J Stat Softw ; 1062023.
Artigo em Inglês | MEDLINE | ID: mdl-37138589

RESUMO

The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models.

11.
Multivariate Behav Res ; 58(6): 1057-1071, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37229653

RESUMO

Despite its potentials benefits, using prediction targets generated based on latent variable (LV) modeling is not a common practice in supervised learning, a dominating framework for developing prediction models. In supervised learning, it is typically assumed that the outcome to be predicted is clear and readily available, and therefore validating outcomes before predicting them is a foreign concept and an unnecessary step. The usual goal of LV modeling is inference, and therefore using it in supervised learning and in the prediction context requires a major conceptual shift. This study lays out methodological adjustments and conceptual shifts necessary for integrating LV modeling into supervised learning. It is shown that such integration is possible by combining the traditions of LV modeling, psychometrics, and supervised learning. In this interdisciplinary learning framework, generating practical outcomes using LV modeling and systematically validating them based on clinical validators are the two main strategies. In the example using the data from the Longitudinal Assessment of Manic Symptoms (LAMS) Study, a large pool of candidate outcomes is generated by flexible LV modeling. It is demonstrated that this exploratory situation can be used as an opportunity to tailor desirable prediction targets taking advantage of contemporary science and clinical insights.


Assuntos
Aprendizado de Máquina Supervisionado , Análise de Classes Latentes
12.
Stat Sin ; 33(1): 259-279, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37102071

RESUMO

In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.

13.
Sci Adv ; 9(3): eadd1166, 2023 01 20.
Artigo em Inglês | MEDLINE | ID: mdl-36662860

RESUMO

Although literature suggests that resistance to TNF inhibitor (TNFi) therapy in patients with ulcerative colitis (UC) is partially linked to immune cell populations in the inflamed region, there is still substantial uncertainty underlying the relevant spatial context. Here, we used the highly multiplexed immunofluorescence imaging technology CODEX to create a publicly browsable tissue atlas of inflammation in 42 tissue regions from 29 patients with UC and 5 healthy individuals. We analyzed 52 biomarkers on 1,710,973 spatially resolved single cells to determine cell types, cell-cell contacts, and cellular neighborhoods. We observed that cellular functional states are associated with cellular neighborhoods. We further observed that a subset of inflammatory cell types and cellular neighborhoods are present in patients with UC with TNFi treatment, potentially indicating resistant niches. Last, we explored applying convolutional neural networks (CNNs) to our dataset with respect to patient clinical variables. We note concerns and offer guidelines for reporting CNN-based predictions in similar datasets.


Assuntos
Colite Ulcerativa , Humanos , Colite Ulcerativa/tratamento farmacológico , Colite Ulcerativa/complicações , Inibidores do Fator de Necrose Tumoral/uso terapêutico , Inflamação/complicações , Biomarcadores
14.
Res Sq ; 2023 Jun 22.
Artigo em Inglês | MEDLINE | ID: mdl-36711501

RESUMO

Background and Objectives: Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated this equation could be used to interpolate missing growth data in children in the first three years of life. Methods: We developed a modified Michaelis-Menten equation and compared expected to actual growth, first in a local birth cohort (N=97) then in a large, outpatient, pediatric sample (N=14,695). Results: The modified Michaelis-Menten equation showed excellent fit for both infant weight (median RMSE: boys: 0.22kg [IQR:0.19; 90%<0.43]; girls: 0.20kg [IQR:0.17; 90%<0.39]) and height (median RMSE: boys: 0.93cm [IQR:0.53; 90%<1.0]; girls: 0.91cm [IQR:0.50;90%<1.0]). Growth data were modeled accurately with as few as four values from routine well-baby visits in year 1 and seven values in years 1-3; birth weight or length was essential for best fit. Conclusions: A modified Michaelis-Menten equation accurately describes growth in healthy babies aged 0-36 months, allowing interpolation of missing weight and height values in individual longitudinal measurement series. The growth pattern in healthy babies in resource-rich environments mirrors an enzymatic saturation curve.

15.
Ann Stat ; 50(2): 949-986, 2022 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-36120512

RESUMO

Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ ℝ p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ ℝ p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ ℝ d , W ∈ ℝ p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

16.
Ann Appl Stat ; 16(3): 1891-1918, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-36091495

RESUMO

In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)-lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.

17.
PLoS Genet ; 18(3): e1010105, 2022 03.
Artigo em Inglês | MEDLINE | ID: mdl-35324888

RESUMO

We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman's ⍴ = 0.61, p = 2.2 x 10-59 for quantitative traits, ⍴ = 0.21, p = 9.6 x 10-4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).


Assuntos
Estudo de Associação Genômica Ampla , Herança Multifatorial , Bancos de Espécimes Biológicos , Predisposição Genética para Doença , Humanos , Herança Multifatorial/genética , Fenótipo , Fatores de Risco , Reino Unido
18.
Front Neurol ; 13: 960760, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36601297

RESUMO

Muscle weakness is common in many neurological, neuromuscular, and musculoskeletal conditions. Muscle size only partially explains muscle strength as adaptions within the nervous system also contribute to strength. Brain-based biomarkers of neuromuscular function could provide diagnostic, prognostic, and predictive value in treating these disorders. Therefore, we sought to characterize and quantify the brain's contribution to strength by developing multimodal MRI pipelines to predict grip strength. However, the prediction of strength was not straightforward, and we present a case of sex being a clear confound in brain decoding analyses. While each MRI modality-structural MRI (i.e., gray matter morphometry), diffusion MRI (i.e., white matter fractional anisotropy), resting state functional MRI (i.e., functional connectivity), and task-evoked functional MRI (i.e., left or right hand motor task activation)-and a multimodal prediction pipeline demonstrated significant predictive power for strength (R 2 = 0.108-0.536, p ≤ 0.001), after correcting for sex, the predictive power was substantially reduced (R 2 = -0.038-0.075). Next, we flipped the analysis and demonstrated that each MRI modality and a multimodal prediction pipeline could significantly predict sex (accuracy = 68.0%-93.3%, AUC = 0.780-0.982, p < 0.001). However, correcting the brain features for strength reduced the accuracy for predicting sex (accuracy = 57.3%-69.3%, AUC = 0.615-0.780). Here we demonstrate the effects of sex-correlated confounds in brain-based predictive models across multiple brain MRI modalities for both regression and classification models. We discuss implications of confounds in predictive modeling and the development of brain-based MRI biomarkers, as well as possible strategies to overcome these barriers.

19.
J Mach Learn Res ; 232022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37102181

RESUMO

Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA