Pesquisa | BVS - MINISTÉRIO DA SAÚDE

1.

A Staged Approach using Machine Learning and Uncertainty Quantification to Predict the Risk of Hip Fracture.

Shaik, Anjum; Larsen, Kristoffer; Lane, Nancy E; Zhao, Chen; Su, Kuan-Jui; Keyak, Joyce H; Tian, Qing; Sha, Qiuying; Shen, Hui; Deng, Hong-Wen; Zhou, Weihua.

ArXiv ; 2024 May 30.

Artigo em Inglês | MEDLINE | ID: mdl-38855554

RESUMO

Hip fractures present a significant healthcare challenge, especially within aging populations, where they are often caused by falls. These fractures lead to substantial morbidity and mortality, emphasizing the need for timely surgical intervention. Despite advancements in medical care, hip fractures impose a significant burden on individuals and healthcare systems. This paper focuses on the prediction of hip fracture risk in older and middle-aged adults, where falls and compromised bone quality are predominant factors. We propose a novel staged model that combines advanced imaging and clinical data to improve predictive performance. By using convolutional neural networks (CNNs) to extract features from hip DXA images, along with clinical variables, shape measurements, and texture features, our method provides a comprehensive framework for assessing fracture risk. The study cohort included 547 patients, with 94 experiencing hip fracture. A staged machine learning-based model was developed using two ensemble models: Ensemble 1 (clinical variables only) and Ensemble 2 (clinical variables and DXA imaging features). This staged approach used uncertainty quantification from Ensemble 1 to decide if DXA features are necessary for further prediction. Ensemble 2 exhibited the highest performance, achieving an Area Under the Curve (AUC) of 0.9541, an accuracy of 0.9195, a sensitivity of 0.8078, and a specificity of 0.9427. The staged model also performed well, with an AUC of 0.8486, an accuracy of 0.8611, a sensitivity of 0.5578, and a specificity of 0.9249, outperforming Ensemble 1, which had an AUC of 0.5549, an accuracy of 0.7239, a sensitivity of 0.1956, and a specificity of 0.8343. Furthermore, the staged model suggested that 54.49% of patients did not require DXA scanning. It effectively balanced accuracy and specificity, offering a robust solution when DXA data acquisition is not always feasible. Statistical tests confirmed significant differences between the models, highlighting the advantages of the advanced modeling strategies. Our staged approach offers a cost-effective holistic view of patients' health. It could identify individuals at risk with a high accuracy but reduce the unnecessary DXA scanning. Our approach has great promise to guide interventions to prevent hip fractures with reduced cost and radiation.

2.

A novel method for multiple phenotype association studies based on genotype and phenotype network.

Cao, Xuewei; Zhang, Shuanglin; Sha, Qiuying.

PLoS Genet ; 20(5): e1011245, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38728360

RESUMO

Joint analysis of multiple correlated phenotypes for genome-wide association studies (GWAS) can identify and interpret pleiotropic loci which are essential to understand pleiotropy in diseases and complex traits. Meanwhile, constructing a network based on associations between phenotypes and genotypes provides a new insight to analyze multiple phenotypes, which can explore whether phenotypes and genotypes might be related to each other at a higher level of cellular and organismal organization. In this paper, we first develop a bipartite signed network by linking phenotypes and genotypes into a Genotype and Phenotype Network (GPN). The GPN can be constructed by a mixture of quantitative and qualitative phenotypes and is applicable to binary phenotypes with extremely unbalanced case-control ratios in large-scale biobank datasets. We then apply a powerful community detection method to partition phenotypes into disjoint network modules based on GPN. Finally, we jointly test the association between multiple phenotypes in a network module and a single nucleotide polymorphism (SNP). Simulations and analyses of 72 complex traits in the UK Biobank show that multiple phenotype association tests based on network modules detected by GPN are much more powerful than those without considering network modules. The newly proposed GPN provides a new insight to investigate the genetic architecture among different types of phenotypes. Multiple phenotypes association studies based on GPN are improved by incorporating the genetic information into the phenotype clustering. Notably, it might broaden the understanding of genetic architecture that exists between diagnoses, genes, and pleiotropy.

Assuntos

Estudo de Associação Genômica Ampla , Genótipo , Fenótipo , Polimorfismo de Nucleotídeo Único , Humanos , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único/genética , Modelos Genéticos , Pleiotropia Genética , Estudos de Associação Genética/métodos , Locos de Características Quantitativas/genética

3.

A new method of modeling the multi-stage decision-making process of CRT using machine learning with uncertainty quantification.

Larsen, Kristoffer; Zhao, Chen; Keyak, Joyce; Sha, Qiuying; Paez, Diana; Zhang, Xinwei; Hung, Guang-Uei; Zou, Jiangang; Peix, Amalia; Zhou, Weihua.

ArXiv ; 2024 Apr 28.

Artigo em Inglês | MEDLINE | ID: mdl-38463497

RESUMO

Aims: Current machine learning-based (ML) models usually attempt to utilize all available patient data to predict patient outcomes while ignoring the associated cost and time for data acquisition. The purpose of this study is to create a multi-stage machine learning model to predict cardiac resynchronization therapy (CRT) response for heart failure (HF) patients. This model exploits uncertainty quantification to recommend additional collection of single-photon emission computed tomography myocardial perfusion imaging (SPECT MPI) variables if baseline clinical variables and features from electrocardiogram (ECG) are not sufficient. Methods: 218 patients who underwent rest-gated SPECT MPI were enrolled in this study. CRT response was defined as an increase in left ventricular ejection fraction (LVEF) > 5% at a 6±1 month follow-up. A multi-stage ML model was created by combining two ensemble models: Ensemble 1 was trained with clinical variables and ECG; Ensemble 2 included Ensemble 1 plus SPECT MPI features. Uncertainty quantification from Ensemble 1 allowed for multi-stage decision-making to determine if the acquisition of SPECT data for a patient is necessary. The performance of the multi-stage model was compared with that of Ensemble models 1 and 2. Results: The response rate for CRT was 55.5% (n = 121) with overall male gender 61.0% (n = 133), an average age of 62.0±11.8, and LVEF of 27.7±11.0. The multi-stage model performed similarly to Ensemble 2 (which utilized the additional SPECT data) with AUC of 0.75 vs. 0.77, accuracy of 0.71 vs. 0.69, sensitivity of 0.70 vs. 0.72, and specificity 0.72 vs. 0.65, respectively. However, the multi-stage model only required SPECT MPI data for 52.7% of the patients across all folds. Conclusions: By using rule-based logic stemming from uncertainty quantification, the multi-stage model was able to reduce the need for additional SPECT MPI data acquisition without sacrificing performance.

4.

A new hip fracture risk index derived from FEA-computed proximal femur fracture loads and energies-to-failure.

Cao, Xuewei; Keyak, Joyce H; Sigurdsson, Sigurdur; Zhao, Chen; Zhou, Weihua; Liu, Anqi; Lang, Thomas F; Deng, Hong-Wen; Gudnason, Vilmundur; Sha, Qiuying.

Osteoporos Int ; 35(5): 785-794, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38246971

RESUMO

Hip fracture risk assessment is an important but challenging task. Quantitative CT-based patient-specific finite element (FE) analysis (FEA) incorporates bone geometry and bone density in the proximal femur. We developed a global FEA-computed fracture risk index to increase the prediction accuracy of hip fracture incidence. PURPOSE: Quantitative CT-based patient-specific finite element (FE) analysis (FEA) incorporates bone geometry and bone density in the proximal femur to compute the force (fracture load) and energy necessary to break the proximal femur in a particular loading condition. The fracture loads and energies-to-failure are individually associated with incident hip fracture, and provide different structural information about the proximal femur. METHODS: We used principal component analysis (PCA) to develop a global FEA-computed fracture risk index that incorporates the FEA-computed yield and ultimate failure loads and energies-to-failure in four loading conditions of 110 hip fracture subjects and 235 age- and sex-matched control subjects from the AGES-Reykjavik study. Using a logistic regression model, we compared the prediction performance for hip fracture based on the stratified resampling. RESULTS: We referred the first principal component (PC1) of the FE parameters as the global FEA-computed fracture risk index, which was the significant predictor of hip fracture (p-value < 0.001). The area under the receiver operating characteristic curve (AUC) using PC1 (0.776) was higher than that using all FE parameters combined (0.737) in the males (p-value < 0.001). CONCLUSIONS: The global FEA-computed fracture risk index increased hip fracture risk prediction accuracy in males.

Assuntos

Fraturas do Quadril , Fraturas Proximais do Fêmur , Masculino , Humanos , Fraturas do Quadril/epidemiologia , Fraturas do Quadril/etiologia , Densidade Óssea , Fêmur/diagnóstico por imagem , Curva ROC , Análise de Elementos Finitos

5.

CLCLSA: Cross-omics linked embedding with contrastive learning and self attention for integration with incomplete multi-omics data.

Zhao, Chen; Liu, Anqi; Zhang, Xiao; Cao, Xuewei; Ding, Zhengming; Sha, Qiuying; Shen, Hui; Deng, Hong-Wen; Zhou, Weihua.

Comput Biol Med ; 170: 108058, 2024 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-38295477

RESUMO

Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding etiology of complex genetic diseases. Each omics technique only provides a limited view of the underlying biological process and integrating heterogeneous omics layers simultaneously would lead to a more comprehensive and detailed understanding of diseases and phenotypes. However, one obstacle faced when performing multi-omics data integration is the existence of unpaired multi-omics data due to instrument sensitivity and cost. Studies may fail if certain aspects of the subjects are missing or incomplete. In this paper, we propose a deep learning method for multi-omics integration with incomplete data by Cross-omics Linked unified embedding with Contrastive Learning and Self Attention (CLCLSA). Utilizing complete multi-omics data as supervision, the model employs cross-omics autoencoders to learn the feature representation across different types of biological data. The multi-omics contrastive learning is employed, which maximizes the mutual information between different types of omics. In addition, the feature-level self-attention and omics-level self-attention are employed to dynamically identify the most informative features for multi-omics data integration. Finally, a Softmax classifier is employed to perform multi-omics data classification. Extensive experiments were conducted on four public multi-omics datasets. The experimental results indicate that our proposed CLCLSA produces promising results in multi-omics data classification using both complete and incomplete multi-omics data.

Assuntos

Cabeça , Multiômica , Humanos , Fenótipo

6.

Integrating External Controls by Regression Calibration for Genome-Wide Association Study.

Zhu, Lirong; Yan, Shijia; Cao, Xuewei; Zhang, Shuanglin; Sha, Qiuying.

Genes (Basel) ; 15(1)2024 Jan 03.

Artigo em Inglês | MEDLINE | ID: mdl-38254957

RESUMO

Genome-wide association studies (GWAS) have successfully revealed many disease-associated genetic variants. For a case-control study, the adequate power of an association test can be achieved with a large sample size, although genotyping large samples is expensive. A cost-effective strategy to boost power is to integrate external control samples with publicly available genotyped data. However, the naive integration of external controls may inflate the type I error rates if ignoring the systematic differences (batch effect) between studies, such as the differences in sequencing platforms, genotype-calling procedures, population stratification, and so forth. To account for the batch effect, we propose an approach by integrating External Controls into the Association Test by Regression Calibration (iECAT-RC) in case-control association studies. Extensive simulation studies show that iECAT-RC not only can control type I error rates but also can boost statistical power in all models. We also apply iECAT-RC to the UK Biobank data for M72 Fibroblastic disorders by considering genotype calling as the batch effect. Four SNPs associated with fibroblastic disorders have been detected by iECAT-RC and the other two comparison methods, iECAT-Score and Internal. However, our method has a higher probability of identifying these significant SNPs in the scenario of an unbalanced case-control association study.

Assuntos

Estudo de Associação Genômica Ampla , Calibragem , Estudos de Casos e Controles , Simulação por Computador , Genótipo

7.

Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics.

Zhao, Chen; Su, Kuan-Jui; Wu, Chong; Cao, Xuewei; Sha, Qiuying; Li, Wu; Luo, Zhe; Qin, Tian; Qiu, Chuan; Zhao, Lan Juan; Liu, Anqi; Jiang, Lindong; Zhang, Xiao; Shen, Hui; Zhou, Weihua; Deng, Hong-Wen.

ArXiv ; 2024 Mar 12.

Artigo em Inglês | MEDLINE | ID: mdl-37873011

RESUMO

Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites. Our approach utilizes a multi-view variational autoencoder to jointly model the burden score, polygenetic risk score (PGS), and linkage disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature extraction and missing metabolomics data imputation. By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values based on genomic information. Results: We evaluate the performance of our method on empirical metabolomics datasets with missing values and demonstrate its superiority compared to conventional imputation techniques. Using 35 template metabolites derived burden scores, PGS and LD-pruned SNPs, the proposed methods achieved R2-scores > 0.01 for 71.55% of metabolites. Conclusion: The integration of WGS data in metabolomics imputation not only improves data completeness but also enhances downstream analyses, paving the way for more comprehensive and accurate investigations of metabolic pathways and disease associations. Our findings offer valuable insights into the potential benefits of utilizing WGS data for metabolomics data imputation and underscore the importance of leveraging multi-modal data integration in precision medicine research.

8.

Multi-view information fusion using multi-view variational autoencoder to predict proximal femoral fracture load.

Zhao, Chen; Keyak, Joyce H; Cao, Xuewei; Sha, Qiuying; Wu, Li; Luo, Zhe; Zhao, Lan-Juan; Tian, Qing; Serou, Michael; Qiu, Chuan; Su, Kuan-Jui; Shen, Hui; Deng, Hong-Wen; Zhou, Weihua.

Front Endocrinol (Lausanne) ; 14: 1261088, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38075049

RESUMO

Background: Hip fracture occurs when an applied force exceeds the force that the proximal femur can support (the fracture load or "strength") and can have devastating consequences with poor functional outcomes. Proximal femoral strengths for specific loading conditions can be computed by subject-specific finite element analysis (FEA) using quantitative computerized tomography (QCT) images. However, the radiation and availability of QCT limit its clinical usability. Alternative low-dose and widely available measurements, such as dual energy X-ray absorptiometry (DXA) and genetic factors, would be preferable for bone strength assessment. The aim of this paper is to design a deep learning-based model to predict proximal femoral strength using multi-view information fusion. Results: We developed new models using multi-view variational autoencoder (MVAE) for feature representation learning and a product of expert (PoE) model for multi-view information fusion. We applied the proposed models to an in-house Louisiana Osteoporosis Study (LOS) cohort with 931 male subjects, including 345 African Americans and 586 Caucasians. We performed genome-wide association studies (GWAS) to select 256 genetic variants with the lowest p-values for each proximal femoral strength and integrated whole genome sequence (WGS) features and DXA-derived imaging features to predict proximal femoral strength. The best prediction model for fall fracture load was acquired by integrating WGS features and DXA-derived imaging features. The designed models achieved the mean absolute percentage error of 18.04%, 6.84% and 7.95% for predicting proximal femoral fracture loads using linear models of fall loading, nonlinear models of fall loading, and nonlinear models of stance loading, respectively. Conclusion: The proposed models are capable of predicting proximal femoral strength using WGS features and DXA-derived imaging features. Though this tool is not a substitute for predicting FEA using QCT images, it would make improved assessment of hip fracture risk more widely available while avoiding the increased radiation exposure from QCT.

Assuntos

Fraturas do Quadril , Osteoporose , Fraturas Proximais do Fêmur , Humanos , Masculino , Estudo de Associação Genômica Ampla , Absorciometria de Fóton/métodos , Fraturas do Quadril/diagnóstico por imagem , Osteoporose/diagnóstico por imagem

9.

Joint analysis of multiple phenotypes for extremely unbalanced case-control association studies using multi-layer network.

Xie, Hongjing; Cao, Xuewei; Zhang, Shuanglin; Sha, Qiuying.

Bioinformatics ; 39(12)2023 12 01.

Artigo em Inglês | MEDLINE | ID: mdl-37991852

RESUMO

MOTIVATION: Genome-wide association studies is an essential tool for analyzing associations between phenotypes and single nucleotide polymorphisms (SNPs). Most of binary phenotypes in large biobanks are extremely unbalanced, which leads to inflated type I error rates for many widely used association tests for joint analysis of multiple phenotypes. In this article, we first propose a novel method to construct a Multi-Layer Network (MLN) using individuals with at least one case status among all phenotypes. Then, we introduce a computationally efficient community detection method to group phenotypes into disjoint clusters based on the MLN. Finally, we propose a novel approach, MLN with Omnibus (MLN-O), to jointly analyse the association between phenotypes and a SNP. MLN-O uses the score test to test the association of each merged phenotype in a cluster and a SNP, then uses the Omnibus test to obtain an overall test statistic to test the association between all phenotypes and a SNP. RESULTS: We conduct extensive simulation studies to reveal that the proposed approach can control type I error rates and is more powerful than some existing methods. Meanwhile, we apply the proposed method to a real data set in the UK Biobank. Using phenotypes in Chapter XIII (Diseases of the musculoskeletal system and connective tissue) in the UK Biobank, we find that MLN-O identifies more significant SNPs than other methods we compare with. AVAILABILITY AND IMPLEMENTATION: https://github.com/Hongjing-Xie/Multi-Layer-Network-with-Omnibus-MLN-O.

Assuntos

Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Estudo de Associação Genômica Ampla/métodos , Fenótipo , Estudos de Casos e Controles , Simulação por Computador

10.

TGPred: efficient methods for predicting target genes of a transcription factor by integrating statistics, machine learning and optimization.

Cao, Xuewei; Zhang, Ling; Islam, Md Khairul; Zhao, Mingxia; He, Cheng; Zhang, Kui; Liu, Sanzhen; Sha, Qiuying; Wei, Hairong.

NAR Genom Bioinform ; 5(3): lqad083, 2023 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-37711605

RESUMO

Four statistical selection methods for inferring transcription factor (TF)-target gene (TG) pairs were developed by coupling mean squared error (MSE) or Huber loss function, with elastic net (ENET) or least absolute shrinkage and selection operator (Lasso) penalty. Two methods were also developed for inferring pathway gene regulatory networks (GRNs) by combining Huber or MSE loss function with a network (Net)-based penalty. To solve these regressions, we ameliorated an accelerated proximal gradient descent (APGD) algorithm to optimize parameter selection processes, resulting in an equally effective but much faster algorithm than the commonly used convex optimization solver. The synthetic data generated in a general setting was used to test four TF-TG identification methods, ENET-based methods performed better than Lasso-based methods. Synthetic data generated from two network settings was used to test Huber-Net and MSE-Net, which outperformed all other methods. The TF-TG identification methods were also tested with SND1 and gl3 overexpression transcriptomic data, Huber-ENET and MSE-ENET outperformed all other methods when genome-wide predictions were performed. The TF-TG identification methods fill the gap of lacking a method for genome-wide TG prediction of a TF, and potential for validating ChIP/DAP-seq results, while the two Net-based methods are instrumental for predicting pathway GRNs.

11.

CLCLSA: Cross-omics Linked embedding with Contrastive Learning and Self Attention for multi-omics integration with incomplete multi-omics data.

Zhao, Chen; Liu, Anqi; Zhang, Xiao; Cao, Xuewei; Ding, Zhengming; Sha, Qiuying; Shen, Hui; Deng, Hong-Wen; Zhou, Weihua.

Res Sq ; 2023 May 02.

Artigo em Inglês | MEDLINE | ID: mdl-37205427

RESUMO

Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding genetic data. Each omics technique only provides a limited view of the underlying biological process and integrating heterogeneous omics layers simultaneously would lead to a more comprehensive and detailed understanding of diseases and phenotypes. However, one obstacle faced when performing multi-omics data integration is the existence of unpaired multi-omics data due to instrument sensitivity and cost. Studies may fail if certain aspects of the subjects are missing or incomplete. In this paper, we propose a deep learning method for multi-omics integration with incomplete data by Cross-omics Linked unified embedding with Contrastive Learning and Self Attention (CLCLSA). Utilizing complete multi-omics data as supervision, the model employs cross-omics autoencoders to learn the feature representation across different types of biological data. The multi-omics contrastive learning, which is used to maximize the mutual information between different types of omics, is employed before latent feature concatenation. In addition, the feature-level self-attention and omics-level self-attention are employed to dynamically identify the most informative features for multi-omics data integration. Extensive experiments were conducted on four public multi-omics datasets. The experimental results indicated that the proposed CLCLSA outperformed the state-of-the-art approaches for multi-omics data classification using incomplete multiomics data.

12.

A machine learning method integrating ECG and gated SPECT for cardiac resynchronization therapy decision support.

de A Fernandes, Fernando; Larsen, Kristoffer; He, Zhuo; Nascimento, Erivelton; Peix, Amalia; Sha, Qiuying; Paez, Diana; Garcia, Ernest V; Zhou, Weihua; Mesquita, Claudio T.

Eur J Nucl Med Mol Imaging ; 50(10): 3022-3033, 2023 08.

Artigo em Inglês | MEDLINE | ID: mdl-37195444

RESUMO

PURPOSE: Cardiac resynchronization therapy (CRT) has been established as an important therapy for heart failure. Mechanical dyssynchrony has the potential to predict responders to CRT. The aim of this study was to report the development and the validation of machine learning models which integrate ECG, gated SPECT MPI (GMPS), and clinical variables to predict patients' response to CRT. METHODS: This analysis included 153 patients who met criteria for CRT from a prospective cohort study. The variables were used to model predictive methods for CRT. Patients were classified as "responders" for an increase of LVEF ≥ 5% at follow-up. In a second analysis, patients were classified as "super-responders" for an increase of LVEF ≥ 15%. For ML, variable selection was applied, and Prediction Analysis of Microarrays (PAM) approach was used to model response while Naïve Bayes (NB) was used to model super-response. These ML models were compared to models obtained with guideline variables. RESULTS: PAM had AUC of 0.80 against 0.72 of partial least squares-discriminant analysis with guideline variables (p = 0.52). The sensitivity (0.86) and specificity (0.75) were better than for guideline alone, sensitivity (0.75) and specificity (0.24). Neural network with guideline variables was better than NB (AUC = 0.93 vs. 0.87) however without statistical significance (p = 0.48). Its sensitivity and specificity (1.0 and 0.75, respectively) were better than guideline alone (0.78 and 0.25, respectively). CONCLUSIONS: Compared to guideline criteria, ML methods trended toward improved CRT response and super-response prediction. GMPS was central in the acquisition of most parameters. Further studies are needed to validate the models.

Assuntos

Terapia de Ressincronização Cardíaca , Insuficiência Cardíaca , Humanos , Terapia de Ressincronização Cardíaca/métodos , Estudos Prospectivos , Teorema de Bayes , Tomografia Computadorizada de Emissão de Fóton Único/métodos , Insuficiência Cardíaca/diagnóstico por imagem , Insuficiência Cardíaca/terapia , Eletrocardiografia , Aprendizado de Máquina , Resultado do Tratamento

13.

CLCLSA: Cross-omics Linked embedding with Contrastive Learning and Self Attention for multi-omics integration with incomplete multi-omics data.

Zhao, Chen; Liu, Anqi; Zhang, Xiao; Cao, Xuewei; Ding, Zhengming; Sha, Qiuying; Shen, Hui; Deng, Hong-Wen; Zhou, Weihua.

ArXiv ; 2023 Apr 12.

Artigo em Inglês | MEDLINE | ID: mdl-37090237

RESUMO

Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding genetic data. Each omics technique only provides a limited view of the underlying biological process and integrating heterogeneous omics layers simultaneously would lead to a more comprehensive and detailed understanding of diseases and phenotypes. However, one obstacle faced when performing multi-omics data integration is the existence of unpaired multi-omics data due to instrument sensitivity and cost. Studies may fail if certain aspects of the subjects are missing or incomplete. In this paper, we propose a deep learning method for multi-omics integration with incomplete data by Cross-omics Linked unified embedding with Contrastive Learning and Self Attention (CLCLSA). Utilizing complete multi-omics data as supervision, the model employs cross-omics autoencoders to learn the feature representation across different types of biological data. The multi-omics contrastive learning, which is used to maximize the mutual information between different types of omics, is employed before latent feature concatenation. In addition, the feature-level self-attention and omics-level self-attention are employed to dynamically identify the most informative features for multi-omics data integration. Extensive experiments were conducted on four public multi-omics datasets. The experimental results indicated that the proposed CLCLSA outperformed the state-of-the-art approaches for multi-omics data classification using incomplete multi-omics data.

14.

A clustering linear combination method for multiple phenotype association studies based on GWAS summary statistics.

Wang, Meida; Cao, Xuewei; Zhang, Shuanglin; Sha, Qiuying.

Sci Rep ; 13(1): 3389, 2023 02 28.

Artigo em Inglês | MEDLINE | ID: mdl-36854754

RESUMO

There is strong evidence showing that joint analysis of multiple phenotypes in genome-wide association studies (GWAS) can increase statistical power when detecting the association between genetic variants and human complex diseases. We previously developed the Clustering Linear Combination (CLC) method and a computationally efficient CLC (ceCLC) method to test the association between multiple phenotypes and a genetic variant, which perform very well. However, both of these methods require individual-level genotypes and phenotypes that are often not easily accessible. In this research, we develop a novel method called sCLC for association studies of multiple phenotypes and a genetic variant based on GWAS summary statistics. We use the LD score regression to estimate the correlation matrix among phenotypes. The test statistic of sCLC is constructed by GWAS summary statistics and has an approximate Cauchy distribution. We perform a variety of simulation studies and compare sCLC with other commonly used methods for multiple phenotype association studies using GWAS summary statistics. Simulation results show that sCLC can control Type I error rates well and has the highest power in most scenarios. Moreover, we apply the newly developed method to the UK Biobank GWAS summary statistics from the XIII category with 70 related musculoskeletal system and connective tissue phenotypes. The results demonstrate that sCLC detects the most number of significant SNPs, and most of these identified SNPs can be matched to genes that have been reported in the GWAS catalog to be associated with those phenotypes. Furthermore, sCLC also identifies some novel signals that were missed by standard GWAS, which provide new insight into the potential genetic factors of the musculoskeletal system and connective tissue phenotypes.

Assuntos

Neoplasias Pulmonares , Carcinoma de Pequenas Células do Pulmão , Humanos , Estudo de Associação Genômica Ampla , Fenótipo , Genótipo , Análise por Conglomerados , Neoplasias Pulmonares/genética

15.

Joint analysis of multiple phenotypes for extremely unbalanced case-control association studies.

Xie, Hongjing; Cao, Xuewei; Zhang, Shuanglin; Sha, Qiuying.

Genet Epidemiol ; 47(2): 185-197, 2023 03.

Artigo em Inglês | MEDLINE | ID: mdl-36691904

RESUMO

In genome-wide association studies (GWAS) for thousands of phenotypes in biobanks, most binary phenotypes have substantially fewer cases than controls. Many widely used approaches for joint analysis of multiple phenotypes produce inflated type I error rates for such extremely unbalanced case-control phenotypes. In this research, we develop a method to jointly analyze multiple unbalanced case-control phenotypes to circumvent this issue. We first group multiple phenotypes into different clusters based on a hierarchical clustering method, then we merge phenotypes in each cluster into a single phenotype. In each cluster, we use the saddlepoint approximation to estimate the p value of an association test between the merged phenotype and a single nucleotide polymorphism (SNP) which eliminates the issue of inflated type I error rate of the test for extremely unbalanced case-control phenotypes. Finally, we use the Cauchy combination method to obtain an integrated p value for all clusters to test the association between multiple phenotypes and a SNP. We use extensive simulation studies to evaluate the performance of the proposed approach. The results show that the proposed approach can control type I error rate very well and is more powerful than other available methods. We also apply the proposed approach to phenotypes in category IX (diseases of the circulatory system) in the UK Biobank. We find that the proposed approach can identify more significant SNPs than the other viable methods we compared with.

Assuntos

Estudo de Associação Genômica Ampla , Modelos Genéticos , Humanos , Estudo de Associação Genômica Ampla/métodos , Fenótipo , Estudos de Casos e Controles , Polimorfismo de Nucleotídeo Único

16.

Gene selection by incorporating genetic networks into case-control association studies.

Cao, Xuewei; Liang, Xiaoyu; Zhang, Shuanglin; Sha, Qiuying.

Eur J Hum Genet ; 2022 Dec 19.

Artigo em Inglês | MEDLINE | ID: mdl-36529820

RESUMO

Large-scale genome-wide association studies (GWAS) have been successfully applied to a wide range of genetic variants underlying complex diseases. The network-based regression approach has been developed to incorporate a biological genetic network and to overcome the challenges caused by the computational efficiency for analyzing high-dimensional genomic data. In this paper, we propose a gene selection approach by incorporating genetic networks into case-control association studies for DNA sequence data or DNA methylation data. Instead of using traditional dimension reduction techniques such as principal component analyses and supervised principal component analyses, we use a linear combination of genotypes at SNPs or methylation values at CpG sites in a gene to capture gene-level signals. We employ three linear combination approaches: optimally weighted sum (OWS), beta-based weighted sum (BWS), and LD-adjusted polygenic risk score (LD-PRS). OWS and LD-PRS are supervised approaches that depend on the effect of each SNP or CpG site on the case-control status, while BWS can be extracted without using the case-control status. After using one of the linear combinations of genotypes or methylation values in each gene to capture gene-level signals, we regularize them to perform gene selection based on the biological network. Simulation studies show that the proposed approaches have higher true positive rates than using traditional dimension reduction techniques. We also apply our approaches to DNA methylation data and UK Biobank DNA sequence data for analyzing rheumatoid arthritis. The results show that the proposed methods can select potentially rheumatoid arthritis related genes that are missed by existing methods.

17.

HCLC-FC: A novel statistical method for phenome-wide association studies.

Liang, Xiaoyu; Cao, Xuewei; Sha, Qiuying; Zhang, Shuanglin.

PLoS One ; 17(11): e0276646, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36350801

RESUMO

The emergence of genetic data coupled to longitudinal electronic medical records (EMRs) offers the possibility of phenome-wide association studies (PheWAS). In PheWAS, the whole phenome can be divided into numerous phenotypic categories according to the genetic architecture across phenotypes. Currently, statistical analyses for PheWAS are mainly univariate analyses, which test the association between one genetic variant and one phenotype at a time. In this article, we derived a novel and powerful multivariate method for PheWAS. The proposed method involves three steps. In the first step, we apply the bottom-up hierarchical clustering method to partition a large number of phenotypes into disjoint clusters within each phenotypic category. In the second step, the clustering linear combination method is used to combine test statistics within each category based on the phenotypic clusters and obtain p-values from each phenotypic category. In the third step, we propose a new false discovery rate (FDR) control approach. We perform extensive simulation studies to compare the performance of our method with that of other existing methods. The results show that our proposed method controls FDR very well and outperforms other methods we compared with. We also apply the proposed approach to a set of EMR-based phenotypes across more than 300,000 samples from the UK Biobank. We find that the proposed approach not only can well-control FDR at a nominal level but also successfully identify 1,244 significant SNPs that are reported to be associated with some phenotypes in the GWAS catalog. Our open-access tools and instructions on how to implement HCLC-FC are available at https://github.com/XiaoyuLiang/HCLCFC.

Assuntos

Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Estudo de Associação Genômica Ampla/métodos , Fenótipo , Fenômica , Análise por Conglomerados

18.

Gene-Based Association Tests Using New Polygenic Risk Scores and Incorporating Gene Expression Data.

Yan, Shijia; Sha, Qiuying; Zhang, Shuanglin.

Genes (Basel) ; 13(7)2022 06 22.

Artigo em Inglês | MEDLINE | ID: mdl-35885903

RESUMO

Recently, gene-based association studies have shown that integrating genome-wide association studies (GWAS) with expression quantitative trait locus (eQTL) data can boost statistical power and that the genetic liability of traits can be captured by polygenic risk scores (PRSs). In this paper, we propose a new gene-based statistical method that leverages gene-expression measurements and new PRSs to identify genes that are associated with phenotypes of interest. We used a generalized linear model to associate phenotypes with gene expression and PRSs and used a score-test statistic to test the association between phenotypes and genes. Our simulation studies show that the newly developed method has correct type I error rates and can boost statistical power compared with other methods that use either gene expression or PRS in association tests. A real data analysis figure based on UK Biobank data for asthma shows that the proposed method is applicable to GWAS.

Assuntos

Estudo de Associação Genômica Ampla , Locos de Características Quantitativas , Expressão Gênica , Fenótipo , Locos de Características Quantitativas/genética , Fatores de Risco

19.

Control for population stratification in genetic association studies based on GWAS summary statistics.

Yan, Shijia; Sha, Qiuying; Zhang, Shuanglin.

Genet Epidemiol ; 46(8): 604-614, 2022 12.

Artigo em Inglês | MEDLINE | ID: mdl-35766057

RESUMO

Over the past years, genome-wide association studies (GWAS) have generated a wealth of new information. Summary data from many GWAS are now publicly available, promoting the development of many statistical methods for association studies based on GWAS summary statistics, which avoids the increasing challenges associated with individual-level genotype and phenotype data sharing. However, for population-based association studies such as GWAS, it has been long recognized that population stratification can seriously confound association results. For large GWAS, it is very likely that there exist population stratification and cryptic relatedness, which will result in inflated Type I error in association testing. Although many methods have been developed to control for population stratification, only two of these approaches can be used to control population stratification without individual-level data: one is based on genomic control (GC) and the other one is based on linkage disequilibrium score regression (LDSC). However, the performance of these two approaches is currently unknown. In this study, we use extensive simulation studies including populations with subpopulations, spatially structured populations, and populations with cryptic relatedness to compare the performance of these two approaches to control for population stratification using only GWAS summary statistics without individual-level data. Data sets from the genetic analysis workshop 19 and UK Biobank are also used to evaluate these two approaches. We demonstrate that the intercept of LDSC can be used as a more accurate correction factor than GC. The results from this study will provide very useful information for researchers using GWAS summary statistics while trying to control for population stratification.

Assuntos

Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Estudos de Associação Genética , Desequilíbrio de Ligação , Fenótipo

20.

A computationally efficient clustering linear combination approach to jointly analyze multiple phenotypes for GWAS.

Wang, Meida; Zhang, Shuanglin; Sha, Qiuying.

PLoS One ; 17(4): e0260911, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35482827

RESUMO

There has been an increasing interest in joint analysis of multiple phenotypes in genome-wide association studies (GWAS) because jointly analyzing multiple phenotypes may increase statistical power to detect genetic variants associated with complex diseases or traits. Recently, many statistical methods have been developed for joint analysis of multiple phenotypes in genetic association studies, including the Clustering Linear Combination (CLC) method. The CLC method works particularly well with phenotypes that have natural groupings, but due to the unknown number of clusters for a given data, the final test statistic of CLC method is the minimum p-value among all p-values of the CLC test statistics obtained from each possible number of clusters. Therefore, a simulation procedure needs to be used to evaluate the p-value of the final test statistic. This makes the CLC method computationally demanding. We develop a new method called computationally efficient CLC (ceCLC) to test the association between multiple phenotypes and a genetic variant. Instead of using the minimum p-value as the test statistic in the CLC method, ceCLC uses the Cauchy combination test to combine all p-values of the CLC test statistics obtained from each possible number of clusters. The test statistic of ceCLC approximately follows a standard Cauchy distribution, so the p-value can be obtained from the cumulative density function without the need for the simulation procedure. Through extensive simulation studies and application on the COPDGene data, the results demonstrate that the type I error rates of ceCLC are effectively controlled in different simulation settings and ceCLC either outperforms all other methods or has statistical power that is very close to the most powerful method with which it has been compared.

Assuntos

Estudo de Associação Genômica Ampla , Análise por Conglomerados , Simulação por Computador , Estudos de Associação Genética , Estudo de Associação Genômica Ampla/métodos , Fenótipo

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA