RESUMO
Polygenic scores (PGSs) are quantitative metrics for predicting phenotypic values, such as human height or disease status. Some PGS methods require only summary statistics of a relevant genome-wide association study (GWAS) for their score. One such method is Lassosum, which inherits the model selection advantages of Lasso to select a meaningful subset of the GWAS single-nucleotide polymorphisms as predictors from their association statistics. However, even efficient scores like Lassosum, when derived from European-based GWASs, are poor predictors of phenotype for subjects of non-European ancestry; that is, they have limited portability to other ancestries. To increase the portability of Lassosum, when GWAS information and estimates of linkage disequilibrium are available for both ancestries, we propose Joint-Lassosum (JLS). In the simulation settings we explore, JLS provides more accurate PGSs compared to other methods, especially when measured in terms of fairness. In analyses of UK Biobank data, JLS was computationally more efficient but slightly less accurate than a Bayesian comparator, SDPRX. Like all PGS methods, JLS requires selection of predictors, which are determined by data-driven tuning parameters. We describe a new approach to selecting tuning parameters and note its relevance for model selection for any PGS. We also draw connections to the literature on algorithmic fairness and discuss how JLS can help mitigate fairness-related harms that might result from the use of PGSs in clinical settings. While no PGS method is likely to be universally portable, due to the diversity of human populations and unequal information content of GWASs for different ancestries, JLS is an effective approach for enhancing portability and reducing predictive bias.
Assuntos
Estudo de Associação Genômica Ampla , Equidade em Saúde , Humanos , Teorema de Bayes , Benchmarking , Simulação por ComputadorRESUMO
Increasing anthropogenic global warming has emerged as a significant challenge to human health in China, as extreme heat hazards increasingly threaten outdoor-exposed populations. Differences in thermal comfort, outdoor activity duration, and social vulnerability between females and males may exacerbate gender inequalities in heat-related health risks, which have been overlooked by previous studies. Here, we combine three heat hazards and outdoor activity duration to identify the spatiotemporal variation in gender-specific heat risk in China during 1991-2020. We found that females' heat risk tends to be higher than that of males. Gender disparities in heat risk decrease in southern regions, while those in northern regions remain severe. Males are prone to overheating in highly urbanized areas, while females in low urbanized areas. Males' overheating risk is mainly attributed to population clustering associated with prolonged outdoor activity time and skewed social resource allocation. In contrast, females' overheating risk is primarily affected by social inequalities. Our findings suggest that China needs to further diminish gender disparities and accelerate climate adaptation planning.
Assuntos
Calor Extremo , Golpe de Calor , Masculino , Feminino , Humanos , Temperatura Alta , Estações do Ano , Fatores Socioeconômicos , China/epidemiologiaRESUMO
The disparity in genetic risk prediction accuracy between European and non-European individuals highlights a critical challenge in health inequality. To bridge this gap, we introduce JointPRS, a novel method that models multiple populations jointly to improve genetic risk predictions for non-European individuals. JointPRS has three key features. First, it encompasses all diverse populations to improve prediction accuracy, rather than relying solely on the target population with a singular auxiliary European group. Second, it autonomously estimates and leverages chromosome-wise cross-population genetic correlations to infer the effect sizes of genetic variants. Lastly, it provides an auto version that has comparable performance to the tuning version to accommodate the situation with no validation dataset. Through extensive simulations and real data applications to 22 quantitative traits and four binary traits in East Asian populations, nine quantitative traits and one binary trait in African populations, and four quantitative traits in South Asian populations, we demonstrate that JointPRS outperforms state-of-art methods, improving the prediction accuracy for both quantitative and binary traits in non-European populations.
RESUMO
In the era of precision medicine, many biomarkers have been discovered to be associated with drug efficacy and safety responses, which can be used for patient stratification and drug response prediction. Due to the small sample size and limited power of randomized clinical studies, meta-analysis is usually conducted to aggregate all available studies to maximize the power for identifying prognostic and predictive biomarkers. However, it is often challenging to find an independent study to replicate the discoveries from the meta-analysis (e.g. meta-analysis of pharmacogenomics genome-wide association studies (PGx GWAS)), which seriously limits the potential impacts of the discovered biomarkers. To overcome this challenge, we develop a novel statistical framework, MAJAR (meta-analysis of joint effect associations for biomarker replicability assessment), to jointly test prognostic and predictive effects and assess the replicability of identified biomarkers by implementing an enhanced expectation-maximization algorithm and calculating their posterior-probability-of-replicabilities and Bayesian false discovery rates (Fdr). Extensive simulation studies were conducted to compare the performance of MAJAR and existing methods in terms of Fdr, power, and computational efficiency. The simulation results showed improved statistical power with well-controlled Fdr of MAJAR over existing methods and robustness to outliers under different data generation processes. We further demonstrated the advantages of MAJAR over existing methods by applying MAJAR to the PGx GWAS summary statistics data from a large cardiovascular randomized clinical trial. Compared to testing main effects only, MAJAR identified 12 novel variants associated with the treatment-related low-density lipoprotein cholesterol reduction from baseline.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Fenótipo , Teorema de Bayes , Biomarcadores , Ensaios Clínicos Controlados Aleatórios como AssuntoRESUMO
The disparity in genetic risk prediction accuracy between European and non-European individuals highlights a critical challenge in health inequality. To bridge this gap, we introduce JointPRS, a novel method that models multiple populations jointly to improve genetic risk predictions for non-European individuals. JointPRS has three key features. First, it encompasses all diverse populations to improve prediction accuracy, rather than relying solely on the target population with a singular auxiliary European group. Second, it autonomously estimates and leverages chromosome-wise cross-population genetic correlations to infer the effect sizes of genetic variants. Lastly, it provides an auto version that has comparable performance to the tuning version to accommodate the situation with no validation dataset. Through extensive simulations and real data applications to 22 quantitative traits and four binary traits in East Asian populations, nine quantitative traits and one binary trait in African populations, and four quantitative traits in South Asian populations, we demonstrate that JointPRS outperforms state-of-art methods, improving the prediction accuracy for both quantitative and binary traits in non-European populations.
RESUMO
BACKGROUND: Whether there are sex differences in hemodynamic profiles among people with elevated blood pressure is not well understood and could guide personalization of treatment. METHODS AND RESULTS: We described the clinical and hemodynamic characteristics of adults with elevated blood pressure in China using impedance cardiography. We included 45,082 individuals with elevated blood pressure (defined as systolic blood pressure of ≥130 mmHg or a diastolic blood pressure of ≥80 mmHg), of which 35.2% were women. Overall, women had a higher mean systolic blood pressure than men (139.0 [±15.7] mmHg vs 136.8 [±13.8] mmHg, P<0.001), but a lower mean diastolic blood pressure (82.6 [±9.0] mmHg vs 85.6 [±8.9] mmHg, P<0.001). After adjusting for age, region, and body mass index, women <50 years old had lower systemic vascular resistance index (beta-coefficient [ß] -31.7; 95% CI: -51.2, -12.2) and higher cardiac index (ß 0.07; 95% CI: 0.04, 0.09) than men of their same age group, whereas among those ≥50 years old women had higher systemic vascular resistance index (ß 120.4; 95% CI: 102.4, 138.5) but lower cardiac index (ß -0.15; 95% CI: -0.16, -0.13). Results were consistent with a propensity score matching sensitivity analysis, although the magnitude of the SVRI difference was lower and non-significant. However, there was substantial overlap between women and men in the distribution plots of these variables, with overlapping areas ranging from 78% to 88%. CONCLUSIONS: Our findings indicate that there are sex differences in hypertension phenotype, but that sex alone is insufficient to infer an individual's profile.
Assuntos
Cardiografia de Impedância , Hipertensão , Pressão Sanguínea/fisiologia , Diástole , Feminino , Hemodinâmica , Humanos , MasculinoRESUMO
The development of single-cell RNA-sequencing (scRNA-seq) technologies has offered insights into complex biological systems at the single-cell resolution. In particular, these techniques facilitate the identifications of genes showing cell-type-specific differential expressions (DE). In this paper, we introduce MARBLES, a novel statistical model for cross-condition DE gene detection from scRNA-seq data. MARBLES employs a Markov Random Field model to borrow information across similar cell types and utilizes cell-type-specific pseudobulk count to account for sample-level variability. Our simulation results showed that MARBLES is more powerful than existing methods to detect DE genes with an appropriate control of false positive rate. Applications of MARBLES to real data identified novel disease-related DE genes and biological pathways from both a single-cell lipopolysaccharide mouse dataset with 24 381 cells and 11 076 genes and a Parkinson's disease human data set with 76 212 cells and 15 891 genes. Overall, MARBLES is a powerful tool to identify cell-type-specific DE genes across conditions from scRNA-seq data.
Assuntos
Lipopolissacarídeos , Análise de Célula Única , Animais , Perfilação da Expressão Gênica/métodos , Humanos , Camundongos , RNA/genética , RNA-Seq , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodosRESUMO
BACKGROUND: Recent development of single cell sequencing technologies has made it possible to identify genes with different expression (DE) levels at the cell type level between different groups of samples. In this article, we propose to borrow information through known biological networks to increase statistical power to identify differentially expressed genes (DEGs). RESULTS: We develop MRFscRNAseq, which is based on a Markov random field (MRF) model to appropriately accommodate gene network information as well as dependencies among cell types to identify cell-type specific DEGs. We implement an Expectation-Maximization (EM) algorithm with mean field-like approximation to estimate model parameters and a Gibbs sampler to infer DE status. Simulation study shows that our method has better power to detect cell-type specific DEGs than conventional methods while appropriately controlling type I error rate. The usefulness of our method is demonstrated through its application to study the pathogenesis and biological processes of idiopathic pulmonary fibrosis (IPF) using a single-cell RNA-sequencing (scRNA-seq) data set, which contains 18,150 protein-coding genes across 38 cell types on lung tissues from 32 IPF patients and 28 normal controls. CONCLUSIONS: The proposed MRF model is implemented in the R package MRFscRNAseq available on GitHub. By utilizing gene-gene and cell-cell networks, our method increases statistical power to detect differentially expressed genes from scRNA-seq data.
Assuntos
Perfilação da Expressão Gênica , Redes Reguladoras de Genes , Algoritmos , Humanos , RNA-Seq , Análise de Sequência de RNA , Análise de Célula ÚnicaRESUMO
We extend the popular Jukes-Cantor evolution model and calculate the probability of an orthologous nucleotide sequence set [a reference sequence (B1) stays with the other sequences (B-1)], where the sequence evolution [from a last common ancestral sequence (É)] follows the (prospective) Poisson process with the overall event rate λ prorated among mutation types (nucleotide/codon substitution, insertion, and deletion) and sites along each sequence. The corresponding retrospective process (reversing the prospective process) facilitates developing algorithms to calculate the marginal probability [Pr(B1)] (Monte Carlo integration) and sample É (given B1). We calculate probability Pr(B-1|É) based on the identified events (during "ÉâB-1") from pairwise sequence alignment to implement Pr(B-1|B1) calculation (Monte Carlo integration). Event queue sampling and probability magnifiers are used to improve the computational efficiency when the number of events is large. We finally test our procedure on both simulated and recently studied hexapod transcriptome data (Brandt et al.), where each asexual lineage pairs with its closest related sexual lineage. Rate estimates (for Phasmatodea and Zygentoma) and model comparison indicate that the asexual lineages likely mutate several times faster than their sexual relatives.
Assuntos
Biologia Computacional/métodos , Insetos/classificação , Algoritmos , Animais , Evolução Molecular , Insetos/genética , Modelos Genéticos , Método de Monte Carlo , Neópteros/genética , Filogenia , Distribuição de Poisson , Análise de Sequência de DNA , Homologia de Sequência do Ácido NucleicoRESUMO
BACKGROUND The impact of therapeutic drug management (TDM) on reducing toxicity and improving efficacy in colorectal cancer (CRC) patients receiving fluorouracil-based chemotherapy is still unclear. MATERIAL AND METHODS A total of 207 patients (Study Group n=54, Historical Group n=153) with metastatic colorectal cancer were enrolled. All of them received 6 administrations of the 5-FU based regimens. Initial 5-FU dosing of all patients was calculated using body surface area (BSA). In the Study Group, individual exposure during each cycle was measured using a nanoparticle immunoassay, and the 5-FU blood concentration was calculated using the area under the curve (AUC). We adjusted the 5-FU infusion dose of the next cycle based on the AUC data of the previous cycle to achieve the target of 20-30 mg×h/L. RESULTS In the fourth cycle, patients in the target concentration range (AUC mean, 26.3 mg×h/L; Median, 28 mg×h/L; Range, 14-38 mg×h/L; CV, 22.4%) accounted for 46.8% of all patients, which were more than the ones in the first cycle (P<0.001). 5-FU TDM significantly reduced the toxicity of chemotherapy and improved its efficacy. The Study Group (30/289) showed a lower percentage of severe adverse events than that in the Historical Group (185/447) (P<0.001). The incidences of complete response and partial response in the Study Group were higher than those in the Historical Group (P=0.032). CONCLUSIONS TDM in colorectal cancer can reduce toxicity, improve efficacy and clinical outcome, and can be routinely used in 5-FU-based chemotherapy.
Assuntos
Protocolos de Quimioterapia Combinada Antineoplásica , Neoplasias Colorretais , Monitoramento de Medicamentos/métodos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Fluoruracila , Metástase Neoplásica , Idoso , Antimetabólitos Antineoplásicos/administração & dosagem , Antimetabólitos Antineoplásicos/efeitos adversos , Antimetabólitos Antineoplásicos/sangue , Protocolos de Quimioterapia Combinada Antineoplásica/administração & dosagem , Protocolos de Quimioterapia Combinada Antineoplásica/efeitos adversos , Área Sob a Curva , China/epidemiologia , Neoplasias Colorretais/sangue , Neoplasias Colorretais/tratamento farmacológico , Neoplasias Colorretais/epidemiologia , Neoplasias Colorretais/patologia , Cálculos da Dosagem de Medicamento , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/diagnóstico , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/etiologia , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/prevenção & controle , Feminino , Fluoruracila/administração & dosagem , Fluoruracila/efeitos adversos , Fluoruracila/sangue , Humanos , Masculino , Conduta do Tratamento Medicamentoso/estatística & dados numéricos , Metástase Neoplásica/patologia , Metástase Neoplásica/terapia , Estadiamento de Neoplasias , Risco Ajustado/métodos , Resultado do TratamentoRESUMO
Gait analysis, as a common inspection method for human gait, can provide a series of kinematics, dynamics and other parameters through instrumental measurement. In recent years, gait analysis has been gradually applied to the diagnosis of diseases, the evaluation of orthopedic surgery and rehabilitation progress, especially, gait phase abnormality can be used as a clinical diagnostic indicator of Alzheimer Disease and Parkinson Disease, which usually show varying degrees of gait phase abnormality. This research proposed an inertial sensor based gait analysis method. Smoothed and filtered angular velocity signal was chosen as the input data of the 15-dimensional temporal characteristic feature. Hidden Markov Model and parameter adaptive model are used to segment gait phases. Experimental results show that the proposed model based on HMM and parameter adaptation achieves good recognition rate in gait phases segmentation compared to other classification models, and the recognition results of gait phase are consistent with ground truth. The proposed wearable device used for data collection can be embedded on the shoe, which can not only collect patients' gait data stably and reliably, ensuring the integrity and objectivity of gait data, but also collect data in daily scene and ambulatory outdoor environment.
Assuntos
Marcha , Dispositivos Eletrônicos Vestíveis , Fenômenos Biomecânicos , HumanosRESUMO
Wearable devices have been increasingly used in research to provide continuous physical activity monitoring, but how to effectively extract features remains challenging for researchers. To analyze the generated actigraphy data in large-scale population studies, we developed computationally efficient methods to derive sleep and activity features through a Hidden Markov Model-based sleep/wake identification algorithm, and circadian rhythm features through a Penalized Multi-band Learning approach adapted from machine learning. Unsupervised feature extraction is useful when labeled data are unavailable, especially in large-scale population studies. We applied these two methods to the UK Biobank wearable device data and used the derived sleep and circadian features as phenotypes in genome-wide association studies. We identified 53 genetic loci with p<5×10-8 including genes known to be associated with sleep disorders and circadian rhythms as well as novel loci associated with Body Mass Index, mental diseases and neurological disorders, which suggest shared genetic factors of sleep and circadian rhythms with physical and mental health. Further cross-tissue enrichment analysis highlights the important role of the central nervous system and the shared genetic architecture with metabolism-related traits and the metabolic system. Our study demonstrates the effectiveness of our unsupervised methods for wearable device data when additional training data cannot be easily acquired, and our study further expands the application of wearable devices in population studies and genetic studies to provide novel biological insights.
Assuntos
Ritmo Circadiano/genética , Predisposição Genética para Doença , Transtornos do Sono-Vigília/genética , Sono/genética , Actigrafia/métodos , Ritmo Circadiano/fisiologia , Feminino , Estudo de Associação Genômica Ampla , Humanos , Masculino , Cadeias de Markov , Pessoa de Meia-Idade , Sono/fisiologia , Transtornos do Sono-Vigília/patologia , Dispositivos Eletrônicos VestíveisRESUMO
One goal of human microbiome studies is to relate host traits with human microbiome compositions. The analysis of microbial community sequencing data presents great statistical challenges, especially when the samples have different library sizes and the data are overdispersed with many zeros. To address these challenges, we introduce a new statistical framework, called predictive analysis in metagenomics via inverse regression (PAMIR), to analyze microbiome sequencing data. Within this framework, an inverse regression model is developed for overdispersed microbiota counts given the trait, and then a prediction rule is constructed by taking advantage of the dimension-reduction structure in the model. An efficient Monte Carlo expectation-maximization algorithm is proposed for maximum likelihood estimation. The method is further generalized to accommodate other types of covariates. We demonstrate the advantages of PAMIR through simulations and two real data examples.
Assuntos
Microbiota/genética , Análise de Sequência , Algoritmos , Bactérias/genética , Humanos , Funções Verossimilhança , Método de Monte Carlo , Análise de RegressãoRESUMO
Public trust in health care systems has been measured in many countries, but there have been few studies of the intercountry variability in trust, or the degree to which such variability is because of population or structural characteristics. We used data from the health care survey conducted by the International Social Survey Program from 2011 to 2013 in 31 countries to assess whether intercountry variability was significantly greater than intracountry variability using general linear models in which country was treated as a fixed factor. We also assessed the extent to which intercountry variability was because of respondent and economic circumstances (gross national income per capita). Public trust in the health care system varied significantly across countries (P < .001), even after adjustment for 8 within-country predictors and gross national income per capita. One of the strongest predictors of trust was the respondents' most recent health care experience. Higher respondent education, urban residence, and a lower country's gross national income predicted less trust in the health care system. After countries with the 10% highest health expenditures per capita (United States) and the 10% lowest health care expenditures per capita (China and the Philippines) were removed, public trust in the health care system was positively associated with the remaining countries' health care expenditures per capita (Pearson correlation coefficient, 0.490; P = .008) and gross national income per capita (Pearson correlation coefficient, 0.495; P = .007). There is significant variation in public trust in health care across the countries studied. The intercountry differences are due, in part to economic circumstances.
Assuntos
Atenção à Saúde , Internacionalidade , Confiança , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Feminino , Gastos em Saúde , Humanos , Masculino , Pessoa de Meia-Idade , Satisfação do Paciente/estatística & dados numéricos , Inquéritos e Questionários , Adulto JovemRESUMO
OBJECTIVES: Despite increasing research attention on public trust in health care systems, empirical evidence on this topic in the developing world is limited and inconclusive. This paper examines the level and determinants of public trust in the health care system in China. METHODS: We used data from a survey conducted with a sample of 5347 adults in all Chinese provincial areas between January and February 2016. Trust in the health care system was assessed with a question used by the 2011-2013 International Social Survey Programme (ISSP) to assess public trust in the health care systems of 29 industrialized countries and regions ('In general, how much confidence do you have in the health care system in your country?'). RESULTS: Only 28% of respondents reported that they had a great deal or complete trust in China's health care system. Respondents who reported to have more trust in other people in society, more trust in the local government and who were more satisfied with their most recent health care system experience and their health insurance were significantly more likely to trust the country's health care system. Furthermore, respondents who reported a higher level of happiness, better health status and positive attitudes towards social equity were more likely to trust the health care system in China. CONCLUSIONS: Our findings suggest that low public trust in China's health care system is a potential problem. Improving health care experiences may be the most practical and effective way of improving trust in the health care system in China.
Assuntos
Atenção à Saúde , Satisfação do Paciente/estatística & dados numéricos , Confiança/psicologia , Adulto , Idoso , China , Estudos Transversais , Feminino , Pesquisas sobre Atenção à Saúde , Humanos , Masculino , Pessoa de Meia-Idade , Adulto JovemRESUMO
Gait and posture are regular activities which are fully controlled by the sensorimotor cortex. In this study, fluctuations of joint angle and asymmetry of foot elevation in human walking stride records are analyzed to assess gait in healthy adults and patients affected with gait disorders. This paper aims to build a low-cost, intelligent and lightweight wearable gait analysis platform based on the emerging body sensor networks, which can be used for rehabilitation assessment of patients with gait impairments. A calibration method for accelerometer and magnetometer was proposed to deal with ubiquitous orthoronal error and magnetic disturbance. Proportional integral controller based complementary filter and error correction of gait parameters have been defined with a multi-sensor data fusion algorithm. The purpose of the current work is to investigate the effectiveness of obtained gait data in differentiating healthy subjects and patients with gait impairments. Preliminary clinical gait experiments results showed that the proposed system can be effective in auxiliary diagnosis and rehabilitation plan formulation compared to existing methods, which indicated that the proposed method has great potential as an auxiliary for medical rehabilitation assessment.
RESUMO
Carbons are considered less favorable for postcombustion CO2 capture because of their low affinity toward CO2, and nitrogen doping was widely studied to enhance CO2 adsorption, but the results are still unsatisfactory. Herein, we report a simple, scalable, and controllable strategy of tethering potassium to a carbon matrix, which can enhance carbon-CO2 interaction effectively, and a remarkable working capacity of ca. 4.5 wt % under flue gas conditions was achieved, which is among the highest for carbon-based materials. More interestingly, a high CO2/N2 selectivity of 404 was obtained. Density functional theory calculations evidenced that the introduced potassium carboxylate moieties are responsible for such excellent performances. We also show the effectiveness of this strategy to be universal, and thus, cheaper precursors can be used, holding great promise for low-cost carbon capture from flue gas.
RESUMO
Multiple omic profiles have been generated for many cancer types; however, comprehensive assessment of their prognostic values across cancers is limited. We conducted a pan-cancer prognostic assessment and presented a multi-omic kernel machine learning method to systematically quantify the prognostic values of high-throughput genomic, epigenomic, and transcriptomic profiles individually, integratively, and in combination with clinical factors for 3,382 samples across 14 cancer types. We found that the prognostic performance varied substantially across cancer types. mRNA and miRNA expression profile frequently performed the best, followed by DNA methylation profile. Germline susceptibility variants displayed low prognostic performance consistently across cancer types. The integration of omic profiles with clinical variables can lead to substantially improved prognostic performance over the use of clinical variables alone in half of cancer types examined. Moreover, we showed that the kernel machine learning method consistently outperformed existing prognostic signatures, suggesting that including a large number of omic biomarkers may provide substantial improvement in prognostic assessment. Our study provides a comprehensive portrait of omic architecture for tumor prognosis across cancers, and highlights the prognostic value of genome-wide omic biomarker aggregation, which may facilitate refined prognostic assessment in the era of precision oncology.
Assuntos
Metilação de DNA , DNA de Neoplasias , Epigenômica , MicroRNAs , Neoplasias , RNA Mensageiro , RNA Neoplásico , DNA de Neoplasias/genética , DNA de Neoplasias/metabolismo , Estudo de Associação Genômica Ampla , Humanos , MicroRNAs/biossíntese , MicroRNAs/genética , Neoplasias/diagnóstico , Neoplasias/genética , Neoplasias/metabolismo , Prognóstico , RNA Mensageiro/biossíntese , RNA Mensageiro/genética , RNA Neoplásico/biossíntese , RNA Neoplásico/genéticaRESUMO
Subgroup identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to define subgroups. Longitudinal gene expression profiles might provide additional information on disease progression than what is captured by baseline profiles alone. Therefore, subgroup identification could be more accurate and effective with the aid of longitudinal gene expression data. However, existing statistical methods are unable to fully utilize these data for patient clustering. In this article, we introduce a novel clustering method in the Bayesian setting based on longitudinal gene expression profiles. This method, called BClustLonG, adopts a linear mixed-effects framework to model the trajectory of genes over time, while clustering is jointly conducted based on the regression coefficients obtained from all genes. In order to account for the correlations among genes and alleviate the high dimensionality challenges, we adopt a factor analysis model for the regression coefficients. The Dirichlet process prior distribution is utilized for the means of the regression coefficients to induce clustering. Through extensive simulation studies, we show that BClustLonG has improved performance over other clustering methods. When applied to a dataset of severely injured (burn or trauma) patients, our model is able to identify interesting subgroups. Copyright © 2017 John Wiley & Sons, Ltd.
Assuntos
Teorema de Bayes , Análise por Conglomerados , Análise Fatorial , Perfilação da Expressão Gênica/métodos , Modelos Genéticos , Análise de Regressão , Queimaduras , Simulação por Computador , Expressão Gênica , Humanos , Cadeias de Markov , Método de Monte Carlo , Estatísticas não ParamétricasRESUMO
BACKGROUND: Distance based unsupervised clustering of gene expression data is commonly used to identify heterogeneity in biologic samples. However, high noise levels in gene expression data and relatively high correlation between genes are often encountered, so traditional distances such as Euclidean distance may not be effective at discriminating the biological differences between samples. An alternative method to examine disease phenotypes is to use pre-defined biological pathways. These pathways have been shown to be perturbed in different ways in different subjects who have similar clinical features. We hypothesize that differences in the expressions of genes in a given pathway are more predictive of differences in biological differences compared to standard approaches and if integrated into clustering analysis will enhance the robustness and accuracy of the clustering method. To examine this hypothesis, we developed a novel computational method to assess the biological differences between samples using gene expression data by assuming that ontologically defined biological pathways in biologically similar samples have similar behavior. RESULTS: Pre-defined biological pathways were downloaded and genes in each pathway were used to cluster samples using the Gaussian mixture model. The clustering results across different pathways were then summarized to calculate the pathway-based distance score between samples. This method was applied to both simulated and real data sets and compared to the traditional Euclidean distance and another pathway-based clustering method, Pathifier. The results show that the pathway-based distance score performs significantly better than the Euclidean distance, especially when the heterogeneity is low and genes in the same pathways are correlated. Compared to Pathifier, we demonstrated that our approach achieves higher accuracy and robustness for small pathways. When the pathway size is large, by downsampling the pathways into smaller pathways, our approach was able to achieve comparable performance. CONCLUSIONS: We have developed a novel distance score that represents the biological differences between samples using gene expression data and pre-defined biological pathway information. Application of this distance score results in more accurate, robust, and biologically meaningful clustering results in both simulated data and real data when compared to traditional methods. It also has comparable or better performance compared to Pathifier.