RESUMEN
Mediation analysis with high-dimensional DNA methylation markers is important in identifying epigenetic pathways between environmental exposures and health outcomes. There have been some methodology developments of mediation analysis with high-dimensional mediators. However, high-dimensional mediation analysis methods for time-to-event outcome data are still yet to be developed. To address these challenges, we propose a new high-dimensional mediation analysis procedure for survival models by incorporating sure independent screening and minimax concave penalty techniques for variable selection, with the Sobel and the joint method for significance test of indirect effect. The simulation studies show good performance in identifying correct biomarkers, false discovery rate control, and minimum estimation bias of the proposed procedure. We also apply this approach to study the causal pathway from smoking to overall survival among lung cancer patients potentially mediated by 365,307 DNA methylations in the TCGA lung cancer cohort. Mediation analysis using a Cox proportional hazards model estimates that patients who have serious smoking history increase the risk of lung cancer through methylation markers including cg21926276, cg27042065, and cg26387355 with significant hazard ratios of 1.2497(95%CI: 1.1121, 1.4045), 1.0920(95%CI: 1.0170, 1.1726), and 1.1489(95%CI: 1.0518, 1.2550), respectively. The three methylation sites locate in the three genes which have been showed to be associated with lung cancer event or overall survival. However, the three CpG sites (cg21926276, cg27042065 and cg26387355) have not been reported, which are newly identified as the potential novel epigenetic markers linking smoking and survival of lung cancer patients. Collectively, the proposed high-dimensional mediation analysis procedure has good performance in mediator selection and indirect effect estimation.
Asunto(s)
Biología Computacional/métodos , Modelos Estadísticos , Análisis de Supervivencia , Adulto , Anciano , Anciano de 80 o más Años , Metilación de ADN/genética , Epigenómica , Humanos , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/mortalidad , Persona de Mediana Edad , Fumar/genética , Fumar/mortalidadRESUMEN
Human Cytochrome P450 3A4 (CYP3A4) is an important member of the cytochrome P450 superfamily with responsibility for metabolizing ~50% of clinical drugs. Experimental evidence showed that CYP3A4 can adopt multiple substrates in its active site to form a cooperative binding model, accelerating substrate metabolism efficiency. In the current study, we constructed both normal and cooperative binding models of human CYP3A4 with antifungal drug ketoconazoles (KLN). Molecular dynamics simulation and free energy calculation were then carried out to study the cooperative binding mechanism. Our simulation showed that the second KLN in the cooperative binding model had a positive impact on the first one binding in the active site by two significant pi-pi stacking interactions. The first one was formed by Phe215, functioning to position the first KLN in a favorable orientation in the active site for further metabolism reactions. The second one was contributed by Phe304. This pi-pi stacking was enhanced in the cooperative binding model by the parallel conformation between the aromatic rings in Phe304 and the dioxolan moiety of the first KLN. These findings can provide an atomic insight into the cooperative binding in CYP3A4, revealing a novel pi-pi stacking mechanism for drug-drug interactions.
Asunto(s)
Antifúngicos/química , Citocromo P-450 CYP3A/química , Cetoconazol/química , Antifúngicos/metabolismo , Sitios de Unión , Cristalografía por Rayos X , Citocromo P-450 CYP3A/ultraestructura , Humanos , Interacciones Hidrofóbicas e Hidrofílicas , Cetoconazol/metabolismo , Modelos Moleculares , Simulación de Dinámica Molecular , Unión ProteicaRESUMEN
Identifying spatially variable genes (SVGs) is crucial for understanding the spatiotemporal characteristics of diseases and tissue structures, posing a distinctive challenge in spatial transcriptomics research. We propose HEARTSVG, a distribution-free, test-based method for fast and accurately identifying spatially variable genes in large-scale spatial transcriptomic data. Extensive simulations demonstrate that HEARTSVG outperforms state-of-the-art methods with higher F 1 scores (average F 1 Score=0.948), improved computational efficiency, scalability, and reduced false positives (FPs). Through analysis of twelve real datasets from various spatial transcriptomic technologies, HEARTSVG identifies a greater number of biologically significant SVGs (average AUC = 0.792) than other comparative methods without prespecifying spatial patterns. Furthermore, by clustering SVGs, we uncover two distinct tumor spatial domains characterized by unique spatial expression patterns, spatial-temporal locations, and biological functions in human colorectal cancer data, unraveling the complexity of tumors.
Asunto(s)
Perfilación de la Expresión Génica , Transcriptoma , Humanos , Perfilación de la Expresión Génica/métodos , Neoplasias Colorrectales/genética , Biología Computacional/métodos , Algoritmos , Regulación Neoplásica de la Expresión Génica , Simulación por Computador , Bases de Datos GenéticasRESUMEN
Background: The identification of the causal SNPs of complex diseases in large-scale genome-wide association analysis is beneficial to the studies of pathogenesis, prevention, diagnosis and treatment of these diseases. However, existing applicable methods for large-scale data suffer from low accuracy. Developing powerful and accurate methods for detecting SNPs associated with complex diseases is highly desired. Results: We propose a score-based two-stage Bayesian network method to identify causal SNPs of complex diseases for case-control designs. This method combines the ideas of constraint-based methods and score-and-search methods to learn the structure of the disease-centered local Bayesian network. Simulation experiments are conducted to compare this new algorithm with several common methods that can achieve the same function. The results show that our method improves the accuracy and stability compared to several common methods. Our method based on Bayesian network theory results in lower false-positive rates when all correct loci are detected. Besides, real-world data application suggests that our algorithm has good performance when handling genome-wide association data. Conclusion: The proposed method is designed to identify the SNPs related to complex diseases, and is more accurate than other methods which can also be adapted to large-scale genome-wide analysis studies data.
RESUMEN
Colorectal cancer is a highly heterogeneous disease. Tumor heterogeneity limits the efficacy of cancer treatment. Single-cell RNA-sequencing technology (scRNA-seq) is a powerful tool for studying cancer heterogeneity at cellular resolution. The sparsity, heterogeneous diversity, and fast-growing scale of scRNA-seq data pose challenges to the flexibility, accuracy, and computing efficiency of the differential expression (DE) methods. We proposed HEART (high-efficiency and robust test), a statistical combination test that can detect DE genes with various sources of differences beyond mean expression changes. To validate the performance of HEART, we compared HEART and the other six popular DE methods on various simulation datasets with different settings by two simulation data generation mechanisms. HEART had high accuracy ( F 1 score >0.75) and brilliant computational efficiency (less than 2 min) on multiple simulation datasets in various experimental settings. HEART performed well on DE genes detection for the PBMC68K dataset quantified by UMI counts and the human brain single-cell dataset quantified by read counts ( F 1 score = 0.79, 0.65). By applying HEART to the single-cell dataset of a colorectal cancer patient, we found several potential blood-based biomarkers (CTTN, S100A4, S100A6, UBA52, FAU, and VIM) associated with colorectal cancer metastasis and validated them on additional spatial transcriptomic data of other colorectal cancer patients.
RESUMEN
Investigation of the genetic basis of traits or clinical outcomes heavily relies on identifying relevant variables in molecular data. However, characteristics such as high dimensionality and complex correlation structures of these data hinder the development of related methods, resulting in the inclusion of false positives and negatives. We developed a variable importance measure method, termed the ECAR scores, that evaluates the importance of variables in the dataset. Based on this score, ranking and selection of variables can be achieved simultaneously. Unlike most current approaches, the ECAR scores aim to rank the influential variables as high as possible while maintaining the grouping property, instead of selecting the ones that are merely predictive. The ECAR scores' performance is tested and compared to other methods on simulated, semi-synthetic, and real datasets. Results showed that the ECAR scores improve the CAR scores in terms of accuracy of variable selection and high-rank variables' predictive power. It also outperforms other classic methods such as lasso and stability selection when there is a high degree of correlation among influential variables. As an application, we used the ECAR scores to analyze genes associated with forced expiratory volume in the first second in patients with lung cancer and reported six associated genes.
Asunto(s)
Biomarcadores de Tumor/metabolismo , Simulación por Computador , Volumen Espiratorio Forzado , Regulación Neoplásica de la Expresión Génica , Hordeum/metabolismo , Neoplasias Pulmonares/patología , Proteínas de Plantas/metabolismo , Biomarcadores de Tumor/genética , Perfilación de la Expresión Génica , Hordeum/genética , Humanos , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/metabolismo , Proteínas de Plantas/genéticaRESUMEN
BACKGROUND: The overall genetic profile for noise-induced hearing loss (NIHL) remains elusive. Herein we proposed a novel machine learning (ML) based strategy to evaluate individual susceptibility to NIHL and identify the underlying genetic risk variants based on a subsample of participants with extreme phenotypes. METHODS: Five features (age, sex, cumulative noise exposure [CNE], smoking, and alcohol drinking status) of 5,539 shipbuilding workers from large cross-sectional surveys were included in four ML classification models to predict their hearing levels. The area under the curve (AUC) and prediction accuracy were exploited to evaluate the performance of the models. Based on the prediction error of the ML models, the NIHL-susceptible group (n=150) and NIHL-resistant group (n=150) with a paradoxical relationship between hearing levels and features were separately screened, to identify the underlying variants associated with NIHL risk using whole-exome sequencing (WES). Subsequently, candidate risk variants were validated in an additional replication cohort (n=2108), followed by a meta-analysis. RESULTS: With 10-fold cross-validation, the performances of the four ML models were robust and similar, with average AUCs and accuracies ranging from 0.783 to 0.798 and 73.7% to 73.8%, respectively. The phenotypes of the NIHL-susceptible and NIHL-resistant groups were significantly different (all p<0.001). After WES analysis and filtering, 12 risk variants contributing to NIHL susceptibility were identified and replicated. The meta-analyses showed that the A allele of CDH23 rs41281334 (odds ratio [OR]=1.506, 95% confidence interval [CI]=1.106-2.051) and the C allele of WHRN rs12339210 (OR=3.06, 95% CI=1.398-6.700) were significantly associated with increased risk of NIHL after adjustment for confounding factors. CONCLUSIONS: This study revealed two genetic variants in CDH23 rs41281334 and WHRN rs12339210 that associated with NIHL risk, based on a promising approach for evaluating individual susceptibility using ML models.
Asunto(s)
Pérdida Auditiva Provocada por Ruido , Estudios de Casos y Controles , China , Estudios Transversales , Predisposición Genética a la Enfermedad , Genotipo , Pérdida Auditiva Provocada por Ruido/etiología , Pérdida Auditiva Provocada por Ruido/genética , Humanos , Ruido en el Ambiente de Trabajo , Exposición Profesional , Polimorfismo de Nucleótido SimpleRESUMEN
Single cell RNA sequencing (scRNA-seq) is a powerful tool in detailing the cellular landscape within complex tissues. Large-scale single cell transcriptomics provide both opportunities and challenges for identifying rare cells playing crucial roles in development and disease. Here, we develop GapClust, a light-weight algorithm to detect rare cell types from ultra-large scRNA-seq datasets with state-of-the-art speed and memory efficiency. Benchmarking on diverse experimental datasets demonstrates the superior performance of GapClust compared to other recently proposed methods. When applying our algorithm to an intestine and 68 k PBMC datasets, GapClust identifies the tuft cells and a previously unrecognised subtype of monocyte, respectively.
Asunto(s)
Algoritmos , RNA-Seq/métodos , Análisis de la Célula Individual/métodos , Conjuntos de Datos como Asunto , Células HEK293 , Humanos , Mucosa Intestinal/citología , Células Jurkat , Programas InformáticosRESUMEN
Identifying personalized driver genes is essential for discovering critical biomarkers and developing effective personalized therapies of cancers. However, few methods consider weights for different types of mutations and efficiently distinguish driver genes over a larger number of passenger genes. We propose MinNetRank (Minimum used for Network-based Ranking), a new method for prioritizing cancer genes that sets weights for different types of mutations, considers the incoming and outgoing degree of interaction network simultaneously, and uses minimum strategy to integrate multi-omics data. MinNetRank prioritizes cancer genes among multi-omics data for each sample. The sample-specific rankings of genes are then integrated into a population-level ranking. When evaluating the accuracy and robustness of prioritizing driver genes, our method almost always significantly outperforms other methods in terms of precision, F1 score, and partial area under the curve (AUC) on six cancer datasets. Importantly, MinNetRank is efficient in discovering novel driver genes. SP1 is selected as a candidate driver gene only by our method (ranked top three), and SP1 RNA and protein differential expression between tumor and normal samples are statistically significant in liver hepatocellular carcinoma. The top seven genes stratify patients into two subtypes exhibiting statistically significant survival differences in five cancer types. These top seven genes are associated with overall survival, as illustrated by previous researchers. MinNetRank can be very useful for identifying cancer driver genes, and these biologically relevant marker genes are associated with clinical outcome. The R package of MinNetRank is available at https://github.com/weitinging/MinNetRank.
RESUMEN
OBJECTIVE: Evaluate the risk of pre-existing comorbidities on COVID-19 mortality, and provide clinical suggestions accordingly. SETTING: A nested case-control design using confirmed case reports released from the news or the national/provincial/municipal health commissions of China between 18 December 2019 and 8 March 2020. PARTICIPANTS: Patients with confirmed SARS-CoV-2 infection, excluding asymptomatic patients, in mainland China outside of Hubei Province. OUTCOME MEASURES: Patient demographics, survival time and status, and history of comorbidities. METHOD: A total of 94 publicly reported deaths in locations outside of Hubei Province, mainland China, were included as cases. Each case was matched with up to three controls, based on gender and age ±1 year old (94 cases and 181 controls). The inverse probability-weighted Cox proportional hazard model was performed, controlling for age, gender and the early period of the outbreak. RESULTS: Of the 94 cases, the median age was 72.5 years old (IQR=16), and 59.6% were men, while in the control group the median age was 67 years old (IQR=22), and 64.6% were men. Adjusting for age, gender and the early period of the outbreak, poor health conditions were associated with a higher risk of COVID-19 mortality (HR of comorbidity score, 1.31 [95% CI 1.11 to 1.54]; p=0.001). The estimated mortality risk in patients with pre-existing coronary heart disease (CHD) was three times that of those without CHD (p<0.001). The estimated 30-day survival probability for a profile patient with pre-existing CHD (65-year-old woman with no other comorbidities) was 0.53 (95% CI 0.34 to 0.82), while it was 0.85 (95% CI 0.79 to 0.91) for those without CHD. Older age was also associated with increased mortality risk: every 1-year increase in age was associated with a 4% increased risk of mortality (p<0.001). CONCLUSION: Extra care and early medical interventions are needed for patients with pre-existing comorbidities, especially CHD.
Asunto(s)
Enfermedad Coronaria/epidemiología , Infecciones por Coronavirus/mortalidad , Neumonía Viral/mortalidad , Adulto , Factores de Edad , Anciano , Anciano de 80 o más Años , Betacoronavirus , Bronquitis Crónica/epidemiología , COVID-19 , Estudios de Casos y Controles , Infarto Cerebral/epidemiología , China/epidemiología , Comorbilidad , Diabetes Mellitus/epidemiología , Femenino , Insuficiencia Cardíaca/epidemiología , Humanos , Fallo Hepático/epidemiología , Masculino , Persona de Mediana Edad , Pandemias , Modelos de Riesgos Proporcionales , Enfermedad Pulmonar Obstructiva Crónica/epidemiología , Insuficiencia Renal/epidemiología , SARS-CoV-2 , Adulto JovenRESUMEN
BACKGROUND: Although many prognostic single-gene (SG) lists have been identified in cancer research, application of these features is hampered due to poor robustness and performance on independent datasets. Pathway-based approaches have thus emerged which embed biological knowledge to yield reproducible features. METHODS: Pathifier estimates pathways deregulation score (PDS) to represent the extent of pathway deregulation based on expression data, and most of its applications treat pathways as independent without addressing the effect of gene overlap between pathway pairs which we refer to as crosstalk. Here, we propose a novel procedure based on Pathifier methodology, which for the first time has been utilized with crosstalk accommodated to identify disease-specific features to predict prognosis in patients with hepatocellular carcinoma (HCC). FINDINGS: With the cohort (Nâ¯=â¯355) of HCC patients from The Cancer Genome Atlas (TCGA), cross validation (CV) revealed that PDSs identified were more robust and accurate than the SG features by deep learning (DL)-based approach. When validated on external HCC datasets, these features outperformed the SGs consistently. INTERPRETATION: On average, we provide 10.2% improvement of prediction accuracy. Importantly, governing genes in these features provide valuable insight into the cancer hallmarks of HCC. We develop an R package PATHcrosstalk (available from GitHub https://github.com/fabotao/PATHcrosstalk) with which users can discover pathways of interest with crosstalk effect considered.
Asunto(s)
Biomarcadores de Tumor , Carcinoma Hepatocelular/metabolismo , Carcinoma Hepatocelular/mortalidad , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/metabolismo , Neoplasias Hepáticas/mortalidad , Transducción de Señal , Carcinoma Hepatocelular/genética , Biología Computacional/métodos , Bases de Datos Genéticas , Perfilación de la Expresión Génica , Redes Reguladoras de Genes , Humanos , Pronóstico , Reproducibilidad de los Resultados , Análisis de SupervivenciaRESUMEN
As a kind of monooxygenase with the function of catalyzing many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids, CYP2J2 is an important member of the cytochrome P450 superfamily. Located at the endoplasmic reticulum, CYP2J2 is responsible for epoxidation of endogenous arachidonic acid in cardiac tissue to produce cis-epoxyeicosatrienoic acids (EETs), which have anti-inflammatory and antifibrinolytic properties, and can protect endothelial cells from ischemic or hypoxic injuries. Some polymorphisms, e.g., CYP2J2 with mutation T143A, R158C, I192N or N404Y, could significantly reduce the metabolism of the arachidonic acid, causing or deteriorating the coronary artery disease. However, so far the detailed mechanism for the mutationinduced dysfunction of arachidonic metabolism is still unknown. To reveal its mechanism, a 3D (three-dimensional) structure for human CYP2J2 was developed, followed by docking the arachidonic acid ligand into the active site of the receptor. It was observed based on the binding mode thus found that Gly486 and Leu378 in the active site of the receptor played a key role in recognizing and positioning the carboxyl group of the ligand via hydrogen bonding interactions, and that any of the aforementioned five mutations might have, either directly or indirectly, impact to their role and hence causing the mutation-induced dysfunction of CYP2J2-mediated arachidonic acid metabolism. It is anticipated that the findings as reported in this review article may stimulate new strategy for finding novel therapeutic approaches to treat coronary artery disease.