|

1.

Machine Learning Reveals Impacts of Smoking on Gene Profiles of Different Cell Types in Lung.

Ma, Qinglan; Shen, Yulong; Guo, Wei; Feng, Kaiyan; Huang, Tao; Cai, Yudong.

Life (Basel) ; 14(4)2024 Apr 13.

Article En | MEDLINE | ID: mdl-38672772

Smoking significantly elevates the risk of lung diseases such as chronic obstructive pulmonary disease (COPD) and lung cancer. This risk is attributed to the harmful chemicals in tobacco smoke that damage lung tissue and impair lung function. Current research on the impact of smoking on gene expression in specific lung cells is limited. This study addresses this gap by analyzing gene expression profiles at the single-cell level from 43,539 lung endothelial cells, 234,349 lung epithelial cells, 189,843 lung immune cells, and 16,031 lung stromal cells using advanced machine learning techniques. The data, categorized by different lung cell types, were classified into three smoking states: active smoker, former smoker, and never smoker. Each cell sample encompassed 28,024 feature genes. Employing an incremental feature selection method within a computational framework, several specific genes have been identified as potential markers of smoking status in different lung cell types. These include B2M, EEF1A1, and TPT1 in lung endothelial cells; FTL and MT-ATP8 in lung epithelial cells; HLA-B and HLA-C in lung immune cells; and HSP90B1 and LCN2 in lung stroma cells. Additionally, this study developed quantitative rules for representing the gene expression patterns related to smoking. This research highlights the potential of machine learning in oncology, enhancing our molecular understanding of smoking's harm and laying the groundwork for future mechanism-based studies.

2.

Identification of Protein-Protein Interaction Associated Functions Based on Gene Ontology.

Zhang, Yu-Hang; Huang, FeiMing; Li, JiaBo; Shen, WenFeng; Chen, Lei; Feng, KaiYan; Huang, Tao; Cai, Yu-Dong.

Protein J ; 2024 Mar 04.

Article En | MEDLINE | ID: mdl-38436837

Protein-protein interactions (PPIs) involve the physical or functional contact between two or more proteins. Generally, proteins that can interact with each other always have special relationships. Some previous studies have reported that gene ontology (GO) terms are related to the determination of PPIs, suggesting the special patterns on the GO terms of proteins in PPIs. In this study, we explored the special GO term patterns on human PPIs, trying to uncover the underlying functional mechanism of PPIs. The experimental validated human PPIs were retrieved from STRING database, which were termed as positive samples. Additionally, we randomly paired proteins occurring in positive samples, yielding lots of negative samples. A simple calculation was conducted to count the number of positive samples for each GO term pair, where proteins in samples were annotated by GO terms in the pair individually. The similar number for negative samples was also counted and further adjusted due to the great gap between the numbers of positive and negative samples. The difference of the above two numbers and the relative ratio compared with the number on positive samples were calculated. This ratio provided a precise evaluation of the occurrence of GO term pairs for positive samples and negative samples, indicating the latent GO term patterns for PPIs. Our analysis unveiled several nuclear biological processes, including gene transcription, cell proliferation, and nutrient metabolism, as key biological functions. Interactions between major proliferative or metabolic GO terms consistently correspond with significantly reported PPIs in recent literature.

3.

Exploring Prognostic Gene Factors in Breast Cancer via Machine Learning.

Ma, QingLan; Chen, Lei; Feng, KaiYan; Guo, Wei; Huang, Tao; Cai, Yu-Dong.

Biochem Genet ; 2024 Feb 21.

Article En | MEDLINE | ID: mdl-38383836

Breast cancer remains the most prevalent cancer in women. To date, its underlying molecular mechanisms have not been fully uncovered. The determination of gene factors is important to improve our understanding on breast cancer, which can correlate the specific gene expression and tumor staging. However, the knowledge in this regard is still far from complete. Thus, this study aimed to explore these knowledge gaps by analyzing existing gene expression profile data from 3149 breast cancer samples, where each sample was represented by the expression of 19,644 genes and classified into Nottingham histological grade (NHG) classes (Grade 1, 2, and 3). To this end, a machine learning-based framework was designed. First, the profile data were analyzed by using seven feature ranking algorithms to evaluate the importance of features (genes). Seven feature lists were generated, each of which sorted features in accordance with feature importance evaluated from a special aspect. Then, the incremental feature selection method was applied to each list to determine essential features for classification and building efficient classifiers. Consequently, overlapping genes, such as AURKA, CBX2, and MYBL2, were deemed as potentially related to breast cancer malignancy and prognosis, indicating that such genes were identified to be important by multiple feature ranking algorithms. In addition, the study formulated classification rules to reflect special gene expression patterns for three NHG classes. Some genes and rules were analyzed and supported by recent literature, providing new references for studying breast cancer.

4.

Analyzing domain features of small proteins using a machine-learning method.

Ding, ShiJian; Liao, HuiPing; Huang, FeiMing; Chen, Lei; Guo, Wei; Feng, KaiYan; Huang, Tao; Cai, Yu-Dong.

Proteomics ; : e2300302, 2024 Jan 22.

Article En | MEDLINE | ID: mdl-38258387

Small proteins (SPs) are a unique group of proteins that play crucial roles in many important biological processes. Exploring the biological function of SPs is necessary. In this study, the InterPro tool and the maximum correlation method were utilized to analyze functional domains of SPs. The purpose was to identify important functional domains that can indicate the essential differences between small and large protein sequences. First, the small and large proteins were represented by their functional domains via a one-hot scheme. Then, the MaxRel method was adopted to evaluate the relationships between each domain and the target variable, indicating small or large protein. The top 36 domain features were selected for further investigation. Among them, 14 were deemed to be highly related to SPs because they were annotated to SPs more frequently than large proteins. We found the involvement of functional domains, such as ubiquitin-conjugating enzyme/RWD-like, nuclear transport factor 2 domain, and alpha subunit of guanine nucleotide-binding protein (G-protein) in regulating the biological function of SPs. The involvement of these domains has been confirmed by other recent studies. Our findings indicate that protein functional domains may regulate small protein-related functions and predict their biological activity.

5.

Identifying Autophagy-Associated Proteins and Chemicals with a Random Walk-Based Method within Heterogeneous Interaction Network.

Huang, FeiMing; Guo, Wei; Chen, Lei; Feng, KaiYan; Huang, Tao; Cai, Yu-Dong.

Front Biosci (Landmark Ed) ; 29(1): 21, 2024 01 17.

Article En | MEDLINE | ID: mdl-38287832

BACKGROUND: Autophagy is instrumental in various health conditions, including cancer, aging, and infections. Therefore, examining proteins and compounds associated with autophagy is paramount to understanding cellular biology and the origins of diseases, paving the way for potential therapeutic and disease prediction strategies. However, the complexity of autophagy, its intersection with other cellular pathways, and the challenges in monitoring autophagic activity make the experimental identification of these elements arduous. METHODS: In this study, autophagy-related proteins and chemicals were catalogued on the basis of Human Autophagy-dedicated Database. These entities were mapped to their respective PubChem identifications (IDs) for chemicals and Ensembl IDs for proteins, yielding 563 chemicals and 779 proteins. A network comprising protein-protein, protein-chemical, and chemical-chemical interactions was probed employing the Random-Walk-with-Restart algorithm using the aforementioned proteins and chemicals as seed nodes to unearth additional autophagy-associated proteins and chemicals. Screening tests were performed to exclude proteins and chemicals with minimal autophagy associations. RESULTS: A total of 88 inferred proteins and 50 inferred chemicals of high autophagy relevance were identified. Certain entities, such as the chemical prostaglandin E2 (PGE2), which is recognized for modulating cell death-induced inflammatory responses during pathogen invasion, and the protein G Protein Subunit Alpha I1 (GNAI1), implicated in ether lipid metabolism influencing a range of cellular processes including autophagy, were associated with autophagy. CONCLUSIONS: The discovery of novel autophagy-associated proteins and chemicals is of vital importance because it enhances the understanding of autophagy, provides potential therapeutic targets, and fosters the development of innovative therapeutic strategies and interventions.

Neoplasms , Proteins , Humans , Autophagy , Algorithms , Computational Biology/methods

6.

Identification of key gene expression associated with quality of life after recovery from COVID-19.

Ren, JingXin; Gao, Qian; Zhou, XianChao; Chen, Lei; Guo, Wei; Feng, KaiYan; Huang, Tao; Cai, Yu-Dong.

Med Biol Eng Comput ; 62(4): 1031-1048, 2024 Apr.

Article En | MEDLINE | ID: mdl-38123886

Post-acute sequelae of COVID-19 (PASC) is a persistent complication of severe acute respiratory syndrome coronavirus 2 infection that includes symptoms, such as fatigue, cognitive impairment, and respiratory distress. These symptoms severely affect the quality of life of patients after their recovery from COVID-19. In this study, a group of machine learning algorithms analyzed the whole blood RNA-seq data from patients with different PASC levels. The purpose of this analysis was to identify the gene markers associated with PASC and the special expression patterns for different PASC levels. By comparing the quality of life of patients after the acute phase of COVID-19 and before the disease, samples in the dataset were divided into three groups, namely, "Better," "The Same," and "Worse." Each patient was represented by the expression levels of 58,929 genes. The machine learning-based workflow included six feature-ranking algorithms, incremental feature selection (IFS), and four classification algorithms. The feature ranking algorithms were in charge of assessing feature importance, whereas IFS with classification algorithms were used to extract essential genes and to construct efficient classifiers and classification rules. The expression of top genes in the results was associated with the immune response to viral infection, which is supported by the published literature. For example, patients with low CCDC18 expression and high CPED1 expression had good quality of life, whereas those with low CDC16 expression had poor quality of life.

COVID-19 , Cognitive Dysfunction , Humans , Quality of Life , Algorithms , Gene Expression , Disease Progression

7.

Identification of key genes associated with persistent immune changes and secondary immune activation responses induced by influenza vaccination after COVID-19 recovery by machine learning methods.

Ren, Jingxin; Zhou, XianChao; Huang, Ke; Chen, Lei; Guo, Wei; Feng, KaiYan; Huang, Tao; Cai, Yu-Dong.

Comput Biol Med ; 169: 107883, 2024 Feb.

Article En | MEDLINE | ID: mdl-38157776

COVID-19 is hypothesized to exert enduring effects on the immune systems of patients, leading to alterations in immune-related gene expression. This study aimed to scrutinize the persistent implications of SARS-CoV-2 infection on gene expression and its influence on subsequent immune activation responses. We designed a machine learning-based approach to analyze transcriptomic data from both healthy individuals and patients who had recovered from COVID-19. Patients were categorized based on their influenza vaccination status and then compared with healthy controls. The initial sample set encompassed 86 blood samples from healthy controls and 72 blood samples from recuperated COVID-19 patients prior to influenza vaccination. The second sample set included 123 blood samples from healthy controls and 106 blood samples from recovered COVID-19 patients who had been vaccinated against influenza. For each sample, the dataset captured expression levels of 17,060 genes. Above two sample sets were first analyzed by seven feature ranking algorithms, yielding seven feature lists for each dataset. Then, each list was fed into the incremental feature selection method, incorporating three classic classification algorithms, to extract essential genes, classification rules and build efficient classifiers. The genes and rules were analyzed in this study. The main findings included that NEXN and ZNF354A were highly expressed in recovered COVID-19 patients, whereas MKI67 and GZMB were highly expressed in patients with secondary immune activation post-COVID-19 recovery. These pivotal genes could provide valuable insights for future health monitoring of COVID-19 patients and guide the creation of continued treatment regimens.

COVID-19 , Influenza, Human , Humans , SARS-CoV-2 , Vaccination , Machine Learning

8.

Identification of Whole-Blood DNA Methylation Signatures and Rules Associated with COVID-19 Severity.

Yuan, Fei; Ren, JingXin; Liao, HuiPing; Guo, Wei; Chen, Lei; Feng, KaiYan; Huang, Tao; Cai, Yu-Dong.

Front Biosci (Landmark Ed) ; 28(11): 284, 2023 11 08.

Article En | MEDLINE | ID: mdl-38062828

BACKGROUND: Different severities of coronavirus disease 2019 (COVID-19) cause different levels of respiratory symptoms and systemic inflammation. DNA methylation, a heritable epigenetic process, also shows differential changes in different severities of COVID-19. DNA methylation is involved in regulating the activity of various immune cells and influences immune pathways associated with viral infections. It may also be involved in regulating the expression of genes associated with the progression of COVID-19. METHODS: In this study, a sophisticated machine-learning workflow was designed to analyze whole-blood DNA methylation data from COVID-19 patients with different severities versus healthy controls. We aimed to understand the role of DNA methylation in the development of COVID-19. The sample set contained 101 negative controls, 360 mildly infected individuals, and 113 severely infected individuals. Each sample involved 768,067 methylation sites. Three feature-ranking algorithms (least absolute shrinkage and selection operator (LASSO), light gradient-boosting machine (LightGBM), and Monte Carlo feature selection (MCFS)) were used to rank and filter out sites highly correlated with COVID-19. Based on the obtained ranking results, a high-performance classification model was constructed by combining the feature incremental approach with four classification algorithms (decision tree (DT), k-nearest neighbor (kNN), random forest (RF), and support vector machine (SVM)). RESULTS: Some essential methylation sites and decision rules were obtained. CONCLUSIONS: The genes (IGSF6, CD38, and TLR2) of some essential methylation sites were confirmed to play important roles in the immune system.

COVID-19 , DNA Methylation , Humans , COVID-19/diagnosis , COVID-19/genetics , Algorithms , Epigenesis, Genetic , Inflammation

9.

Identification of Colon Immune Cell Marker Genes Using Machine Learning Methods.

Yang, Yong; Zhang, Yuhang; Ren, Jingxin; Feng, Kaiyan; Li, Zhandong; Huang, Tao; Cai, Yudong.

Life (Basel) ; 13(9)2023 Sep 07.

Article En | MEDLINE | ID: mdl-37763280

Immune cell infiltration that occurs at the site of colon tumors influences the course of cancer. Different immune cell compositions in the microenvironment lead to different immune responses and different therapeutic effects. This study analyzed single-cell RNA sequencing data in a normal colon with the aim of screening genetic markers of 25 candidate immune cell types and revealing quantitative differences between them. The dataset contains 25 classes of immune cells, 41,650 cells in total, and each cell is expressed by 22,164 genes at the expression level. They were fed into a machine learning-based stream. The five feature ranking algorithms (last absolute shrinkage and selection operator, light gradient boosting machine, Monte Carlo feature selection, minimum redundancy maximum relevance, and random forest) were first used to analyze the importance of gene features, yielding five feature lists. Then, incremental feature selection and two classification algorithms (decision tree and random forest) were combined to filter the most important genetic markers from each list. For different immune cell subtypes, their marker genes, such as KLRB1 in CD4 T cells, RPL30 in B cell IGA plasma cells, and JCHAIN in IgG producing B cells, were identified. They were confirmed to be differentially expressed in different immune cells and involved in immune processes. In addition, quantitative rules were summarized by using the decision tree algorithm to distinguish candidate immune cell types. These results provide a reference for exploring the cell composition of the colon cancer microenvironment and for clinical immunotherapy.

10.

Correction: High homocysteine is associated with idiopathic normal pressure hydrocephalus in deep perforating arteriopathy: a cross-sectional study.

Ye, Shisheng; Feng, Kaiyan; Li, Yizhong; Liu, Sanxin; Wu, Qiaoling; Feng, Jinwen; Liao, Xiaorong; Jiang, Chunmei; Liang, Bo; Yuan, Li; Chen, Hai; Huang, Jinbo; Yang, Zhi; Lu, Zhengqi; Li, Hao.

BMC Geriatr ; 23(1): 541, 2023 Sep 06.

Article En | MEDLINE | ID: mdl-37674111

11.

Identification of Gene Markers Associated with COVID-19 Severity and Recovery in Different Immune Cell Subtypes.

Ren, Jing-Xin; Gao, Qian; Zhou, Xiao-Chao; Chen, Lei; Guo, Wei; Feng, Kai-Yan; Lu, Lin; Huang, Tao; Cai, Yu-Dong.

Biology (Basel) ; 12(7)2023 Jul 02.

Article En | MEDLINE | ID: mdl-37508378

As COVID-19 develops, dynamic changes occur in the patient's immune system. Changes in molecular levels in different immune cells can reflect the course of COVID-19. This study aims to uncover the molecular characteristics of different immune cell subpopulations at different stages of COVID-19. We designed a machine learning workflow to analyze scRNA-seq data of three immune cell types (B, T, and myeloid cells) in four levels of COVID-19 severity/outcome. The datasets for three cell types included 403,700 B-cell, 634,595 T-cell, and 346,547 myeloid cell samples. Each cell subtype was divided into four groups, control, convalescence, progression mild/moderate, and progression severe/critical, and each immune cell contained 27,943 gene features. A feature analysis procedure was applied to the data of each cell type. Irrelevant features were first excluded according to their relevance to the target variable measured by mutual information. Then, four ranking algorithms (last absolute shrinkage and selection operator, light gradient boosting machine, Monte Carlo feature selection, and max-relevance and min-redundancy) were adopted to analyze the remaining features, resulting in four feature lists. These lists were fed into the incremental feature selection, incorporating three classification algorithms (decision tree, k-nearest neighbor, and random forest) to extract key gene features and construct classifiers with superior performance. The results confirmed that genes such as PFN1, RPS26, and FTH1 played important roles in SARS-CoV-2 infection. These findings provide a useful reference for the understanding of the ongoing effect of COVID-19 development on the immune system.

12.

Machine Learning Classification of Time since BNT162b2 COVID-19 Vaccination Based on Array-Measured Antibody Activity.

Ma, Qing-Lan; Huang, Fei-Ming; Guo, Wei; Feng, Kai-Yan; Huang, Tao; Cai, Yu-Dong.

Life (Basel) ; 13(6)2023 May 31.

Article En | MEDLINE | ID: mdl-37374086

Vaccines trigger an immunological response that includes B and T cells, with B cells producing antibodies. SARS-CoV-2 immunity weakens over time after vaccination. Discovering key changes in antigen-reactive antibodies over time after vaccination could help improve vaccine efficiency. In this study, we collected data on blood antibody levels in a cohort of healthcare workers vaccinated for COVID-19 and obtained 73 antigens in samples from four groups according to the duration after vaccination, including 104 unvaccinated healthcare workers, 534 healthcare workers within 60 days after vaccination, 594 healthcare workers between 60 and 180 days after vaccination, and 141 healthcare workers over 180 days after vaccination. Our work was a reanalysis of the data originally collected at Irvine University. This data was obtained in Orange County, California, USA, with the collection process commencing in December 2020. British variant (B.1.1.7), South African variant (B.1.351), and Brazilian/Japanese variant (P.1) were the most prevalent strains during the sampling period. An efficient machine learning based framework containing four feature selection methods (least absolute shrinkage and selection operator, light gradient boosting machine, Monte Carlo feature selection, and maximum relevance minimum redundancy) and four classification algorithms (decision tree, k-nearest neighbor, random forest, and support vector machine) was designed to select essential antibodies against specific antigens. Several efficient classifiers with a weighted F1 value around 0.75 were constructed. The antigen microarray used for identifying antibody levels in the coronavirus features ten distinct SARS-CoV-2 antigens, comprising various segments of both nucleocapsid protein (NP) and spike protein (S). This study revealed that S1 + S2, S1.mFcTag, S1.HisTag, S1, S2, Spike.RBD.His.Bac, Spike.RBD.rFc, and S1.RBD.mFc were most highly ranked among all features, where S1 and S2 are the subunits of Spike, and the suffixes represent the tagging information of different recombinant proteins. Meanwhile, the classification rules were obtained from the optimal decision tree to explain quantitatively the roles of antigens in the classification. This study identified antibodies associated with decreased clinical immunity based on populations with different time spans after vaccination. These antibodies have important implications for maintaining long-term immunity to SARS-CoV-2.

13.

Identification of Phase-Separation-Protein-Related Function Based on Gene Ontology by Using Machine Learning Methods.

Ma, Qinglan; Huang, FeiMing; Guo, Wei; Feng, KaiYan; Huang, Tao; Cai, Yudong.

Life (Basel) ; 13(6)2023 May 31.

Article En | MEDLINE | ID: mdl-37374089

Phase-separation proteins (PSPs) are a class of proteins that play a role in the process of liquid-liquid phase separation, which is a mechanism that mediates the formation of membranelle compartments in cells. Identifying phase separation proteins and their associated function could provide insights into cellular biology and the development of diseases, such as neurodegenerative diseases and cancer. Here, PSPs and non-PSPs that have been experimentally validated in earlier studies were gathered as positive and negative samples. Each protein's corresponding Gene Ontology (GO) terms were extracted and used to create a 24,907-dimensional binary vector. The purpose was to extract essential GO terms that can describe essential functions of PSPs and build efficient classifiers to identify PSPs with these GO terms at the same time. To this end, the incremental feature selection computational framework and an integrated feature analysis scheme, containing categorical boosting, least absolute shrinkage and selection operator, light gradient-boosting machine, extreme gradient boosting, and permutation feature importance, were used to build efficient classifiers and identify GO terms with classification-related importance. A set of random forest (RF) classifiers with F1 scores over 0.960 were established to distinguish PSPs from non-PSPs. A number of GO terms that are crucial for distinguishing between PSPs and non-PSPs were found, including GO:0003723, which is related to a biological process involving RNA binding; GO:0016020, which is related to membrane formation; and GO:0045202, which is related to the function of synapses. This study offered recommendations for future research aimed at determining the functional roles of PSPs in cellular processes by developing efficient RF classifiers and identifying the representative GO terms related to PSPs.

14.

High homocysteine is associated with idiopathic normal pressure hydrocephalus in deep perforating arteriopathy: a cross-sectional study.

Ye, Shisheng; Feng, Kaiyan; Li, Yizhong; Liu, Sanxin; Wu, Qiaoling; Feng, Jinwen; Liao, Xiaorong; Jiang, Chunmei; Liang, Bo; Yuan, Li; Chen, Hai; Huang, Jinbo; Yang, Zhi; Lu, Zhengqi; Li, Hao.

BMC Geriatr ; 23(1): 382, 2023 06 21.

Article En | MEDLINE | ID: mdl-37344765

BACKGROUND AND OBJECTIVE: The pathogenesis and pathophysiology of idiopathic normal pressure hydrocephalus (iNPH) remain unclear. Homocysteine may reduce the compliance of intracranial arteries and damage the endothelial function of the blood-brain barrier (BBB), which may be the underlying mechanism of iNPH. The overlap cases between deep perforating arteriopathy (DPA) and iNPH were not rare for the shared risk factors. We aimed to investigate the relationship between serum homocysteine and iNPH in DPA. METHODS: A total of 41 DPA patients with iNPH and 49 DPA patients without iNPH were included. Demographic characteristics, vascular risk factors, laboratory results, and neuroimaging data were collected. Multivariable logistic regression analysis was performed to investigate the relationship between serum homocysteine and iNPH in DPA patients. RESULTS: Patients with iNPH had significantly higher homocysteine levels than those without iNPH (median, 16.34 mmol/L versus 14.28 mmol/L; P = 0.002). There was no significant difference in CSVD burden scores between patients with iNPH and patients without iNPH. Univariate logistic regression analysis demonstrated that patients with homocysteine levels in the Tertile3 were more likely to have iNPH than those in the Tertile1 (OR, 4.929; 95% CI, 1.612-15.071; P = 0.005). The association remained significant after multivariable adjustment for potential confounders, including age, male, hypertension, diabetes mellitus, atherosclerotic cardiovascular disease (ASCVD) or hypercholesterolemia, and eGFR level. CONCLUSION: Our study indicated that high serum homocysteine levels were independently associated with iNPH in DPA. However, further research is needed to determine the predictive value of homocysteine and to confirm the underlying mechanism between homocysteine and iNPH.

Hydrocephalus, Normal Pressure , Vascular Diseases , Humans , Male , Hydrocephalus, Normal Pressure/diagnostic imaging , Hydrocephalus, Normal Pressure/complications , Cross-Sectional Studies , Vascular Diseases/complications , Risk Factors , Neuroimaging

15.

Identification of dynamic gene expression profiles during sequential vaccination with ChAdOx1/BNT162b2 using machine learning methods.

Li, Jing; Ren, JingXin; Liao, HuiPing; Guo, Wei; Feng, KaiYan; Huang, Tao; Cai, Yu-Dong.

Front Microbiol ; 14: 1138674, 2023.

Article En | MEDLINE | ID: mdl-37007526

To date, COVID-19 remains a serious global public health problem. Vaccination against SARS-CoV-2 has been adopted by many countries as an effective coping strategy. The strength of the body's immune response in the face of viral infection correlates with the number of vaccinations and the duration of vaccination. In this study, we aimed to identify specific genes that may trigger and control the immune response to COVID-19 under different vaccination scenarios. A machine learning-based approach was designed to analyze the blood transcriptomes of 161 individuals who were classified into six groups according to the dose and timing of inoculations, including I-D0, I-D2-4, I-D7 (day 0, days 2-4, and day 7 after the first dose of ChAdOx1, respectively) and II-D0, II-D1-4, II-D7-10 (day 0, days 1-4, and days 7-10 after the second dose of BNT162b2, respectively). Each sample was represented by the expression levels of 26,364 genes. The first dose was ChAdOx1, whereas the second dose was mainly BNT162b2 (Only four individuals received a second dose of ChAdOx1). The groups were deemed as labels and genes were considered as features. Several machine learning algorithms were employed to analyze such classification problem. In detail, five feature ranking algorithms (Lasso, LightGBM, MCFS, mRMR, and PFI) were first applied to evaluate the importance of each gene feature, resulting in five feature lists. Then, the lists were put into incremental feature selection method with four classification algorithms to extract essential genes, classification rules and build optimal classifiers. The essential genes, namely, NRF2, RPRD1B, NEU3, SMC5, and TPX2, have been previously associated with immune response. This study also summarized expression rules that describe different vaccination scenarios to help determine the molecular mechanism of vaccine-induced antiviral immunity.

16.

Immune responses of different COVID-19 vaccination strategies by analyzing single-cell RNA sequencing data from multiple tissues using machine learning methods.

Li, Hao; Ma, Qinglan; Ren, Jingxin; Guo, Wei; Feng, Kaiyan; Li, Zhandong; Huang, Tao; Cai, Yu-Dong.

Front Genet ; 14: 1157305, 2023.

Article En | MEDLINE | ID: mdl-37007947

Multiple types of COVID-19 vaccines have been shown to be highly effective in preventing SARS-CoV-2 infection and in reducing post-infection symptoms. Almost all of these vaccines induce systemic immune responses, but differences in immune responses induced by different vaccination regimens are evident. This study aimed to reveal the differences in immune gene expression levels of different target cells under different vaccine strategies after SARS-CoV-2 infection in hamsters. A machine learning based process was designed to analyze single-cell transcriptomic data of different cell types from the blood, lung, and nasal mucosa of hamsters infected with SARS-CoV-2, including B and T cells from the blood and nasal cavity, macrophages from the lung and nasal cavity, alveolar epithelial and lung endothelial cells. The cohort was divided into five groups: non-vaccinated (control), 2*adenovirus (two doses of adenovirus vaccine), 2*attenuated (two doses of attenuated virus vaccine), 2*mRNA (two doses of mRNA vaccine), and mRNA/attenuated (primed by mRNA vaccine, boosted by attenuated vaccine). All genes were ranked using five signature ranking methods (LASSO, LightGBM, Monte Carlo feature selection, mRMR, and permutation feature importance). Some key genes that contributed to the analysis of immune changes, such as RPS23, DDX5, PFN1 in immune cells, and IRF9 and MX1 in tissue cells, were screened. Afterward, the five feature sorting lists were fed into the feature incremental selection framework, which contained two classification algorithms (decision tree [DT] and random forest [RF]), to construct optimal classifiers and generate quantitative rules. Results showed that random forest classifiers could provide relative higher performance than decision tree classifiers, whereas the DT classifiers provided quantitative rules that indicated special gene expression levels under different vaccine strategies. These findings may help us to develop better protective vaccination programs and new vaccines.

17.

Using Machine Learning Methods in Identifying Genes Associated with COVID-19 in Cardiomyocytes and Cardiac Vascular Endothelial Cells.

Xu, Yaochen; Ma, Qinglan; Ren, Jingxin; Chen, Lei; Guo, Wei; Feng, Kaiyan; Zeng, Zhenbing; Huang, Tao; Cai, Yudong.

Life (Basel) ; 13(4)2023 Apr 14.

Article En | MEDLINE | ID: mdl-37109540

Corona Virus Disease 2019 (COVID-19) not only causes respiratory system damage, but also imposes strain on the cardiovascular system. Vascular endothelial cells and cardiomyocytes play an important role in cardiac function. The aberrant expression of genes in vascular endothelial cells and cardiomyocytes can lead to cardiovascular diseases. In this study, we sought to explain the influence of respiratory syndrome coronavirus 2 (SARS-CoV-2) infection on the gene expression levels of vascular endothelial cells and cardiomyocytes. We designed an advanced machine learning-based workflow to analyze the gene expression profile data of vascular endothelial cells and cardiomyocytes from patients with COVID-19 and healthy controls. An incremental feature selection method with a decision tree was used in building efficient classifiers and summarizing quantitative classification genes and rules. Some key genes, such as MALAT1, MT-CO1, and CD36, were extracted, which exert important effects on cardiac function, from the gene expression matrix of 104,182 cardiomyocytes, including 12,007 cells from patients with COVID-19 and 92,175 cells from healthy controls, and 22,438 vascular endothelial cells, including 10,812 cells from patients with COVID-19 and 11,626 cells from healthy controls. The findings reported in this study may provide insights into the effect of COVID-19 on cardiac cells and further explain the pathogenesis of COVID-19, and they may facilitate the identification of potential therapeutic targets.

18.

Characterization of chromatin accessibility patterns in different mouse cell types using machine learning methods at single-cell resolution.

Xu, Yaochen; Huang, FeiMing; Guo, Wei; Feng, KaiYan; Zhu, Lin; Zeng, Zhenbing; Huang, Tao; Cai, Yu-Dong.

Front Genet ; 14: 1145647, 2023.

Article En | MEDLINE | ID: mdl-36936430

Chromatin accessibility is a generic property of the eukaryotic genome, which refers to the degree of physical compaction of chromatin. Recent studies have shown that chromatin accessibility is cell type dependent, indicating chromatin heterogeneity across cell lines and tissues. The identification of markers used to distinguish cell types at the chromosome level is important to understand cell function and classify cell types. In the present study, we investigated transcriptionally active chromosome segments identified by sci-ATAC-seq at single-cell resolution, including 69,015 cells belonging to 77 different cell types. Each cell was represented by existence status on 20,783 genes that were obtained from 436,206 active chromosome segments. The gene features were deeply analyzed by Boruta, resulting in 3897 genes, which were ranked in a list by Monte Carlo feature selection. Such list was further analyzed by incremental feature selection (IFS) method, yielding essential genes, classification rules and an efficient random forest (RF) classifier. To improve the performance of the optimal RF classifier, its features were further processed by autoencoder, light gradient boosting machine and IFS method. The final RF classifier with MCC of 0.838 was constructed. Some marker genes such as H2-Dmb2, which are specifically expressed in antigen-presenting cells (e.g., dendritic cells or macrophages), and Tenm2, which are specifically expressed in T cells, were identified in this study. Our analysis revealed numerous potential epigenetic modification patterns that are unique to particular cell types, thereby advancing knowledge of the critical functions of chromatin accessibility in cell processes.

19.

Identification of genes related to immune enhancement caused by heterologous ChAdOx1-BNT162b2 vaccines in lymphocytes at single-cell resolution with machine learning methods.

Li, Jing; Huang, FeiMing; Ma, QingLan; Guo, Wei; Feng, KaiYan; Huang, Tao; Cai, Yu-Dong.

Front Immunol ; 14: 1131051, 2023.

Article En | MEDLINE | ID: mdl-36936955

The widely used ChAdOx1 nCoV-19 (ChAd) vector and BNT162b2 (BNT) mRNA vaccines have been shown to induce robust immune responses. Recent studies demonstrated that the immune responses of people who received one dose of ChAdOx1 and one dose of BNT were better than those of people who received vaccines with two homologous ChAdOx1 or two BNT doses. However, how heterologous vaccines function has not been extensively investigated. In this study, single-cell RNA sequencing data from three classes of samples: volunteers vaccinated with heterologous ChAdOx1-BNT and volunteers vaccinated with homologous ChAd-ChAd and BNT-BNT vaccinations after 7 days were divided into three types of immune cells (3654 B, 8212 CD4+ T, and 5608 CD8+ T cells). To identify differences in gene expression in various cell types induced by vaccines administered through different vaccination strategies, multiple advanced feature selection methods (max-relevance and min-redundancy, Monte Carlo feature selection, least absolute shrinkage and selection operator, light gradient boosting machine, and permutation feature importance) and classification algorithms (decision tree and random forest) were integrated into a computational framework. Feature selection methods were in charge of analyzing the importance of gene features, yielding multiple gene lists. These lists were fed into incremental feature selection, incorporating decision tree and random forest, to extract essential genes, classification rules and build efficient classifiers. Highly ranked genes include PLCG2, whose differential expression is important to the B cell immune pathway and is positively correlated with immune cells, such as CD8+ T cells, and B2M, which is associated with thymic T cell differentiation. This study gave an important contribution to the mechanistic explanation of results showing the stronger immune response of a heterologous ChAdOx1-BNT vaccination schedule than two doses of either BNT or ChAdOx1, offering a theoretical foundation for vaccine modification.

BNT162 Vaccine , ChAdOx1 nCoV-19 , Humans , BNT162 Vaccine/immunology , CD8-Positive T-Lymphocytes , ChAdOx1 nCoV-19/immunology , Machine Learning , COVID-19/prevention & control , CD4-Positive T-Lymphocytes

20.

Identification of Genes Associated with the Impairment of Olfactory and Gustatory Functions in COVID-19 via Machine-Learning Methods.

Ren, Jingxin; Zhang, Yuhang; Guo, Wei; Feng, Kaiyan; Yuan, Ye; Huang, Tao; Cai, Yu-Dong.

Life (Basel) ; 13(3)2023 Mar 15.

Article En | MEDLINE | ID: mdl-36983953

The coronavirus disease 2019 (COVID-19), as a severe respiratory disease, affects many parts of the body, and approximately 20-85% of patients exhibit functional impairment of the senses of smell and taste, some of whom even experience the permanent loss of these senses. These symptoms are not life-threatening but severely affect patients' quality of life and increase the risk of depression and anxiety. The pathological mechanisms of these symptoms have not been fully identified. In the current study, we aimed to identify the important biomarkers at the expression level associated with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection-mediated loss of taste or olfactory ability, and we have suggested the potential pathogenetic mechanisms of COVID-19 complications. We designed a machine-learning-based approach to analyze the transcriptome of 577 COVID-19 patient samples, including 84 COVID-19 samples with a decreased ability to taste or smell and 493 COVID-19 samples without impairment. Each sample was represented by 58,929 gene expression levels. The features were analyzed and sorted by three feature selection methods (least absolute shrinkage and selection operator, light gradient boosting machine, and Monte Carlo feature selection). The optimal feature sets were obtained through incremental feature selection using two classification algorithms: decision tree (DT) and random forest (RF). The top genes identified by these multiple methods (H3-5, NUDT5, and AOC1) are involved in olfactory and gustatory impairments. Meanwhile, a high-performance RF classifier was developed in this study, and three sets of quantitative rules that describe the impairment of olfactory and gustatory functions were obtained based on the optimal DT classifiers. In summary, this study provides a new computation analysis and suggests the latent biomarkers (genes and rules) for predicting olfactory and gustatory impairment caused by COVID-19 complications.