Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 23
Filter
Add more filters










Publication year range
1.
Front Cardiovasc Med ; 11: 1277123, 2024.
Article in English | MEDLINE | ID: mdl-38699582

ABSTRACT

Background: Electrocardiogram (ECG) signals are inevitably contaminated with various kinds of noises during acquisition and transmission. The presence of noises may produce the inappropriate information on cardiac health, thereby preventing specialists from making correct analysis. Methods: In this paper, an efficient strategy is proposed to denoise ECG signals, which employs a time-frequency framework based on S-transform (ST) and combines bi-dimensional empirical mode decomposition (BEMD) and non-local means (NLM). In the method, the ST maps an ECG signal into a subspace in the time frequency domain, then the BEMD decomposes the ST-based time-frequency representation (TFR) into a series of sub-TFRs at different scales, finally the NLM removes noise and restores ECG signal characteristics based on structural self-similarity. Results: The proposed method is validated using numerous ECG signals from the MIT-BIH arrhythmia database, and several different types of noises with varying signal-to-noise (SNR) are taken into account. The experimental results show that the proposed technique is superior to the existing wavelet based approach and NLM filtering, with the higher SNR and structure similarity index measure (SSIM), the lower root mean squared error (RMSE) and percent root mean square difference (PRD). Conclusions: The proposed method not only significantly suppresses the noise presented in ECG signals, but also preserves the characteristics of ECG signals better, thus, it is more suitable for ECG signals processing.

2.
Front Genet ; 14: 1172108, 2023.
Article in English | MEDLINE | ID: mdl-37636270

ABSTRACT

Minimal residual disease (MRD) refers to a very small number of residual tumor cells in the body during or after treatment, representing the persistence of the tumor and the possibility of clinical progress. Circulating tumor DNA (ctDNA) is a DNA fragment actively secreted by tumor cells or released into the circulatory system during the process of apoptosis or necrosis of tumor cells, which emerging as a non-invasive biomarker to dynamically monitor the therapeutic effect and prediction of recurrence. The feasibility of ctDNA as MRD detection and the revolution in ctDNA-based liquid biopsies provides a potential method for cancer monitoring. In this review, we summarized the main methods of ctDNA detection (PCR-based Sequencing and Next-Generation Sequencing) and their advantages and disadvantages. Additionally, we reviewed the significance of ctDNA analysis to guide the adjuvant therapy and predict the relapse of lung, breast and colon cancer et al. Finally, there are still many challenges of MRD detection, such as lack of standardization, false-negatives or false-positives results make misleading, and the requirement of validation using large independent cohorts to improve clinical outcomes.

3.
Comput Struct Biotechnol J ; 21: 1414-1423, 2023.
Article in English | MEDLINE | ID: mdl-36824227

ABSTRACT

Identifying the potential associations between microbes and diseases is the first step for revealing the pathological mechanisms of microbe-associated diseases. However, traditional culture-based microbial experiments are expensive and time-consuming. Thus, it is critical to prioritize disease-associated microbes by computational methods for further experimental validation. In this study, we proposed a novel method called MNNMDA, to predict microbe-disease associations (MDAs) by applying a Matrix Nuclear Norm method into known microbe and disease data. Specifically, we first calculated Gaussian interaction profile kernel similarity and functional similarity for diseases and microbes. Then we constructed a heterogeneous information network by combining the integrated disease similarity network, the integrated microbe similarity network and the known microbe-disease bipartite network. Finally, we formulated the microbe-disease association prediction problem as a low-rank matrix completion problem, which was solved by minimizing the nuclear norm of a matrix with a few regularization terms. We tested the performances of MNNMDA in three datasets including HMDAD, Disbiome, and Combined Data with small, medium and large sizes respectively. We also compared MNNMDA with 5 state-of-the-art methods including KATZHMDA, LRLSHMDA, NTSHMDA, GATMDA, and KGNMDA, respectively. MNNMDA achieved area under the ROC curves (AUROC) of 0.9536 and 0.9364 respectively on HDMAD and Disbiome, better than the AUCs of compared methods under the 5-fold cross-validation for all microbe-disease associations. It also obtained a relatively good performance with AUROC 0.8858 in the combined data. In addition, MNNMDA was also better than other methods in area under precision and recall curve (AUPR) under the 5-fold cross-validation for all associations, and in both AUROC and AUPR under the 5-fold cross-validation for diseases and the 5-fold cross-validation for microbes. Finally, the case studies on colon cancer and inflammatory bowel disease (IBD) also validated the effectiveness of MNNMDA. In conclusion, MNNMDA is an effective method in predicting microbe-disease associations. Availability: The codes and data for this paper are freely available at Github https://github.com/Haiyan-Liu666/MNNMDA.

4.
Front Oncol ; 12: 988680, 2022.
Article in English | MEDLINE | ID: mdl-36203428

ABSTRACT

Background: Cuproptosis is a new modality of cell death regulation that is currently considered as a new cancer treatment strategy. Nevertheless, the prognostic predictive value of cuproptosis-related lncRNAs in breast cancer (BC) remains unknown. Using cuproptosis-related lncRNAs, this study aims to predict the immune microenvironment and prognosis of BC patients. and develop new therapeutic strategies that target the disease. Methods: The Cancer Genome Atlas (TCGA) database provided the RNA-seq data along with the corresponding clinical and prognostic information. Univariate and multivariate Cox regression analyses were performed to acquire lncRNAs associated with cuproptosis to establish predictive features. The Kaplan-Meier method was used to calculate the overall survival rate (OS) in the high-risk and low-risk groups. High risk and low risk gene sets were enriched to explore functional discrepancies among risk teams. The mutation data were analyzed using the "MAFTools" r-package. The ties of predictive characteristics and immune status had been explored by single sample gene set enrichment analysis (ssGSEA). Last, the correlation between predictive features and treatment condition in patients with BC was analyzed. Based on prognostic risk models, we assessed associations between risk subgroups and immune scores and immune checkpoints. In addition, drug responses in at-risk populations were predicted. Results: We identified a set of 11 Cuproptosis-Related lncRNAs (GORAB-AS1, AC 079922.2, AL 589765.4, AC 005696.4, Cytor, ZNF 197-AS1, AC 002398.1, AL 451085.3, YTH DF 3-AS1, AC 008771.1, LINC 02446), based on which to construct the risk model. In comparison to the high-risk group, the low-risk patients lived longer (p < 0.001). Moreover, cuproptosis-related lncRNA profiles can independently predict prognosis in BC patients. The AUC values for receiver operating characteristics (ROC) of 1-, 3-, and 5-year risk were 0.849, 0.779, and 0.794, respectively. Patients in the high-risk group had lower OS than those in the low-risk group when they were divided into groups based on various clinicopathological variables. The tumor burden mutations (TMB) correlation analysis showed that high TMB had a worse prognosis than low-TMB, and gene mutations were found to be different in high and low TMB groups, such as PIK3CA (36% versus 32%), SYNE1 (4% versus 6%). Gene enrichment analysis indicated that the differential genes were significantly concentrated in immune-related pathways. The predictive traits were significantly correlated with the immune status of BC patients, according to ssGSEA results. Finally, high-risk patients showed high sensitivity in anti-CD276 immunotherapy and conventional chemotherapeutic drugs such as imatinib, lapatinib, and pazopanib. Conclusion: We successfully constructed of a cuproptosis-related lncRNA signature, which can independently predict the prognosis of BC patients and can be used to estimate OS and clinical treatment outcomes in BRCA patients. It will serve as a foundation for further research into the mechanism of cuproptosis-related lncRNAs in breast cancer, as well as for the development of new markers and therapeutic targets for the disease.

5.
Front Cardiovasc Med ; 9: 983543, 2022.
Article in English | MEDLINE | ID: mdl-36299867

ABSTRACT

As an important auxiliary tool of arrhythmia diagnosis, Electrocardiogram (ECG) is frequently utilized to detect a variety of cardiovascular diseases caused by arrhythmia, such as cardiac mechanical infarction. In the past few years, the classification of ECG has always been a challenging problem. This paper presents a novel deep learning model called convolutional vision transformer (ConViT), which combines vision transformer (ViT) with convolutional neural network (CNN), for ECG arrhythmia classification, in which the unique soft convolutional inductive bias of gated positional self-attention (GPSA) layers integrates the superiorities of attention mechanism and convolutional architecture. Moreover, the time-reassigned synchrosqueezing transform (TSST), a newly developed time-frequency analysis (TFA) method where the time-frequency coefficients are reassigned in the time direction, is employed to sharpen pulse traits for feature extraction. Aiming at the class imbalance phenomena in the traditional ECG database, the smote algorithm and focal loss (FL) are used for data augmentation and minority-class weighting, respectively. The experiment using MIT-BIH arrhythmia database indicates that the overall accuracy of the proposed model is as high as 99.5%. Furthermore, the specificity (Spe), F1-Score and positive Matthews Correlation Coefficient (MCC) of supra ventricular ectopic beat (S) and ventricular ectopic beat (V) are all more than 94%. These results demonstrate that the proposed method is superior to most of the existing methods.

6.
Front Oncol ; 12: 922178, 2022.
Article in English | MEDLINE | ID: mdl-36248992

ABSTRACT

Backgrounds: Breast cancer is a common malignant tumors in women. TIMM8A was up-regulated in different cancers. The aim of this work was to clarify the value of TIMM8A in the diagnosis, prognosis of Breast Cancer (BC), and its association with immune cells and immune detection points. Gene mutations. Methods: The transcription and expression profile of TIMM8A between BC and normal tissues was downloaded from The Cancer Genome atlas (TCGA). The expression of TIMM8A protein was evaluated by human protein map. The correlation between TIMM8A and clinical features was analyzed using the R package to establish a ROC diagnostic curve. cBioPortal and MethSurv were used to identify gene alterations and DNA methylation and their effects on prognosis. The tumor immune estimation resource (TIMER) database and tumor immune system interaction database (TISIDB) database were used to determine the relationship between TIMM8A gene expression levels and immune infiltration. The CTD database was used to predict related drugs that inhibit TIMM8A, and the PubChem database was used to determine the molecular structure of potentially effective drug small molecules. Results: The expression of TIMM8A in breast cancer tissues was significantly higher than that in normally adjacent tissues to cancer. ROC curve analysis showed that the AUC value of TIMM8A was 0.679. Kaplan-Meier method showed that patients with high TIMM8A had a lower prognosis (Overall Survival HR = 1.83 (1.31 - 2.54), P < 0.001) than patients with low TIMM8A expression of breast cancer (148.5 months vs. 115.4 months, P < 0.001). Methylation levels at seven CpG were associated with prognosis. Correlation analysis showed that TIMM8A expression was associated with tumor immune cell infiltration. There was a significant positive correlation of TIMM8A with PDL-1, and CTLA-4 in BC. In addition, CTD database analysis identified 15 small molecular drugs that target TIMM8A, such as Cyclosporine, Leflunomide, and Tretinoin, which might be effective therapies for targeted inhibition of TIMM8A. Conclusion: In breast cancer, up-regulated TIMM 8A was significantly related to lower survival rate and higher immune invasiveness. Our research showed that TIMM 8A could be used as a biomarker for poor prognosis of breast cancer and a potential target of immunotherapy.

7.
Brief Bioinform ; 23(6)2022 11 19.
Article in English | MEDLINE | ID: mdl-36151744

ABSTRACT

The identification of disease-causing genes is critical for mechanistic understanding of disease etiology and clinical manipulation in disease prevention and treatment. Yet the existing approaches in tackling this question are inadequate in accuracy and efficiency, demanding computational methods with higher identification power. Here, we proposed a new method called DGHNE to identify disease-causing genes through a heterogeneous biomedical network empowered by network enhancement. First, a disease-disease association network was constructed by the cosine similarity scores between phenotype annotation vectors of diseases, and a new heterogeneous biomedical network was constructed by using disease-gene associations to connect the disease-disease network and gene-gene network. Then, the heterogeneous biomedical network was further enhanced by using network embedding based on the Gaussian random projection. Finally, network propagation was used to identify candidate genes in the enhanced network. We applied DGHNE together with five other methods into the most updated disease-gene association database termed DisGeNet. Compared with all other methods, DGHNE displayed the highest area under the receiver operating characteristic curve and the precision-recall curve, as well as the highest precision and recall, in both the global 5-fold cross-validation and predicting new disease-gene associations. We further performed DGHNE in identifying the candidate causal genes of Parkinson's disease and diabetes mellitus, and the genes connecting hyperglycemia and diabetes mellitus. In all cases, the predicted causing genes were enriched in disease-associated gene ontology terms and Kyoto Encyclopedia of Genes and Genomes pathways, and the gene-disease associations were highly evidenced by independent experimental studies.


Subject(s)
Computational Biology , Gene Regulatory Networks , Computational Biology/methods , Gene Ontology , ROC Curve , Phenotype , Algorithms
8.
Comput Biol Med ; 146: 105697, 2022 07.
Article in English | MEDLINE | ID: mdl-35697529

ABSTRACT

Recent advances in single-cell RNA sequencing (scRNA-seq) provide exciting opportunities for transcriptome analysis at single-cell resolution. Clustering individual cells is a key step to reveal cell subtypes and infer cell lineage in scRNA-seq analysis. Although many dedicated algorithms have been proposed, clustering quality remains a computational challenge for scRNA-seq data, which is exacerbated by inflated zero counts due to various technical noise. To address this challenge, we assess the combinations of nine popular dropout imputation methods and eight clustering methods on a collection of 10 well-annotated scRNA-seq datasets with different sample sizes. Our results show that (i) imputation algorithms do typically improve the performance of clustering methods, and the quality of data visualization using t-Distributed Stochastic Neighbor Embedding; and (ii) the performance of a particular combination of imputation and clustering methods varies with dataset size. For example, the combination of single-cell analysis via expression recovery and Sparse Subspace Clustering (SSC) methods usually works well on smaller datasets, while the combination of adaptively-thresholded low-rank approximation and single-cell interpretation via multikernel learning (SIMLR) usually achieves the best performance on larger datasets.


Subject(s)
Gene Expression Profiling , Single-Cell Analysis , Algorithms , Cluster Analysis , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods
9.
Front Pharmacol ; 13: 865065, 2022.
Article in English | MEDLINE | ID: mdl-35370663

ABSTRACT

Pulmonary fibrosis is a chronic, progressive and irreversible heterogeneous disease of pulmonary interstitial tissue. Its incidence is increasing year by year in the world, and it will be further increased due to the pandemic of COVID-19. However, at present, there is no safe and effective treatment for this disease, so it is very meaningful to find drugs with high efficiency and less adverse reactions. The natural astragalus polysaccharide has the pharmacological effect of anti-pulmonary fibrosis with little toxic and side effects. At present, the mechanism of anti-pulmonary fibrosis of astragalus polysaccharide is not clear. Based on the network pharmacology and molecular docking method, this study analyzes the mechanism of Astragalus polysaccharides in treating pulmonary fibrosis, which provides a theoretical basis for its further clinical application. The active components of Astragalus polysaccharides were screened out by Swisstarget database, and the related targets of pulmonary fibrosis were screened out by GeneCards database. Protein-protein interaction network analysis and molecular docking were carried out to verify the docking affinity of active ingredients. At present, through screening, we have obtained 92 potential targets of Astragalus polysaccharides for treating pulmonary fibrosis, including 11 core targets. Astragalus polysaccharides has the characteristics of multi-targets and multi-pathways, and its mechanism of action may be through regulating the expression of VCAM1, RELA, CDK2, JUN, CDK1, HSP90AA1, NOS2, SOD1, CASP3, AHSA1, PTGER3 and other genes during the development of pulmonary fibrosis.

10.
Front Oncol ; 11: 763527, 2021.
Article in English | MEDLINE | ID: mdl-34900711

ABSTRACT

Many diseases are accompanied by changes in certain biochemical indicators called biomarkers in cells or tissues. A variety of biomarkers, including proteins, nucleic acids, antibodies, and peptides, have been identified. Tumor biomarkers have been widely used in cancer risk assessment, early screening, diagnosis, prognosis, treatment, and progression monitoring. For example, the number of circulating tumor cell (CTC) is a prognostic indicator of breast cancer overall survival, and tumor mutation burden (TMB) can be used to predict the efficacy of immune checkpoint inhibitors. Currently, clinical methods such as polymerase chain reaction (PCR) and next generation sequencing (NGS) are mainly adopted to evaluate these biomarkers, which are time-consuming and expansive. Pathological image analysis is an essential tool in medical research, disease diagnosis and treatment, functioning by extracting important physiological and pathological information or knowledge from medical images. Recently, deep learning-based analysis on pathological images and morphology to predict tumor biomarkers has attracted great attention from both medical image and machine learning communities, as this combination not only reduces the burden on pathologists but also saves high costs and time. Therefore, it is necessary to summarize the current process of processing pathological images and key steps and methods used in each process, including: (1) pre-processing of pathological images, (2) image segmentation, (3) feature extraction, and (4) feature model construction. This will help people choose better and more appropriate medical image processing methods when predicting tumor biomarkers.

11.
Front Genet ; 12: 730519, 2021.
Article in English | MEDLINE | ID: mdl-34777467

ABSTRACT

Illumina is the leading sequencing platform in the next-generation sequencing (NGS) market globally. In recent years, MGI Tech has presented a series of new sequencers, including DNBSEQ-T7, MGISEQ-2000 and MGISEQ-200. As a complex application of NGS, cancer-detecting panels pose increasing demands for the high accuracy and sensitivity of sequencing and data analysis. In this study, we used the same capture DNA libraries constructed based on the Illumina protocol to evaluate the performance of the Illumina Nextseq500 and MGISEQ-2000 sequencing platforms. We found that the two platforms had high consistency in the results of hotspot mutation analysis; more importantly, we found that there was a significant loss of fragments in the 101-133 bp size range on the MGISEQ-2000 sequencing platform for Illumina libraries, but not for the capture DNA libraries prepared based on the MGISEQ protocol. This phenomenon may indicate fragment selection or low fragment ligation efficiency during the DNA circularization step, which is a unique step of the MGISEQ-2000 sequence platform. In conclusion, these different sequencing libraries and corresponding sequencing platforms are compatible with each other, but protocol and platform selection need to be carefully evaluated in combination with research purpose.

12.
Front Oncol ; 11: 711225, 2021.
Article in English | MEDLINE | ID: mdl-34367996

ABSTRACT

Drug repositioning is a new way of applying the existing therapeutics to new disease indications. Due to the exorbitant cost and high failure rate in developing new drugs, the continued use of existing drugs for treatment, especially anti-tumor drugs, has become a widespread practice. With the assistance of high-throughput sequencing techniques, many efficient methods have been proposed and applied in drug repositioning and individualized tumor treatment. Current computational methods for repositioning drugs and chemical compounds can be divided into four categories: (i) feature-based methods, (ii) matrix decomposition-based methods, (iii) network-based methods, and (iv) reverse transcriptome-based methods. In this article, we comprehensively review the widely used methods in the above four categories. Finally, we summarize the advantages and disadvantages of these methods and indicate future directions for more sensitive computational drug repositioning methods and individualized tumor treatment, which are critical for further experimental validation.

13.
Front Cell Dev Biol ; 9: 619330, 2021.
Article in English | MEDLINE | ID: mdl-34012960

ABSTRACT

Carcinoma of unknown primary (CUP) is a type of metastatic cancer, the primary tumor site of which cannot be identified. CUP occupies approximately 5% of cancer incidences in the United States with usually unfavorable prognosis, making it a big threat to public health. Traditional methods to identify the tissue-of-origin (TOO) of CUP like immunohistochemistry can only deal with around 20% CUP patients. In recent years, more and more studies suggest that it is promising to solve the problem by integrating machine learning techniques with big biomedical data involving multiple types of biomarkers including epigenetic, genetic, and gene expression profiles, such as DNA methylation. Different biomarkers play different roles in cancer research; for example, genomic mutations in a patient's tumor could lead to specific anticancer drugs for treatment; DNA methylation and copy number variation could reveal tumor tissue of origin and molecular classification. However, there is no systematic comparison on which biomarker is better at identifying the cancer type and site of origin. In addition, it might also be possible to further improve the inference accuracy by integrating multiple types of biomarkers. In this study, we used primary tumor data rather than metastatic tumor data. Although the use of primary tumors may lead to some biases in our classification model, their tumor-of-origins are known. In addition, previous studies have suggested that the CUP prediction model built from primary tumors could efficiently predict TOO of metastatic cancers (Lal et al., 2013; Brachtel et al., 2016). We systematically compared the performances of three types of biomarkers including DNA methylation, gene expression profile, and somatic mutation as well as their combinations in inferring the TOO of CUP patients. First, we downloaded the gene expression profile, somatic mutation and DNA methylation data of 7,224 tumor samples across 21 common cancer types from the cancer genome atlas (TCGA) and generated seven different feature matrices through various combinations. Second, we performed feature selection by the Pearson correlation method. The selected features for each matrix were used to build up an XGBoost multi-label classification model to infer cancer TOO, an algorithm proven to be effective in a few previous studies. The performance of each biomarker and combination was compared by the 10-fold cross-validation process. Our results showed that the TOO tracing accuracy using gene expression profile was the highest, followed by DNA methylation, while somatic mutation performed the worst. Meanwhile, we found that simply combining multiple biomarkers does not have much effect in improving prediction accuracy.

14.
Article in English | MEDLINE | ID: mdl-32850691

ABSTRACT

Sequencing-based identification of tumor tissue-of-origin (TOO) is critical for patients with cancer of unknown primary lesions. Even if the TOO of a tumor can be diagnosed by clinicopathological observation, reevaluations by computational methods can help avoid misdiagnosis. In this study, we developed a neural network (NN) framework using the expression of a 150-gene panel to infer the tumor TOO for 15 common solid tumor cancer types, including lung, breast, liver, colorectal, gastroesophageal, ovarian, cervical, endometrial, pancreatic, bladder, head and neck, thyroid, prostate, kidney, and brain cancers. To begin with, we downloaded the RNA-Seq data of 7,460 primary tumor samples across the above mentioned 15 cancer types, with each type of cancer having between 142 and 1,052 samples, from the cancer genome atlas. Then, we performed feature selection by the Pearson correlation method and performed a 150-gene panel analysis; the genes were significantly enriched in the GO:2001242 Regulation of intrinsic apoptotic signaling pathway and the GO:0009755 Hormone-mediated signaling pathway and other similar functions. Next, we developed a novel NN model using the 150 genes to predict tumor TOO for the 15 cancer types. The average prediction sensitivity and precision of the framework are 93.36 and 94.07%, respectively, for the 7,460 tumor samples based on the 10-fold cross-validation; however, the prediction sensitivity and precision for a few specific cancers, like prostate cancer, reached 100%. We also tested the trained model on a 20-sample independent dataset with metastatic tumor, and achieved an 80% accuracy. In summary, we present here a highly accurate method to infer tumor TOO, which has potential clinical implementation.

15.
Article in English | MEDLINE | ID: mdl-32850708

ABSTRACT

Data quality control and preprocessing are often the first step in processing next-generation sequencing (NGS) data of tumors. Not only can it help us evaluate the quality of sequencing data, but it can also help us obtain high-quality data for downstream data analysis. However, by comparing data analysis results of preprocessing with Cutadapt, FastP, Trimmomatic, and raw sequencing data, we found that the frequency of mutation detection had some fluctuations and differences, and human leukocyte antigen (HLA) typing directly resulted in erroneous results. We think that our research had demonstrated the impact of data preprocessing steps on downstream data analysis results. We hope that it can promote the development or optimization of better data preprocessing methods, so that downstream information analysis can be more accurate.

16.
Article in English | MEDLINE | ID: mdl-32850745

ABSTRACT

Circulating tumor cells (CTCs) derived from primary tumors and/or metastatic tumors are markers for tumor prognosis, and can also be used to monitor therapeutic efficacy and tumor recurrence. Circulating tumor cells enrichment and screening can be automated, but the final counting of CTCs currently requires manual intervention. This not only requires the participation of experienced pathologists, but also easily causes artificial misjudgment. Medical image recognition based on machine learning can effectively reduce the workload and improve the level of automation. So, we use machine learning to identify CTCs. First, we collected the CTC test results of 600 patients. After immunofluorescence staining, each picture presented a positive CTC cell nucleus and several negative controls. The images of CTCs were then segmented by image denoising, image filtering, edge detection, image expansion and contraction techniques using python's openCV scheme. Subsequently, traditional image recognition methods and machine learning were used to identify CTCs. Machine learning algorithms are implemented using convolutional neural network deep learning networks for training. We took 2300 cells from 600 patients for training and testing. About 1300 cells were used for training and the others were used for testing. The sensitivity and specificity of recognition reached 90.3 and 91.3%, respectively. We will further revise our models, hoping to achieve a higher sensitivity and specificity.

17.
Biochim Biophys Acta Mol Basis Dis ; 1866(11): 165916, 2020 11 01.
Article in English | MEDLINE | ID: mdl-32771416

ABSTRACT

Carcinoma of unknown primary (CUP), defined as metastatic cancers with unknown cancer origin, occurs in 3-5 per 100 cancer patients in the United States. Heterogeneity and metastasis of cancer brings great difficulties to the follow-up diagnosis and treatment for CUP. To find the tissue-of-origin (TOO) of the CUP, multiple methods have been raised. However, the accuracies for computed tomography (CT) and positron emission tomography (PET) to identify TOO were 20%-27% and 24%-40% respectively, which were not enough for determining targeted therapies. In this study, we provide a machine learning framework to trace tumor tissue origin by using gene length-normalized somatic mutation sequencing data. Somatic mutation data was downloaded from the Data Portal (Release 28) of the International Cancer Genome Consortium (ICGC), and 4909 samples for 13 cancers was used to identify primary site of cancers. Optimal results were obtained based on a 600-gene set by using the random forest algorithm with 10-fold cross-validation, and the average accuracy and F1-score were 0.8822 and 0.8886 respectively across 13 types of cancer. In conclusion, we provide an effective computational framework to infer cancer tissue-of-origin by combining DNA sequencing and machine learning techniques, which is promising in assisting clinical diagnosis of cancers.


Subject(s)
DNA/genetics , Machine Learning , Neoplasms, Unknown Primary/genetics , Algorithms , Mutation/genetics , Positron-Emission Tomography , Sequence Analysis, DNA
18.
Article in English | MEDLINE | ID: mdl-32509741

ABSTRACT

Metastatic cancers require further diagnosis to determine their primary tumor sites. However, the tissue-of-origin for around 5% tumors could not be identified by routine medical diagnosis according to a statistics in the United States. With the development of machine learning techniques and the accumulation of big cancer data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), it is now feasible to predict cancer tissue-of-origin by computational tools. Metastatic tumor inherits characteristics from its tissue-of-origin, and both gene expression profile and somatic mutation have tissue specificity. Thus, we developed a computational framework to infer tumor tissue-of-origin by integrating both gene mutation and expression (TOOme). Specifically, we first perform feature selection on both gene expressions and mutations by a random forest method. The selected features are then used to build up a multi-label classification model to infer cancer tissue-of-origin. We adopt a few popular multiple-label classification methods, which are compared by the 10-fold cross validation process. We applied TOOme to the TCGA data containing 7,008 non-metastatic samples across 20 solid tumors. Seventy four genes by gene expression profile and six genes by gene mutation are selected by the random forest process, which can be divided into two categories: (1) cancer type specific genes and (2) those expressed or mutated in several cancers with different levels of expression or mutation rates. Function analysis indicates that the selected genes are significantly enriched in gland development, urogenital system development, hormone metabolic process, thyroid hormone generation prostate hormone generation and so on. According to the multiple-label classification method, random forest performs the best with a 10-fold cross-validation prediction accuracy of 96%. We also use the 19 metastatic samples from TCGA and 256 cancer samples downloaded from GEO as independent testing data, for which TOOme achieves a prediction accuracy of 89%. The cross-validation validation accuracy is better than those using gene expression (i.e., 95%) and gene mutation (53%) alone. In conclusion, TOOme provides a quick yet accurate alternative to traditional medical methods in inferring cancer tissue-of-origin. In addition, the methods combining somatic mutation and gene expressions outperform those using gene expression or mutation alone.

19.
Biomed Res Int ; 2020: 6782046, 2020.
Article in English | MEDLINE | ID: mdl-32462012

ABSTRACT

Gene coexpression analysis is widely used to infer gene modules associated with diseases and other clinical traits. However, a systematic view and comparison of gene coexpression networks and modules across a cohort of tissues are more or less ignored. In this study, we first construct gene coexpression networks and modules of 52 GTEx tissues and cell lines. The network modules are enriched in many tissue-common functions like organelle membrane and tissue-specific functions. We then study the correlation of tissues from the network point of view. As a result, the network modules of most tissues are significantly correlated, indicating a general similar network pattern across tissues. However, the level of similarity among the tissues is different. The tissues closing in a physical location seem to be more similar in their coexpression networks. For example, the two adjacent tissues fallopian tube and bladder have the highest Fisher's exact test p value 8.54E-291 among all tissue pairs. It is known that immune-associated modules are frequently identified in coexperssion modules. In this study, we found immune modules in many tissues like liver, kidney cortex, lung, uterus, adipose subcutaneous, and adipose visceral omentum. However, not all tissues have immune-associated modules, for example, brain cerebellum. Finally, by the clique analysis, we identify the largest clique of modules, in which the genes in each module are significantly overlapped with those in other modules. As a result, we are able to find a clique of size 40 (out of 52 tissues), indicating a strong correlation of modules across tissues. It is not surprising that the 40 modules are most commonly enriched in immune-related functions.


Subject(s)
Cluster Analysis , Gene Expression Regulation , Gene Regulatory Networks , Adipose Tissue , Brain , Female , Gene Expression Profiling , Gene Ontology , Humans , Kidney , Liver , Lung , Uterus
20.
Front Genet ; 11: 147, 2020.
Article in English | MEDLINE | ID: mdl-32180799

ABSTRACT

Human blood contains cell-free DNA (cfDNA), with circulating tumor-derived DNAs (ctDNAs) widely used in cancer diagnosis and treatment. However, it is still difficult to efficiently and accurately identify and distinguish specific ctDNAs from normal cfDNA in cancer patient blood samples. In this study, ctDNA fragment length distribution analysis showed that ctDNA fragments are frequently shorter than the normal cfDNAs, which is consistent with previous findings. Interestingly, the ctDNA fragment length was found to be partially associated with the mutant allele frequency, with a low mutant allele frequency (< ~0.6%) associated with a longer ctDNA fragment length when compared to normal cfDNAs. The findings of this study contribute to improving the detection of low-frequency tumor mutations.

SELECTION OF CITATIONS
SEARCH DETAIL
...