Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
1.
Nutrients ; 16(10)2024 May 20.
Artigo em Inglês | MEDLINE | ID: mdl-38794775

RESUMO

BACKGROUND: This study aims to identify unique metabolomics biomarkers associated with Type 2 Diabetes (T2D) and develop an accurate diagnostics model using tree-based machine learning (ML) algorithms integrated with bioinformatics techniques. METHODS: Univariate and multivariate analyses such as fold change, a receiver operating characteristic curve (ROC), and Partial Least-Squares Discriminant Analysis (PLS-DA) were used to identify biomarker metabolites that showed significant concentration in T2D patients. Three tree-based algorithms [eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Adaptive Boosting (AdaBoost)] that demonstrated robustness in high-dimensional data analysis were used to create a diagnostic model for T2D. RESULTS: As a result of the biomarker discovery process validated with three different approaches, Pyruvate, D-Rhamnose, AMP, pipecolate, Tetradecenoic acid, Tetradecanoic acid, Dodecanediothioic acid, Prostaglandin E3/D3 (isobars), ADP and Hexadecenoic acid were determined as potential biomarkers for T2D. Our results showed that the XGBoost model [accuracy = 0.831, F1-score = 0.845, sensitivity = 0.882, specificity = 0.774, positive predictive value (PPV) = 0.811, negative-PV (NPV) = 0.857 and Area under the ROC curve (AUC) = 0.887] had the slight highest performance measures. CONCLUSIONS: ML integrated with bioinformatics techniques offers accurate and positive T2D candidate biomarker discovery. The XGBoost model can successfully distinguish T2D based on metabolites.


Assuntos
Biomarcadores , Biologia Computacional , Diabetes Mellitus Tipo 2 , Aprendizado de Máquina , Metabolômica , Diabetes Mellitus Tipo 2/metabolismo , Humanos , Biomarcadores/sangue , Biologia Computacional/métodos , Projetos Piloto , Masculino , Pessoa de Meia-Idade , Feminino , Metabolômica/métodos , Curva ROC , Algoritmos , Idoso , Adulto
2.
Heliyon ; 10(1): e23195, 2024 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-38163104

RESUMO

Aims: The multi-omics data integration has emerged as a prominent avenue within the healthcare industry, presenting substantial potential for enhancing predictive models. The main motivation behind this study stems from the imperative need to advance prognostic methodologies in cancer diagnosis, an area where precision is pivotal for effective clinical decision-making. In this context, the present study introduces an innovative methodology that integrates copy number alteration (CNA), DNA methylation, and gene expression data. Methods: The three omics data were successfully merged into a two-dimensional (2D) map using the PaCMAP dimensionality reduction technique. Utilizing the RGB coloring scheme, a visual representation of the integration was produced utilizing the values of the three omics of each sample. Then, the colored 2D maps were fed into a convolutional neural network (CNN) to forecast the Gleason score. Results: Our proposed model outperforms the cutting-edge i-SOM-GSN model by integrating multi-omics data and the CNN architecture with an accuracy of 98.89, and AUC of 0.9996. Conclusion: This study demonstrates the effectiveness of multi-omics data integration in predicting health outcomes. The proposed methodology, combining PaCMAP for dimensionality reduction, RGB coloring for visualization, and CNN for prediction, offers a comprehensive framework for integrating heterogeneous omics data and improving predictive accuracy. These findings contribute to the advancement of personalized medicine and have the potential to aid in clinical decision-making for prostate cancer patients.

3.
Diagnostics (Basel) ; 13(23)2023 Nov 21.
Artigo em Inglês | MEDLINE | ID: mdl-38066735

RESUMO

BACKGROUND: Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) is a complex and debilitating illness with a significant global prevalence, affecting over 65 million individuals. It affects various systems, including the immune, neurological, gastrointestinal, and circulatory systems. Studies have shown abnormalities in immune cell types, increased inflammatory cytokines, and brain abnormalities. Further research is needed to identify consistent biomarkers and develop targeted therapies. This study uses explainable artificial intelligence and machine learning techniques to identify discriminative metabolites for ME/CFS. MATERIAL AND METHODS: The model investigates a metabolomics dataset of CFS patients and healthy controls, including 26 healthy controls and 26 ME/CFS patients aged 22-72. The dataset encapsulated 768 metabolites into nine metabolic super-pathways: amino acids, carbohydrates, cofactors, vitamins, energy, lipids, nucleotides, peptides, and xenobiotics. Random forest methods together with other classifiers were applied to the data to classify individuals as ME/CFS patients and healthy individuals. The classification learning algorithms' performance in the validation step was evaluated using a variety of methods, including the traditional hold-out validation method, as well as the more modern cross-validation and bootstrap methods. Explainable artificial intelligence approaches were applied to clinically explain the optimum model's prediction decisions. RESULTS: The metabolomics of C-glycosyltryptophan, oleoylcholine, cortisone, and 3-hydroxydecanoate were determined to be crucial for ME/CFS diagnosis. The random forest model outperformed the other classifiers in ME/CFS prediction using the 1000-iteration bootstrapping method, achieving 98% accuracy, precision, recall, F1 score, 0.01 Brier score, and 99% AUC. According to the obtained results, the bootstrap validation approach demonstrated the highest classification outcomes. CONCLUSION: The proposed model accurately classifies ME/CFS patients based on the selected biomarker candidate metabolites. It offers a clear interpretation of risk estimation for ME/CFS, aiding physicians in comprehending the significance of key metabolomic features within the model.

4.
Metabolites ; 13(12)2023 Dec 18.
Artigo em Inglês | MEDLINE | ID: mdl-38132885

RESUMO

Diabetic retinopathy (DR), a common ocular microvascular complication of diabetes, contributes significantly to diabetes-related vision loss. This study addresses the imperative need for early diagnosis of DR and precise treatment strategies based on the explainable artificial intelligence (XAI) framework. The study integrated clinical, biochemical, and metabolomic biomarkers associated with the following classes: non-DR (NDR), non-proliferative diabetic retinopathy (NPDR), and proliferative diabetic retinopathy (PDR) in type 2 diabetes (T2D) patients. To create machine learning (ML) models, 10% of the data was divided into validation sets and 90% into discovery sets. The validation dataset was used for hyperparameter optimization and feature selection stages, while the discovery dataset was used to measure the performance of the models. A 10-fold cross-validation technique was used to evaluate the performance of ML models. Biomarker discovery was performed using minimum redundancy maximum relevance (mRMR), Boruta, and explainable boosting machine (EBM). The predictive proposed framework compares the results of eXtreme Gradient Boosting (XGBoost), natural gradient boosting for probabilistic prediction (NGBoost), and EBM models in determining the DR subclass. The hyperparameters of the models were optimized using Bayesian optimization. Combining EBM feature selection with XGBoost, the optimal model achieved (91.25 ± 1.88) % accuracy, (89.33 ± 1.80) % precision, (91.24 ± 1.67) % recall, (89.37 ± 1.52) % F1-Score, and (97.00 ± 0.25) % the area under the ROC curve (AUROC). According to the EBM explanation, the six most important biomarkers in determining the course of DR were tryptophan (Trp), phosphatidylcholine diacyl C42:2 (PC.aa.C42.2), butyrylcarnitine (C4), tyrosine (Tyr), hexadecanoyl carnitine (C16) and total dimethylarginine (DMA). The identified biomarkers may provide a better understanding of the progression of DR, paving the way for more precise and cost-effective diagnostic and treatment strategies.

5.
Metabolites ; 13(5)2023 Apr 25.
Artigo em Inglês | MEDLINE | ID: mdl-37233630

RESUMO

Colorectal cancer (CRC) is one of the most common and lethal diseases among all types of cancer, and metabolites play a significant role in the development of this complex disease. This study aimed to identify potential biomarkers and targets in the diagnosis and treatment of CRC using high-throughput metabolomics. Metabolite data extracted from the feces of CRC patients and healthy volunteers were normalized with the median normalization and Pareto scale for multivariate analysis. Univariate ROC analysis, the t-test, and analysis of fold changes (FCs) were applied to identify biomarker candidate metabolites in CRC patients. Only metabolites that overlapped the two different statistical approaches (false-discovery-rate-corrected p-value < 0.05 and AUC > 0.70) were considered in the further analysis. Multivariate analysis was performed with biomarker candidate metabolites based on linear support vector machines (SVM), partial least squares discrimination analysis (PLS-DA), and random forests (RF). The model identified five biomarker candidate metabolites that were significantly and differently expressed (adjusted p-value < 0.05) in CRC patients compared to healthy controls. The metabolites were succinic acid, aminoisobutyric acid, butyric acid, isoleucine, and leucine. Aminoisobutyric acid was the metabolite with the highest discriminatory potential in CRC, with an AUC equal to 0.806 (95% CI = 0.700-0.897), and was down-regulated in CRC patients. The SVM model showed the most substantial discrimination capacity for the five metabolites selected in the CRC screening, with an AUC of 0.985 (95% CI: 0.94-1).

7.
Comput Biol Med ; 154: 106619, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36738712

RESUMO

AIM: COVID-19 has revealed the need for fast and reliable methods to assist clinicians in diagnosing the disease. This article presents a model that applies explainable artificial intelligence (XAI) methods based on machine learning techniques on COVID-19 metagenomic next-generation sequencing (mNGS) samples. METHODS: In the data set used in the study, there are 15,979 gene expressions of 234 patients with COVID-19 negative 141 (60.3%) and COVID-19 positive 93 (39.7%). The least absolute shrinkage and selection operator (LASSO) method was applied to select genes associated with COVID-19. Support Vector Machine - Synthetic Minority Oversampling Technique (SVM-SMOTE) method was used to handle the class imbalance problem. Logistics regression (LR), SVM, random forest (RF), and extreme gradient boosting (XGBoost) methods were constructed to predict COVID-19. An explainable approach based on local interpretable model-agnostic explanations (LIME) and SHAPley Additive exPlanations (SHAP) methods was applied to determine COVID-19- associated biomarker candidate genes and improve the final model's interpretability. RESULTS: For the diagnosis of COVID-19, the XGBoost (accuracy: 0.930) model outperformed the RF (accuracy: 0.912), SVM (accuracy: 0.877), and LR (accuracy: 0.912) models. As a result of the SHAP, the three most important genes associated with COVID-19 were IFI27, LGR6, and FAM83A. The results of LIME showed that especially the high level of IFI27 gene expression contributed to increasing the probability of positive class. CONCLUSIONS: The proposed model (XGBoost) was able to predict COVID-19 successfully. The results show that machine learning combined with LIME and SHAP can explain the biomarker prediction for COVID-19 and provide clinicians with an intuitive understanding and interpretability of the impact of risk factors in the model.


Assuntos
Inteligência Artificial , COVID-19 , Humanos , COVID-19/diagnóstico , COVID-19/genética , Marcadores Genéticos , Fatores de Risco , Proteínas de Neoplasias
8.
Cancer Inform ; 21: 11769351221124205, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36187912

RESUMO

Introduction: Multi-omics data integration facilitates collecting richer understanding and perceptions than separate omics data. Various promising integrative approaches have been utilized to analyze multi-omics data for biomedical applications, including disease prediction and disease subtypes, biomarker prediction, and others. Methods: In this paper, we introduce a multi-omics data integration method that is constructed using the combination of gene similarity network (GSN) based on uniform manifold approximation and projection (UMAP) and convolutional neural networks (CNNs). The method utilizes UMAP to embed gene expression, DNA methylation, and copy number alteration (CNA) to a lower dimension creating two-dimensional RGB images. Gene expression is used as a reference to construct the GSN and then integrate other omics data with the gene expression for better prediction. We used CNNs to predict the Gleason score levels of prostate cancer patients and the tumor stage in breast cancer patients. Results: The model proposed near perfection with accuracy above 99% with all other performance measurements at the same level. The proposed model outperformed the state-of-art iSOM-GSN model that constructs the GSN map based on the self-organizing map. Conclusion: The results show that UMAP as an embedding technique can better integrate multi-omics maps into the prediction model than SOM. The proposed model can also be applied to build a multi-omics prediction model for other types of cancer.

9.
Urol Oncol ; 40(5): 191.e15-191.e20, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35307289

RESUMO

OBJECTIVE: To examine the ability of machine learning methods to predict upgrading of Gleason score on confirmatory magnetic resonance imaging-guided targeted biopsy (MRI-TB) of the prostate in candidates for active surveillance. SUBJECTS AND METHODS: Our database included 592 patients who received prostate multiparametric magnetic resonance imaging in the evaluation for active surveillance. Upgrading to significant prostate cancer on MRI-TB was defined as upgrading to G 3+4 (definition 1 - DF1) and 4+3 (DF2). Machine learning classifiers were applied on both classification problems DF1 and DF2. RESULTS: Univariate analysis showed that older age and the number of positive cores on pre-MRI-TB were positively correlated with upgrading by DF1 (P-value ≤ 0.05). Upgrading by DF2 was positively correlated with age and the number of positive cores and negatively correlated with body mass index. For upgrading prediction, the AdaBoost model was highly predictive of upgrading by DF1 (AUC 0.952), while for prediction of upgrading by DF2, the Random Forest model had a lower but excellent prediction performance (AUC 0.947). CONCLUSION: We show that machine learning has the potential to be integrated in future diagnostic assessments for patients eligible for AS. Training our models on larger multi-institutional databases is needed to confirm our results and improve the accuracy of these models' prediction.


Assuntos
Neoplasias da Próstata , Conduta Expectante , Biópsia , Humanos , Biópsia Guiada por Imagem/métodos , Aprendizado de Máquina , Imageamento por Ressonância Magnética/métodos , Masculino , Gradação de Tumores , Neoplasias da Próstata/diagnóstico por imagem , Neoplasias da Próstata/patologia , Estudos Retrospectivos
10.
Cancers (Basel) ; 14(4)2022 Feb 13.
Artigo em Inglês | MEDLINE | ID: mdl-35205681

RESUMO

The Nottingham Prognostics Index (NPI) is a prognostics measure that predicts operable primary breast cancer survival. The NPI value is calculated based on the size of the tumor, the number of lymph nodes, and the tumor grade. Next-generation sequencing advancements have led to measuring different biological indicators called multi-omics data. The availability of multi-omics data triggered the challenge of integrating and analyzing these various biological measures to understand the progression of the diseases. High-dimensional embedding techniques are incorporated to present the features in the lower dimension, i.e., in a 2-dimensional map. The dataset consists of three -omics: gene expression, copy number alteration (CNA), and mRNA from 1885 female patients. The model creates a gene similarity network (GSN) map for each omic using t-distributed stochastic neighbor embedding (t-SNE) before being merged into the residual neural network (ResNet) classification model. The aim of this work was to (i) extract multi-omics biomarkers that are associated with the prognosis and prediction of breast cancer survival; and (ii) build a prediction model for multi-class breast cancer NPI classes. We evaluated this model and compared it to different high-dimensional embedding techniques and neural network combinations. The proposed model outperformed the other methods with an accuracy of 98.48%, and the area under the curve (AUC) equals 0.9999. The findings in the literature confirm associations between some of the extracted omics and breast cancer prognosis and survival including CDCA5, IL17RB, MUC2, NOD2 and NXPH4 from the gene expression dataset; MED30, RAD21, EIF3H and EIF3E from the CNA dataset; and CENPA, MACF1, UGT2B7 and SEMA3B from the mRNA dataset.

11.
Cell Mol Gastroenterol Hepatol ; 12(5): 1847-1872.e0, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34534703

RESUMO

BACKGROUND & AIMS: Circadian rhythms are daily physiological oscillations driven by the circadian clock: a 24-hour transcriptional timekeeper that regulates hormones, inflammation, and metabolism. Circadian rhythms are known to be important for health, but whether their loss contributes to colorectal cancer is not known. We tested the nonredundant clock gene Bmal1 in intestinal homeostasis and tumorigenesis, using the Apcmin model of colorectal cancer. METHODS: Bmal1 mutant, epithelium-conditional Bmal1 mutant, and photoperiod (day/night cycle) disrupted mice bearing the Apcmin allele were assessed for tumorigenesis. Tumors and normal nontransformed tissue were characterized. Intestinal organoids were assessed for circadian transcription rhythms by RNA sequencing, and in vivo and organoid assays were used to test Bmal1-dependent proliferation and self-renewal. RESULTS: Loss of Bmal1 or circadian photoperiod increases tumor initiation. In the intestinal epithelium the clock regulates transcripts involved in regeneration and intestinal stem cell signaling. Tumors have no self-autonomous clock function and only weak clock function in vivo. Apcmin clock-disrupted tumors show high Yes-associated protein 1 (Hippo signaling) activity but show low Wnt (Wingless and Int-1) activity. Intestinal organoid assays show that loss of Bmal1 increases self-renewal in a Yes-associated protein 1-dependent manner. CONCLUSIONS: Bmal1 regulates intestinal stem cell pathways, including Hippo signaling, and the loss of circadian rhythms potentiates tumor initiation. Transcript profiling: GEO accession number: GSE157357.


Assuntos
Fatores de Transcrição ARNTL/genética , Transformação Celular Neoplásica/genética , Transformação Celular Neoplásica/metabolismo , Relógios Circadianos/genética , Regulação da Expressão Gênica , Transdução de Sinais , Células-Tronco/metabolismo , Animais , Autorrenovação Celular/genética , Ritmo Circadiano , Via de Sinalização Hippo , Imuno-Histoquímica , Camundongos , Camundongos Knockout , Mucosa/imunologia , Mucosa/metabolismo , Mucosa/patologia , Mutação , Proteínas de Sinalização YAP/metabolismo
12.
BMC Bioinformatics ; 21(Suppl 2): 78, 2020 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-32164523

RESUMO

BACKGROUND: Finding the tumor location in the prostate is an essential pathological step for prostate cancer diagnosis and treatment. The location of the tumor - the laterality - can be unilateral (the tumor is affecting one side of the prostate), or bilateral on both sides. Nevertheless, the tumor can be overestimated or underestimated by standard screening methods. In this work, a combination of efficient machine learning methods for feature selection and classification are proposed to analyze gene activity and select them as relevant biomarkers for different laterality samples. RESULTS: A data set that consists of 450 samples was used in this study. The samples were divided into three laterality classes (left, right, bilateral). The aim of this work is to understand the genomic activity in each class and find relevant genes as indicators for each class with nearly 99% accuracy. The system identified groups of differentially expressed genes (RTN1, HLA-DMB, MRI1) that are able to differentiate samples among the three classes. CONCLUSION: The proposed method was able to detect sets of genes that can identify different laterality classes. The resulting genes are found to be strongly correlated with disease progression. HLA-DMB and EIF4G2, which are detected in the set of genes can detect the left laterality, were reported earlier to be in the same pathway called Allograft rejection SuperPath.


Assuntos
Regulação Neoplásica da Expressão Gênica , Aprendizado de Máquina , Neoplasias da Próstata/patologia , Área Sob a Curva , Autoantígenos/genética , Autoantígenos/metabolismo , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/metabolismo , Humanos , Imageamento por Ressonância Magnética , Masculino , Fosfoproteínas/genética , Fosfoproteínas/metabolismo , Próstata/diagnóstico por imagem , Neoplasias da Próstata/diagnóstico por imagem , Neoplasias da Próstata/genética , Curva ROC , Ribonuclease P/genética , Ribonuclease P/metabolismo , Fatores de Processamento de Serina-Arginina/genética , Fatores de Processamento de Serina-Arginina/metabolismo
13.
Diagnostics (Basel) ; 9(4)2019 Dec 11.
Artigo em Inglês | MEDLINE | ID: mdl-31835700

RESUMO

(1) Background:One of the most common cancers that affect North American men and men worldwide is prostate cancer. The Gleason score is a pathological grading system to examine the potential aggressiveness of the disease in the prostate tissue. Advancements in computing and next-generation sequencing technology now allow us to study the genomic profiles of patients in association with their different Gleason scores more accurately and effectively. (2) Methods: In this study, we used a novel machine learning method to analyse gene expression of prostate tumours with different Gleason scores, and identify potential genetic biomarkers for each Gleason group. We obtained a publicly-available RNA-Seq dataset of a cohort of 104 prostate cancer patients from the National Center for Biotechnology Information's (NCBI) Gene Expression Omnibus (GEO) repository, and categorised patients based on their Gleason scores to create a hierarchy of disease progression. A hierarchical model with standard classifiers in different Gleason groups, also known as nodes, was developed to identify and predict nodes based on their mRNA or gene expression. In each node, patient samples were analysed via class imbalance and hybrid feature selection techniques to build the prediction model. The outcome from analysis of each node was a set of genes that could differentiate each Gleason group from the remaining groups. To validate the proposed method, the set of identified genes were used to classify a second dataset of 499 prostate cancer patients collected from cBioportal. (3) Results: The overall accuracy of applying this novel method to the first dataset was 93.3%; the method was further validated to have 87% accuracy using the second dataset. This method also identified genes that were not previously reported as potential biomarkers for specific Gleason groups. In particular, PIAS3 was identified as a potential biomarker for Gleason score 4 + 3 = 7, and UBE2V2 for Gleason score 6. (4) Insight: Previous reports show that the genes predicted by this newly proposed method strongly correlate with prostate cancer development and progression. Furthermore, pathway analysis shows that both PIAS3 and UBE2V2 share similar protein interaction pathways, the JAK/STAT signaling process.

14.
Front Genet ; 10: 256, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30972106

RESUMO

Genomic profiles among different breast cancer survivors who received similar treatment may provide clues about the key biological processes involved in the cells and finding the right treatment. More specifically, such profiling may help personalize the treatment based on the patients' gene expression. In this paper, we present a hierarchical machine learning system that predicts the 5-year survivability of the patients who underwent though specific therapy; The classes are built on the combination of two parts that are the survivability information and the given therapy. For the survivability information part, it defines whether the patient survives the 5-years interval or deceased. While the therapy part denotes the therapy has been taken during that interval, which includes hormone therapy, radiotherapy, or surgery, which totally forms six classes. The Model classifies one class vs. the rest at each node, which makes the tree-based model creates five nodes. The model is trained using a set of standard classifiers based on a comprehensive study dataset that includes genomic profiles and clinical information of 347 patients. A combination of feature selection methods and a prediction method are applied on each node to identify the genes that can predict the class at that node, the identified genes for each class may serve as potential biomarkers to the class's treatment for better survivability. The results show that the model identifies the classes with high-performance measurements. An exhaustive analysis based on relevant literature shows that some of the potential biomarkers are strongly related to breast cancer survivability and cancer in general.

15.
Cancer Inform ; 18: 1176935119835522, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30890858

RESUMO

Prostate cancer is one of the most common types of cancer among Canadian men. Next-generation sequencing using RNA-Seq provides large amounts of data that may reveal novel and informative biomarkers. We introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have tremendous value in guiding treatment decisions. Analysis of normal versus malignant prostate cancer data sets indicates differential expression of the genes HEATR5B, DDC, and GABPB1-AS1 as potential prostate cancer biomarkers. Our study also supports PTGFR, NREP, SCARNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease.

16.
Evol Bioinform Online ; 14: 1176934318790266, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30116102

RESUMO

Analyzing the genetic activity of breast cancer survival for a specific type of therapy provides a better understanding of the body response to the treatment and helps select the best course of action and while leading to the design of drugs based on gene activity. In this work, we use supervised and nonsupervised machine learning methods to deal with a multiclass classification problem in which we label the samples based on the combination of the 5-year survivability and treatment; we focus on hormone therapy, radiotherapy, and surgery. The proposed nonsupervised hierarchical models are created to find the highest separability between combinations of the classes. The supervised model consists of a combination of feature selection techniques and efficient classifiers used to find a potential set of biomarker genes specific to response to therapy. The results show that different models achieve different performance scores with accuracies ranging from 80.9% to 100%. We have investigated the roles of many biomarkers through the literature and found that some of the discriminative genes in the computational model such as ZC3H11A, VAX2, MAF1, and ZFP91 are related to breast cancer and other types of cancer.

17.
J Comput Biol ; 24(8): 746-755, 2017 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-28414515

RESUMO

Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.


Assuntos
Genoma Humano , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA