Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 130
Filtrar
1.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37427963

RESUMO

Survival analysis is critical to cancer prognosis estimation. High-throughput technologies facilitate the increase in the dimension of genic features, but the number of clinical samples in cohorts is relatively small due to various reasons, including difficulties in participant recruitment and high data-generation costs. Transcriptome is one of the most abundantly available OMIC (referring to the high-throughput data, including genomic, transcriptomic, proteomic and epigenomic) data types. This study introduced a multitask graph attention network (GAT) framework DQSurv for the survival analysis task. We first used a large dataset of healthy tissue samples to pretrain the GAT-based HealthModel for the quantitative measurement of the gene regulatory relations. The multitask survival analysis framework DQSurv used the idea of transfer learning to initiate the GAT model with the pretrained HealthModel and further fine-tuned this model using two tasks i.e. the main task of survival analysis and the auxiliary task of gene expression prediction. This refined GAT was denoted as DiseaseModel. We fused the original transcriptomic features with the difference vector between the latent features encoded by the HealthModel and DiseaseModel for the final task of survival analysis. The proposed DQSurv model stably outperformed the existing models for the survival analysis of 10 benchmark cancer types and an independent dataset. The ablation study also supported the necessity of the main modules. We released the codes and the pretrained HealthModel to facilitate the feature encodings and survival analysis of transcriptome-based future studies, especially on small datasets. The model and the code are available at http://www.healthinformaticslab.org/supp/.


Assuntos
Algoritmos , Neoplasias , Humanos , Proteômica , Análise de Sobrevida
2.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38426310

RESUMO

MOTIVATION: Predicting molecular properties is a pivotal task in various scientific domains, including drug discovery, material science, and computational chemistry. This problem is often hindered by the lack of annotated data and imbalanced class distributions, which pose significant challenges in developing accurate and robust predictive models. RESULTS: This study tackles these issues by employing pretrained molecular models within a few-shot learning framework. A novel dynamic contrastive loss function is utilized to further improve model performance in the situation of class imbalance. The proposed MolFeSCue framework not only facilitates rapid generalization from minimal samples, but also employs a contrastive loss function to extract meaningful molecular representations from imbalanced datasets. Extensive evaluations and comparisons of MolFeSCue and state-of-the-art algorithms have been conducted on multiple benchmark datasets, and the experimental data demonstrate our algorithm's effectiveness in molecular representations and its broad applicability across various pretrained models. Our findings underscore MolFeSCues potential to accelerate advancements in drug discovery. AVAILABILITY AND IMPLEMENTATION: We have made all the source code utilized in this study publicly accessible via GitHub at http://www.healthinformaticslab.org/supp/ or https://github.com/zhangruochi/MolFeSCue. The code (MolFeSCue-v1-00) is also available as the supplementary file of this paper.


Assuntos
Algoritmos , Benchmarking , Descoberta de Drogas , Modelos Moleculares , Software
3.
PLoS Comput Biol ; 20(4): e1011945, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38578805

RESUMO

Early identification of safe and efficacious disease targets is crucial to alleviating the tremendous cost of drug discovery projects. However, existing experimental methods for identifying new targets are generally labor-intensive and failure-prone. On the other hand, computational approaches, especially machine learning-based frameworks, have shown remarkable application potential in drug discovery. In this work, we propose Progeni, a novel machine learning-based framework for target identification. In addition to fully exploiting the known heterogeneous biological networks from various sources, Progeni integrates literature evidence about the relations between biological entities to construct a probabilistic knowledge graph. Graph neural networks are then employed in Progeni to learn the feature embeddings of biological entities to facilitate the identification of biologically relevant target candidates. A comprehensive evaluation of Progeni demonstrated its superior predictive power over the baseline methods on the target identification task. In addition, our extensive tests showed that Progeni exhibited high robustness to the negative effect of exposure bias, a common phenomenon in recommendation systems, and effectively identified new targets that can be strongly supported by the literature. Moreover, our wet lab experiments successfully validated the biological significance of the top target candidates predicted by Progeni for melanoma and colorectal cancer. All these results suggested that Progeni can identify biologically effective targets and thus provide a powerful and useful tool for advancing the drug discovery process.


Assuntos
Biologia Computacional , Descoberta de Drogas , Aprendizado de Máquina , Redes Neurais de Computação , Humanos , Biologia Computacional/métodos , Descoberta de Drogas/métodos , Algoritmos , Melanoma , Probabilidade , Neoplasias Colorretais
4.
Anal Chem ; 96(21): 8772-8781, 2024 May 28.
Artigo em Inglês | MEDLINE | ID: mdl-38743842

RESUMO

The metabolic signature identification of colorectal cancer is critical for its early diagnosis and therapeutic approaches that will significantly block cancer progression and improve patient survival. Here, we combined an untargeted metabolic analysis strategy based on internal extractive electrospray ionization mass spectrometry and the machine learning approach to analyze metabolites in 173 pairs of cancer samples and matched normal tissue samples to build robust metabolic signature models for diagnostic purposes. Screening and independent validation of metabolic signatures from colorectal cancers via machine learning methods (Logistic Regression_L1 for feature selection and eXtreme Gradient Boosting for classification) was performed to generate a panel of seven signatures with good diagnostic performance (the accuracy of 87.74%, sensitivity of 85.82%, and specificity of 89.66%). Moreover, seven signatures were evaluated according to their ability to distinguish between cancer and normal tissues, with the metabolic molecule PC (30:0) showing good diagnostic performance. In addition, genes associated with PC (30:0) were identified by multiomics analysis (combining metabolic data with transcriptomic data analysis) and our results showed that PC (30:0) could promote the proliferation of colorectal cancer cell SW480, revealing the correlation between genetic changes and metabolic dysregulation in cancer. Overall, our results reveal potential determinants affecting metabolite dysregulation, paving the way for a mechanistic understanding of altered tissue metabolites in colorectal cancer and design interventions for manipulating the levels of circulating metabolites.


Assuntos
Neoplasias Colorretais , Aprendizado de Máquina , Neoplasias Colorretais/metabolismo , Neoplasias Colorretais/diagnóstico , Humanos , Metabolômica , Linhagem Celular Tumoral , Espectrometria de Massas por Ionização por Electrospray , Metaboloma , Proliferação de Células , Multiômica
5.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35514183

RESUMO

Human Leukocyte Antigen (HLA) is a type of molecule residing on the surfaces of most human cells and exerts an essential role in the immune system responding to the invasive items. The T cell antigen receptors may recognize the HLA-peptide complexes on the surfaces of cancer cells and destroy these cancer cells through toxic T lymphocytes. The computational determination of HLA-binding peptides will facilitate the rapid development of cancer immunotherapies. This study hypothesized that the natural language processing-encoded peptide features may be further enriched by another deep neural network. The hypothesis was tested with the Bi-directional Long Short-Term Memory-extracted features from the pretrained Protein Bidirectional Encoder Representations from Transformers-encoded features of the class I HLA (HLA-I)-binding peptides. The experimental data showed that our proposed HLAB feature engineering algorithm outperformed the existing ones in detecting the HLA-I-binding peptides. The extensive evaluation data show that the proposed HLAB algorithm outperforms all the seven existing studies on predicting the peptides binding to the HLA-A*01:01 allele in AUC and achieves the best average AUC values on the six out of the seven k-mers (k=8,9,...,14, respectively represent the prediction task of a polypeptide consisting of k amino acids) except for the 9-mer prediction tasks. The source code and the fine-tuned feature extraction models are available at http://www.healthinformaticslab.org/supp/resources.php.


Assuntos
Antígenos de Histocompatibilidade Classe I , Peptídeos , Aminoácidos/metabolismo , Antígenos HLA/química , Antígenos HLA/genética , Antígenos HLA-A/metabolismo , Antígenos de Histocompatibilidade Classe I/química , Humanos , Peptídeos/química , Ligação Proteica
6.
Anal Biochem ; 689: 115495, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38431142

RESUMO

RNA modification, N4-acetylcytidine (ac4C), is enzymatically catalyzed by N-acetyltransferase 10 (NAT10) and plays an essential role across tRNA, rRNA, and mRNA. It influences various cellular functions, including mRNA stability and rRNA biosynthesis. Wet-lab detection of ac4C modification sites is highly resource-intensive and costly. Therefore, various machine learning and deep learning techniques have been employed for computational detection of ac4C modification sites. The known ac4C modification sites are limited for training an accurate and stable prediction model. This study introduces GANSamples-ac4C, a novel framework that synergizes transfer learning and generative adversarial network (GAN) to generate synthetic RNA sequences to train a better ac4C modification site prediction model. Comparative analysis reveals that GANSamples-ac4C outperforms existing state-of-the-art methods in identifying ac4C sites. Moreover, our result underscores the potential of synthetic data in mitigating the issue of data scarcity for biological sequence prediction tasks. Another major advantage of GANSamples-ac4C is its interpretable decision logic. Multi-faceted interpretability analyses detect key regions in the ac4C sequences influencing the discriminating decision between positive and negative samples, a pronounced enrichment of G in this region, and ac4C-associated motifs. These findings may offer novel insights for ac4C research. The GANSamples-ac4C framework and its source code are publicly accessible at http://www.healthinformaticslab.org/supp/.


Assuntos
Citidina/análogos & derivados , Aprendizado de Máquina , RNA , Estabilidade de RNA
7.
Brief Bioinform ; 22(2): 896-904, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-32743639

RESUMO

The novel coronavirus (2019-nCoV) has recently caused a large-scale outbreak of viral pneumonia both in China and worldwide. In this study, we obtained the entire genome sequence of 777 new coronavirus strains as of 29 February 2020 from a public gene bank. Bioinformatics analysis of these strains indicated that the mutation rate of these new coronaviruses is not high at present, similar to the mutation rate of the severe acute respiratory syndrome (SARS) virus. The similarities of 2019-nCoV and SARS virus suggested that the S and ORF6 proteins shared a low similarity, while the E protein shared the higher similarity. The 2019-nCoV sequence has similar potential phosphorylation sites and glycosylation sites on the surface protein and the ORF1ab polyprotein as the SARS virus; however, there are differences in potential modification sites between the Chinese strain and some American strains. At the same time, we proposed two possible recombination sites for 2019-nCoV. Based on the results of the skyline, we speculate that the activity of the gene population of 2019-nCoV may be before the end of 2019. As the scope of the 2019-nCoV infection further expands, it may produce different adaptive evolutions due to different environments. Finally, evolutionary genetic analysis can be a useful resource for studying the spread and virulence of 2019-nCoV, which are essential aspects of preventive and precise medicine.


Assuntos
COVID-19/classificação , Filogenia , Teorema de Bayes , COVID-19/genética , COVID-19/virologia , Evolução Molecular , Humanos , Coronavírus Relacionado à Síndrome Respiratória Aguda Grave/genética , Coronavírus Relacionado à Síndrome Respiratória Aguda Grave/isolamento & purificação
8.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33048108

RESUMO

MOTIVATION: DNA methylation is a biological process impacting the gene functions without changing the underlying DNA sequence. The DNA methylation machinery usually attaches methyl groups to some specific cytosine residues, which modify the chromatin architectures. Such modifications in the promoter regions will inactivate some tumor-suppressor genes. DNA methylation within the coding region may significantly reduce the transcription elongation efficiency. The gene function may be tuned through some cytosines are methylated. METHODS: This study hypothesizes that the overall methylation level across a gene may have a better association with the sample labels like diseases than the methylations of individual cytosines. The gene methylation level is formulated as a regression model using the methylation levels of all the cytosines within this gene. A comprehensive evaluation of various feature selection algorithms and classification algorithms is carried out between the gene-level and residue-level methylation levels. RESULTS: A comprehensive evaluation was conducted to compare the gene and cytosine methylation levels for their associations with the sample labels and classification performances. The unsupervised clustering was also improved using the gene methylation levels. Some genes demonstrated statistically significant associations with the class label, even when no residue-level methylation features have statistically significant associations with the class label. So in summary, the trained gene methylation levels improved various methylome-based machine learning models. Both methodology development of regression algorithms and experimental validation of the gene-level methylation biomarkers are worth of further investigations in the future studies. The source code, example data files and manual are available at http://www.healthinformaticslab.org/supp/.


Assuntos
Metilação de DNA , Bases de Dados de Ácidos Nucleicos , Aprendizado de Máquina , Modelos Genéticos , Humanos
9.
BMC Infect Dis ; 23(1): 622, 2023 Sep 21.
Artigo em Inglês | MEDLINE | ID: mdl-37735372

RESUMO

BACKGROUND: Coronavirus disease 2019 (COVID-19) is a rapidly developing and sometimes lethal pulmonary disease. Accurately predicting COVID-19 mortality will facilitate optimal patient treatment and medical resource deployment, but the clinical practice still needs to address it. Both complete blood counts and cytokine levels were observed to be modified by COVID-19 infection. This study aimed to use inexpensive and easily accessible complete blood counts to build an accurate COVID-19 mortality prediction model. The cytokine fluctuations reflect the inflammatory storm induced by COVID-19, but their levels are not as commonly accessible as complete blood counts. Therefore, this study explored the possibility of predicting cytokine levels based on complete blood counts. METHODS: We used complete blood counts to predict cytokine levels. The predictive model includes an autoencoder, principal component analysis, and linear regression models. We used classifiers such as support vector machine and feature selection models such as adaptive boost to predict the mortality of COVID-19 patients. RESULTS: Complete blood counts and original cytokine levels reached the COVID-19 mortality classification area under the curve (AUC) values of 0.9678 and 0.9111, respectively, and the cytokine levels predicted by the feature set alone reached the classification AUC value of 0.9844. The predicted cytokine levels were more significantly associated with COVID-19 mortality than the original values. CONCLUSIONS: Integrating the predicted cytokine levels and complete blood counts improved a COVID-19 mortality prediction model using complete blood counts only. Both the cytokine level prediction models and the COVID-19 mortality prediction models are publicly available at http://www.healthinformaticslab.org/supp/resources.php .


Assuntos
COVID-19 , Humanos , Área Sob a Curva , Citocinas , Modelos Lineares , Análise de Componente Principal
10.
Bioinformatics ; 37(15): 2183-2189, 2021 Aug 09.
Artigo em Inglês | MEDLINE | ID: mdl-33515240

RESUMO

MOTIVATION: A feature selection algorithm may select the subset of features with the best associations with the class labels. The recursive feature elimination (RFE) is a heuristic feature screening framework and has been widely used to select the biological OMIC biomarkers. This study proposed a dynamic recursive feature elimination (dRFE) framework with more flexible feature elimination operations. The proposed dRFE was comprehensively compared with 11 existing feature selection algorithms and five classifiers on the eight difficult transcriptome datasets from a previous study, the ten newly collected transcriptome datasets and the five methylome datasets. RESULTS: The experimental data suggested that the regular RFE framework did not perform well, and dRFE outperformed the existing feature selection algorithms in most cases. The dRFE-detected features achieved Acc = 1.0000 for the two methylome datasets GSE53045 and GSE66695. The best prediction accuracies of the dRFE-detected features were 0.9259, 0.9424 and 0.8601 for the other three methylome datasets GSE74845, GSE103186 and GSE80970, respectively. Four transcriptome datasets received Acc = 1.0000 using the dRFE-detected features, and the prediction accuracies for the other six newly collected transcriptome datasets were between 0.6301 and 0.9917. AVAILABILITY AND IMPLEMENTATION: The experiments in this study are implemented and tested using the programming language Python version 3.7.6. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

11.
PLoS Comput Biol ; 17(2): e1008676, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33529200

RESUMO

Finding non-standard or new metabolic pathways has important applications in metabolic engineering, synthetic biology and the analysis and reconstruction of metabolic networks. Branched metabolic pathways dominate in metabolic networks and depict a more comprehensive picture of metabolism compared to linear pathways. Although progress has been developed to find branched metabolic pathways, few efforts have been made in identifying branched metabolic pathways via atom group tracking. In this paper, we present a pathfinding method called BPFinder for finding branched metabolic pathways by atom group tracking, which aims to guide the synthetic design of metabolic pathways. BPFinder enumerates linear metabolic pathways by tracking the movements of atom groups in metabolic network and merges the linear atom group conserving pathways into branched pathways. Two merging rules based on the structure of conserved atom groups are proposed to accurately merge the branched compounds of linear pathways to identify branched pathways. Furthermore, the integrated information of compound similarity, thermodynamic feasibility and conserved atom groups is also used to rank the pathfinding results for feasible branched pathways. Experimental results show that BPFinder is more capable of recovering known branched metabolic pathways as compared to other existing methods, and is able to return biologically relevant branched pathways and discover alternative branched pathways of biochemical interest. The online server of BPFinder is available at http://114.215.129.245:8080/atomic/. The program, source code and data can be downloaded from https://github.com/hyr0771/BPFinder.


Assuntos
Biologia Computacional/métodos , Engenharia Metabólica/métodos , Biologia Sintética/métodos , Algoritmos , Linhagem Celular Tumoral , Ácido Glutâmico/metabolismo , Humanos , Modelos Lineares , Redes e Vias Metabólicas , Modelos Biológicos , Ornitina/metabolismo , Valor Preditivo dos Testes , Reprodutibilidade dos Testes , Software
12.
Skin Res Technol ; 28(5): 677-688, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-35639819

RESUMO

BACKGROUND: Acne is one of the most common skin lesions in adolescents. Some severe or inflammatory acne leads to scars, which may have major impacts on patients' quality of life or even job prospects. Grading acne plays an important role in diagnosis, and the diagnosis is made by counting the number of acne. It is a labor-intensive job and it is easy for dermatologists to make mistakes, so it is very important to develop automatic diagnosis methods. Ensemble learning may improve the prediction results of the base models, but its time complexity is relatively high. The ensemble pruning strategy may solve this computational challenge by removing the redundant base models. MATERIALS AND METHODS: This study proposed a novel ensemble pruning framework of deep learning models to accurately detect and grade acne using images. First, we train multi-base models and prune the redundancy models according to the performance and diversity of the models. Then, we construct the new features of the training data by the base models we select in the previous step. Next, we remove the redundancy models further by a feature selection algorithm. Finally, we integrate all the base models by classifiers. The ensemble pruning algorithm was proposed to prune the deep learning base models. RESULTS: The experimental data showed that the ensemble pruned framework achieved a prediction accuracy of 85.82% on the acne dataset, better than the existing studies. To verify our method's effectiveness, we test our method in a skin cancer dataset and greatly outperform the state-of-the-art methods. CONCLUSION: The method we proposed is used to grade acne. Our method's performance outperforms state-of-the-art methods on two datasets, and it can also remove redundancy models to reduce computational complexity.


Assuntos
Acne Vulgar , Aprendizado Profundo , Acne Vulgar/diagnóstico por imagem , Adolescente , Algoritmos , Humanos , Qualidade de Vida
13.
Bioinformatics ; 36(1): 49-55, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31218360

RESUMO

MOTIVATION: Cell divisions start from replicating the double-stranded DNA, and the DNA replication process needs to be precisely regulated both spatially and temporally. The DNA is replicated starting from the DNA replication origins. A few successful prediction models were generated based on the assumption that the DNA replication origin regions have sequence level features like physicochemical properties significantly different from the other DNA regions. RESULTS: This study proposed a feature selection procedure to further refine the classification model of the DNA replication origins. The experimental data demonstrated that as large as 26% improvement in the prediction accuracy may be achieved on the yeast Saccharomyces cerevisiae. Moreover, the prediction accuracies of the DNA replication origins were improved for all the four yeast genomes investigated in this study. AVAILABILITY AND IMPLEMENTATION: The software sefOri version 1.0 was available at http://www.healthinformaticslab.org/supp/resources.php. An online server was also provided for the convenience of the users, and its web link may be found in the above-mentioned web page. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Replicação do DNA , Modelos Genéticos , Origem de Replicação , Análise de Sequência de DNA , DNA/química , Replicação do DNA/genética , Origem de Replicação/genética , Saccharomyces cerevisiae/genética , Análise de Sequência de DNA/métodos
14.
Bioinformatics ; 36(5): 1542-1552, 2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-31591638

RESUMO

MOTIVATION: Deep neural network (DNN) algorithms were utilized in predicting various biomedical phenotypes recently, and demonstrated very good prediction performances without selecting features. This study proposed a hypothesis that the DNN models may be further improved by feature selection algorithms. RESULTS: A comprehensive comparative study was carried out by evaluating 11 feature selection algorithms on three conventional DNN algorithms, i.e. convolution neural network (CNN), deep belief network (DBN) and recurrent neural network (RNN), and three recent DNNs, i.e. MobilenetV2, ShufflenetV2 and Squeezenet. Five binary classification methylomic datasets were chosen to calculate the prediction performances of CNN/DBN/RNN models using feature selected by the 11 feature selection algorithms. Seventeen binary classification transcriptome and two multi-class transcriptome datasets were also utilized to evaluate how the hypothesis may generalize to different data types. The experimental data supported our hypothesis that feature selection algorithms may improve DNN models, and the DBN models using features selected by SVM-RFE usually achieved the best prediction accuracies on the five methylomic datasets. AVAILABILITY AND IMPLEMENTATION: All the algorithms were implemented and tested under the programming environment Python version 3.6.6. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Redes Neurais de Computação , Algoritmos
15.
J Chem Inf Model ; 61(6): 2697-2705, 2021 06 28.
Artigo em Inglês | MEDLINE | ID: mdl-34009965

RESUMO

Determining the properties of chemical molecules is essential for screening candidates similar to a specific drug. These candidate molecules are further evaluated for their target binding affinities, side effects, target missing probabilities, etc. Conventional machine learning algorithms demonstrated satisfying prediction accuracies of molecular properties. A molecule cannot be directly loaded into a machine learning model, and a set of engineered features needs to be designed and calculated from a molecule. Such hand-crafted features rely heavily on the experiences of the investigating researchers. The concept of graph neural networks (GNNs) was recently introduced to describe the chemical molecules. The features may be automatically and objectively extracted from the molecules through various types of GNNs, e.g., GCN (graph convolution network), GGNN (gated graph neural network), DMPNN (directed message passing neural network), etc. However, the training of a stable GNN model requires a huge number of training samples and a large amount of computing power, compared with the conventional machine learning strategies. This study proposed the integrated framework XGraphBoost to extract the features using a GNN and build an accurate prediction model of molecular properties using the classifier XGBoost. The proposed framework XGraphBoost fully inherits the merits of the GNN-based automatic molecular feature extraction and XGBoost-based accurate prediction performance. Both classification and regression problems were evaluated using the framework XGraphBoost. The experimental results strongly suggest that XGraphBoost may facilitate the efficient and accurate predictions of various molecular properties. The source code is freely available to academic users at https://github.com/chenxiaowei-vincent/XGraphBoost.git.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Algoritmos , Software
16.
Int J Mol Sci ; 22(6)2021 Mar 17.
Artigo em Inglês | MEDLINE | ID: mdl-33802922

RESUMO

Enhancers are short genomic regions exerting tissue-specific regulatory roles, usually for remote coding regions. Enhancers are observed in both prokaryotic and eukaryotic genomes, and their detections facilitate a better understanding of the transcriptional regulation mechanism. The accurate detection and transcriptional regulation strength evaluation of the enhancers remain a major bioinformatics challenge. Most of the current studies utilized the statistical features of short fixed-length nucleotide sequences. This study introduces the location information of each k-mer (SeqPose) into the encoding strategy of a DNA sequence and employs the attention mechanism in the two-layer bi-directional long-short term memory (BD-LSTM) model (spEnhancer) for the enhancer detection problem. The first layer of the delivered classifier discriminates between enhancers and non-enhancers, and the second layer evaluates the transcriptional regulation strength of the detected enhancer. The SeqPose-encoded features are selected by the Chi-squared test, and 45 positions are removed from further analysis. The existing studies may focus on selecting the statistical DNA sequence descriptors with large contributions to the prediction models. This study does not utilize these statistical DNA sequence descriptors. Then the word vector of the SeqPose-encoded features is obtained by using the word embedding layer. This study hypothesizes that different word vector features may contribute differently to the enhancer detection model, and assigns different weights to these word vectors through the attention mechanism in the BD-LSTM model. The previous study generously provided the training and independent test datasets, and the proposed spEnhancer is compared with the three existing state-of-the-art studies using the same experimental procedure. The leave-one-out validation data on the training dataset shows that the proposed spEnhancer achieves similar detection performances as the three existing studies. While spEnhancer achieves the best overall performance metric MCC for both of the two binary classification problems on the independent test dataset. The experimental data shows that the strategy of removing redundant positions (SeqPose) may help improve the DNA sequence-based prediction models. spEnhancer may serve well as a complementary model to the existing studies, especially for the novel query enhancers that are not included in the training dataset.


Assuntos
Algoritmos , Biologia Computacional/métodos , Elementos Facilitadores Genéticos , Sequência de Bases , Bases de Dados Genéticas
17.
Medicina (Kaunas) ; 57(2)2021 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-33499377

RESUMO

BACKGROUND AND OBJECTIVE: Primary lung cancer is a lethal and rapidly-developing cancer type and is one of the most leading causes of cancer deaths. MATERIALS AND METHODS: Statistical methods such as Cox regression are usually used to detect the prognosis factors of a disease. This study investigated survival prediction using machine learning algorithms. The clinical data of 28,458 patients with primary lung cancers were collected from the Surveillance, Epidemiology, and End Results (SEER) database. RESULTS: This study indicated that the survival rate of women with primary lung cancer was often higher than that of men (p < 0.001). Seven popular machine learning algorithms were utilized to evaluate one-year, three-year, and five-year survival prediction The two classifiers extreme gradient boosting (XGB) and logistic regression (LR) achieved the best prediction accuracies. The importance variable of the trained XGB models suggested that surgical removal (feature "Surgery") made the largest contribution to the one-year survival prediction models, while the metastatic status (feature "N" stage) of the regional lymph nodes was the most important contributor to three-year and five-year survival prediction. The female patients' three-year prognosis model achieved a prediction accuracy of 0.8297 on the independent future samples, while the male model only achieved the accuracy 0.7329. CONCLUSIONS: This data suggested that male patients may have more complicated factors in lung cancer than females, and it is necessary to develop gender-specific diagnosis and prognosis models.


Assuntos
Neoplasias Pulmonares , Aprendizado de Máquina , Algoritmos , Feminino , Humanos , Modelos Logísticos , Neoplasias Pulmonares/diagnóstico , Masculino , Prognóstico
18.
Int J Mol Sci ; 21(6)2020 Mar 22.
Artigo em Inglês | MEDLINE | ID: mdl-32235704

RESUMO

With recent advances in single-cell RNA sequencing, enormous transcriptome datasets have been generated. These datasets have furthered our understanding of cellular heterogeneity and its underlying mechanisms in homogeneous populations. Single-cell RNA sequencing (scRNA-seq) data clustering can group cells belonging to the same cell type based on patterns embedded in gene expression. However, scRNA-seq data are high-dimensional, noisy, and sparse, owing to the limitation of existing scRNA-seq technologies. Traditional clustering methods are not effective and efficient for high-dimensional and sparse matrix computations. Therefore, several dimension reduction methods have been introduced. To validate a reliable and standard research routine, we conducted a comprehensive review and evaluation of four classical dimension reduction methods and five clustering models. Four experiments were progressively performed on two large scRNA-seq datasets using 20 models. Results showed that the feature selection method contributed positively to high-dimensional and sparse scRNA-seq data. Moreover, feature-extraction methods were able to promote clustering performance, although this was not eternally immutable. Independent component analysis (ICA) performed well in those small compressed feature spaces, whereas principal component analysis was steadier than all the other feature-extraction methods. In addition, ICA was not ideal for fuzzy C-means clustering in scRNA-seq data analysis. K-means clustering was combined with feature-extraction methods to achieve good results.


Assuntos
Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Algoritmos , Animais , Análise por Conglomerados , Perfilação da Expressão Gênica/métodos , Camundongos , Transcriptoma
19.
Int J Mol Sci ; 20(11)2019 May 28.
Artigo em Inglês | MEDLINE | ID: mdl-31141969

RESUMO

Breast cancer is estimated to be the leading cancer type among new cases in American women. Core biopsy data have shown a close association between breast hyperplasia and breast cancer. The early diagnosis and treatment of breast hyperplasia are extremely important to prevent breast cancer. The Mongolian medicine RuXian-I is a traditional drug that has achieved a high level of efficacy and a low incidence of side effects in its clinical use. However, for detecting the efficacy of RuXian-I, a rapid and accurate evaluation method based on metabolomic data is still lacking. Therefore, we proposed a framework, named the metabolomics deep belief network (MDBN), to analyze breast hyperplasia metabolomic data. We obtained 168 samples of metabolomic data from an animal model experiment of RuXian-I, which were averaged from control groups, treatment groups, and model groups. In the process of training, unlabelled data were used to pretrain the Deep Belief Networks models, and then labelled data were used to complete fine-tuning based on a limited-memory Broyden Fletcher Goldfarb Shanno (L-BFGS) algorithm. To prevent overfitting, a dropout method was added to the pretraining and fine-tuning procedures. The experimental results showed that the proposed model is superior to other classical classification methods that are based on positive and negative spectra data. Further, the proposed model can be used as an extension of the classification method for metabolomic data. For the high accuracy of classification of the three groups, the model indicates obvious differences and boundaries between the three groups. It can be inferred that the animal model of RuXian-I is well established, which can lay a foundation for subsequent related experiments. This also shows that metabolomic data can be used as a means to verify the effectiveness of RuXian-I in the treatment of breast hyperplasia.


Assuntos
Neoplasias da Mama/patologia , Metabolômica , Modelos Teóricos , Neoplasias da Mama/metabolismo , Simulação por Computador , Feminino , Humanos , Hiperplasia , Glândulas Mamárias Humanas/metabolismo , Glândulas Mamárias Humanas/patologia
20.
BMC Bioinformatics ; 19(1): 452, 2018 Nov 26.
Artigo em Inglês | MEDLINE | ID: mdl-30477418

RESUMO

BACKGROUND: Imaging is one of the major biomedical technologies to investigate the status of a living object. But the biomedical image based data mining problem requires extensive knowledge across multiple disciplinaries, e.g. biology, mathematics and computer science, etc. RESULTS: pyHIVE (a Health-related Image Visualization and Engineering system using Python) was implemented as an image processing system, providing five widely used image feature engineering algorithms. A standard binary classification pipeline was also provided to help researchers build data models immediately after the data is collected. pyHIVE may calculate five widely-used image feature engineering algorithms efficiently using multiple computing cores, and also featured the modules of Principal Component Analysis (PCA) based preprocessing and normalization. CONCLUSIONS: The demonstrative example shows that the image features generated by pyHIVE achieved very good classification performances based on the gastrointestinal endoscopic images. This system pyHIVE and the demonstrative example are freely available and maintained at http://www.healthinformaticslab.org/supp/resources.php .


Assuntos
Processamento de Imagem Assistida por Computador/métodos , Algoritmos , Endoscopia Gastrointestinal , Humanos , Análise de Componente Principal , Linguagens de Programação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA