Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 68
Filtrar
1.
Front Cell Dev Biol ; 9: 781285, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34917619

RESUMO

There are many types of cancers. Although they share some hallmarks, such as proliferation and metastasis, they are still very different from many perspectives. They grow on different organ or tissues. Does each cancer have a unique gene expression pattern that makes it different from other cancer types? After the Cancer Genome Atlas (TCGA) project, there are more and more pan-cancer studies. Researchers want to get robust gene expression signature from pan-cancer patients. But there is large variance in cancer patients due to heterogeneity. To get robust results, the sample size will be too large to recruit. In this study, we tried another approach to get robust pan-cancer biomarkers by using the cell line data to reduce the variance. We applied several advanced computational methods to analyze the Cancer Cell Line Encyclopedia (CCLE) gene expression profiles which included 988 cell lines from 20 cancer types. Two feature selection methods, including Boruta, and max-relevance and min-redundancy methods, were applied to the cell line gene expression data one by one, generating a feature list. Such list was fed into incremental feature selection method, incorporating one classification algorithm, to extract biomarkers, construct optimal classifiers and decision rules. The optimal classifiers provided good performance, which can be useful tools to identify cell lines from different cancer types, whereas the biomarkers (e.g. NCKAP1, TNFRSF12A, LAMB2, FKBP9, PFN2, TOM1L1) and rules identified in this work may provide a meaningful and precise reference for differentiating multiple types of cancer and contribute to the personalized treatment of tumors.

2.
Biomed Res Int ; 2021: 9939134, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34307679

RESUMO

COVID-19, a severe respiratory disease caused by a new type of coronavirus SARS-CoV-2, has been spreading all over the world. Patients infected with SARS-CoV-2 may have no pathogenic symptoms, i.e., presymptomatic patients and asymptomatic patients. Both patients could further spread the virus to other susceptible people, thereby making the control of COVID-19 difficult. The two major challenges for COVID-19 diagnosis at present are as follows: (1) patients could share similar symptoms with other respiratory infections, and (2) patients may not have any symptoms but could still spread the virus. Therefore, new biomarkers at different omics levels are required for the large-scale screening and diagnosis of COVID-19. Although some initial analyses could identify a group of candidate gene biomarkers for COVID-19, the previous work still could not identify biomarkers capable for clinical use in COVID-19, which requires disease-specific diagnosis compared with other multiple infectious diseases. As an extension of the previous study, optimized machine learning models were applied in the present study to identify some specific qualitative host biomarkers associated with COVID-19 infection on the basis of a publicly released transcriptomic dataset, which included healthy controls and patients with bacterial infection, influenza, COVID-19, and other kinds of coronavirus. This dataset was first analysed by Boruta, Max-Relevance and Min-Redundancy feature selection methods one by one, resulting in a feature list. This list was fed into the incremental feature selection method, incorporating one of the classification algorithms to extract essential biomarkers and build efficient classifiers and classification rules. The capacity of these findings to distinguish COVID-19 with other similar respiratory infectious diseases at the transcriptomic level was also validated, which may improve the efficacy and accuracy of COVID-19 diagnosis.


Assuntos
Teste para COVID-19/métodos , COVID-19/diagnóstico , COVID-19/genética , Biomarcadores/análise , COVID-19/sangue , Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Humanos , Influenza Humana , Aprendizado de Máquina , Programas de Rastreamento/métodos , Modelos Teóricos , Infecções Respiratórias/sangue , Infecções Respiratórias/diagnóstico , SARS-CoV-2/genética , SARS-CoV-2/patogenicidade , Transcriptoma/genética
3.
Artigo em Inglês | MEDLINE | ID: mdl-32766217

RESUMO

Protein is one of the most significant components of all living creatures. All significant and essential biological structures and functions relies on proteins and their respective biological functions. However, proteins cannot perform their unique biological significance independently. They have to interact with each other to realize the complicated biological processes in all living creatures including human beings. In other words, proteins depend on interactions (protein-protein interactions) to realize their significant effects. Thus, the significance comparison and quantitative contribution of candidate PPI features must be determined urgently. According to previous studies, 258 physical and chemical characteristics of proteins have been reported and confirmed to definitively affect the interaction efficiency of the related proteins. Among such features, essential physiochemical features of proteins like stoichiometric balance, protein abundance, molecular weight and charge distribution have been validated to be quite significant and irreplaceable for protein-protein interactions (PPIs). Therefore, in this study, we, on one hand, presented a novel computational framework to identify the key factors affecting PPIs with Boruta feature selection (BFS), Monte Carlo feature selection (MCFS), incremental feature selection (IFS), and on the other hand, built a quantitative decision-rule system to evaluate the potential PPIs under real conditions with random forest (RF) and RIPPER algorithms, thereby supplying several new insights into the detailed biological mechanisms of complicated PPIs. The main datasets and codes can be downloaded at https://github.com/xypan1232/Mass-PPI.

4.
Biomed Res Int ; 2020: 6384120, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32626751

RESUMO

Among various risk factors for the initiation and progression of cancer, alternative polyadenylation (APA) is a remarkable endogenous contributor that directly triggers the malignant phenotype of cancer cells. APA affects biological processes at a transcriptional level in various ways. As such, APA can be involved in tumorigenesis through gene expression, protein subcellular localization, or transcription splicing pattern. The APA sites and status of different cancer types may have diverse modification patterns and regulatory mechanisms on transcripts. Potential APA sites were screened by applying several machine learning algorithms on a TCGA-APA dataset. First, a powerful feature selection method, minimum redundancy maximum relevancy, was applied on the dataset, resulting in a feature list. Then, the feature list was fed into the incremental feature selection, which incorporated the support vector machine as the classification algorithm, to extract key APA features and build a classifier. The classifier can classify cancer patients into cancer types with perfect performance. The key APA-modified genes had a potential prognosis ability because of their significant power in the survival analysis of TCGA pan-cancer data.


Assuntos
Carcinogênese/genética , Regulação Neoplásica da Expressão Gênica/genética , Neoplasias , Poliadenilação/genética , Processamento Pós-Transcricional do RNA/genética , Algoritmos , Biologia Computacional , Bases de Dados Genéticas , Humanos , Aprendizado de Máquina , Neoplasias/classificação , Neoplasias/genética , Neoplasias/mortalidade , Neoplasias/patologia , Máquina de Vetores de Suporte
5.
Artigo em Inglês | MEDLINE | ID: mdl-32528944

RESUMO

DNA methylation is an essential epigenetic modification for multiple biological processes. DNA methylation in mammals acts as an epigenetic mark of transcriptional repression. Aberrant levels of DNA methylation can be observed in various types of tumor cells. Thus, DNA methylation has attracted considerable attention among researchers to provide new and feasible tumor therapies. Conventional studies considered single-gene methylation or specific loci as biomarkers for tumorigenesis. However, genome-scale methylated modification has not been completely investigated. Thus, we proposed and compared two novel computational approaches based on multiple machine learning algorithms for the qualitative and quantitative analyses of methylation-associated genes and their dys-methylated patterns. This study contributes to the identification of novel effective genes and the establishment of optimal quantitative rules for aberrant methylation distinguishing tumor cells with different origin tissues.

6.
Cancer Gene Ther ; 27(1-2): 56-69, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-31138902

RESUMO

Acute myeloid leukemia (AML) is a type of blood cancer characterized by the rapid growth of immature white blood cells from the bone marrow. Therapy resistance resulting from the persistence of leukemia stem cells (LSCs) are found in numerous patients. Comparative transcriptome studies have been previously conducted to analyze differentially expressed genes between LSC+ and LSC- cells. However, these studies mainly focused on a limited number of genes with the most obvious expression differences between the two cell types. We developed a computational approach incorporating several machine learning algorithms, including Monte Carlo feature selection (MCFS), incremental feature selection (IFS), support vector machine (SVM), Repeated Incremental Pruning to Produce Error Reduction (RIPPER), to identify gene expression features specific to LSCs. One thousand 0ne hudred fifty-nine features (genes) were first identified, which can be used to build the optimal SVM classifier for distinguishing LSC+ and LSC- cells. Among these 1159 genes, the top 17 genes were identified as LSC-specific biomarkers. In addition, six classification rules were produced by RIPPER algorithm. The subsequent literature review on these features/genes and the classification rules and functional enrichment analyses of the 1159 features/genes confirmed the relevance of extracted genes and rules to the characteristics of LSCs.


Assuntos
Biomarcadores Tumorais/genética , Leucemia Mieloide Aguda/genética , Modelos Genéticos , Células-Tronco Neoplásicas/patologia , Máquina de Vetores de Suporte , Biomarcadores Tumorais/análise , Biologia Computacional/métodos , Conjuntos de Dados como Assunto , Resistencia a Medicamentos Antineoplásicos/genética , Estudos de Viabilidade , Perfilação da Expressão Gênica/métodos , Humanos , Leucemia Mieloide Aguda/tratamento farmacológico , Leucemia Mieloide Aguda/patologia , Método de Monte Carlo , Células-Tronco Neoplásicas/efeitos dos fármacos
7.
Front Genet ; 10: 738, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31456818

RESUMO

Patient-derived tumor xenograft (PDX) mouse models are widely used for drug screening. The underlying assumption is that PDX tissue is very similar with the original patient tissue, and it has the same response to the drug treatment. To investigate whether the primary tumor site information is well preserved in PDX, we analyzed the gene expression profiles of PDX mouse models originated from different tissues, including breast, kidney, large intestine, lung, ovary, pancreas, skin, and soft tissues. The popular Monte Carlo feature selection method was employed to analyze the expression profile, yielding a feature list. From this list, incremental feature selection and support vector machine (SVM) were adopted to extract distinctively expressed genes in PDXs from different primary tumor sites and build an optimal SVM classifier. In addition, we also set up a group of quantitative rules to identify primary tumor sites. A total of 755 genes were extracted by the feature selection procedures, on which the SVM classifier can provide a high performance with MCC 0.986 on classifying primary tumor sites originated from different tissues. Furthermore, we obtained 16 classification rules, which gave a lower accuracy but clear classification procedures. Such results validated that the primary tumor site specificity was well preserved in PDX as the PDXs from different primary tumor sites were still very different and these PDX differences were similar with the differences observed in patients with tumor. For example, VIM and ABHD17C were highly expressed in the PDX from breast tissue and also highly expressed in breast cancer patients.

8.
Int J Mol Sci ; 20(9)2019 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-31052553

RESUMO

Small nucleolar RNAs (snoRNAs) are a new type of functional small RNAs involved in the chemical modifications of rRNAs, tRNAs, and small nuclear RNAs. It is reported that they play important roles in tumorigenesis via various regulatory modes. snoRNAs can both participate in the regulation of methylation and pseudouridylation and regulate the expression pattern of their host genes. This research investigated the expression pattern of snoRNAs in eight major cancer types in TCGA via several machine learning algorithms. The expression levels of snoRNAs were first analyzed by a powerful feature selection method, Monte Carlo feature selection (MCFS). A feature list and some informative features were accessed. Then, the incremental feature selection (IFS) was applied to the feature list to extract optimal features/snoRNAs, which can make the support vector machine (SVM) yield best performance. The discriminative snoRNAs included HBII-52-14, HBII-336, SNORD123, HBII-85-29, HBII-420, U3, HBI-43, SNORD116, SNORA73B, SCARNA4, HBII-85-20, etc., on which the SVM can provide a Matthew's correlation coefficient (MCC) of 0.881 for predicting these eight cancer types. On the other hand, the informative features were fed into the Johnson reducer and repeated incremental pruning to produce error reduction (RIPPER) algorithms to generate classification rules, which can clearly show different snoRNAs expression patterns in different cancer types. The analysis results indicated that extracted discriminative snoRNAs can be important for identifying cancer samples in different types and the expression pattern of snoRNAs in different cancer types can be partly uncovered by quantitative recognition rules.


Assuntos
Regulação Neoplásica da Expressão Gênica , Aprendizado de Máquina , Neoplasias/genética , RNA Nucleolar Pequeno/genética , Algoritmos , Humanos , Método de Monte Carlo , Máquina de Vetores de Suporte
9.
J Clin Med ; 7(10)2018 Oct 13.
Artigo em Inglês | MEDLINE | ID: mdl-30322114

RESUMO

As a common brain cancer derived from glial cells, gliomas have three subtypes: glioblastoma, diffuse astrocytoma, and anaplastic astrocytoma. The subtypes have distinctive clinical features but are closely related to each other. A glioblastoma can be derived from the early stage of diffuse astrocytoma, which can be transformed into anaplastic astrocytoma. Due to the complexity of these dynamic processes, single-cell gene expression profiles are extremely helpful to understand what defines these subtypes. We analyzed the single-cell gene expression profiles of 5057 cells of anaplastic astrocytoma tissues, 261 cells of diffuse astrocytoma tissues, and 1023 cells of glioblastoma tissues with advanced machine learning methods. In detail, a powerful feature selection method, Monte Carlo feature selection (MCFS) method, was adopted to analyze the gene expression profiles of cells, resulting in a feature list. Then, the incremental feature selection (IFS) method was applied to the obtained feature list, with the help of support vector machine (SVM), to extract key features (genes) and construct an optimal SVM classifier. Several key biomarker genes, such as IGFBP2, IGF2BP3, PRDX1, NOV, NEFL, HOXA10, GNG12, SPRY4, and BCL11A, were identified. In addition, the underlying rules of classifying the three subtypes were produced by Johnson reducer algorithm. We found that in diffuse astrocytoma, PRDX1 is highly expressed, and in glioblastoma, the expression level of PRDX1 is low. These rules revealed the difference among the three subtypes, and how they are formed and transformed. These genes are not only biomarkers for glioma subtypes, but also drug targets that may switch the clinical features or even reverse the tumor progression.

10.
Genes (Basel) ; 9(4)2018 Apr 12.
Artigo em Inglês | MEDLINE | ID: mdl-29649131

RESUMO

Atrioventricular septal defect (AVSD) is a clinically significant subtype of congenital heart disease (CHD) that severely influences the health of babies during birth and is associated with Down syndrome (DS). Thus, exploring the differences in functional genes in DS samples with and without AVSD is a critical way to investigate the complex association between AVSD and DS. In this study, we present a computational method to distinguish DS patients with AVSD from those without AVSD using the newly proposed self-normalizing neural network (SNN). First, each patient was encoded by using the copy number of probes on chromosome 21. The encoded features were ranked by the reliable Monte Carlo feature selection (MCFS) method to obtain a ranked feature list. Based on this feature list, we used a two-stage incremental feature selection to construct two series of feature subsets and applied SNNs to build classifiers to identify optimal features. Results show that 2737 optimal features were obtained, and the corresponding optimal SNN classifier constructed on optimal features yielded a Matthew's correlation coefficient (MCC) value of 0.748. For comparison, random forest was also used to build classifiers and uncover optimal features. This method received an optimal MCC value of 0.582 when top 132 features were utilized. Finally, we analyzed some key features derived from the optimal features in SNNs found in literature support to further reveal their essential roles.

11.
J Cell Biochem ; 119(4): 3394-3403, 2018 04.
Artigo em Inglês | MEDLINE | ID: mdl-29130544

RESUMO

Adult neural stem cells (NSCs) are a group of multi-potent, self-renewing progenitor cells that contribute to the generation of new neurons and oligodendrocytes. Three subtypes of NSCs can be isolated based on the stages of the NSC lineage, including quiescent neural stem cells (qNSCs), activated neural stem cells (aNSCs) and neural progenitor cells (NPCs). Although it is widely accepted that these three groups of NSCs play different roles in the development of the nervous system, their molecular signatures are poorly understood. In this study, we applied the Monte-Carlo Feature Selection (MCFS) method to identify the gene expression signatures, which can yield a Matthews correlation coefficient (MCC) value of 0.918 with a support vector machine evaluated by ten-fold cross-validation. In addition, some classification rules yielded by the MCFS program for distinguishing above three subtypes were reported. Our results not only demonstrate a high classification capacity and subtype-specific gene expression patterns but also quantitatively reflect the pattern of the gene expression levels across the NSC lineage, providing insight into deciphering the molecular basis of NSC differentiation.


Assuntos
Astrócitos/citologia , Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Células-Tronco Neurais/classificação , Algoritmos , Linhagem da Célula , Células Cultivadas , Humanos , Método de Monte Carlo , Máquina de Vetores de Suporte
12.
Genes (Basel) ; 8(10)2017 Oct 02.
Artigo em Inglês | MEDLINE | ID: mdl-28974058

RESUMO

Bone and dental diseases are serious public health problems. Most current clinical treatments for these diseases can produce side effects. Regeneration is a promising therapy for bone and dental diseases, yielding natural tissue recovery with few side effects. Because soft tissues inside the bone and dentin are densely populated with nerves and vessels, the study of bone and dentin regeneration should also consider the co-regeneration of nerves and vessels. In this study, a network-based method to identify co-regeneration genes for bone, dentin, nerve and vessel was constructed based on an extensive network of protein-protein interactions. Three procedures were applied in the network-based method. The first procedure, searching, sought the shortest paths connecting regeneration genes of one tissue type with regeneration genes of other tissues, thereby extracting possible co-regeneration genes. The second procedure, testing, employed a permutation test to evaluate whether possible genes were false discoveries; these genes were excluded by the testing procedure. The last procedure, screening, employed two rules, the betweenness ratio rule and interaction score rule, to select the most essential genes. A total of seventeen genes were inferred by the method, which were deemed to contribute to co-regeneration of at least two tissues. All these seventeen genes were extensively discussed to validate the utility of the method.

13.
Comb Chem High Throughput Screen ; 20(7): 622-628, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28292251

RESUMO

AIM AND OBJECTIVE: Protein malonylation is a newly discovered post-translational modification. Malonylation is known to closely be associated with type 2 diabetes and to play its regulatory role in fatty acid oxidation and the associated genetic disease. Identifying protein malonylations might lay a solid foundation to explore malonylation function. Due to the limitations of experimental techniques, it is a great challenge to fast and accurately identify malonylation sites. METHODS: We proposed a computational method to predict malonylation sites and to analyze malonylation pattern. We firstly extracted protein segments so that the lysine is at the center of each segment. Then, each segment was encoded by the pseudo amino acid compositions. The support vector machine classifier trained by a training dataset was built to distinguish malonylation sites from non-malonylation ones. RESULTS: The leave-one-out test on the training dataset reached the accuracy of 0.7733, and the independent test on the testing dataset got 0.8889. Furthermore, the classifier also successfully identified 144 of 160 putative malonylation sites. Analyses on the differences between malonylation and non-malonylation segments implicated that lysine malonylation should follow a specific pattern, e.g. lysine with its neighbors being Glycine and Alanine might be more likely to be malonylated. Therefore, the proposed method is expected to be a promising tool to identify malonylation sites.


Assuntos
Aminoácidos/metabolismo , Biologia Computacional , Diabetes Mellitus Tipo 2/metabolismo , Lisina/metabolismo , Proteínas/metabolismo , Algoritmos , Aminoácidos/química , Bases de Dados de Proteínas , Humanos , Lisina/química , Proteínas/química
14.
Biomed Res Int ; 2017: 6132436, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28255556

RESUMO

As a pathological condition, epilepsy is caused by abnormal neuronal discharge in brain which will temporarily disrupt the cerebral functions. Epilepsy is a chronic disease which occurs in all ages and would seriously affect patients' personal lives. Thus, it is highly required to develop effective medicines or instruments to treat the disease. Identifying epilepsy-related genes is essential in order to understand and treat the disease because the corresponding proteins encoded by the epilepsy-related genes are candidates of the potential drug targets. In this study, a pioneering computational workflow was proposed to predict novel epilepsy-related genes using the random walk with restart (RWR) algorithm. As reported in the literature RWR algorithm often produces a number of false positive genes, and in this study a permutation test and functional association tests were implemented to filter the genes identified by RWR algorithm, which greatly reduce the number of suspected genes and result in only thirty-three novel epilepsy genes. Finally, these novel genes were analyzed based upon some recently published literatures. Our findings implicate that all novel genes were closely related to epilepsy. It is believed that the proposed workflow can also be applied to identify genes related to other diseases and deepen our understanding of the mechanisms of these diseases.


Assuntos
Epilepsia/genética , Estudos de Associação Genética/métodos , Algoritmos , Bases de Dados Genéticas , Humanos
15.
Comb Chem High Throughput Screen ; 20(2): 164-173, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28029071

RESUMO

BACKGROUND: As one of essential post-translational modifications (PTMs), the citrullination or deimination on an arginine residue would change the molecular weight and electrostatic charge of its side-chain. And it has been found that the citrullination in protein sequences was catalyzed by a type of Ca2+-dependent enzyme family called peptidylarginine deiminase (PAD), which include five isotypes: PAD1, 2, 3, 4/5, and 6. Citrullinated proteins participate in many biological processes, e.g. the citrullination of myelin basic protein (MBP) assists the early development of central nervous system. However, abnormal modifications on citrullinated proteins would also lead to some severe human diseases including multiple sclerosis and rheumatoid arthritis. OBJECTIVE: Therefore, it is necessary and important to identify the citrullination sites in protein sequences. The information about the location of citrulliantion sites in protein sequences will be useful to investigate the molecular functions and disease mechanisms related to citrullinated proteins. MATERIALS AND METHODS: In this study, we investigated the peptide segments that contain the citrullination sites in the centers, which were encoded into numeric digits from four aspects. Thus, we yielded a training set with 116 positive samples and 232 negative samples. Then, a reliable feature selection technique, called maximum-relevance-minimum-redundancy (mRMR), was applied to analyze these features, and four algorithms, including random forest (RF), Dagging, nearest neighbor algorithm (NNA), and support vector machine (SVM), together with the incremental feature selection (IFS) method were adopted to extract important features. RESULTS: Finally an optimal classifier derived from RF algorithm was constructed to predict citrullination sites. 44 most prominent features were comprehensively analyzed and their biological characteristics in citrullination catalysis were also revealed. CONCLUSION: We believed that the biological features obtained in this pioneering work would provide some useful insights into the formation and function of citrullination and the optimal classifier could be a useful tool to identify citrullination sites in protein sequences.


Assuntos
Algoritmos , Sequência de Aminoácidos , Arginina/metabolismo , Citrulina/metabolismo , Processamento de Proteína Pós-Traducional , Sítios de Ligação , Humanos , Hidrolases/metabolismo , Desiminases de Arginina em Proteínas , Máquina de Vetores de Suporte
16.
Artigo em Inglês | MEDLINE | ID: mdl-26552438

RESUMO

It is crucial to identify the molecular targets of a compound during the course of the new drug discovery and drug development. Due to the complexity of biological systems, finding drug targets by biological experiments is very tedious and expensive. In the paper, we used chemicalchemical interactions in the STITCH database to construct a network of drug-drug association. Based on the network, a learning method keeping local and global consistency was presented to infer drug targets. We achieved an accuracy of 57.75% in the first order prediction using leave-one-out cross validation, which was higher than the accuracy of 53.77% achieved by the local neighbor model. We manually validated 27 absent drug targets in the crossvalidation using drug-target interactions from other databases. Applying the presented method to large-scale prediction of unknown targets, we manually confirmed 14 pairs of drug-target interactions among the newly predicted drug targets. These results suggested that the presented method was a promising tool for large-scale identification of drug targets.


Assuntos
Bases de Dados de Compostos Químicos , Terapia de Alvo Molecular , Preparações Farmacêuticas/química , Ensaios de Triagem em Larga Escala , Humanos
17.
Comb Chem High Throughput Screen ; 19(2): 136-43, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26552441

RESUMO

A metabolic pathway is a series of biological processes providing necessary molecules and energies for an organism, which could be essential to the lives of the living organisms. Most metabolic pathways require the involvement of compounds and given a compound it is helpful to know what types of metabolic pathways the compound participates in. In this study, compounds are first represented by molecular fragments which are then delivered to a prediction engine called Sequential Minimal Optimization (SMO) for predictions. Maximum relevance and minimum redundancy (mRMR) and incremental feature selection are adopted to extract key features based on which an optimal prediction engine is established. The proposed method is effective comparing to the random forest, Dagging and a popular method that integrating chemical-chemical interactions and chemical-chemical similarities. We also make predictions using some compounds with unknown metabolic pathways and choose 17 compounds for analysis. The results indicate that the method proposed may become a useful tool in predicting and analyzing metabolic pathways.


Assuntos
Algoritmos , Redes e Vias Metabólicas , Compostos Orgânicos/metabolismo , Biologia Computacional , Bases de Dados de Compostos Químicos , Ensaios de Triagem em Larga Escala , Compostos Orgânicos/química
18.
Biomed Res Int ; 2014: 438341, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25184139

RESUMO

Protein S-nitrosylation plays a very important role in a wide variety of cellular biological activities. Hitherto, accurate prediction of S-nitrosylation sites is still of great challenge. In this paper, we presented a framework to computationally predict S-nitrosylation sites based on kernel sparse representation classification and minimum Redundancy Maximum Relevance algorithm. As much as 666 features derived from five categories of amino acid properties and one protein structure feature are used for numerical representation of proteins. A total of 529 protein sequences collected from the open-access databases and published literatures are used to train and test our predictor. Computational results show that our predictor achieves Matthews' correlation coefficients of 0.1634 and 0.2919 for the training set and the testing set, respectively, which are better than those of k-nearest neighbor algorithm, random forest algorithm, and sparse representation classification algorithm. The experimental results also indicate that 134 optimal features can better represent the peptides of protein S-nitrosylation than the original 666 redundant features. Furthermore, we constructed an independent testing set of 113 protein sequences to evaluate the robustness of our predictor. Experimental result showed that our predictor also yielded good performance on the independent testing set with Matthews' correlation coefficients of 0.2239.


Assuntos
Algoritmos , Biologia Computacional , Processamento de Proteína Pós-Traducional , Proteínas/química , Sequência de Aminoácidos , Aminoácidos/química , Aminoácidos/genética , Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Proteínas/genética , Proteínas/metabolismo , Software
19.
J Biomed Inform ; 48: 130-6, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24486562

RESUMO

Extracting information from unstructured clinical narratives is valuable for many clinical applications. Although natural Language Processing (NLP) methods have been profoundly studied in electronic medical records (EMR), few studies have explored NLP in extracting information from Chinese clinical narratives. In this study, we report the development and evaluation of extracting tumor-related information from operation notes of hepatic carcinomas which were written in Chinese. Using 86 operation notes manually annotated by physicians as the training set, we explored both rule-based and supervised machine-learning approaches. Evaluating on unseen 29 operation notes, our best approach yielded 69.6% in precision, 58.3% in recall and 63.5% F-score.


Assuntos
Inteligência Artificial , Carcinoma/diagnóstico , Neoplasias Hepáticas/diagnóstico , Processamento de Linguagem Natural , Algoritmos , Carcinoma/patologia , China , Simulação por Computador , Sistemas Computacionais , Mineração de Dados/métodos , Registros Eletrônicos de Saúde , Humanos , Idioma , Neoplasias Hepáticas/patologia , Informática Médica/métodos , Software
20.
PLoS One ; 9(2): e87791, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24498372

RESUMO

Cancer, which is a leading cause of death worldwide, places a big burden on health-care system. In this study, an order-prediction model was built to predict a series of cancer drug indications based on chemical-chemical interactions. According to the confidence scores of their interactions, the order from the most likely cancer to the least one was obtained for each query drug. The 1(st) order prediction accuracy of the training dataset was 55.93%, evaluated by Jackknife test, while it was 55.56% and 59.09% on a validation test dataset and an independent test dataset, respectively. The proposed method outperformed a popular method based on molecular descriptors. Moreover, it was verified that some drugs were effective to the 'wrong' predicted indications, indicating that some 'wrong' drug indications were actually correct indications. Encouraged by the promising results, the method may become a useful tool to the prediction of drugs indications.


Assuntos
Antineoplásicos/farmacologia , Interações Medicamentosas , Informática/métodos , Modelos Teóricos , Neoplasias/tratamento farmacológico , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...