Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 63
Filtrar
1.
J Med Syst ; 43(8): 235, 2019 Jun 17.
Artigo em Inglês | MEDLINE | ID: mdl-31209677

RESUMO

Cancer is a deadly disease which requires a very complex and costly treatment. Microarray data classification plays an important role in cancer treatment. An efficient gene selection technique to select the more promising genes is necessary for cancer classification. Here, we propose a Two-stage MI-GA Gene Selection algorithm for selecting informative genes in cancer data classification. In the first stage, Mutual Information based gene selection is applied which selects only the genes that have high information related to the cancer. The genes which have high mutual information value are given as input to the second stage. The Genetic Algorithm based gene selection is applied in the second stage to identify and select the optimal set of genes required for accurate classification. For classification, Support Vector Machine (SVM) is used. The proposed MI-GA gene selection approach is applied to Colon, Lung and Ovarian cancer datasets and the results show that the proposed gene selection approach results in higher classification accuracy compared to the existing methods.


Assuntos
Algoritmos , Bases de Dados Genéticas/classificação , Perfilação da Expressão Gênica , Neoplasias/genética , Mineração de Dados , Humanos , Análise em Microsséries , Máquina de Vetores de Suporte
2.
Brief Bioinform ; 14(1): 13-26, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22408190

RESUMO

A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte-Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas/classificação , Algoritmos , Análise Discriminante , Feminino , Humanos , Máquina de Vetores de Suporte
3.
Soc Stud Sci ; 42(2): 214-36, 2012 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-22848998

RESUMO

Cross-species comparison has long been regarded as a stepping-stone for medical research, enabling the discovery and testing of prospective treatments before they undergo clinical trial on humans. Post-genomic medicine has made cross-species comparison crucial in another respect: the 'community databases' developed to collect and disseminate data on model organisms are now often used as a template for the dissemination of data on humans and as a tool for comparing results of medical significance across the human-animal boundary. This paper identifies and discusses four key problems encountered by database curators when integrating human and non-human data within the same database: (1) picking criteria for what counts as reliable evidence, (2) selecting metadata, (3) standardising and describing research materials and (4) choosing nomenclature to classify data. An analysis of these hurdles reveals epistemic disagreement and controversies underlying cross-species comparisons, which in turn highlight important differences in the experimental cultures of biologists and clinicians trying to make sense of these data. By considering database development through the eyes of curators, this study casts new light on the complex conjunctions of biological and clinical practice, model organisms and human subjects, and material and virtual sources of evidence--thus emphasizing the fragmented, localized and inherently translational nature of biomedicine.


Assuntos
Pesquisa Biomédica , Bases de Dados Genéticas , Armazenamento e Recuperação da Informação , Modelos Animais , Especificidade da Espécie , Animais , Bases de Dados Genéticas/classificação , Bases de Dados Genéticas/normas , Genômica , Humanos , Internet , Controle de Qualidade , Padrões de Referência , Terminologia como Assunto
4.
Methods Mol Biol ; 2284: 457-466, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33835457

RESUMO

Circular RNA (or circRNA) is a type of single-stranded covalently closed circular RNA molecule and play important roles in diverse biological pathways. A comprehensive functionally annotated circRNA database will help to understand the circRNAs and their functions. CircFunBase is such a web-accessible database that aims to provide a high-quality functional circRNA resource including experimentally validated and computationally predicted functions. CircFunBase provides visualized circRNA-miRNA interaction networks. In addition, a genome browser is provided to visualize the genome context of circRNA. In this chapter, we illustrate examples of searching for circRNA and getting detailed information of circRNA. Moreover, other circRNA related databases are outlined.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas/provisão & distribuição , RNA Circular/análise , Análise de Dados , Bases de Dados Genéticas/classificação , Doença/genética , Redes Reguladoras de Genes , Humanos , RNA Circular/genética , RNA Circular/fisiologia , Software
5.
BMC Bioinformatics ; 11: 530, 2010 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-20973947

RESUMO

BACKGROUND: The Gene Ontology project supports categorization of gene products according to their location of action, the molecular functions that they carry out, and the processes that they are involved in. Although the ontologies are intentionally developed to be taxon neutral, and to cover all species, there are inherent taxon specificities in some branches. For example, the process 'lactation' is specific to mammals and the location 'mitochondrion' is specific to eukaryotes. The lack of an explicit formalization of these constraints can lead to errors and inconsistencies in automated and manual annotation. RESULTS: We have formalized the taxonomic constraints implicit in some GO classes, and specified these at various levels in the ontology. We have also developed an inference system that can be used to check for violations of these constraints in annotations. Using the constraints in conjunction with the inference system, we have detected and removed errors in annotations and improved the structure of the ontology. CONCLUSIONS: Detection of inconsistencies in taxon-specificity enables gradual improvement of the ontologies, the annotations, and the formalized constraints. This is progressively improving the quality of our data. The full system is available for download, and new constraints or proposed changes to constraints can be submitted online at https://sourceforge.net/tracker/?atid=605890&group_id=36855.


Assuntos
Classificação/métodos , Anotação de Sequência Molecular/métodos , Bases de Dados Genéticas/classificação , Bases de Dados de Proteínas/classificação , Terminologia como Assunto , Vocabulário Controlado
6.
Bioinformatics ; 25(12): i63-8, 2009 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-19478018

RESUMO

Subjective methods have been reported to adapt a general-purpose ontology for a specific application. For example, Gene Ontology (GO) Slim was created from GO to generate a highly aggregated report of the human-genome annotation. We propose statistical methods to adapt the general purpose, OBO Foundry Disease Ontology (DO) for the identification of gene-disease associations. Thus, we need a simplified definition of disease categories derived from implicated genes. On the basis of the assumption that the DO terms having similar associated genes are closely related, we group the DO terms based on the similarity of gene-to-DO mapping profiles. Two types of binary distance metrics are defined to measure the overall and subset similarity between DO terms. A compactness-scalable fuzzy clustering method is then applied to group similar DO terms. To reduce false clustering, the semantic similarities between DO terms are also used to constrain clustering results. As such, the DO terms are aggregated and the redundant DO terms are largely removed. Using these methods, we constructed a simplified vocabulary list from the DO called Disease Ontology Lite (DOLite). We demonstrated that DOLite results in more interpretable results than DO for gene-disease association tests. The resultant DOLite has been used in the Functional Disease Ontology (FunDO) Web application at http://www.projects.bioinformatics.northwestern.edu/fundo.


Assuntos
Biologia Computacional/métodos , Doença/genética , Vocabulário Controlado , Interpretação Estatística de Dados , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas/classificação , Genoma , Terminologia como Assunto
7.
Bioinformatics ; 25(11): 1412-8, 2009 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-19376821

RESUMO

MOTIVATION: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared with a limited number of other systems. RESULTS: We compare the performance of six MeSH classification systems [MetaMap, EAGL, a language and a vector space model-based approach, a K-Nearest Neighbor (KNN) approach and MTI] in terms of reproducing and complementing manual MeSH annotations. A KNN system clearly outperforms the other published approaches and scales well with large amounts of text using the full MeSH thesaurus. Our measurements demonstrate to what extent manual MeSH annotations can be reproduced and how they can be complemented by automatic annotations. We also show that a statistically significant improvement can be obtained in information retrieval (IR) when the text of a user's query is automatically annotated with MeSH concepts, compared to using the original textual query alone. CONCLUSIONS: The annotation of biomedical texts using controlled vocabularies such as MeSH can be automated to improve text-only IR. Furthermore, the automatic MeSH annotation system we propose is highly scalable and it generates improvements in IR comparable with those observed for manual annotations.


Assuntos
Biologia Computacional/métodos , Armazenamento e Recuperação da Informação/métodos , Medical Subject Headings , Sistemas de Gerenciamento de Base de Dados/classificação , Bases de Dados Genéticas/classificação , Armazenamento e Recuperação da Informação/classificação , Vocabulário Controlado
8.
J Biomed Inform ; 43(1): 81-7, 2010 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-19699316

RESUMO

Selecting relevant and discriminative genes for sample classification is a common and critical task in gene expression analysis (e.g. disease diagnostic). It is desirable that gene selection can improve classification performance of learning algorithm effectively. In general, for most gene selection methods widely used in reality, an individual gene subset will be chosen according to its discriminative power. One of deficiencies of individual gene subset is that its contribution to classification purpose is limited. This issue can be alleviated by ensemble gene selection based on random selection to some extend. However, the random one requires an unnecessary large number of candidate gene subsets and its reliability is a problem. In this study, we propose a new ensemble method, called ensemble gene selection by grouping (EGSG), to select multiple gene subsets for the classification purpose. Rather than selecting randomly, our method chooses salient gene subsets from microarray data by virtue of information theory and approximate Markov blanket. The effectiveness and accuracy of our method is validated by experiments on five publicly available microarray data sets. The experimental results show that our ensemble gene selection method has comparable classification performance to other gene selection methods, and is more stable than the random one.


Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Algoritmos , Inteligência Artificial , Linhagem Celular Tumoral , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados Genéticas/classificação , Processamento Eletrônico de Dados , Regulação Neoplásica da Expressão Gênica , Humanos , Cadeias de Markov , Modelos Estatísticos , Reconhecimento Automatizado de Padrão/classificação , Reconhecimento Automatizado de Padrão/métodos , Reprodutibilidade dos Testes
9.
J Biomed Sci ; 16: 25, 2009 Feb 24.
Artigo em Inglês | MEDLINE | ID: mdl-19272192

RESUMO

Different microarray techniques recently have been successfully used to investigate useful information for cancer diagnosis at the gene expression level due to their ability to measure thousands of gene expression levels in a massively parallel way. One important issue is to improve classification performance of microarray data. However, it would be ideal that influential genes and even interpretable rules can be explored at the same time to offer biological insight. Introducing the concepts of system design in software engineering, this paper has presented an integrated and effective method (named X-AI) for accurate cancer classification and the acquisition of knowledge from DNA microarray data. This method included a feature selector to systematically extract the relative important genes so as to reduce the dimension and retain as much as possible of the class discriminatory information. Next, diagonal quadratic discriminant analysis (DQDA) was combined to classify tumors, and generalized rule induction (GRI) was integrated to establish association rules which can give an understanding of the relationships between cancer classes and related genes. Two non-redundant datasets of acute leukemia were used to validate the proposed X-AI, showing significantly high accuracy for discriminating different classes. On the other hand, I have presented the abilities of X-AI to extract relevant genes, as well as to develop interpretable rules. Further, a web server has been established for cancer classification and it is freely available at http://bioinformatics.myweb.hinet.net/xai.htm.


Assuntos
Leucemia , Análise de Sequência com Séries de Oligonucleotídeos , Software , Algoritmos , Inteligência Artificial , Biologia Computacional/métodos , Bases de Dados Genéticas/classificação , Perfilação da Expressão Gênica , Humanos , Internet , Leucemia/classificação , Leucemia/genética , Dados de Sequência Molecular , Reprodutibilidade dos Testes
10.
Arch Virol ; 154(7): 1181-8, 2009.
Artigo em Inglês | MEDLINE | ID: mdl-19495937

RESUMO

In accordance with the Statutes of the International Committee of Taxonomy of Viruses (ICTV), the final stage in the process of making changes to the Universal Scheme of Virus Classification is the ratification of taxonomic proposals by ICTV Members. This can occur either at a Plenary meeting of ICTV, held during an International Congress of Virology meeting, or by circulation of proposals by mail followed by a ballot. Therefore, a list of proposals that had been subjected to the full, multi-stage review process was prepared and presented on the ICTVonline web pages in March 2008. This review process involved input from the ICTV Study Groups and Subcommittees, other interested virologists, and the ICTV Executive Committee. For the first time, the ratification process was performed entirely by email. The proposals were sent electronically via email on 18 March 2008 to ICTV Life Members (11), ICTV Subcommittee Members (74), and ICTV National Representatives (53).


Assuntos
Classificação/métodos , Terminologia como Assunto , Vírus/classificação , Bases de Dados Genéticas/classificação , Cooperação Internacional , Filogenia , Sociedades Científicas , Vírus/genética
11.
IEEE J Biomed Health Inform ; 23(4): 1805-1815, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-31283472

RESUMO

The discovery of disease-causing genes is a critical step towards understanding the nature of a disease and determining a possible cure for it. In recent years, many computational methods to identify disease genes have been proposed. However, making full use of disease-related (e.g., symptoms) and gene-related (e.g., gene ontology and protein-protein interactions) information to improve the performance of disease gene prediction is still an issue. Here, we develop a heterogeneous disease-gene-related network (HDGN) embedding representation framework for disease gene prediction (called HerGePred). Based on this framework, a low-dimensional vector representation (LVR) of the nodes in the HDGN can be obtained. Then, we propose two specific algorithms, namely, an LVR-based similarity prediction and a random walk with restart on a reconstructed heterogeneous disease-gene network (RW-RDGN), to predict disease genes with high performance. First, to validate the rationality of the framework, we analyze the similarity-based overlap distribution of disease pairs and design an experiment for disease-gene association recovery, the results of which revealed that the LVR of nodes performs well at preserving the local and global network structure of the HDGN. Then, we apply tenfold cross validation and external validation to compare our methods with other well-known disease gene prediction algorithms. The experimental results show that the RW-RDGN performs better than the state-of-the-art algorithm. The prediction results of disease candidate genes are essential for molecular mechanism investigation and experimental validation. The source codes of HerGePred and experimental data are available at https://github.com/yangkuoone/HerGePred.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas/classificação , Doença/genética , Aprendizado de Máquina , Algoritmos , Humanos , Modelos Estatísticos
12.
Artigo em Inglês | MEDLINE | ID: mdl-29990202

RESUMO

Analyzing the disease data from the view of combinatorial features may better characterize the disease phenotype. In this study, a novel method is proposed to construct feature combinations and a classification model (CFC-CM) by mining key feature relationships. CFC-CM iteratively tests for differences in the feature relationship between different groups. To do this, it uses a modified $k$k-top-scoring pair (M-$k$k-TSP) algorithm and then selects the most discriminative feature pairs in the current feature set to infer the combinatorial features and build the classification model. Compared with support vector machines, random forests, least absolute shrinkage and selection operator, elastic net, and M-$k$k-TSP, the superior performance of CFC-CM on nine public gene expression datasets validates its potential for more precise identification of complex diseases. Subsequently, CFC-CM was applied to two metabolomics datasets, it obtained accuracy rates of $88.73\pm 2.06\%$88.73±2.06% and $79.11\pm 2.70\%$79.11±2.70% in distinguishing between hepatocellular carcinoma and hepatic cirrhosis groups and between acute kidney injury (AKI) and non-AKI samples, results superior to those of the other five methods. In summary, the better results of CFC-CM show that in contrast to molecules and combinations constituted by just two features, the combinations inferred by appropriate number of features could better identify the complex diseases.


Assuntos
Biologia Computacional/métodos , Diagnóstico por Computador/métodos , Metaboloma , Metabolômica/métodos , Algoritmos , Bases de Dados Genéticas/classificação , Humanos , Nefropatias/diagnóstico , Hepatopatias/diagnóstico , Metaboloma/genética , Metaboloma/fisiologia , Máquina de Vetores de Suporte
13.
Stud Health Technol Inform ; 136: 863-8, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18487840

RESUMO

The application of upper ontologies has been repeatedly advocated for supporting interoperability between domain ontologies in order to facilitate shared data use both within and across disciplines. We have developed BioTop as a top-domain ontology to integrate more specialized ontologies in the biomolecular and biomedical domain. In this paper, we report on concrete integration problems of this ontology with the domain-independent Basic Formal Ontology (BFO) concerning the issue of fiat and aggregated objects in the context of different granularity levels. We conclude that the third BFO level must be ignored in order not to obviate cross-granularity integration.


Assuntos
Armazenamento e Recuperação da Informação , Computação em Informática Médica/classificação , Integração de Sistemas , Unified Medical Language System , Vocabulário Controlado , Bases de Dados Genéticas/classificação , Humanos , Aplicações da Informática Médica , Linguagens de Programação , Interface Usuário-Computador
14.
IEEE J Biomed Health Inform ; 22(5): 1619-1629, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-29990162

RESUMO

One of the main challenges in modern medic-ine is to stratify patients for personalized care. Many different clustering methods have been proposed to solve the problem in both quantitative and biologically meaningful manners. However, existing clustering algorithms suffer from numerous restrictions such as experimental noises, high dimensionality, and poor interpretability. To overcome those limitations altogether, we propose and formulate a multiobjective framework based on evolutionary multiobjective optimization to balance the feature relevance and redundancy for patient stratification. To demonstrate the effectiveness of our proposed algorithms, we benchmark our algorithms across 55 synthetic datasets based on a real human transcription regulation network model, 35 real cancer gene expression datasets, and two case studies. Experimental results suggest that the proposed algorithms perform better than the recent state-of-the-arts. In addition, time complexity analysis, convergence analysis, and parameter analysis are conducted to demonstrate the robustness of the proposed methods from different perspectives. Finally, the t-Distributed Stochastic Neighbor Embedding (t-SNE) is applied to project the selected feature subsets onto two or three dimensions to visualize the high-dimensional patient stratification data.


Assuntos
Bases de Dados Genéticas/classificação , Registros Eletrônicos de Saúde/classificação , Informática Médica/métodos , Medicina de Precisão/métodos , Algoritmos , Análise por Conglomerados , Humanos , Transcriptoma
15.
BMC Bioinformatics ; 8: 144, 2007 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-17474999

RESUMO

BACKGROUND: Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE. RESULTS: We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights. CONCLUSION: SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups. Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful.


Assuntos
Bases de Dados Genéticas/classificação , Perfilação da Expressão Gênica/classificação , Regulação Neoplásica da Expressão Gênica/genética , Família Multigênica/genética , Bases de Dados Genéticas/estatística & dados numéricos , Expressão Gênica/genética , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Neoplasias de Cabeça e Pescoço/genética , Humanos , Masculino , Neoplasias da Próstata/genética
16.
BMC Bioinformatics ; 8: 142, 2007 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-17472750

RESUMO

BACKGROUND: Pre-processing methods for two-sample long oligonucleotide arrays, specifically the Agilent technology, have not been extensively studied. The goal of this study is to quantify some of the sources of error that affect measurement of expression using Agilent arrays and to compare Agilent's Feature Extraction software with pre-processing methods that have become the standard for normalization of cDNA arrays. These include log transformation followed by loess normalization with or without background subtraction and often a between array scale normalization procedure. The larger goal is to define best study design and pre-processing practices for Agilent arrays, and we offer some suggestions. RESULTS: Simple loess normalization without background subtraction produced the lowest variability. However, without background subtraction, fold changes were biased towards zero, particularly at low intensities. ROC analysis of a spike-in experiment showed that differentially expressed genes are most reliably detected when background is not subtracted. Loess normalization and no background subtraction yielded an AUC of 99.7% compared with 88.8% for Agilent processed fold changes. All methods performed well when error was taken into account by t- or z-statistics, AUCs > or = 99.8%. A substantial proportion of genes showed dye effects, 43% (99% CI: 39%, 47%). However, these effects were generally small regardless of the pre-processing method. CONCLUSION: Simple loess normalization without background subtraction resulted in low variance fold changes that more reliably ranked gene expression than the other methods. While t-statistics and other measures that take variation into account, including Agilent's z-statistic, can also be used to reliably select differentially expressed genes, fold changes are a standard measure of differential expression for exploratory work, cross platform comparison, and biological interpretation and can not be entirely replaced. Although dye effects are small for most genes, many array features are affected. Therefore, an experimental design that incorporates dye swaps or a common reference could be valuable.


Assuntos
Análise de Sequência com Séries de Oligonucleotídeos/métodos , Animais , Linhagem Celular Tumoral , Bases de Dados Genéticas/classificação , Bases de Dados Genéticas/estatística & dados numéricos , Cães , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Camundongos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos
17.
BMC Bioinformatics ; 8: 243, 2007 Jul 10.
Artigo em Inglês | MEDLINE | ID: mdl-17620146

RESUMO

BACKGROUND: Uncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. Protein functions, often referred to as its annotations, are believed to manifest themselves through topology of the networks of inter-proteins interactions. In particular, there is a growing body of evidence that proteins performing the same function are more likely to interact with each other than with proteins with other functions. However, since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. In addition to having the general biological significance, such demonstration would further validate the data extraction and processing methods used to compose protein annotation and protein-protein interactions datasets. RESULTS: We developed a method for automatic extraction of protein functional annotation from scientific text based on the Natural Language Processing (NLP) technology. For the protein annotation extracted from the entire PubMed, we evaluated the precision and recall rates, and compared the performance of the automatic extraction technology to that of manual curation used in public Gene Ontology (GO) annotation. In the second part of our presentation, we reported a large-scale investigation into the correspondence between communities in the literature-based protein networks and GO annotation groups of functionally related proteins. We found a comprehensive two-way match: proteins within biological annotation groups form significantly denser linked network clusters than expected by chance and, conversely, densely linked network communities exhibit a pronounced non-random overlap with GO groups. We also expanded the publicly available GO biological process annotation using the relations extracted by our NLP technology. An increase in the number and size of GO groups without any noticeable decrease of the link density within the groups indicated that this expansion significantly broadens the public GO annotation without diluting its quality. We revealed that functional GO annotation correlates mostly with clustering in a physical interaction protein network, while its overlap with indirect regulatory network communities is two to three times smaller. CONCLUSION: Protein functional annotations extracted by the NLP technology expand and enrich the existing GO annotation system. The GO functional modularity correlates mostly with the clustering in the physical interaction network, suggesting that the essential role of structural organization maintained by these interactions. Reciprocally, clustering of proteins in physical interaction networks can serve as an evidence for their functional similarity.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas/classificação , Genes , Reconhecimento Automatizado de Padrão/métodos , Proteínas/fisiologia , Análise por Conglomerados , Biologia Computacional/normas , Bases de Dados Genéticas/normas , Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/normas , Mapeamento de Interação de Proteínas , PubMed , Reprodutibilidade dos Testes , Terminologia como Assunto
18.
BMC Bioinformatics ; 8: 87, 2007 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-17349060

RESUMO

BACKGROUND: To interpret microarray experiments, several ontological analysis tools have been developed. However, current tools are limited to specific organisms. RESULTS: We developed a bioinformatics system to assign the probe set sequences of any organism to a hierarchical functional classification modelled on KEGG ontology. The GeneBins database currently supports the functional classification of expression data from four Affymetrix arrays; Arabidopsis thaliana, Oryza sativa, Glycine max and Medicago truncatula. An online analysis tool to identify relevant functions is also provided. CONCLUSION: GeneBins provides resources to interpret gene expression results from microarray experiments. It is available at http://bioinfoserver.rsbs.anu.edu.au/utils/GeneBins/


Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Genoma de Planta/genética , Análise de Sequência com Séries de Oligonucleotídeos , Biologia Computacional/métodos , Bases de Dados Genéticas/classificação , Perfilação da Expressão Gênica/classificação , Genes de Plantas/genética , Análise de Sequência com Séries de Oligonucleotídeos/classificação , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Plantas/classificação , Plantas/genética , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA