RESUMEN
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.
Asunto(s)
Bases de Datos de Proteínas , Humanos , Secuencia de Aminoácidos , Inteligencia Artificial , Internet , Proteínas/química , Programas InformáticosRESUMEN
The UniProt knowledgebase is a public database for protein sequence and function, covering the tree of life and over 220 million protein entries. Now, the whole community can use a new crowdsourcing annotation system to help scale up UniProt curation and receive proper attribution for their biocuration work.
Asunto(s)
Colaboración de las Masas/métodos , Curaduría de Datos/métodos , Anotación de Secuencia Molecular/métodos , Secuencia de Aminoácidos/genética , Biología Computacional/métodos , Bases de Datos de Proteínas/tendencias , Humanos , Literatura , Proteínas/metabolismo , Participación de los InteresadosRESUMEN
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/química , Secuencia de Aminoácidos , COVID-19/metabolismo , Internet , Anotación de Secuencia Molecular , Dominios Proteicos , Mapas de Interacción de Proteínas , SARS-CoV-2/metabolismo , Alineación de SecuenciaRESUMEN
BACKGROUND: Sustainable production of high-quality feedstock has been of great interest in bioenergy research. Despite the economic importance, high temperatures and water deficit are limiting factors for the successful cultivation of switchgrass in semi-arid areas. There are limited reports on the molecular basis of combined abiotic stress tolerance in switchgrass, particularly the combination of drought and heat stress. We used transcriptomic approaches to elucidate the changes in the response of switchgrass to drought and high temperature simultaneously. RESULTS: We conducted solely drought treatment in switchgrass plant Alamo AP13 by withholding water after 45 days of growing. For the combination of drought and heat effect, heat treatment (35 °C/25 °C day/night) was imposed after 72 h of the initiation of drought. Samples were collected at 0 h, 72 h, 96 h, 120 h, 144 h, and 168 h after treatment imposition, total RNA was extracted, and RNA-Seq conducted. Out of a total of 32,190 genes, we identified 3912, as drought (DT) responsive genes, 2339 and 4635 as, heat (HT) and drought and heat (DTHT) responsive genes, respectively. There were 209, 106, and 220 transcription factors (TFs) differentially expressed under DT, HT and DTHT respectively. Gene ontology annotation identified the metabolic process as the significant term enriched in DTHT genes. Other biological processes identified in DTHT responsive genes included: response to water, photosynthesis, oxidation-reduction processes, and response to stress. KEGG pathway enrichment analysis on DT and DTHT responsive genes revealed that TFs and genes controlling phenylpropanoid pathways were important for individual as well as combined stress response. For example, hydroxycinnamoyl-CoA shikimate/quinate hydroxycinnamoyl transferase (HCT) from the phenylpropanoid pathway was induced by single DT and combinations of DTHT stress. CONCLUSION: Through RNA-Seq analysis, we have identified unique and overlapping genes in response to DT and combined DTHT stress in switchgrass. The combination of DT and HT stress may affect the photosynthetic machinery and phenylpropanoid pathway of switchgrass which negatively impacts lignin synthesis and biomass production of switchgrass. The biological function of genes identified particularly in response to DTHT stress could further be confirmed by techniques such as single point mutation or RNAi.
Asunto(s)
Adaptación Fisiológica/genética , Deshidratación/genética , Respuesta al Choque Térmico/genética , Panicum/genética , Transcriptoma , Perfilación de la Expresión Génica , Regulación de la Expresión Génica de las Plantas , Genes de PlantasRESUMEN
SUMMARY: The global response to the COVID-19 pandemic has led to a rapid increase of scientific literature on this deadly disease. Extracting knowledge from biomedical literature and integrating it with relevant information from curated biological databases is essential to gain insight into COVID-19 etiology, diagnosis and treatment. We used Semantic Web technology RDF to integrate COVID-19 knowledge mined from literature by iTextMine, PubTator and SemRep with relevant biological databases and formalized the knowledge in a standardized and computable COVID-19 Knowledge Graph (KG). We published the COVID-19 KG via a SPARQL endpoint to support federated queries on the Semantic Web and developed a knowledge portal with browsing and searching interfaces. We also developed a RESTful API to support programmatic access and provided RDF dumps for download. AVAILABILITY AND IMPLEMENTATION: The COVID-19 Knowledge Graph is publicly available under CC-BY 4.0 license at https://research.bioinformatics.udel.edu/covid19kg/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
COVID-19 , Semántica , Humanos , Pandemias , Reconocimiento de Normas Patrones Automatizadas , Bases de Datos FactualesRESUMEN
Protein post-translational modifications (PTMs) play a pivotal role in numerous biological processes by modulating regulation of protein function. We have developed iPTMnet (http://proteininformationresource.org/iPTMnet) for PTM knowledge discovery, employing an integrative bioinformatics approach-combining text mining, data mining, and ontological representation to capture rich PTM information, including PTM enzyme-substrate-site relationships, PTM-specific protein-protein interactions (PPIs) and PTM conservation across species. iPTMnet encompasses data from (i) our PTM-focused text mining tools, RLIMS-P and eFIP, which extract phosphorylation information from full-scale mining of PubMed abstracts and full-length articles; (ii) a set of curated databases with experimentally observed PTMs; and iii) Protein Ontology that organizes proteins and PTM proteoforms, enabling their representation, annotation and comparison within and across species. Presently covering eight major PTM types (phosphorylation, ubiquitination, acetylation, methylation, glycosylation, S-nitrosylation, sumoylation and myristoylation), iPTMnet knowledgebase contains more than 654 500 unique PTM sites in over 62 100 proteins, along with more than 1200 PTM enzymes and over 24 300 PTM enzyme-substrate-site relations. The website supports online search, browsing, retrieval and visual analysis for scientific queries. Several examples, including functional interpretation of phosphoproteomic data, demonstrate iPTMnet as a gateway for visual exploration and systematic analysis of PTM networks and conservation, thereby enabling PTM discovery and hypothesis generation.
Asunto(s)
Bases de Datos de Proteínas , Bases del Conocimiento , Procesamiento Proteico-Postraduccional , Animales , Biología Computacional , Minería de Datos , Enzimas/metabolismo , Humanos , Internet , Fosforilación , Mapas de Interacción de Proteínas , Alineación de SecuenciaRESUMEN
The Protein Ontology (PRO; http://purl.obolibrary.org/obo/pr) formally defines and describes taxon-specific and taxon-neutral protein-related entities in three major areas: proteins related by evolution; proteins produced from a given gene; and protein-containing complexes. PRO thus serves as a tool for referencing protein entities at any level of specificity. To enhance this ability, and to facilitate the comparison of such entities described in different resources, we developed a standardized representation of proteoforms using UniProtKB as a sequence reference and PSI-MOD as a post-translational modification reference. We illustrate its use in facilitating an alignment between PRO and Reactome protein entities. We also address issues of scalability, describing our first steps into the use of text mining to identify protein-related entities, the large-scale import of proteoform information from expert curated resources, and our ability to dynamically generate PRO terms. Web views for individual terms are now more informative about closely-related terms, including for example an interactive multiple sequence alignment. Finally, we describe recent improvement in semantic utility, with PRO now represented in OWL and as a SPARQL endpoint. These developments will further support the anticipated growth of PRO and facilitate discoverability of and allow aggregation of data relating to protein entities.
Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Proteínas , Animales , Humanos , Proteínas/química , Proteínas/genética , Navegador WebRESUMEN
InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.
Asunto(s)
Biología Computacional/métodos , Bases de Datos de Proteínas , Dominios y Motivos de Interacción de Proteínas , Programas Informáticos , Humanos , Anotación de Secuencia Molecular , FilogeniaRESUMEN
BACKGROUND: Protein-protein interaction (PPI) prediction remains a central task in systems biology to achieve a better and holistic understanding of cellular and intracellular processes. Recently, an increasing number of computational methods have shifted from pair-wise prediction to network level prediction. Many of the existing network level methods predict PPIs under the assumption that the training network should be connected. However, this assumption greatly affects the prediction power and limits the application area because the current golden standard PPI networks are usually very sparse and disconnected. Therefore, how to effectively predict PPIs based on a training network that is sparse and disconnected remains a challenge. RESULTS: In this work, we developed a novel PPI prediction method based on deep learning neural network and regularized Laplacian kernel. We use a neural network with an autoencoder-like architecture to implicitly simulate the evolutionary processes of a PPI network. Neurons of the output layer correspond to proteins and are labeled with values (1 for interaction and 0 for otherwise) from the adjacency matrix of a sparse disconnected training PPI network. Unlike autoencoder, neurons at the input layer are given all zero input, reflecting an assumption of no a priori knowledge about PPIs, and hidden layers of smaller sizes mimic ancient interactome at different times during evolution. After the training step, an evolved PPI network whose rows are outputs of the neural network can be obtained. We then predict PPIs by applying the regularized Laplacian kernel to the transition matrix that is built upon the evolved PPI network. The results from cross-validation experiments show that the PPI prediction accuracies for yeast data and human data measured as AUC are increased by up to 8.4 and 14.9% respectively, as compared to the baseline. Moreover, the evolved PPI network can also help us leverage complementary information from the disconnected training network and multiple heterogeneous data sources. Tested by the yeast data with six heterogeneous feature kernels, the results show our method can further improve the prediction performance by up to 2%, which is very close to an upper bound that is obtained by an Approximate Bayesian Computation based sampling method. CONCLUSIONS: The proposed evolution deep neural network, coupled with regularized Laplacian kernel, is an effective tool in completing sparse and disconnected PPI networks and in facilitating integration of heterogeneous data sources.
Asunto(s)
Algoritmos , Aprendizaje Automático , Mapas de Interacción de Proteínas , Área Bajo la Curva , Teorema de Bayes , Humanos , Redes Neurales de la Computación , Proteínas/metabolismo , Curva ROC , Saccharomyces cerevisiae/metabolismoRESUMEN
BACKGROUND: Although hatching is perhaps the most abrupt and profound metabolic challenge that a chicken must undergo; there have been no attempts to functionally map the metabolic pathways induced in liver during the embryo-to-hatchling transition. Furthermore, we know very little about the metabolic and regulatory factors that regulate lipid metabolism in late embryos or newly-hatched chicks. In the present study, we examined hepatic transcriptomes of 12 embryos and 12 hatchling chicks during the peri-hatch period-or the metabolic switch from chorioallantoic to pulmonary respiration. RESULTS: Initial hierarchical clustering revealed two distinct, albeit opposing, patterns of hepatic gene expression. Cluster A genes are largely lipolytic and highly expressed in embryos. While, Cluster B genes are lipogenic/thermogenic and mainly controlled by the lipogenic transcription factor THRSPA. Using pairwise comparisons of embryo and hatchling ages, we found 1272 genes that were differentially expressed between embryos and hatchling chicks, including 24 transcription factors and 284 genes that regulate lipid metabolism. The three most differentially-expressed transcripts found in liver of embryos were MOGAT1, DIO3 and PDK4, whereas THRSPA, FASN and DIO2 were highest in hatchlings. An unusual finding was the "ectopic" and extremely high differentially expression of seven feather keratin transcripts in liver of 16 day embryos, which coincides with engorgement of liver with yolk lipids. Gene interaction networks show several transcription factors, transcriptional co-activators/co-inhibitors and their downstream genes that exert a 'ying-yang' action on lipid metabolism during the embryo-to-hatching transition. These upstream regulators include ligand-activated transcription factors, sirtuins and Kruppel-like factors. CONCLUSIONS: Our genome-wide transcriptional analysis has greatly expanded the hepatic repertoire of regulatory and metabolic genes involved in the embryo-to-hatchling transition. New knowledge was gained on interactive transcriptional networks and metabolic pathways that enable the abrupt switch from ectothermy (embryo) to endothermy (hatchling) in the chicken. Several transcription factors and their coactivators/co-inhibitors appear to exert opposing actions on lipid metabolism, leading to the predominance of lipolysis in embryos and lipogenesis in hatchlings. Our analysis of hepatic transcriptomes has enabled discovery of opposing, interconnected and interdependent transcriptional regulators that provide precise ying-yang or homeorhetic regulation of lipid metabolism during the critical embryo-to-hatchling transition.
Asunto(s)
Pollos/crecimiento & desarrollo , Pollos/metabolismo , Regulación del Desarrollo de la Expresión Génica , Hígado/metabolismo , Animales , Cruzamiento , Embrión de Pollo/crecimiento & desarrollo , Embrión de Pollo/metabolismo , Desarrollo Embrionario , Perfilación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Hígado/embriología , Hígado/crecimiento & desarrollo , TranscriptomaRESUMEN
The synthesis and processing of mRNA, from transcription to translation initiation, often requires splicing of intragenic material. The final mRNA composition varies based on proteins that modulate splice site selection. EWS-FLI1 is an Ewing sarcoma (ES) oncoprotein with an interactome that we demonstrate to have multiple partners in spliceosomal complexes. We evaluate the effect of EWS-FLI1 on posttranscriptional gene regulation using both exon array and RNA-seq. Genes that potentially regulate oncogenesis, including CLK1, CASP3, PPFIBP1, and TERT, validate as alternatively spliced by EWS-FLI1. In a CLIP-seq experiment, we find that EWS-FLI1 RNA-binding motifs most frequently occur adjacent to intron-exon boundaries. EWS-FLI1 also alters splicing by directly binding to known splicing factors including DDX5, hnRNP K, and PRPF6. Reduction of EWS-FLI1 produces an isoform of γ-TERT that has increased telomerase activity compared with wild-type (WT) TERT. The small molecule YK-4-279 is an inhibitor of EWS-FLI1 oncogenic function that disrupts specific protein interactions, including helicases DDX5 and RNA helicase A (RHA) that alters RNA-splicing ratios. As such, YK-4-279 validates the splicing mechanism of EWS-FLI1, showing alternatively spliced gene patterns that significantly overlap with EWS-FLI1 reduction and WT human mesenchymal stem cells (hMSC). Exon array analysis of 75 ES patient samples shows similar isoform expression patterns to cell line models expressing EWS-FLI1, supporting the clinical relevance of our findings. These experiments establish systemic alternative splicing as an oncogenic process modulated by EWS-FLI1. EWS-FLI1 modulation of mRNA splicing may provide insight into the contribution of splicing toward oncogenesis, and, reciprocally, EWS-FLI1 interactions with splicing proteins may inform the splicing code.
Asunto(s)
Empalme Alternativo/genética , Proteínas de Fusión Oncogénica/metabolismo , Proteína Proto-Oncogénica c-fli-1/metabolismo , Proteína EWS de Unión a ARN/metabolismo , Transducción de Señal/genética , Empalme Alternativo/efectos de los fármacos , Secuencia de Bases , Línea Celular Tumoral , Exones/genética , Humanos , Indoles , Intrones/genética , Proteínas de Fusión Oncogénica/genética , Unión Proteica/efectos de los fármacos , Isoformas de Proteínas/metabolismo , Proteína Proto-Oncogénica c-fli-1/genética , Procesamiento Postranscripcional del ARN/efectos de los fármacos , Procesamiento Postranscripcional del ARN/genética , ARN Mensajero/genética , ARN Mensajero/metabolismo , Proteína EWS de Unión a ARN/genética , Sarcoma de Ewing/genética , Sarcoma de Ewing/patología , Transducción de Señal/efectos de los fármacos , Empalmosomas/efectos de los fármacos , Empalmosomas/metabolismo , Telomerasa/metabolismoRESUMEN
BACKGROUND: The Gene Ontology (GO) is a resource that supplies information about gene product function using ontologies to represent biological knowledge. These ontologies cover three domains: Cellular Component (CC), Molecular Function (MF), and Biological Process (BP). GO annotation is a process which assigns gene functional information using GO terms to relevant genes in the literature. It is a common task among the Model Organism Database (MOD) groups. Manual GO annotation relies on human curators assigning gene functional information using GO terms by reading the biomedical literature. This process is very time-consuming and labor-intensive. As a result, many MODs can afford to curate only a fraction of relevant articles. METHODS: GO terms from the CC domain can be essentially divided into two sub-hierarchies: subcellular location terms, and protein complex terms. We cast the task of gene annotation using GO terms from the CC domain as relation extraction between gene and other entities: (1) extract cases where a protein is found to be in a subcellular location, and (2) extract cases where a protein is a subunit of a protein complex. For each relation extraction task, we use an approach based on triggers and syntactic dependencies to extract the desired relations among entities. RESULTS: We tested our approach on the BC4GO test set, a publicly available corpus for GO annotation. Our approach obtains a F1-score of 71%, a precision of 91% and a recall of 58% for predicting GO terms from CC Domain for given genes. CONCLUSIONS: We have described a novel approach of treating gene annotation with GO terms from CC domain as two relation extraction subtasks. Evaluation results show that our approach achieves a F1-score of 71% for predicting GO terms for given genes. Thereby our approach can be used to accelerate the process of GO annotation for the bio-annotators.
Asunto(s)
Biología Computacional , Ontología de Genes , Anotación de Secuencia Molecular , Procesamiento de Lenguaje Natural , HumanosRESUMEN
MOTIVATION: The enormous number of redundant sequenced genomes has hindered efforts to analyze and functionally annotate proteins. As the taxonomy of viruses is not uniformly defined, viral proteomes pose special challenges in this regard. Grouping viruses based on the similarity of their proteins at proteome scale can normalize against potential taxonomic nomenclature anomalies. RESULTS: We present Viral Reference Proteomes (Viral RPs), which are computed from complete virus proteomes within UniProtKB. Viral RPs based on 95, 75, 55, 35 and 15% co-membership in proteome similarity based clusters are provided. Comparison of our computational Viral RPs with UniProt's curator-selected Reference Proteomes indicates that the two sets are consistent and complementary. Furthermore, each Viral RP represents a cluster of virus proteomes that was consistent with virus or host taxonomy. We provide BLASTP search and FTP download of Viral RP protein sequences, and a browser to facilitate the visualization of Viral RPs. AVAILABILITY AND IMPLEMENTATION: http://proteininformationresource.org/rps/viruses/ CONTACT: chenc@udel.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Bases de Datos de Proteínas , Proteoma/análisis , Proteínas Virales/análisis , Secuencia de Aminoácidos , Análisis por Conglomerados , Biología Computacional , Bases del ConocimientoRESUMEN
Protein-protein interactions play essential roles in many biological processes. Acquiring knowledge of the residue-residue contact information of two interacting proteins is not only helpful in annotating functions for proteins, but also critical for structure-based drug design. The prediction of the protein residue-residue contact matrix of the interfacial regions is challenging. In this work, we introduced deep learning techniques (specifically, stacked autoencoders) to build deep neural network models to tackled the residue-residue contact prediction problem. In tandem with interaction profile Hidden Markov Models, which was used first to extract Fisher score features from protein sequences, stacked autoencoders were deployed to extract and learn hidden abstract features. The deep learning model showed significant improvement over the traditional machine learning model, Support Vector Machines (SVM), with the overall accuracy increased by 15% from 65.40% to 80.82%. We showed that the stacked autoencoders could extract novel features, which can be utilized by deep neural networks and other classifiers to enhance learning, out of the Fisher score features. It is further shown that deep neural networks have significant advantages over SVM in making use of the newly extracted features.
Asunto(s)
Secuencia de Aminoácidos/genética , Biología Computacional/métodos , Mapeo de Interacción de Proteínas/métodos , Mapas de Interacción de Proteínas/genética , Aprendizaje Automático , Programas InformáticosRESUMEN
The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36,766 member database signatures integrated into 26,238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/clasificación , Bacterias/metabolismo , Ontología de Genes , Estructura Terciaria de Proteína , Proteínas/genética , Análisis de Secuencia de Proteína , Programas InformáticosRESUMEN
MOTIVATION: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. RESULTS: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (â¼7 times shorter hit list before expansion), faster (â¼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.
Asunto(s)
Biología Computacional , Bases de Datos de Proteínas , Dioxigenasas/metabolismo , Proteínas de la Membrana/metabolismo , Proteínas/metabolismo , Análisis de Secuencia de Proteína , Programas Informáticos , Desmetilasa de ARN, Homólogo 5 de AlkB , Análisis por Conglomerados , Dioxigenasas/química , Dioxigenasas/genética , Ontología de Genes , Humanos , Almacenamiento y Recuperación de la Información , Proteínas de la Membrana/química , Proteínas de la Membrana/genética , Anotación de Secuencia Molecular , Proteínas/química , Proteínas/genéticaRESUMEN
MicroRNAs (miRNAs) regulate a wide range of cellular and developmental processes through gene expression suppression or mRNA degradation. Experimentally validated miRNA gene targets are often reported in the literature. In this paper, we describe miRTex, a text mining system that extracts miRNA-target relations, as well as miRNA-gene and gene-miRNA regulation relations. The system achieves good precision and recall when evaluated on a literature corpus of 150 abstracts with F-scores close to 0.90 on the three different types of relations. We conducted full-scale text mining using miRTex to process all the Medline abstracts and all the full-length articles in the PubMed Central Open Access Subset. The results for all the Medline abstracts are stored in a database for interactive query and file download via the website at http://proteininformationresource.org/mirtex. Using miRTex, we identified genes potentially regulated by miRNAs in Triple Negative Breast Cancer, as well as miRNA-gene relations that, in conjunction with kinase-substrate relations, regulate the response to abiotic stress in Arabidopsis thaliana. These two use cases demonstrate the usefulness of miRTex text mining in the analysis of miRNA-regulated biological processes.
Asunto(s)
Biología Computacional/métodos , Minería de Datos/métodos , Genes/genética , MicroARNs/genética , Bases de Datos Genéticas , Humanos , MicroARNs/clasificación , Modelos Genéticos , Publicaciones Periódicas como AsuntoRESUMEN
The Protein Ontology (PRO; http://proconsortium.org) formally defines protein entities and explicitly represents their major forms and interrelations. Protein entities represented in PRO corresponding to single amino acid chains are categorized by level of specificity into family, gene, sequence and modification metaclasses, and there is a separate metaclass for protein complexes. All metaclasses also have organism-specific derivatives. PRO complements established sequence databases such as UniProtKB, and interoperates with other biomedical and biological ontologies such as the Gene Ontology (GO). PRO relates to UniProtKB in that PRO's organism-specific classes of proteins encoded by a specific gene correspond to entities documented in UniProtKB entries. PRO relates to the GO in that PRO's representations of organism-specific protein complexes are subclasses of the organism-agnostic protein complex terms in the GO Cellular Component Ontology. The past few years have seen growth and changes to the PRO, as well as new points of access to the data and new applications of PRO in immunology and proteomics. Here we describe some of these developments.
Asunto(s)
Ontologías Biológicas , Bases de Datos de Proteínas , Proteínas/clasificación , Animales , Humanos , Internet , Ratones , Proteínas/químicaRESUMEN
BACKGROUND: Text mining is increasingly used in the biomedical domain because of its ability to automatically gather information from large amount of scientific articles. One important task in biomedical text mining is relation extraction, which aims to identify designated relations among biological entities reported in literature. A relation extraction system achieving high performance is expensive to develop because of the substantial time and effort required for its design and implementation. Here, we report a novel framework to facilitate the development of a pattern-based biomedical relation extraction system. It has several unique design features: (1) leveraging syntactic variations possible in a language and automatically generating extraction patterns in a systematic manner, (2) applying sentence simplification to improve the coverage of extraction patterns, and (3) identifying referential relations between a syntactic argument of a predicate and the actual target expected in the relation extraction task. RESULTS: A relation extraction system derived using the proposed framework achieved overall F-scores of 72.66% for the Simple events and 55.57% for the Binding events on the BioNLP-ST 2011 GE test set, comparing favorably with the top performing systems that participated in the BioNLP-ST 2011 GE task. We obtained similar results on the BioNLP-ST 2013 GE test set (80.07% and 60.58%, respectively). We conducted additional experiments on the training and development sets to provide a more detailed analysis of the system and its individual modules. This analysis indicates that without increasing the number of patterns, simplification and referential relation linking play a key role in the effective extraction of biomedical relations. CONCLUSIONS: In this paper, we present a novel framework for fast development of relation extraction systems. The framework requires only a list of triggers as input, and does not need information from an annotated corpus. Thus, we reduce the involvement of domain experts, who would otherwise have to provide manual annotations and help with the design of hand crafted patterns. We demonstrate how our framework is used to develop a system which achieves state-of-the-art performance on a public benchmark corpus.
Asunto(s)
Investigación Biomédica/métodos , Minería de Datos/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Lenguaje , Publicaciones , Factores de TiempoRESUMEN
MOTIVATION: Prediction of protein-protein interaction has become an important part of systems biology in reverse engineering the biological networks for better understanding the molecular biology of the cell. Although significant progress has been made in terms of prediction accuracy, most computational methods only predict whether two proteins interact but not their interacting residues-the information that can be very valuable for understanding the interaction mechanisms and designing modulation of the interaction. In this work, we developed a computational method to predict the interacting residue pairs-contact matrix for interacting protein domains, whose rows and columns correspond to the residues in the two interacting domains respectively and whose values (1 or 0) indicate whether the corresponding residues (do or do not) interact. RESULTS: Our method is based on supervised learning using support vector machines. For each domain involved in a given domain-domain interaction (DDI), an interaction profile hidden Markov model (ipHMM) is first built for the domain family, and then each residue position for a member domain sequence is represented as a 20-dimension vector of Fisher scores, characterizing how similar it is as compared with the family profile at that position. Each element of the contact matrix for a sequence pair is now represented by a feature vector from concatenating the vectors of the two corresponding residues, and the task is to predict the element value (1 or 0) from the feature vector. A support vector machine is trained for a given DDI, using either a consensus contact matrix or contact matrices for individual sequence pairs, and is tested by leave-one-out cross validation. The performance averaged over a set of 115 DDIs collected from the 3 DID database shows significant improvement (sensitivity up to 85%, and specificity up to 85%), as compared with a multiple sequence alignment-based method (sensitivity 57%, and specificity 78%) previously reported in the literature. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.