Pesquisa | Biblioteca Virtual em Saúde

Unsupervised learning and natural language processing highlight research trends in a superbug.

Méndez-Cruz, Carlos-Francisco; Rodríguez-Herrera, Joel; Varela-Vega, Alfredo; Mateo-Estrada, Valeria; Castillo-Ramírez, Santiago.

Front Artif Intell ; 7: 1336071, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38576460

RESUMO

Introduction: Antibiotic-resistant Acinetobacter baumannii is a very important nosocomial pathogen worldwide. Thousands of studies have been conducted about this pathogen. However, there has not been any attempt to use all this information to highlight the research trends concerning this pathogen. Methods: Here we use unsupervised learning and natural language processing (NLP), two areas of Artificial Intelligence, to analyse the most extensive database of articles created (5,500+ articles, from 851 different journals, published over 3 decades). Results: K-means clustering found 113 theme clusters and these were defined with representative terms automatically obtained with topic modelling, summarising different research areas. The biggest clusters, all with over 100 articles, are biased toward multidrug resistance, carbapenem resistance, clinical treatment, and nosocomial infections. However, we also found that some research areas, such as ecology and non-human infections, have received very little attention. This approach allowed us to study research themes over time unveiling those of recent interest, such as the use of Cefiderocol (a recently approved antibiotic) against A. baumannii. Discussion: In a broader context, our results show that unsupervised learning, NLP and topic modelling can be used to describe and analyse the research themes for important infectious diseases. This strategy should be very useful to analyse other ESKAPE pathogens or any other pathogens relevant to Public Health.

RegulonDB 11.0: Comprehensive high-throughput datasets on transcriptional regulation in Escherichia coli K-12.

Tierrafría, Víctor H; Rioualen, Claire; Salgado, Heladia; Lara, Paloma; Gama-Castro, Socorro; Lally, Patrick; Gómez-Romero, Laura; Peña-Loredo, Pablo; López-Almazo, Andrés G; Alarcón-Carranza, Gabriel; Betancourt-Figueroa, Felipe; Alquicira-Hernández, Shirley; Polanco-Morelos, J Enrique; García-Sotelo, Jair; Gaytan-Nuñez, Estefani; Méndez-Cruz, Carlos-Francisco; Muñiz, Luis J; Bonavides-Martínez, César; Moreno-Hagelsieb, Gabriel; Galagan, James E; Wade, Joseph T; Collado-Vides, Julio.

Microb Genom ; 8(5)2022 05.

Artigo em Inglês | MEDLINE | ID: mdl-35584008

RESUMO

Genomics has set the basis for a variety of methodologies that produce high-throughput datasets identifying the different players that define gene regulation, particularly regulation of transcription initiation and operon organization. These datasets are available in public repositories, such as the Gene Expression Omnibus, or ArrayExpress. However, accessing and navigating such a wealth of data is not straightforward. No resource currently exists that offers all available high and low-throughput data on transcriptional regulation in Escherichia coli K-12 to easily use both as whole datasets, or as individual interactions and regulatory elements. RegulonDB (https://regulondb.ccg.unam.mx) began gathering high-throughput dataset collections in 2009, starting with transcription start sites, then adding ChIP-seq and gSELEX in 2012, with up to 99 different experimental high-throughput datasets available in 2019. In this paper we present a radical upgrade to more than 2000 high-throughput datasets, processed to facilitate their comparison, introducing up-to-date collections of transcription termination sites, transcription units, as well as transcription factor binding interactions derived from ChIP-seq, ChIP-exo, gSELEX and DAP-seq experiments, besides expression profiles derived from RNA-seq experiments. For ChIP-seq experiments we offer both the data as presented by the authors, as well as data uniformly processed in-house, enhancing their comparability, as well as the traceability of the methods and reproducibility of the results. Furthermore, we have expanded the tools available for browsing and visualization across and within datasets. We include comparisons against previously existing knowledge in RegulonDB from classic experiments, a nucleotide-resolution genome viewer, and an interface that enables users to browse datasets by querying their metadata. A particular effort was made to automatically extract detailed experimental growth conditions by implementing an assisted curation strategy applying Natural language processing and machine learning. We provide summaries with the total number of interactions found in each experiment, as well as tools to identify common results among different experiments. This is a long-awaited resource to make use of such wealth of knowledge and advance our understanding of the biology of the model bacterium E. coli K-12.

Assuntos

Escherichia coli K12 , Escherichia coli , Escherichia coli/genética , Escherichia coli K12/genética , Escherichia coli K12/metabolismo , Regulação Bacteriana da Expressão Gênica , Óperon/genética , Reprodutibilidade dos Testes

Improving classification of low-resource COVID-19 literature by using Named Entity Recognition.

Lithgow-Serrano, Oscar; Cornelius, Joseph; Kanjirangat, Vani; Méndez-Cruz, Carlos-Francisco; Rinaldi, Fabio.

Genomics Inform ; 19(3): e22, 2021 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-34638169

RESUMO

Automatic document classification for highly interrelated classes is a demanding task that becomes more challenging when there is little labeled data for training. Such is the case of the coronavirus disease 2019 (COVID-19) Clinical repository-a repository of classified and translated academic articles related to COVID-19 and relevant to the clinical practice-where a 3-way classification scheme is being applied to COVID-19 literature. During the 7th Biomedical Linked Annotation Hackathon (BLAH7) hackathon, we performed experiments to explore the use of named-entity-recognition (NER) to improve the classification. We processed the literature with OntoGene's Biomedical Entity Recogniser (OGER) and used the resulting identified Named Entities (NE) and their links to major biological databases as extra input features for the classifier. We compared the results with a baseline model without the OGER extracted features. In these proof-of-concept experiments, we observed a clear gain on COVID-19 literature classification. In particular, NE's origin was useful to classify document types and NE's type for clinical specialties. Due to the limitations of the small dataset, we can only conclude that our results suggests that NER would benefit this classification task. In order to accurately estimate this benefit, further experiments with a larger dataset would be needed.

Lisen&Curate: A platform to facilitate gathering textual evidence for curation of regulation of transcription initiation in bacteria.

Díaz-Rodríguez, Martín; Lithgow-Serrano, Oscar; Guadarrama-García, Francisco; Tierrafría, Víctor H; Gama-Castro, Socorro; Solano-Lira, Hilda; Salgado, Heladia; Rinaldi, Fabio; Méndez-Cruz, Carlos-Francisco; Collado-Vides, Julio.

Biochim Biophys Acta Gene Regul Mech ; 1864(11-12): 194753, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34461312

RESUMO

The number of published papers in biomedical research makes it rather impossible for a researcher to keep up to date. This is where manually curated databases contribute facilitating the access to knowledge. However, the structure required by databases strongly limits the type of valuable information that can be incorporated. Here, we present Lisen&Curate, a curation system that facilitates linking sentences or part of sentences (both considered sources) in articles with their corresponding curated objects, so that rich additional information of these objects is easily available to users. These sources are going to be offered both within RegulonDB and a new database, L-Regulon. To show the relevance of our work, two senior curators performed a curation of 31 articles on the regulation of transcription initiation of E. coli using Lisen&Curate. As a result, 194 objects were curated and 781 sources were recorded. We also found that these sources are useful to develop automatic approaches to detect objects in articles by observing word frequency patterns and by carrying out an open information extraction task. Sources may help to elaborate a controlled vocabulary of experimental methods. Finally, we discuss our ecosystem of interconnected applications, RegulonDB, L-Regulon, and Lisen&Curate, to facilitate the access to knowledge on regulation of transcription initiation in bacteria. We see our proposal as the starting point to change the way experimentalists connect a piece of knowledge with its evidence using RegulonDB.

Assuntos

Curadoria de Dados/métodos , Bases de Dados Genéticas , Regulação Bacteriana da Expressão Gênica , Iniciação da Transcrição Genética , Escherichia coli/genética

Knowledge extraction for assisted curation of summaries of bacterial transcription factor properties.

Méndez-Cruz, Carlos-Francisco; Blanchet, Antonio; Godínez, Alan; Arroyo-Fernández, Ignacio; Gama-Castro, Socorro; Martínez-Luna, Sara Berenice; González-Colín, Cristian; Collado-Vides, Julio.

Database (Oxford) ; 20202020 12 11.

Artigo em Inglês | MEDLINE | ID: mdl-33306798

RESUMO

Transcription factors (TFs) play a main role in transcriptional regulation of bacteria, as they regulate transcription of the genetic information encoded in DNA. Thus, the curation of the properties of these regulatory proteins is essential for a better understanding of transcriptional regulation. However, traditional manual curation of article collections to compile descriptions of TF properties takes significant time and effort due to the overwhelming amount of biomedical literature, which increases every day. The development of automatic approaches for knowledge extraction to assist curation is therefore critical. Here, we show an effective approach for knowledge extraction to assist curation of summaries describing bacterial TF properties based on an automatic text summarization strategy. We were able to recover automatically a median 77% of the knowledge contained in manual summaries describing properties of 177 TFs of Escherichia coli K-12 by processing 5961 scientific articles. For 71% of the TFs, our approach extracted new knowledge that can be used to expand manual descriptions. Furthermore, as we trained our predictive model with manual summaries of E. coli, we also generated summaries for 185 TFs of Salmonella enterica serovar Typhimurium from 3498 articles. According to the manual curation of 10 of these Salmonella typhimurium summaries, 96% of their sentences contained relevant knowledge. Our results demonstrate the feasibility to assist manual curation to expand manual summaries with new knowledge automatically extracted and to create new summaries of bacteria for which these curation efforts do not exist. Database URL: The automatic summaries of the TFs of E. coli and Salmonella and the automatic summarizer are available in GitHub (https://github.com/laigen-unam/tf-properties-summarizer.git).

Assuntos

Escherichia coli K12 , Fatores de Transcrição , Escherichia coli/genética , Escherichia coli/metabolismo , Escherichia coli K12/metabolismo , Regulação da Expressão Gênica , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Transcrição Gênica

RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12.

Santos-Zavaleta, Alberto; Salgado, Heladia; Gama-Castro, Socorro; Sánchez-Pérez, Mishael; Gómez-Romero, Laura; Ledezma-Tejeida, Daniela; García-Sotelo, Jair Santiago; Alquicira-Hernández, Kevin; Muñiz-Rascado, Luis José; Peña-Loredo, Pablo; Ishida-Gutiérrez, Cecilia; Velázquez-Ramírez, David A; Del Moral-Chávez, Víctor; Bonavides-Martínez, César; Méndez-Cruz, Carlos-Francisco; Galagan, James; Collado-Vides, Julio.

Nucleic Acids Res ; 47(D1): D212-D220, 2019 01 08.

Artigo em Inglês | MEDLINE | ID: mdl-30395280

RESUMO

RegulonDB, first published 20 years ago, is a comprehensive electronic resource about regulation of transcription initiation of Escherichia coli K-12 with decades of knowledge from classic molecular biology experiments, and recently also from high-throughput genomic methodologies. We curated the literature to keep RegulonDB up to date, and initiated curation of ChIP and gSELEX experiments. We estimate that current knowledge describes between 10% and 30% of the expected total number of transcription factor- gene regulatory interactions in E. coli. RegulonDB provides datasets for interactions for which there is no evidence that they affect expression, as well as expression datasets. We developed a proof of concept pipeline to merge binding and expression evidence to identify regulatory interactions. These datasets can be visualized in the RegulonDB JBrowse. We developed the Microbial Conditions Ontology with a controlled vocabulary for the minimal properties to reproduce an experiment, which contributes to integrate data from high throughput and classic literature. At a higher level of integration, we report Genetic Sensory-Response Units for 200 transcription factors, including their regulation at the metabolic level, and include summaries for 70 of them. Finally, we summarize our research with Natural language processing strategies to enhance our biocuration work.

Assuntos

Biologia Computacional/métodos , Escherichia coli K12/genética , Regulação Bacteriana da Expressão Gênica , Genômica , Ontologia Genética , Redes Reguladoras de Genes , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala

First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes.

Méndez-Cruz, Carlos-Francisco; Gama-Castro, Socorro; Mejía-Almonte, Citlalli; Castillo-Villalba, Marco-Polo; Muñiz-Rascado, Luis-José; Collado-Vides, Julio.

Database (Oxford) ; 20172017 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-29220462

RESUMO

Database URL: RegulonDB, http://regulondb.ccg.unam.mx.

Assuntos

Bases de Dados Bibliográficas , Escherichia coli K12 , Regulon , Serina Endopeptidases , Máquina de Vetores de Suporte , Fatores de Transcrição , Domínios Proteicos

Strategies towards digital and semi-automated curation in RegulonDB.

Rinaldi, Fabio; Lithgow, Oscar; Gama-Castro, Socorro; Solano, Hilda; López-Fuentes, Alejandra; Muñiz Rascado, Luis José; Ishida-Gutiérrez, Cecilia; Méndez-Cruz, Carlos-Francisco; Collado-Vides, Julio.

Database (Oxford) ; 20172017 Jan 01.

Artigo em Inglês | MEDLINE | ID: mdl-28605767

Strategies towards digital and semi-automated curation in RegulonDB.

Rinaldi, Fabio; Lithgow, Oscar; Gama-Castro, Socorro; Solano, Hilda; Lopez, Alejandra; Muñiz Rascado, Luis José; Ishida-Gutiérrez, Cecilia; Méndez-Cruz, Carlos-Francisco; Collado-Vides, Julio.

Database (Oxford) ; 2017(1)2017 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-28365731

RESUMO

Experimentally generated biological information needs to be organized and structured in order to become meaningful knowledge. However, the rate at which new information is being published makes manual curation increasingly unable to cope. Devising new curation strategies that leverage upon data mining and text analysis is, therefore, a promising avenue to help life science databases to cope with the deluge of novel information. In this article, we describe the integration of text mining technologies in the curation pipeline of the RegulonDB database, and discuss how the process can enhance the productivity of the curators. Specifically, a named entity recognition approach is used to pre-annotate terms referring to a set of domain entities which are potentially relevant for the curation process. The annotated documents are presented to the curator, who, thanks to a custom-designed interface, can select sentences containing specific types of entities, thus restricting the amount of text that needs to be inspected. Additionally, a module capable of computing semantic similarity between sentences across the entire collection of articles to be curated is being integrated in the system. We tested the module using three sets of scientific articles and six domain experts. All these improvements are gradually enabling us to obtain a high throughput curation process with the same quality as manual curation.

Assuntos

Curadoria de Dados/métodos , Mineração de Dados/métodos , Bases de Dados Factuais , Regulon/fisiologia , Curadoria de Dados/normas

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA