Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
BMC Bioinformatics ; 21(Suppl 23): 579, 2020 Dec 29.
Artículo en Inglés | MEDLINE | ID: mdl-33372606

RESUMEN

BACKGROUND: Entity normalization is an important information extraction task which has gained renewed attention in the last decade, particularly in the biomedical and life science domains. In these domains, and more generally in all specialized domains, this task is still challenging for the latest machine learning-based approaches, which have difficulty handling highly multi-class and few-shot learning problems. To address this issue, we propose C-Norm, a new neural approach which synergistically combines standard and weak supervision, ontological knowledge integration and distributional semantics. RESULTS: Our approach greatly outperforms all methods evaluated on the Bacteria Biotope datasets of BioNLP Open Shared Tasks 2019, without integrating any manually-designed domain-specific rules. CONCLUSIONS: Our results show that relatively shallow neural network methods can perform well in domains that present highly multi-class and few-shot learning problems.


Asunto(s)
Algoritmos , Redes Neurales de la Computación , Bacterias/metabolismo , Intervalos de Confianza , Bases de Datos como Asunto , Ecosistema , Humanos , Conocimiento , Aprendizaje Automático , Fenotipo
2.
Food Microbiol ; 81: 63-75, 2019 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-30910089

RESUMEN

Information on food microbial diversity is scattered across millions of scientific papers. Researchers need tools to assist their bibliographic search in such large collections. Text mining and knowledge engineering methods are useful to automatically and efficiently find relevant information in Life Science. This work describes how the Alvis text mining platform has been applied to a large collection of PubMed abstracts of scientific papers in the food microbiology domain. The information targeted by our work is microorganisms, their habitats and phenotypes. Two knowledge resources, the NCBI taxonomy and the OntoBiotope ontology were used to detect this information in texts. The result of the text mining process was indexed and is presented through the AlvisIR Food on-line semantic search engine. In this paper, we also show through two illustrative examples the great potential of this new tool to assist in studies on ecological diversity and the origin of microbial presence in food.


Asunto(s)
Biodiversidad , Biología Computacional/métodos , Minería de Datos/métodos , Microbiología de Alimentos , Algoritmos , Ontologías Biológicas , Bases de Datos Bibliográficas , Bases de Datos Factuales , Ecosistema , Humanos , Servicios de Información , Almacenamiento y Recuperación de la Información , Internet , Literatura , MEDLINE , National Library of Medicine (U.S.) , Fenotipo , Filogenia , PubMed , Programas Informáticos , Estados Unidos
3.
BMC Bioinformatics ; 16 Suppl 10: S1, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26202448

RESUMEN

BACKGROUND: We present the two Bacteria Track tasks of BioNLP 2013 Shared Task (ST): Gene Regulation Network (GRN) and Bacteria Biotope (BB). These tasks were previously introduced in the 2011 BioNLP-ST Bacteria Track as Bacteria Gene Interaction (BI) and Bacteria Biotope (BB). The Bacteria Track was motivated by a need to develop specific BioNLP tools for fine-grained event extraction in bacteria biology. The 2013 tasks expand on the 2011 version by better addressing the biological knowledge modeling needs. New evaluation metrics were designed for the new goals. Moving beyond a list of gene interactions, the goal of the GRN task is to build a gene regulation network from the extracted gene interactions. BB'13 is dedicated to the extraction of bacteria biotopes, i.e. bacterial environmental information, as was BB'11. BB'13 extends the typology of BB'11 to a large diversity of biotopes, as defined by the OntoBiotope ontology. The detection of entities and events is tackled by distinct subtasks in order to measure the progress achieved by the participant systems since 2011. RESULTS: This paper details the corpus preparations and the evaluation metrics, as well as summarizing and discussing the participant results. Five groups participated in each of the two tasks. The high diversity of the participant methods reflects the dynamism of the BioNLP research community. CONCLUSION: The evaluation results suggest new research directions for the improvement and development of Information Extraction for molecular and environmental biology. The Bacteria Track tasks remain publicly open; the BioNLP-ST website provides an online evaluation service, the reference corpora and the evaluation tools.


Asunto(s)
Bacterias/genética , Microbiología Ambiental , Epistasis Genética , Redes Reguladoras de Genes , Genes Bacterianos , Almacenamiento y Recuperación de la Información , Humanos , Procesamiento de Lenguaje Natural
4.
PLoS One ; 19(6): e0305475, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38870159

RESUMEN

Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. A growing number of plant molecular information networks provide interlinked interoperable data to support the discovery of gene-phenotype interactions. A large body of scientific literature and observational data obtained in-field and under controlled conditions document wheat breeding experiments. The cross-referencing of this complementary information is essential. Text from databases and scientific publications has been identified early on as a relevant source of information. However, the wide variety of terms used to refer to traits and phenotype values makes it difficult to find and cross-reference the textual information, e.g. simple dictionary lookup methods miss relevant terms. Corpora with manually annotated examples are thus needed to evaluate and train textual information extraction methods. While several corpora contain annotations of human and animal phenotypes, no corpus is available for plant traits. This hinders the evaluation of text mining-based crop knowledge graphs (e.g. AgroLD, KnetMiner, WheatIS-FAIDARE) and limits the ability to train machine learning methods and improve the quality of information. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 528 PubMed references that are fully annotated by trait, phenotype, and species. We address the interoperability challenge of crossing sparse assay data and publications by using the Wheat Trait and Phenotype Ontology to normalize trait mentions and the species taxonomy of the National Center for Biotechnology Information to normalize species. The paper describes the construction of the corpus. A study of the performance of state-of-the-art language models for both named entity recognition and linking tasks trained on the corpus shows that it is suitable for training and evaluation. This corpus is currently the most comprehensive manually annotated corpus for natural language processing studies on crop phenotype information from the literature.


Asunto(s)
Minería de Datos , Fenotipo , Fitomejoramiento , Triticum , Triticum/genética , Fitomejoramiento/métodos , Minería de Datos/métodos
5.
Data Brief ; 54: 110404, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38665156

RESUMEN

There is a growing interest in milk oligosaccharides (MOs) because of their numerous benefits for newborns' and long-term health. A large number of MO structures have been identified in mammalian milk. Mostly described in human milk, the oligosaccharide richness, although less broad, has also been reported for a wide range of mammalian species. The structure of MOs is particularly difficult to report as it results from the combination of 5 monosaccharides linked by various glycosidic bonds forming structurally diverse and complex matrices of linear and branched oligosaccharides. Exploring the literature and extracting relevant information on MO diversity within or across species appears promising to elucidate structure-function role of MOs. Currently, given the complexity of these molecules, the main issues in exploring literature to extract relevant information on MO diversity within or across species relate to the heterogeneity in the way authors refer to these molecules. Herein, we provide a thesaurus (MilkOligoThesaurus) including the names and synonyms of MOs collected from key selected articles on mammalian milk analyses. MilkOligoThesaurus gathers the names of the MOs with a complete description of their monosaccharide composition and structures. When available, each unique MO molecule is linked to its ID from the NCBI PubChem and ChEBI databases. MilkOligoThesaurus is provided in a tabular format. It gathers 245 unique oligosaccharide structures described by 22 features (columns) including the name of the molecule, its abbreviation, the chemical database IDs if available, the monosaccharide composition, chemical information (molecular formula, monoisotopic mass), synonyms, its formula in condensed form, and in abbreviated condensed form, the abbreviated systematic name, the systematic name, the isomer group, and scientific article sources. MilkOligoThesaurus is also provided in the SKOS (Simple Knowledge Organization System) format. This thesaurus is a valuable resource gathering MO naming variations that are not found elsewhere for (i) Text and Data Mining to enable automatic annotation and rapid extraction of milk oligosaccharide data from scientific papers; (ii) biology researchers aiming to search for or decipher the structure of milk oligosaccharides based on any of their names, abbreviations or monosaccharide compositions and linkages.

6.
PLoS One ; 18(1): e0272473, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36662691

RESUMEN

The dramatic increase in the number of microbe descriptions in databases, reports, and papers presents a two-fold challenge for accessing the information: integration of heterogeneous data in a standard ontology-based representation and normalization of the textual descriptions by semantic analysis. Recent text mining methods offer powerful ways to extract textual information and generate ontology-based representation. This paper describes the design of the Omnicrobe application that gathers comprehensive information on habitats, phenotypes, and usages of microbes from scientific sources of high interest to the microbiology community. The Omnicrobe database contains around 1 million descriptions of microbe properties. These descriptions are created by analyzing and combining six information sources of various kinds, i.e. biological resource catalogs, sequence databases and scientific literature. The microbe properties are indexed by the Ontobiotope ontology and their taxa are indexed by an extended version of the taxonomy maintained by the National Center for Biotechnology Information. The Omnicrobe application covers all domains of microbiology. With simple or rich ontology-based queries, it provides easy-to-use support in the resolution of scientific questions related to the habitats, phenotypes, and uses of microbes. We illustrate the potential of Omnicrobe with a use case from the food innovation domain.


Asunto(s)
Minería de Datos , Ecosistema , Minería de Datos/métodos , Bases de Datos Factuales , Publicaciones , Fenotipo
7.
Front Artif Intell ; 6: 1188036, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37829659

RESUMEN

This article describes our study on the alignment of two complementary knowledge graphs useful in agriculture: the thesaurus of cultivated plants in France named French Crop Usage (FCU) and the French national taxonomic repository TAXREF for fauna, flora, and fungi. FCU describes the usages of plants in agriculture: "tomatoes" are crops used for human food, and "grapevines" are crops used for human beverage. TAXREF describes biological taxa and associated scientific names: for example, a tomato species may be "Solanum lycopersicum" or a grapevine species may be "Vitis vinifera". Both knowledge graphs contain vernacular names of plants but those names are ambiguous. Thus, a group of agricultural experts produced some mappings from FCU crops to TAXREF taxa. Moreover, new RDF properties have been defined to declare those new types of mapping relations between plant descriptions. The metadata for the mappings and the mapping set are encoded with the Simple Standard for Sharing Ontological Mappings (SSSOM), a new model which, among other qualities, offers means to report on provenance of particular interest for this study. The produced mappings are available for download in Recherche Data Gouv, the federated national platform for research data in France.

8.
BMC Bioinformatics ; 13 Suppl 11: S3, 2012 Jun 26.
Artículo en Inglés | MEDLINE | ID: mdl-22759457

RESUMEN

BACKGROUND: We present the BioNLP 2011 Shared Task Bacteria Track, the first Information Extraction challenge entirely dedicated to bacteria. It includes three tasks that cover different levels of biological knowledge. The Bacteria Gene Renaming supporting task is aimed at extracting gene renaming and gene name synonymy in PubMed abstracts. The Bacteria Gene Interaction is a gene/protein interaction extraction task from individual sentences. The interactions have been categorized into ten different sub-types, thus giving a detailed account of genetic regulations at the molecular level. Finally, the Bacteria Biotopes task focuses on the localization and environment of bacteria mentioned in textbook articles. We describe the process of creation for the three corpora, including document acquisition and manual annotation, as well as the metrics used to evaluate the participants' submissions. RESULTS: Three teams submitted to the Bacteria Gene Renaming task; the best team achieved an F-score of 87%. For the Bacteria Gene Interaction task, the only participant's score had reached a global F-score of 77%, although the system efficiency varies significantly from one sub-type to another. Three teams submitted to the Bacteria Biotopes task with very different approaches; the best team achieved an F-score of 45%. However, the detailed study of the participating systems efficiency reveals the strengths and weaknesses of each participating system. CONCLUSIONS: The three tasks of the Bacteria Track offer participants a chance to address a wide range of issues in Information Extraction, including entity recognition, semantic typing and coreference resolution. We found common trends in the most efficient systems: the systematic use of syntactic dependencies and machine learning. Nevertheless, the originality of the Bacteria Biotopes task encouraged the use of interesting novel methods and techniques, such as term compositionality, scopes wider than the sentence.


Asunto(s)
Bacterias/genética , Genes Bacterianos , Almacenamiento y Recuperación de la Información , Epistasis Genética , Humanos , PubMed , Terminología como Asunto
9.
Database (Oxford) ; 20222022 08 25.
Artículo en Inglés | MEDLINE | ID: mdl-36006843

RESUMEN

Collecting relations between chemicals and drugs is crucial in biomedical research. The pre-trained transformer model, e.g. Bidirectional Encoder Representations from Transformers (BERT), is shown to have limitations on biomedical texts; more specifically, the lack of annotated data makes relation extraction (RE) from biomedical texts very challenging. In this paper, we hypothesize that enriching a pre-trained transformer model with syntactic information may help improve its performance on chemical-drug RE tasks. For this purpose, we propose three syntax-enhanced models based on the domain-specific BioBERT model: Chunking-Enhanced-BioBERT and Constituency-Tree-BioBERT in which constituency information is integrated and a Multi-Task-Learning framework Multi-Task-Syntactic (MTS)-BioBERT in which syntactic information is injected implicitly by adding syntax-related tasks as training objectives. Besides, we test an existing model Late-Fusion which is enhanced by syntactic dependency information and build ensemble systems combining syntax-enhanced models and non-syntax-enhanced models. Experiments are conducted on the BioCreative VII DrugProt corpus, a manually annotated corpus for the development and evaluation of RE systems. Our results reveal that syntax-enhanced models in general degrade the performance of BioBERT in the scenario of biomedical RE but improve the performance when the subject-object distance of candidate semantic relation is long. We also explore the impact of quality of dependency parses. [Our code is available at: https://github.com/Maple177/syntax-enhanced-RE/tree/drugprot (for only MTS-BioBERT); https://github.com/Maple177/drugprot-relation-extraction (for the rest of experiments)] Database URL https://github.com/Maple177/drugprot-relation-extraction.


Asunto(s)
Investigación Biomédica , Minería de Datos , Minería de Datos/métodos , Bases de Datos Factuales , Procesamiento de Lenguaje Natural , Semántica
10.
Nat Biotechnol ; 25(7): 763-9, 2007 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-17592475

RESUMEN

We report here the complete genome sequence of the virulent strain JIP02/86 (ATCC 49511) of Flavobacterium psychrophilum, a widely distributed pathogen of wild and cultured salmonid fish. The genome consists of a 2,861,988-base pair (bp) circular chromosome with 2,432 predicted protein-coding genes. Among these predicted proteins, stress response mediators, gliding motility proteins, adhesins and many putative secreted proteases are probably involved in colonization, invasion and destruction of the host tissues. The genome sequence provides the basis for explaining the relationships of the pathogen to the host and opens new perspectives for the development of more efficient disease control strategies. It also allows for a better understanding of the physiology and evolution of a significant representative of the family Flavobacteriaceae, whose members are associated with an interesting diversity of lifestyles and habitats.


Asunto(s)
Biotecnología/métodos , Peces/microbiología , Flavobacterium/metabolismo , Genoma Bacteriano , Animales , Biopelículas , Adhesión Celular , Membrana Celular/metabolismo , Infecciones por Flavobacteriaceae/metabolismo , Genoma , Modelos Biológicos , Sistemas de Lectura Abierta , Parásitos
11.
Genomics Inform ; 18(2): e14, 2020 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-32634868

RESUMEN

Phenotyping is a major issue for wheat agriculture to meet the challenges of adaptation of wheat varieties to climate change and chemical input reduction in crop. The need to improve the reuse of observations and experimental data has led to the creation of reference ontologies to standardize descriptions of phenotypes and to facilitate their comparison. The scientific literature is largely under-exploited, although extremely rich in phenotype descriptions associated with cultivars and genetic information. In this paper we propose the Wheat Trait Ontology (WTO) that is suitable for the extraction and management of scientific information from scientific papers, and its combination with data from genomic and experimental databases. We describe the principles of WTO construction and show examples of WTO use for the extraction and management of phenotype descriptions obtained from scientific documents.

12.
Genomics Inform ; 17(2): e20, 2019 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-31307135

RESUMEN

Entity normalization, or entity linking in the general domain, is an information extraction task that aims to annotate/bind multiple words/expressions in raw text with semantic references, such as concepts of an ontology. An ontology consists minimally of a formally organized vocabulary or hierarchy of terms, which captures knowledge of a domain. Presently, machine-learning methods, often coupled with distributional representations, achieve good performance. However, these require large training datasets, which are not always available, especially for tasks in specialized domains. CONTES (CONcept-TErm System) is a supervised method that addresses entity normalization with ontology concepts using small training datasets. CONTES has some limitations, such as it does not scale well with very large ontologies, it tends to overgeneralize predictions, and it lacks valid representations for the out-of-vocabulary words. Here, we propose to assess different methods to reduce the dimensionality in the representation of the ontology. We also propose to calibrate parameters in order to make the predictions more accurate, and to address the problem of out-of-vocabulary words, with a specific method.

13.
Nat Biotechnol ; 23(12): 1527-33, 2005 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-16273110

RESUMEN

Lactobacillus sakei is a psychotrophic lactic acid bacterium found naturally on fresh meat and fish. This microorganism is widely used in the manufacture of fermented meats and has biotechnological potential in biopreservation and food safety. We have explored the 1,884,661-base-pair (bp) circular chromosome of strain 23K encoding 1,883 predicted genes. Genome sequencing revealed a specialized metabolic repertoire, including purine nucleoside scavenging that may contribute to an ability to successfully compete on raw meat products. Many genes appear responsible for robustness during the rigors of food processing--particularly resilience against changing redox and oxygen levels. Genes potentially responsible for biofilm formation and cellular aggregation that may assist the organism to colonize meat surfaces were also identified. This genome project is an initial step for investigating new biotechnological approaches to meat and fish processing and for exploring fundamental aspects of bacterial adaptation to these specific environments.


Asunto(s)
Proteínas Bacterianas/metabolismo , Genoma Bacteriano/genética , Ácido Láctico/biosíntesis , Lactobacillus/genética , Lactobacillus/metabolismo , Carne/microbiología , Transducción de Señal/fisiología , Animales , Proteínas Bacterianas/genética , Secuencia de Bases , Mapeo Cromosómico , Microbiología de Alimentos , Perfilación de la Expresión Génica/métodos , Regulación de la Expresión Génica/fisiología , Datos de Secuencia Molecular
14.
Artículo en Inglés | MEDLINE | ID: mdl-27888231

RESUMEN

Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable-those that have the crucial ability to share information, enabling smooth integration and reusability.


Asunto(s)
Minería de Datos/métodos , Bases de Datos Factuales , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA