Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 37
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 38(15): 3850-3852, 2022 08 02.
Artigo em Inglês | MEDLINE | ID: mdl-35652780

RESUMO

MOTIVATION: The importance of clinical data in understanding the pathophysiology of complex disorders has prompted the launch of multiple initiatives designed to generate patient-level data from various modalities. While these studies can reveal important findings relevant to the disease, each study captures different yet complementary aspects and modalities which, when combined, generate a more comprehensive picture of disease etiology. However, achieving this requires a global integration of data across studies, which proves to be challenging given the lack of interoperability of cohort datasets. RESULTS: Here, we present the Data Steward Tool (DST), an application that allows for semi-automatic semantic integration of clinical data into ontologies and global data models and data standards. We demonstrate the applicability of the tool in the field of dementia research by establishing a Clinical Data Model (CDM) in this domain. The CDM currently consists of 277 common variables covering demographics (e.g. age and gender), diagnostics, neuropsychological tests and biomarker measurements. The DST combined with this disease-specific data model shows how interoperability between multiple, heterogeneous dementia datasets can be achieved. AVAILABILITY AND IMPLEMENTATION: The DST source code and Docker images are respectively available at https://github.com/SCAI-BIO/data-steward and https://hub.docker.com/r/phwegner/data-steward. Furthermore, the DST is hosted at https://data-steward.bio.scai.fraunhofer.de/data-steward. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Demência , Semântica , Humanos , Software , Demência/diagnóstico
2.
Cerebellum ; 2023 Mar 31.
Artigo em Inglês | MEDLINE | ID: mdl-37002505

RESUMO

With SCAview, we present a prompt and comprehensive tool that enables scientists to browse large datasets of the most common spinocerebellar ataxias intuitively and without technical effort. Basic concept is a visualization of data, with a graphical handling and filtering to select and define subgroups and their comparison. Several plot types to visualize all data points resulting from the selected attributes are provided. The underlying synthetic cohort is based on clinical data from five different European and US longitudinal multicenter cohorts in spinocerebellar ataxia type 1, 2, 3, and 6 (SCA1, 2, 3, and 6) comprising > 1400 patients with overall > 5500 visits. First, we developed a common data model to integrate the clinical, demographic, and characterizing data of each source cohort. Second, the available datasets from each cohort were mapped onto the data model. Third, we created a synthetic cohort based on the cleaned dataset. With SCAview, we demonstrate the feasibility of mapping cohort data from different sources onto a common data model. The resulting browser-based visualization tool with a thoroughly graphical handling of the data offers researchers the unique possibility to visualize relationships and distributions of clinical data, to define subgroups and to further investigate them without any technical effort. Access to SCAview can be requested via the Ataxia Global Initiative and is free of charge.

3.
Nucleic Acids Res ; 48(D1): D204-D219, 2020 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-31598718

RESUMO

We present the Small RNA Expression Atlas (SEAweb), a web application that allows for the interactive querying, visualization and analysis of known and novel small RNAs across 10 organisms. It contains sRNA and pathogen expression information for over 4200 published samples with standardized search terms and ontologies. In addition, SEAweb allows for the interactive visualization and re-analysis of 879 differential expression and 514 classification comparisons. SEAweb's user model enables sRNA researchers to compare and re-analyze user-specific and published datasets, highlighting common and distinct sRNA expression patterns. We provide evidence for SEAweb's fidelity by (i) generating a set of 591 tissue specific miRNAs across 29 tissues, (ii) finding known and novel bacterial and viral infections across diseases and (iii) determining a Parkinson's disease-specific blood biomarker signature using novel data. We believe that SEAweb's simple semantic search interface, the flexible interactive reports and the user model with rich analysis capabilities will enable researchers to better understand the potential function and diagnostic value of sRNAs or pathogens across tissues, diseases and organisms.


Assuntos
Bases de Dados de Ácidos Nucleicos , Pequeno RNA não Traduzido/metabolismo , Animais , Infecções Bacterianas/microbiologia , Bovinos , Humanos , Internet , Camundongos , Especificidade de Órgãos , Doença de Parkinson/sangue , RNA Bacteriano/metabolismo , RNA Viral/metabolismo , Ratos , Viroses/virologia
4.
J Med Internet Res ; 24(4): e34072, 2022 04 08.
Artigo em Inglês | MEDLINE | ID: mdl-35285808

RESUMO

BACKGROUND: The current COVID-19 crisis underscores the importance of preprints, as they allow for rapid communication of research results without delay in review. To fully integrate this type of publication into library information systems, we developed preview: a publicly available, central search engine for COVID-19-related preprints, which clearly distinguishes this source from peer-reviewed publications. The relationship between the preprint version and its corresponding journal version should be stored as metadata in both versions so that duplicates can be easily identified and information overload for researchers is reduced. OBJECTIVE: In this work, we investigated the extent to which the relationship information between preprint and corresponding journal publication is present in the published metadata, how it can be further completed, and how it can be used in preVIEW to identify already republished preprints and filter those duplicates in search results. METHODS: We first analyzed the information content available at the preprint servers themselves and the information that can be retrieved via Crossref. Moreover, we developed the algorithm Pre2Pub to find the corresponding reviewed article for each preprint. We integrated the results of those different resources into our search engine preVIEW, presented the information in the result set overview, and added filter options accordingly. RESULTS: Preprints have found their place in publication workflows; however, the link from a preprint to its corresponding journal publication is not completely covered in the metadata of the preprint servers or in Crossref. Our algorithm Pre2Pub is able to find approximately 16% more related journal articles with a precision of 99.27%. We also integrate this information in a transparent way within preVIEW so that researchers can use it in their search. CONCLUSIONS: Relationships between the preprint version and its journal version is valuable information that can help researchers finding only previously unknown information in preprints. As long as there is no transparent and complete way to store this relationship in metadata, the Pre2Pub algorithm is a suitable extension to retrieve this information.


Assuntos
COVID-19 , Algoritmos , Humanos , Revisão por Pares
5.
Artigo em Alemão | MEDLINE | ID: mdl-34297162

RESUMO

Public health research and epidemiological and clinical studies are necessary to understand the COVID-19 pandemic and to take appropriate action. Therefore, since early 2020, numerous research projects have also been initiated in Germany. However, due to the large amount of information, it is currently difficult to get an overview of the diverse research activities and their results. Based on the "Federated research data infrastructure for personal health data" (NFDI4Health) initiative, the "COVID-19 task force" is able to create easier access to SARS-CoV-2- and COVID-19-related clinical, epidemiological, and public health research data. Therefore, the so-called FAIR data principles (findable, accessible, interoperable, reusable) are taken into account and should allow an expedited communication of results. The most essential work of the task force includes the generation of a study portal with metadata, selected instruments, other study documents, and study results as well as a search engine for preprint publications. Additional contents include a concept for the linkage between research and routine data, a service for an enhanced practice of image data, and the application of a standardized analysis routine for harmonized quality assessment. This infrastructure, currently being established, will facilitate the findability and handling of German COVID-19 research. The developments initiated in the context of the NFDI4Health COVID-19 task force are reusable for further research topics, as the challenges addressed are generic for the findability of and the handling with research data.


Assuntos
Pesquisa Biomédica/tendências , COVID-19 , Disseminação de Informação , Alemanha , Humanos , Metadados , Pandemias , SARS-CoV-2
6.
Sci Rep ; 14(1): 14412, 2024 06 22.
Artigo em Inglês | MEDLINE | ID: mdl-38909025

RESUMO

Access to individual-level health data is essential for gaining new insights and advancing science. In particular, modern methods based on artificial intelligence rely on the availability of and access to large datasets. In the health sector, access to individual-level data is often challenging due to privacy concerns. A promising alternative is the generation of fully synthetic data, i.e., data generated through a randomised process that have similar statistical properties as the original data, but do not have a one-to-one correspondence with the original individual-level records. In this study, we use a state-of-the-art synthetic data generation method and perform in-depth quality analyses of the generated data for a specific use case in the field of nutrition. We demonstrate the need for careful analyses of synthetic data that go beyond descriptive statistics and provide valuable insights into how to realise the full potential of synthetic datasets. By extending the methods, but also by thoroughly analysing the effects of sampling from a trained model, we are able to largely reproduce significant real-world analysis results in the chosen use case.


Assuntos
Análise de Dados , Humanos , Estudos Longitudinais , Inteligência Artificial
7.
Stud Health Technol Inform ; 302: 108-112, 2023 May 18.
Artigo em Inglês | MEDLINE | ID: mdl-37203619

RESUMO

Semantic interoperability, i.e., the ability to automatically interpret the shared information in a meaningful way, is one of the most important requirements for data analysis of different sources. In the area of clinical and epidemiological studies, the target of the National Research Data Infrastructure for Personal Health Data (NFDI4Health), interoperability of data collection instruments such as case report forms (CRFs), data dictionaries and questionnaires is critical. Retrospective integration of semantic codes into study metadata at item-level is important, as ongoing or completed studies contain valuable information, which should be preserved. We present a first version of a Metadata Annotation Workbench to support annotators in dealing with a variety of complex terminologies and ontologies. User-driven development with users from the fields of nutritional epidemiology and chronic diseases ensured that the service fulfills the basic requirements for a semantic metadata annotation software for these NFDI4Health use cases. The web application can be accessed using a web browser and the source code of the software is available with an open-source MIT license.


Assuntos
Semântica , Software , Estudos Retrospectivos , Navegador , Metadados
8.
J Biomed Inform ; 45(5): 885-92, 2012 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-22554702

RESUMO

A significant amount of information about drug-related safety issues such as adverse effects are published in medical case reports that can only be explored by human readers due to their unstructured nature. The work presented here aims at generating a systematically annotated corpus that can support the development and validation of methods for the automatic extraction of drug-related adverse effects from medical case reports. The documents are systematically double annotated in various rounds to ensure consistent annotations. The annotated documents are finally harmonized to generate representative consensus annotations. In order to demonstrate an example use case scenario, the corpus was employed to train and validate models for the classification of informative against the non-informative sentences. A Maximum Entropy classifier trained with simple features and evaluated by 10-fold cross-validation resulted in the F1 score of 0.70 indicating a potential useful application of the corpus.


Assuntos
Inteligência Artificial , Mineração de Dados/métodos , Bases de Dados Factuais , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , PubMed , Documentação , Humanos , Reprodutibilidade dos Testes , Semântica
9.
BMC Med Inform Decis Mak ; 12: 148, 2012 Dec 18.
Artigo em Inglês | MEDLINE | ID: mdl-23249606

RESUMO

BACKGROUND: For selection and evaluation of potential biomarkers, inclusion of already published information is of utmost importance. In spite of significant advancements in text- and data-mining techniques, the vast knowledge space of biomarkers in biomedical text has remained unexplored. Existing named entity recognition approaches are not sufficiently selective for the retrieval of biomarker information from the literature. The purpose of this study was to identify textual features that enhance the effectiveness of biomarker information retrieval for different indication areas and diverse end user perspectives. METHODS: A biomarker terminology was created and further organized into six concept classes. Performance of this terminology was optimized towards balanced selectivity and specificity. The information retrieval performance using the biomarker terminology was evaluated based on various combinations of the terminology's six classes. Further validation of these results was performed on two independent corpora representing two different neurodegenerative diseases. RESULTS: The current state of the biomarker terminology contains 119 entity classes supported by 1890 different synonyms. The result of information retrieval shows improved retrieval rate of informative abstracts, which is achieved by including clinical management terms and evidence of gene/protein alterations (e.g. gene/protein expression status or certain polymorphisms) in combination with disease and gene name recognition. When additional filtering through other classes (e.g. diagnostic or prognostic methods) is applied, the typical high number of unspecific search results is significantly reduced. The evaluation results suggest that this approach enables the automated identification of biomarker information in the literature. A demo version of the search engine SCAIView, including the biomarker retrieval, is made available to the public through http://www.scaiview.com/scaiview-academia.html. CONCLUSIONS: The approach presented in this paper demonstrates that using a dedicated biomarker terminology for automated analysis of the scientific literature maybe helpful as an aid to finding biomarker information in text. Successful extraction of candidate biomarkers information from published resources can be considered as the first step towards developing novel hypotheses. These hypotheses will be valuable for the early decision-making in the drug discovery and development process.


Assuntos
Biomarcadores , Mineração de Dados , Terminologia como Assunto , Algoritmos , Humanos , Ferramenta de Busca
10.
J Biomed Semantics ; 13(1): 26, 2022 10 27.
Artigo em Inglês | MEDLINE | ID: mdl-36303237

RESUMO

BACKGROUND: Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize. RESULTS: Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data. CONCLUSIONS: We argue that there is a need for larger annotated data sets for training and testing. Therefore, we foresee the curation of further data sets and, moreover, the investigation of continual learning processes for machine learning-based models.


Assuntos
COVID-19 , Mineração de Dados , Humanos , Mineração de Dados/métodos , Processamento de Linguagem Natural , Aprendizado de Máquina
11.
Database (Oxford) ; 20222022 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-35776071

RESUMO

preVIEW is a freely available semantic search engine for Coronavirus disease (COVID-19)-related preprint publications. Currently, it contains >43 800 documents indexed with >4000 semantic concepts, annotated automatically. During the last 2 years, the dynamic situation of the corona crisis has demanded dynamic development. Whereas new semantic concepts have been added over time-such as the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of interest-the service has been also extended with several features improving the usability and user friendliness. Most importantly, the user is now able to give feedback on detected semantic concepts, i.e. a user can mark annotations as true positives or false positives. In addition, we expanded our methods to construct search queries. The presented version of preVIEW also includes links to the peer-reviewed journal articles, if available. With the described system, we participated in the BioCreative VII interactive text-mining track and retrieved promising user-in-the-loop feedback. Additionally, as the occurrence of long-term symptoms after an infection with the virus SARS-CoV-2-called long COVID-is getting more and more attention, we have recently developed and incorporated a long COVID classifier based on state-of-the-art methods and manually curated data by experts. The service is freely accessible under https://preview.zbmed.de.


Assuntos
COVID-19 , Ferramenta de Busca , COVID-19/complicações , COVID-19/epidemiologia , Humanos , SARS-CoV-2 , Semântica , Síndrome de COVID-19 Pós-Aguda
12.
Int J Med Inform ; 161: 104724, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35279550

RESUMO

BACKGROUND: Health care records provide large amounts of data with real-world and longitudinal aspects, which is advantageous for predictive analyses and improvements in personalized medicine. Text-based records are a main source of information in mental health. Therefore, application of text mining to the electronic health records - especially mental state examination - is a key approach for detection of psychiatric disease phenotypes that relate to treatment outcomes. METHODS: We focused on the mental state examination (MSE) in the patients' discharge summaries as the key part of the psychiatric records. We prepared a sample of 150 text documents that we manually annotated for psychiatric attributes and symptoms. These documents were further divided into training and test sets. We designed and implemented a system to detect the psychiatric attributes automatically and linked the pathologically assessed attributes to AMDP terminology. This workflow uses a pre-trained neural network model, which is fine-tuned on the training set, and validated on the independent test set. Furthermore, a traditional NLP and rule-based component linked the recognized mentions to AMDP terminology. In a further step, we applied the system on a larger clinical dataset of 510 patients to extract their symptoms. RESULTS: The system identified the psychiatric attributes as well as their assessment (normal and pathological) and linked these entities to the AMDP terminology with an F1-score of 86% and 91% on an independent test set, respectively. CONCLUSION: The development of the current text mining system and the results highlight the feasibility of text mining methods applied to MSE in electronic mental health care reports. Our findings pave the way for the secondary use of routine data in the field of mental health, facilitating further clinical data analyses.


Assuntos
Aprendizado Profundo , Saúde Mental , Mineração de Dados/métodos , Registros Eletrônicos de Saúde , Humanos , Processamento de Linguagem Natural , Redes Neurais de Computação
13.
JAMIA Open ; 5(4): ooac087, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-36380848

RESUMO

Objective: Healthcare data such as clinical notes are primarily recorded in an unstructured manner. If adequately translated into structured data, they can be utilized for health economics and set the groundwork for better individualized patient care. To structure clinical notes, deep-learning methods, particularly transformer-based models like Bidirectional Encoder Representations from Transformers (BERT), have recently received much attention. Currently, biomedical applications are primarily focused on the English language. While general-purpose German-language models such as GermanBERT and GottBERT have been published, adaptations for biomedical data are unavailable. This study evaluated the suitability of existing and novel transformer-based models for the German biomedical and clinical domain. Materials and Methods: We used 8 transformer-based models and pre-trained 3 new models on a newly generated biomedical corpus, and systematically compared them with each other. We annotated a new dataset of clinical notes and used it with 4 other corpora (BRONCO150, CLEF eHealth 2019 Task 1, GGPONC, and JSynCC) to perform named entity recognition (NER) and document classification tasks. Results: General-purpose language models can be used effectively for biomedical and clinical natural language processing (NLP) tasks, still, our newly trained BioGottBERT model outperformed GottBERT on both clinical NER tasks. However, training new biomedical models from scratch proved ineffective. Discussion: The domain-adaptation strategy's potential is currently limited due to a lack of pre-training data. Since general-purpose language models are only marginally inferior to domain-specific models, both options are suitable for developing German-language biomedical applications. Conclusion: General-purpose language models perform remarkably well on biomedical and clinical NLP tasks. If larger corpora become available in the future, domain-adapting these models may improve performances.

14.
Stud Health Technol Inform ; 281: 78-82, 2021 May 27.
Artigo em Inglês | MEDLINE | ID: mdl-34042709

RESUMO

During the current COVID-19 pandemic, the rapid availability of profound information is crucial in order to derive information about diagnosis, disease trajectory, treatment or to adapt the rules of conduct in public. The increased importance of preprints for COVID-19 research initiated the design of the preprint search engine preVIEW. Conceptually, it is a lightweight semantic search engine focusing on easy inclusion of specialized COVID-19 textual collections and provides a user friendly web interface for semantic information retrieval. In order to support semantic search functionality, we integrated a text mining workflow for indexing with relevant terminologies. Currently, diseases, human genes and SARS-CoV-2 proteins are annotated, and more will be added in future. The system integrates collections from several different preprint servers that are used in the biomedical domain to publish non-peer-reviewed work, thereby enabling one central access point for the users. In addition, our service offers facet searching, export functionality and an API access. COVID-19 preVIEW is publicly available at https://preview.zbmed.de.


Assuntos
COVID-19 , Humanos , Pandemias , Editoração , SARS-CoV-2 , Semântica
15.
Stud Health Technol Inform ; 287: 78-82, 2021 Nov 18.
Artigo em Inglês | MEDLINE | ID: mdl-34795085

RESUMO

The German Central Health Study Hub COVID-19 is an online service that offers bundled access to COVID-19 related studies conducted in Germany. It combines metadata and other information of epidemiologic, public health and clinical studies into a single data repository for FAIR data access. In addition to study characteristics the system also allows easy access to study documents, as well as instruments for data collection. Study metadata and survey instruments are decomposed into individual data items and semantically enriched to ease the findability. Data from existing clinical trial registries (DRKS, clinicaltrails.gov and WHO ICTRP) are merged with epidemiological and public health studies manually collected and entered. More than 850 studies are listed as of September 2021.


Assuntos
COVID-19 , Alemanha , Humanos , Metadados , SARS-CoV-2 , Inquéritos e Questionários
16.
Stud Health Technol Inform ; 281: 794-798, 2021 May 27.
Artigo em Inglês | MEDLINE | ID: mdl-34042687

RESUMO

COVID-19 poses a major challenge to individuals and societies around the world. Yet, it is difficult to obtain a good overview of studies across different medical fields of research such as clinical trials, epidemiology, and public health. Here, we describe a consensus metadata model to facilitate structured searches of COVID-19 studies and resources along with its implementation in three linked complementary web-based platforms. A relational database serves as central study metadata hub that secures compatibilities with common trials registries (e.g. ICTRP and standards like HL7 FHIR, CDISC ODM, and DataCite). The Central Search Hub was developed as a single-page application, the other two components with additional frontends are based on the SEEK platform and MICA, respectively. These platforms have different features concerning cohort browsing, item browsing, and access to documents and other study resources to meet divergent user needs. By this we want to promote transparent and harmonized COVID-19 research.


Assuntos
COVID-19 , Estudos Epidemiológicos , Humanos , Metadados , Sistema de Registros , SARS-CoV-2
17.
Bioinformatics ; 24(13): i268-76, 2008 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-18586724

RESUMO

MOTIVATION: Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools. RESULTS: We present an IUPAC name recognizer with an F(1) measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F(1) measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run. AVAILABILITY: We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component.


Assuntos
Indexação e Redação de Resumos/métodos , Dicionários Químicos como Assunto , MEDLINE , Processamento de Linguagem Natural , Preparações Farmacêuticas/classificação , Terminologia como Assunto , Vocabulário Controlado , Inteligência Artificial , Reconhecimento Automatizado de Padrão/métodos
18.
Database (Oxford) ; 20192019 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31603193

RESUMO

Knowledge of the molecular interactions of biological and chemical entities and their involvement in biological processes or clinical phenotypes is important for data interpretation. Unfortunately, this knowledge is mostly embedded in the literature in such a way that it is unavailable for automated data analysis procedures. Biological expression language (BEL) is a syntax representation allowing for the structured representation of a broad range of biological relationships. It is used in various situations to extract such knowledge and transform it into BEL networks. To support the tedious and time-intensive extraction work of curators with automated methods, we developed the BEL track within the framework of BioCreative Challenges. Within the BEL track, we provide training data and an evaluation environment to encourage the text mining community to tackle the automatic extraction of complex BEL relationships. In 2017 BioCreative VI, the 2015 BEL track was repeated with new test data. Although only minor improvements in text snippet retrieval for given statements were achieved during this second BEL task iteration, a significant increase of BEL statement extraction performance from provided sentences could be seen. The best performing system reached a 32% F-score for the extraction of complete BEL statements and with the given named entities this increased to 49%. This time, besides rule-based systems, new methods involving hierarchical sequence labeling and neural networks were applied for BEL statement extraction.


Assuntos
Mineração de Dados , Bases de Dados Factuais , Redes Neurais de Computação , Vocabulário Controlado
19.
Bioinformatics ; 23(13): i264-72, 2007 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-17646305

RESUMO

UNLABELLED: Knowledge about biological effects of small molecules helps in the understanding of biological processes and supports the development of new therapeutic agents. DrugBank is a high quality database providing such information about drugs that contains annotation of drug effects and classification of therapeutic effects. However, to broaden the scope of such a database in classifying and annotating drugs, systems for automatic extraction of classification terms and the corresponding annotation of drugs are needed. We have developed an approach for the identification of new terms used in unstructured text that provide information about drug properties. It is based on the identification and extraction of phrases corresponding to lexico-syntactic patterns--so-called Hearst patterns that contain drug names and directly related drug annotation terms. Such phrases could be identified with a high performance in DrugBank text (0.89 F-score) and in Medline abstracts (0.83 F-score). In comparison to DrugBank annotation terminology, a huge amount of new drug annotation terms could be found. The evaluation of terms extracted from Medline showed that 29-53% of them are new valid drug property terms. They could be assigned to existing and new drug property classes not provided by the DrugBank drug annotation. We come to the conclusion that our system can support database content update by providing additionally drug descriptions of pharmacological effects not yet found in databases like DrugBank. Moreover, we propose that automatic normalization of terms improves the annotation and the retrieval of relevant database entries. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Inteligência Artificial , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais , Serviços de Informação sobre Medicamentos , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Preparações Farmacêuticas/classificação , Armazenamento e Recuperação da Informação/métodos
20.
J Bioinform Comput Biol ; 5(6): 1277-96, 2007 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-18172929

RESUMO

The influence of genetic variations on diseases or cellular processes is the main focus of many investigations, and results of biomedical studies are often only accessible through scientific publications. Automatic extraction of this information requires recognition of the gene names and the accompanying allelic variant information. In a previous work, the OSIRIS system for the detection of allelic variation in text based on a query expansion approach was communicated. Challenges associated with this system are the relatively low recall for variation mentions and gene name recognition. To tackle this challenge, we integrate the ProMiner system developed for the recognition and normalization of gene and protein names with a conditional random field (CRF)-based recognition of variation terms in biomedical text. Following the newly developed normalization of variation entities, we can link textual entities to Single Nucleotide Polymorphism database (dbSNP) entries. The performance of this novel approach is evaluated, and improved results in comparison to state-of-the-art systems are reported.


Assuntos
Variação Genética , Algoritmos , Alelos , Biologia Computacional , Bases de Dados de Ácidos Nucleicos , Humanos , Armazenamento e Recuperação da Informação , MEDLINE , Modelos Estatísticos , Polimorfismo de Nucleotídeo Único , Terminologia como Assunto
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA