Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 41
Filtrar
1.
Nucleic Acids Res ; 2024 Apr 04.
Artigo em Inglês | MEDLINE | ID: mdl-38572754

RESUMO

PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.

2.
ArXiv ; 2024 Jan 19.
Artigo em Inglês | MEDLINE | ID: mdl-38410657

RESUMO

PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases, and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.

3.
ACS Nano ; 17(24): 25552-25564, 2023 Dec 26.
Artigo em Inglês | MEDLINE | ID: mdl-38096149

RESUMO

Photomemristors have been regarded as one of the most promising candidates for next-generation hardware-based neuromorphic computing due to their potentials of fast data transmission and low power consumption. However, intriguingly, so far, photomemristors seldom display truly nonvolatile memory characteristics with high light sensitivity. Herein, we demonstrate ultrasensitive photomemristors utilizing two-dimensional (2D) Ruddlesden-Popper (RP) perovskites with a highly polar donor-acceptor-type push-pull organic cation, 4-(5-(2-aminoethyl)thiophen-2-yl)benzonitrile+ (EATPCN+), as charge-trapping layers. High linearity and almost zero-decay retention are observed in (EATPCN)2PbI4 devices, which are very distinct from that of the traditional 2D RP perovskite devices consisting of nonpolar organic cations, such as phenethylamine+ (PEA+) and octylamine+ (OA+), and traditional 3D perovskite devices consisting of methylamine+ (MA+). The 2-fold advantages, including desirable spatial crystal arrangement and engineered energetic band alignment, clarify the mechanism of superior performance in (EATPCN)2PbI4 devices. The optimized (EATPCN)2PbI4 photomemristor also shows a memory window of 87.9 V and an on/off ratio of 106 with a retention time of at least 2.4 × 105 s and remains unchanged after >105 writing-reading-erasing-reading endurance cycles. Very low energy consumptions of 1.12 and 6 fJ for both light stimulation and the reading process of each status update are also demonstrated. The extremely low power consumption and high photoresponsivity were simultaneously achieved. The high photosensitivity surpasses that of a state-of-the-art commercial pulse energy meter by several orders of magnitude. With their outstanding linearity and retention, rabbit images have been rebuilt by (EATPCN)2PbI4 photomemristors, which truthfully render the image without fading over time. Finally, by utilizing the powerful ∼8 bits of nonvolatile potentiation and depression levels of (EATPCN)2PbI4 photomemristors, the accuracies of the recognition tasks of CIFAR-10 image classification and MNIST handwritten digit classification have reached 89% and 94.8%, respectively. This study represents the first report of utilizing a functional donor-acceptor type of organic cation in 2D RP perovskites for high-performance photomemristors with characteristics that are not found in current halide perovskites.

4.
Bioinformatics ; 39(10)2023 10 03.
Artigo em Inglês | MEDLINE | ID: mdl-37878810

RESUMO

MOTIVATION: Gene name normalization is an important yet highly complex task in biomedical text mining research, as gene names can be highly ambiguous and may refer to different genes in different species or share similar names with other bioconcepts. This poses a challenge for accurately identifying and linking gene mentions to their corresponding entries in databases such as NCBI Gene or UniProt. While there has been a body of literature on the gene normalization task, few have addressed all of these challenges or make their solutions publicly available to the scientific community. RESULTS: Building on the success of GNormPlus, we have created GNorm2: a more advanced tool with optimized functions and improved performance. GNorm2 integrates a range of advanced deep learning-based methods, resulting in the highest levels of accuracy and efficiency for gene recognition and normalization to date. Our tool is freely available for download. AVAILABILITY AND IMPLEMENTATION: https://github.com/ncbi/GNorm2.


Assuntos
Mineração de Dados , Mineração de Dados/métodos , Bases de Dados Factuais
5.
ArXiv ; 2023 Oct 17.
Artigo em Inglês | MEDLINE | ID: mdl-37904734

RESUMO

ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction, and medical education, and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.

6.
J Biomed Inform ; 146: 104487, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37673376

RESUMO

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

7.
ACS Appl Mater Interfaces ; 15(37): 44033-44042, 2023 Sep 20.
Artigo em Inglês | MEDLINE | ID: mdl-37694918

RESUMO

Three organic conjugated small molecules, DTA-DTPZ, Cz-DTPZ, and DTA-me-DTPZ comprising an antiaromatic 5,10-ditolylphenazine (DTPZ) core and electron-donating peripheral substituents with high HOMOs (-4.2 to -4.7 eV) and multiple reversible oxidative potentials are reported. The corresponding films sandwiched between two electrodes show unipolar and switchable hysteresis current-voltage (I-V) characteristics upon voltage sweeping, revealing the prominent features of nonvolatile memristor behaviors. The numerical simulation of the I-V curves suggests that the carriers generated by the oxidized molecules lead to the increment of conductance. However, the accumulated carriers tend to deteriorate the device endurance. The electroactive sites are fully blocked in the dimethylated molecule DTA-me-DTPZ, preventing the irreversible electrochemical reaction, thereby boosting the endurance of the memristor device over 300 cycles. Despite the considerable improvement in endurance, the decrement of on/off ratio from 105 to 101 after 250 cycles suggests that the excessive charge carriers (radical cations) remains a problem. Thus, a new strategy of doping an electron-deficient material, CN-T2T, into the unipolar active layer was introduced to further improve the device stability. The device containing DTA-me-DTPZ:CNT2T (1:1) blend as the active layer retained the endurance and on/off ratio (∼104) upon sweeping 300 cycles. The molecular designs and doping strategy demonstrate effective approaches toward more stable metal-free organic conjugated small-molecule memristors.

8.
ArXiv ; 2023 Jun 19.
Artigo em Inglês | MEDLINE | ID: mdl-37502629

RESUMO

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

9.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37171899

RESUMO

MOTIVATION: Biomedical named entity recognition (BioNER) seeks to automatically recognize biomedical entities in natural language text, serving as a necessary foundation for downstream text mining tasks and applications such as information extraction and question answering. Manually labeling training data for the BioNER task is costly, however, due to the significant domain expertise required for accurate annotation. The resulting data scarcity causes current BioNER approaches to be prone to overfitting, to suffer from limited generalizability, and to address a single entity type at a time (e.g. gene or disease). RESULTS: We therefore propose a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to enhance the accuracy and stability of BioNER models. We further present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluate AIONER on 14 BioNER benchmark tasks and show that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrate the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g. the entire PubMed data). AVAILABILITY AND IMPLEMENTATION: The source code, trained models and data for AIONER are freely available at https://github.com/ncbi/AIONER.


Assuntos
Aprendizado Profundo , Mineração de Dados/métodos , Software , Idioma , PubMed
10.
Adv Sci (Weinh) ; 10(10): e2206076, 2023 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-36748267

RESUMO

Although vacuum-deposited metal halide perovskite light-emitting diodes (PeLEDs) have great promise for use in large-area high-color-gamut displays, the efficiency of vacuum-sublimed PeLEDs currently lags that of solution-processed counterparts. In this study, highly efficient vacuum-deposited PeLEDs are prepared through a process of optimizing the stoichiometric ratio of the sublimed precursors under high vacuum and incorporating ultrathin under- and upper-layers for the perovskite emission layer (EML). In contrast to the situation in most vacuum-deposited organic light-emitting devices, the properties of these perovskite EMLs are highly influenced by the presence and nature of the upper- and presublimed materials, thereby allowing us to enhance the performance of the resulting devices. By eliminating Pb° formation and passivating defects in the perovskite EMLs, the PeLEDs achieve an outstanding external quantum efficiency (EQE) of 10.9% when applying a very smooth and flat geometry; it reaches an extraordinarily high value of 21.1% when integrating a light out-coupling structure, breaking through the 10% EQE milestone of vacuum-deposited PeLEDs.

11.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38168838

RESUMO

ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically, we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction and medical education and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.


Assuntos
Armazenamento e Recuperação da Informação , Idioma , Humanos , Privacidade , Pesquisadores
12.
Chem Sci ; 13(44): 12996-13005, 2022 Nov 16.
Artigo em Inglês | MEDLINE | ID: mdl-36425506

RESUMO

Owing to the high technology maturity of thermally activated delayed fluorescence (TADF) emitter design with a specific molecular shape, extremely high-performance organic light-emitting diodes (OLEDs) have recently been achieved via various doping techniques. Recently, undoped OLEDs have drawn immense attention because of their manufacturing cost reduction and procedure simplification. However, capable materials as host emitters are rare and precious because general fluorophores in high-concentration states suffer from serious aggregation-caused quenching (ACQ) and undergo exciton quenching. In this work, a series of diboron materials, CzDBA, iCzDBA, and tBuCzDBA, is introduced to realize the effect of steric hindrance and the molecular aspect ratio via experimental and theoretical studies. We computed transition electric dipole moment (TEDM) and molecular dynamics (MD) simulations as a proof-of-concept model to investigate the molecular stacking in neat films. It is worth noting that the pure tBuCzDBA film with a high horizontal ratio of 92% is employed to achieve a nondoped OLED with an excellent external quantum efficiency of 26.9%. In addition, we demonstrated the first ultrathin emitting layer (1 nm) TADF device, which exhibited outstanding power efficiency. This molecular design and high-performance devices show the potential of power-saving and economical fabrication for advanced OLEDs.

13.
Database (Oxford) ; 20222022 10 13.
Artigo em Inglês | MEDLINE | ID: mdl-36227127

RESUMO

The automatic assignment of species information to the corresponding genes in a research article is a critically important step in the gene normalization task, whereby a gene mention is normalized and linked to a database record or an identifier by a text-mining algorithm. Existing methods typically rely on heuristic rules based on gene and species co-occurrence in the article, but their accuracy is suboptimal. We therefore developed a high-performance method, using a novel deep learning-based framework, to identify whether there is a relation between a gene and a species. Instead of the traditional binary classification framework in which all possible pairs of genes and species in the same article are evaluated, we treat the problem as a sequence labeling task such that only a fraction of the pairs needs to be considered. Our benchmarking results show that our approach obtains significantly higher performance compared to that of the rule-based baseline method for the species assignment task (from 65.8-81.3% in accuracy). The source code and data for species assignment are freely available. Database URL https://github.com/ncbi/SpeciesAssignment.


Assuntos
Mineração de Dados , Software , Algoritmos , Benchmarking , Mineração de Dados/métodos , Bases de Dados Factuais
14.
Database (Oxford) ; 20222022 07 19.
Artigo em Inglês | MEDLINE | ID: mdl-35856889

RESUMO

Automatic extracting interactions between chemical compound/drug and gene/protein are significantly beneficial to drug discovery, drug repurposing, drug design and biomedical knowledge graph construction. To promote the development of the relation extraction between drug and protein, the BioCreative VII challenge organized the DrugProt track. This paper describes the approach we developed for this task. In addition to the conventional text classification framework that has been widely used in relation extraction tasks, we propose a sequence labeling framework to drug-protein relation extraction. We first comprehensively compared the cutting-edge biomedical pre-trained language models for both frameworks. Then, we explored several ensemble methods to further improve the final performance. In the evaluation of the challenge, our best submission (i.e. the ensemble of models in two frameworks via major voting) achieved the F1-score of 0.795 on the official test set. Further, we realized the sequence labeling framework is more efficient and achieves better performance than the text classification framework. Finally, our ensemble of the sequence labeling models with majority voting achieves the best F1-score of 0.800 on the test set. DATABASE URL: https://github.com/lingluodlut/BioCreativeVII_DrugProt.


Assuntos
Mineração de Dados , Proteínas , Mineração de Dados/métodos , Bases de Dados Factuais , Idioma , Publicações
15.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35849818

RESUMO

Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.


Assuntos
Algoritmos , Mineração de Dados , Proteínas , PubMed
16.
ACS Appl Mater Interfaces ; 14(17): 19795-19805, 2022 May 04.
Artigo em Inglês | MEDLINE | ID: mdl-35417120

RESUMO

Highly sensitive X-ray detection is crucial in, for example, medical imaging and secure inspection. Halide perovskite X-ray detectors are promising candidates for detecting highly energetic radiation. In this report, we describe vacuum-deposited Cs-based perovskite X-ray detectors possessing a p-i-n architecture. Because of the built-in potential of the p-i-n structure, these perovskite X-ray detectors were capable of efficient charge collection and displayed an exceptionally high X-ray sensitivity (1.2 C Gyair-1 cm-3) under self-powered, zero-bias conditions. We ascribe the outstanding X-ray sensitivity of the vacuum-deposited CsPbI2Br devices to their prominent charge carrier mobility. Moreover, these devices functioned with a lowest detection limit of 25.69 nGyair s-1 and possessed excellent stability after exposure to over 3000 times the total dose of a chest X-ray image. For comparison, we also prepared traditional spin-coated CH3NH3-based perovskite devices having a similar device architecture. Their volume sensitivity was only one-fifth of that of the vacuum-deposited CsPbI2Br devices. Thus, all-vacuum deposition appears to be a new strategy for developing perovskite X-ray detectors; with a high practical deposition rate, a balance can be reached between the thickness of the absorbing layer and the fabrication time.

17.
J Biomed Inform ; 129: 104059, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35351638

RESUMO

The study aims at developing a neural network model to improve the performance of Human Phenotype Ontology (HPO) concept recognition tools. We used the terms, definitions, and comments about the phenotypic concepts in the HPO database to train our model. The document to be analyzed is first split into sentences and annotated with a base method to generate candidate concepts. The sentences, along with the candidate concepts, are then fed into the pre-trained model for re-ranking. Our model comprises the pre-trained BlueBERT and a feature selection module, followed by a contrastive loss. We re-ranked the results generated by three robust HPO annotation tools and compared the performance against most of the existing approaches. The experimental results show that our model can improve the performance of the existing methods. Significantly, it boosted 3.0% and 5.6% in F1 score on the two evaluated datasets compared with the base methods. It removed more than 80% of the false positives predicted by the base methods, resulting in up to 18% improvement in precision. Our model utilizes the descriptive data in the ontology and the contextual information in the sentences for re-ranking. The results indicate that the additional information and the re-ranking model can significantly enhance the precision of HPO concept recognition compared with the base method.


Assuntos
Idioma , Redes Neurais de Computação , Bases de Dados Factuais , Humanos , Fenótipo
18.
Bioinformatics ; 37(13): 1884-1890, 2021 Jul 27.
Artigo em Inglês | MEDLINE | ID: mdl-33471061

RESUMO

MOTIVATION: Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation. RESULTS: In this article, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary, which is then used to automatically build a distantly supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase (n-gram from input sentence) into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods. AVAILABILITYAND IMPLEMENTATION: The source code, API information and data for PhenoTagger are freely available at https://github.com/ncbi-nlp/PhenoTagger. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

19.
Bioinformatics ; 36(24): 5678-5685, 2021 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-33416851

RESUMO

MOTIVATION: A biomedical relation statement is commonly expressed in multiple sentences and consists of many concepts, including gene, disease, chemical and mutation. To automatically extract information from biomedical literature, existing biomedical text-mining approaches typically formulate the problem as a cross-sentence n-ary relation-extraction task that detects relations among n entities across multiple sentences, and use either a graph neural network (GNN) with long short-term memory (LSTM) or an attention mechanism. Recently, Transformer has been shown to outperform LSTM on many natural language processing (NLP) tasks. RESULTS: In this work, we propose a novel architecture that combines Bidirectional Encoder Representations from Transformers with Graph Transformer (BERT-GT), through integrating a neighbor-attention mechanism into the BERT architecture. Unlike the original Transformer architecture, which utilizes the whole sentence(s) to calculate the attention of the current token, the neighbor-attention mechanism in our method calculates its attention utilizing only its neighbor tokens. Thus, each token can pay attention to its neighbor information with little noise. We show that this is critically important when the text is very long, as in cross-sentence or abstract-level relation-extraction tasks. Our benchmarking results show improvements of 5.44% and 3.89% in accuracy and F1-measure over the state-of-the-art on n-ary and chemical-protein relation datasets, suggesting BERT-GT is a robust approach that is applicable to other biomedical relation extraction tasks or datasets. AVAILABILITY AND IMPLEMENTATION: the source code of BERT-GT will be made freely available at https://github.com/ncbi/bert_gt upon publication. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

20.
Brief Bioinform ; 21(6): 2219-2238, 2020 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-32602538

RESUMO

Natural language processing (NLP) is widely applied in biological domains to retrieve information from publications. Systems to address numerous applications exist, such as biomedical named entity recognition (BNER), named entity normalization (NEN) and protein-protein interaction extraction (PPIE). High-quality datasets can assist the development of robust and reliable systems; however, due to the endless applications and evolving techniques, the annotations of benchmark datasets may become outdated and inappropriate. In this study, we first review commonlyused BNER datasets and their potential annotation problems such as inconsistency and low portability. Then, we introduce a revised version of the JNLPBA dataset that solves potential problems in the original and use state-of-the-art named entity recognition systems to evaluate its portability to different kinds of biomedical literature, including protein-protein interaction and biology events. Lastly, we introduce an ensembled biomedical entity dataset (EBED) by extending the revised JNLPBA dataset with PubMed Central full-text paragraphs, figure captions and patent abstracts. This EBED is a multi-task dataset that covers annotations including gene, disease and chemical entities. In total, it contains 85000 entity mentions, 25000 entity mentions with database identifiers and 5000 attribute tags. To demonstrate the usage of the EBED, we review the BNER track from the AI CUP Biomedical Paper Analysis challenge. Availability: The revised JNLPBA dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/Re vised_JNLPBA.zip. The EBED dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/AICUP _EBED_dataset.rar. Contact: Email: thtsai@g.ncu.edu.tw, Tel. 886-3-4227151 ext. 35203, Fax: 886-3-422-2681 Email: hsu@iis.sinica.edu.tw, Tel. 886-2-2788-3799 ext. 2211, Fax: 886-2-2782-4814 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.


Assuntos
Mineração de Dados , Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Benchmarking , Biologia Computacional/métodos , Mineração de Dados/métodos , Bases de Dados Factuais , Redes Neurais de Computação , PubMed , Software , Inquéritos e Questionários
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA