Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 46
Filtrar
1.
Sci Data ; 11(1): 982, 2024 Sep 09.
Artículo en Inglés | MEDLINE | ID: mdl-39251610

RESUMEN

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.


Asunto(s)
Enzimas , Procesamiento de Lenguaje Natural , Enzimas/química , PubMed , Bases de Datos de Proteínas , Bases del Conocimiento
2.
Database (Oxford) ; 20242024 Aug 09.
Artículo en Inglés | MEDLINE | ID: mdl-39126204

RESUMEN

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.


Asunto(s)
Curaduría de Datos , Humanos , Curaduría de Datos/métodos , Minería de Datos/métodos , Semántica , PubMed
3.
Database (Oxford) ; 20242024 Aug 08.
Artículo en Inglés | MEDLINE | ID: mdl-39114977

RESUMEN

The BioRED track at BioCreative VIII calls for a community effort to identify, semantically categorize, and highlight the novelty factor of the relationships between biomedical entities in unstructured text. Relation extraction is crucial for many biomedical natural language processing (NLP) applications, from drug discovery to custom medical solutions. The BioRED track simulates a real-world application of biomedical relationship extraction, and as such, considers multiple biomedical entity types, normalized to their specific corresponding database identifiers, as well as defines relationships between them in the documents. The challenge consisted of two subtasks: (i) in Subtask 1, participants were given the article text and human expert annotated entities, and were asked to extract the relation pairs, identify their semantic type and the novelty factor, and (ii) in Subtask 2, participants were given only the article text, and were asked to build an end-to-end system that could identify and categorize the relationships and their novelty. We received a total of 94 submissions from 14 teams worldwide. The highest F-score performances achieved for the Subtask 1 were: 77.17% for relation pair identification, 58.95% for relation type identification, 59.22% for novelty identification, and 44.55% when evaluating all of the above aspects of the comprehensive relation extraction. The highest F-score performances achieved for the Subtask 2 were: 55.84% for relation pair, 43.03% for relation type, 42.74% for novelty, and 32.75% for comprehensive relation extraction. The entire BioRED track dataset and other challenge materials are available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/ and https://codalab.lisn.upsaclay.fr/competitions/13377 and https://codalab.lisn.upsaclay.fr/competitions/13378. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/https://codalab.lisn.upsaclay.fr/competitions/13377https://codalab.lisn.upsaclay.fr/competitions/13378.


Asunto(s)
Minería de Datos , Procesamiento de Lenguaje Natural , Humanos , Minería de Datos/métodos , Bases de Datos Factuales , Semántica
4.
ArXiv ; 2024 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-38903736

RESUMEN

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.

5.
ArXiv ; 2024 May 25.
Artículo en Inglés | MEDLINE | ID: mdl-38903746

RESUMEN

Gene set knowledge discovery is essential for advancing human functional genomics. Recent studies have shown promising performance by harnessing the power of Large Language Models (LLMs) on this task. Nonetheless, their results are subject to several limitations common in LLMs such as hallucinations. In response, we present GeneAgent, a first-of-its-kind language agent featuring self-verification capability. It autonomously interacts with various biological databases and leverages relevant domain knowledge to improve accuracy and reduce hallucination occurrences. Benchmarking on 1,106 gene sets from different sources, GeneAgent consistently outperforms standard GPT-4 by a significant margin. Moreover, a detailed manual review confirms the effectiveness of the self-verification module in minimizing hallucinations and generating more reliable analytical narratives. To demonstrate its practical utility, we apply GeneAgent to seven novel gene sets derived from mouse B2905 melanoma cell lines, with expert evaluations showing that GeneAgent offers novel insights into gene functions and subsequently expedites knowledge discovery.

6.
Nucleic Acids Res ; 52(W1): W540-W546, 2024 Jul 05.
Artículo en Inglés | MEDLINE | ID: mdl-38572754

RESUMEN

PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.


Asunto(s)
PubMed , Inteligencia Artificial , Humanos , Programas Informáticos , Minería de Datos/métodos , Semántica , Internet
7.
ArXiv ; 2024 Jan 19.
Artículo en Inglés | MEDLINE | ID: mdl-38410657

RESUMEN

PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases, and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.

8.
ACS Nano ; 17(24): 25552-25564, 2023 Dec 26.
Artículo en Inglés | MEDLINE | ID: mdl-38096149

RESUMEN

Photomemristors have been regarded as one of the most promising candidates for next-generation hardware-based neuromorphic computing due to their potentials of fast data transmission and low power consumption. However, intriguingly, so far, photomemristors seldom display truly nonvolatile memory characteristics with high light sensitivity. Herein, we demonstrate ultrasensitive photomemristors utilizing two-dimensional (2D) Ruddlesden-Popper (RP) perovskites with a highly polar donor-acceptor-type push-pull organic cation, 4-(5-(2-aminoethyl)thiophen-2-yl)benzonitrile+ (EATPCN+), as charge-trapping layers. High linearity and almost zero-decay retention are observed in (EATPCN)2PbI4 devices, which are very distinct from that of the traditional 2D RP perovskite devices consisting of nonpolar organic cations, such as phenethylamine+ (PEA+) and octylamine+ (OA+), and traditional 3D perovskite devices consisting of methylamine+ (MA+). The 2-fold advantages, including desirable spatial crystal arrangement and engineered energetic band alignment, clarify the mechanism of superior performance in (EATPCN)2PbI4 devices. The optimized (EATPCN)2PbI4 photomemristor also shows a memory window of 87.9 V and an on/off ratio of 106 with a retention time of at least 2.4 × 105 s and remains unchanged after >105 writing-reading-erasing-reading endurance cycles. Very low energy consumptions of 1.12 and 6 fJ for both light stimulation and the reading process of each status update are also demonstrated. The extremely low power consumption and high photoresponsivity were simultaneously achieved. The high photosensitivity surpasses that of a state-of-the-art commercial pulse energy meter by several orders of magnitude. With their outstanding linearity and retention, rabbit images have been rebuilt by (EATPCN)2PbI4 photomemristors, which truthfully render the image without fading over time. Finally, by utilizing the powerful ∼8 bits of nonvolatile potentiation and depression levels of (EATPCN)2PbI4 photomemristors, the accuracies of the recognition tasks of CIFAR-10 image classification and MNIST handwritten digit classification have reached 89% and 94.8%, respectively. This study represents the first report of utilizing a functional donor-acceptor type of organic cation in 2D RP perovskites for high-performance photomemristors with characteristics that are not found in current halide perovskites.

9.
ArXiv ; 2023 Oct 17.
Artículo en Inglés | MEDLINE | ID: mdl-37904734

RESUMEN

ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction, and medical education, and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.

10.
Bioinformatics ; 39(10)2023 10 03.
Artículo en Inglés | MEDLINE | ID: mdl-37878810

RESUMEN

MOTIVATION: Gene name normalization is an important yet highly complex task in biomedical text mining research, as gene names can be highly ambiguous and may refer to different genes in different species or share similar names with other bioconcepts. This poses a challenge for accurately identifying and linking gene mentions to their corresponding entries in databases such as NCBI Gene or UniProt. While there has been a body of literature on the gene normalization task, few have addressed all of these challenges or make their solutions publicly available to the scientific community. RESULTS: Building on the success of GNormPlus, we have created GNorm2: a more advanced tool with optimized functions and improved performance. GNorm2 integrates a range of advanced deep learning-based methods, resulting in the highest levels of accuracy and efficiency for gene recognition and normalization to date. Our tool is freely available for download. AVAILABILITY AND IMPLEMENTATION: https://github.com/ncbi/GNorm2.


Asunto(s)
Minería de Datos , Minería de Datos/métodos , Bases de Datos Factuales
11.
J Biomed Inform ; 146: 104487, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37673376

RESUMEN

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

12.
ACS Appl Mater Interfaces ; 15(37): 44033-44042, 2023 Sep 20.
Artículo en Inglés | MEDLINE | ID: mdl-37694918

RESUMEN

Three organic conjugated small molecules, DTA-DTPZ, Cz-DTPZ, and DTA-me-DTPZ comprising an antiaromatic 5,10-ditolylphenazine (DTPZ) core and electron-donating peripheral substituents with high HOMOs (-4.2 to -4.7 eV) and multiple reversible oxidative potentials are reported. The corresponding films sandwiched between two electrodes show unipolar and switchable hysteresis current-voltage (I-V) characteristics upon voltage sweeping, revealing the prominent features of nonvolatile memristor behaviors. The numerical simulation of the I-V curves suggests that the carriers generated by the oxidized molecules lead to the increment of conductance. However, the accumulated carriers tend to deteriorate the device endurance. The electroactive sites are fully blocked in the dimethylated molecule DTA-me-DTPZ, preventing the irreversible electrochemical reaction, thereby boosting the endurance of the memristor device over 300 cycles. Despite the considerable improvement in endurance, the decrement of on/off ratio from 105 to 101 after 250 cycles suggests that the excessive charge carriers (radical cations) remains a problem. Thus, a new strategy of doping an electron-deficient material, CN-T2T, into the unipolar active layer was introduced to further improve the device stability. The device containing DTA-me-DTPZ:CNT2T (1:1) blend as the active layer retained the endurance and on/off ratio (∼104) upon sweeping 300 cycles. The molecular designs and doping strategy demonstrate effective approaches toward more stable metal-free organic conjugated small-molecule memristors.

13.
ArXiv ; 2023 Jun 19.
Artículo en Inglés | MEDLINE | ID: mdl-37502629

RESUMEN

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

14.
Bioinformatics ; 39(5)2023 05 04.
Artículo en Inglés | MEDLINE | ID: mdl-37171899

RESUMEN

MOTIVATION: Biomedical named entity recognition (BioNER) seeks to automatically recognize biomedical entities in natural language text, serving as a necessary foundation for downstream text mining tasks and applications such as information extraction and question answering. Manually labeling training data for the BioNER task is costly, however, due to the significant domain expertise required for accurate annotation. The resulting data scarcity causes current BioNER approaches to be prone to overfitting, to suffer from limited generalizability, and to address a single entity type at a time (e.g. gene or disease). RESULTS: We therefore propose a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to enhance the accuracy and stability of BioNER models. We further present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluate AIONER on 14 BioNER benchmark tasks and show that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrate the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g. the entire PubMed data). AVAILABILITY AND IMPLEMENTATION: The source code, trained models and data for AIONER are freely available at https://github.com/ncbi/AIONER.


Asunto(s)
Aprendizaje Profundo , Minería de Datos/métodos , Programas Informáticos , Lenguaje , PubMed
15.
Adv Sci (Weinh) ; 10(10): e2206076, 2023 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-36748267

RESUMEN

Although vacuum-deposited metal halide perovskite light-emitting diodes (PeLEDs) have great promise for use in large-area high-color-gamut displays, the efficiency of vacuum-sublimed PeLEDs currently lags that of solution-processed counterparts. In this study, highly efficient vacuum-deposited PeLEDs are prepared through a process of optimizing the stoichiometric ratio of the sublimed precursors under high vacuum and incorporating ultrathin under- and upper-layers for the perovskite emission layer (EML). In contrast to the situation in most vacuum-deposited organic light-emitting devices, the properties of these perovskite EMLs are highly influenced by the presence and nature of the upper- and presublimed materials, thereby allowing us to enhance the performance of the resulting devices. By eliminating Pb° formation and passivating defects in the perovskite EMLs, the PeLEDs achieve an outstanding external quantum efficiency (EQE) of 10.9% when applying a very smooth and flat geometry; it reaches an extraordinarily high value of 21.1% when integrating a light out-coupling structure, breaking through the 10% EQE milestone of vacuum-deposited PeLEDs.

16.
Brief Bioinform ; 25(1)2023 11 22.
Artículo en Inglés | MEDLINE | ID: mdl-38168838

RESUMEN

ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically, we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction and medical education and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.


Asunto(s)
Almacenamiento y Recuperación de la Información , Lenguaje , Humanos , Privacidad , Investigadores
17.
Chem Sci ; 13(44): 12996-13005, 2022 Nov 16.
Artículo en Inglés | MEDLINE | ID: mdl-36425506

RESUMEN

Owing to the high technology maturity of thermally activated delayed fluorescence (TADF) emitter design with a specific molecular shape, extremely high-performance organic light-emitting diodes (OLEDs) have recently been achieved via various doping techniques. Recently, undoped OLEDs have drawn immense attention because of their manufacturing cost reduction and procedure simplification. However, capable materials as host emitters are rare and precious because general fluorophores in high-concentration states suffer from serious aggregation-caused quenching (ACQ) and undergo exciton quenching. In this work, a series of diboron materials, CzDBA, iCzDBA, and tBuCzDBA, is introduced to realize the effect of steric hindrance and the molecular aspect ratio via experimental and theoretical studies. We computed transition electric dipole moment (TEDM) and molecular dynamics (MD) simulations as a proof-of-concept model to investigate the molecular stacking in neat films. It is worth noting that the pure tBuCzDBA film with a high horizontal ratio of 92% is employed to achieve a nondoped OLED with an excellent external quantum efficiency of 26.9%. In addition, we demonstrated the first ultrathin emitting layer (1 nm) TADF device, which exhibited outstanding power efficiency. This molecular design and high-performance devices show the potential of power-saving and economical fabrication for advanced OLEDs.

18.
Database (Oxford) ; 20222022 10 13.
Artículo en Inglés | MEDLINE | ID: mdl-36227127

RESUMEN

The automatic assignment of species information to the corresponding genes in a research article is a critically important step in the gene normalization task, whereby a gene mention is normalized and linked to a database record or an identifier by a text-mining algorithm. Existing methods typically rely on heuristic rules based on gene and species co-occurrence in the article, but their accuracy is suboptimal. We therefore developed a high-performance method, using a novel deep learning-based framework, to identify whether there is a relation between a gene and a species. Instead of the traditional binary classification framework in which all possible pairs of genes and species in the same article are evaluated, we treat the problem as a sequence labeling task such that only a fraction of the pairs needs to be considered. Our benchmarking results show that our approach obtains significantly higher performance compared to that of the rule-based baseline method for the species assignment task (from 65.8-81.3% in accuracy). The source code and data for species assignment are freely available. Database URL https://github.com/ncbi/SpeciesAssignment.


Asunto(s)
Minería de Datos , Programas Informáticos , Algoritmos , Benchmarking , Minería de Datos/métodos , Bases de Datos Factuales
19.
Database (Oxford) ; 20222022 07 19.
Artículo en Inglés | MEDLINE | ID: mdl-35856889

RESUMEN

Automatic extracting interactions between chemical compound/drug and gene/protein are significantly beneficial to drug discovery, drug repurposing, drug design and biomedical knowledge graph construction. To promote the development of the relation extraction between drug and protein, the BioCreative VII challenge organized the DrugProt track. This paper describes the approach we developed for this task. In addition to the conventional text classification framework that has been widely used in relation extraction tasks, we propose a sequence labeling framework to drug-protein relation extraction. We first comprehensively compared the cutting-edge biomedical pre-trained language models for both frameworks. Then, we explored several ensemble methods to further improve the final performance. In the evaluation of the challenge, our best submission (i.e. the ensemble of models in two frameworks via major voting) achieved the F1-score of 0.795 on the official test set. Further, we realized the sequence labeling framework is more efficient and achieves better performance than the text classification framework. Finally, our ensemble of the sequence labeling models with majority voting achieves the best F1-score of 0.800 on the test set. DATABASE URL: https://github.com/lingluodlut/BioCreativeVII_DrugProt.


Asunto(s)
Minería de Datos , Proteínas , Minería de Datos/métodos , Bases de Datos Factuales , Lenguaje , Publicaciones
20.
Brief Bioinform ; 23(5)2022 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-35849818

RESUMEN

Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.


Asunto(s)
Algoritmos , Minería de Datos , Proteínas , PubMed
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...