Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 4.683
Filtrar
Mais filtros

Intervalo de ano de publicação
1.
Annu Rev Neurosci ; 47(1): 277-301, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38669478

RESUMO

It has long been argued that only humans could produce and understand language. But now, for the first time, artificial language models (LMs) achieve this feat. Here we survey the new purchase LMs are providing on the question of how language is implemented in the brain. We discuss why, a priori, LMs might be expected to share similarities with the human language system. We then summarize evidence that LMs represent linguistic information similarly enough to humans to enable relatively accurate brain encoding and decoding during language processing. Finally, we examine which LM properties-their architecture, task performance, or training-are critical for capturing human neural responses to language and review studies using LMs as in silico model organisms for testing hypotheses about language. These ongoing investigations bring us closer to understanding the representations and processes that underlie our ability to comprehend sentences and express thoughts in language.


Assuntos
Encéfalo , Idioma , Humanos , Encéfalo/fisiologia , Animais , Inteligência Artificial , Modelos Neurológicos
2.
Trends Biochem Sci ; 48(12): 1014-1018, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-37833131

RESUMO

Generative artificial intelligence (AI) is a burgeoning field with widespread applications, including in science. Here, we explore two paradigms that provide insight into the capabilities and limitations of Chat Generative Pre-trained Transformer (ChatGPT): its ability to (i) define a core biological concept (the Central Dogma of molecular biology); and (ii) interpret the genetic code.


Assuntos
Inteligência Artificial , Código Genético , Biologia Molecular
3.
Proc Natl Acad Sci U S A ; 121(38): e2322764121, 2024 Sep 17.
Artigo em Inglês | MEDLINE | ID: mdl-39250662

RESUMO

Are members of marginalized communities silenced on social media when they share personal experiences of racism? Here, we investigate the role of algorithms, humans, and platform guidelines in suppressing disclosures of racial discrimination. In a field study of actual posts from a neighborhood-based social media platform, we find that when users talk about their experiences as targets of racism, their posts are disproportionately flagged for removal as toxic by five widely used moderation algorithms from major online platforms, including the most recent large language models. We show that human users disproportionately flag these disclosures for removal as well. Next, in a follow-up experiment, we demonstrate that merely witnessing such suppression negatively influences how Black Americans view the community and their place in it. Finally, to address these challenges to equity and inclusion in online spaces, we introduce a mitigation strategy: a guideline-reframing intervention that is effective at reducing silencing behavior across the political spectrum.


Assuntos
Racismo , Mídias Sociais , Humanos , Negro ou Afro-Americano , Algoritmos
4.
Am J Hum Genet ; 110(10): 1661-1672, 2023 10 05.
Artigo em Inglês | MEDLINE | ID: mdl-37741276

RESUMO

In the effort to treat Mendelian disorders, correcting the underlying molecular imbalance may be more effective than symptomatic treatment. Identifying treatments that might accomplish this goal requires extensive and up-to-date knowledge of molecular pathways-including drug-gene and gene-gene relationships. To address this challenge, we present "parsing modifiers via article annotations" (PARMESAN), a computational tool that searches PubMed and PubMed Central for information to assemble these relationships into a central knowledge base. PARMESAN then predicts putatively novel drug-gene relationships, assigning an evidence-based score to each prediction. We compare PARMESAN's drug-gene predictions to all of the drug-gene relationships displayed by the Drug-Gene Interaction Database (DGIdb) and show that higher-scoring relationship predictions are more likely to match the directionality (up- versus down-regulation) indicated by this database. PARMESAN had more than 200,000 drug predictions scoring above 8 (as one example cutoff), for more than 3,700 genes. Among these predicted relationships, 210 were registered in DGIdb and 201 (96%) had matching directionality. This publicly available tool provides an automated way to prioritize drug screens to target the most-promising drugs to test, thereby saving time and resources in the development of therapeutics for genetic disorders.


Assuntos
PubMed , Humanos , Bases de Dados Factuais
5.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38609331

RESUMO

Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.


Assuntos
Descoberta de Drogas , Processamento de Linguagem Natural , Transdução de Sinais
6.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38324624

RESUMO

Connections between circular RNAs (circRNAs) and microRNAs (miRNAs) assume a pivotal position in the onset, evolution, diagnosis and treatment of diseases and tumors. Selecting the most potential circRNA-related miRNAs and taking advantage of them as the biological markers or drug targets could be conducive to dealing with complex human diseases through preventive strategies, diagnostic procedures and therapeutic approaches. Compared to traditional biological experiments, leveraging computational models to integrate diverse biological data in order to infer potential associations proves to be a more efficient and cost-effective approach. This paper developed a model of Convolutional Autoencoder for CircRNA-MiRNA Associations (CA-CMA) prediction. Initially, this model merged the natural language characteristics of the circRNA and miRNA sequence with the features of circRNA-miRNA interactions. Subsequently, it utilized all circRNA-miRNA pairs to construct a molecular association network, which was then fine-tuned by labeled samples to optimize the network parameters. Finally, the prediction outcome is obtained by utilizing the deep neural networks classifier. This model innovatively combines the likelihood objective that preserves the neighborhood through optimization, to learn the continuous feature representation of words and preserve the spatial information of two-dimensional signals. During the process of 5-fold cross-validation, CA-CMA exhibited exceptional performance compared to numerous prior computational approaches, as evidenced by its mean area under the receiver operating characteristic curve of 0.9138 and a minimal SD of 0.0024. Furthermore, recent literature has confirmed the accuracy of 25 out of the top 30 circRNA-miRNA pairs identified with the highest CA-CMA scores during case studies. The results of these experiments highlight the robustness and versatility of our model.


Assuntos
MicroRNAs , Neoplasias , Humanos , MicroRNAs/genética , RNA Circular/genética , Funções Verossimilhança , Redes Neurais de Computação , Neoplasias/genética , Biologia Computacional/métodos
7.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38600668

RESUMO

Microbial community analysis is an important field to study the composition and function of microbial communities. Microbial species annotation is crucial to revealing microorganisms' complex ecological functions in environmental, ecological and host interactions. Currently, widely used methods can suffer from issues such as inaccurate species-level annotations and time and memory constraints, and as sequencing technology advances and sequencing costs decline, microbial species annotation methods with higher quality classification effectiveness become critical. Therefore, we processed 16S rRNA gene sequences into k-mers sets and then used a trained DNABERT model to generate word vectors. We also design a parallel network structure consisting of deep and shallow modules to extract the semantic and detailed features of 16S rRNA gene sequences. Our method can accurately and rapidly classify bacterial sequences at the SILVA database's genus and species level. The database is characterized by long sequence length (1500 base pairs), multiple sequences (428,748 reads) and high similarity. The results show that our method has better performance. The technique is nearly 20% more accurate at the species level than the currently popular naive Bayes-dominated QIIME 2 annotation method, and the top-5 results at the species level differ from BLAST methods by <2%. In summary, our approach combines a multi-module deep learning approach that overcomes the limitations of existing methods, providing an efficient and accurate solution for microbial species labeling and more reliable data support for microbiology research and application.


Assuntos
Aprendizado Profundo , Microbiota , RNA Ribossômico 16S/genética , Teorema de Bayes , Microbiota/genética , Bactérias/genética , Filogenia
8.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38314912

RESUMO

Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.


Assuntos
Medicina , Humanos , Linhagem Celular , Imunoprecipitação da Cromatina , Bases de Dados Factuais , Idioma
9.
Mol Cell Proteomics ; 23(1): 100682, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37993103

RESUMO

Global phosphoproteomics experiments quantify tens of thousands of phosphorylation sites. However, data interpretation is hampered by our limited knowledge on functions, biological contexts, or precipitating enzymes of the phosphosites. This study establishes a repository of phosphosites with associated evidence in biomedical abstracts, using deep learning-based natural language processing techniques. Our model for illuminating the dark phosphoproteome through PubMed mining (IDPpub) was generated by fine-tuning BioBERT, a deep learning tool for biomedical text mining. Trained using sentences containing protein substrates and phosphorylation site positions from 3000 abstracts, the IDPpub model was then used to extract phosphorylation sites from all MEDLINE abstracts. The extracted proteins were normalized to gene symbols using the National Center for Biotechnology Information gene query, and sites were mapped to human UniProt sequences using ProtMapper and mouse UniProt sequences by direct match. Precision and recall were calculated using 150 curated abstracts, and utility was assessed by analyzing the CPTAC (Clinical Proteomics Tumor Analysis Consortium) pan-cancer phosphoproteomics datasets and the PhosphoSitePlus database. Using 10-fold cross validation, pairs of correct substrates and phosphosite positions were extracted with an average precision of 0.93 and recall of 0.94. After entity normalization and site mapping to human reference sequences, an independent validation achieved a precision of 0.91 and recall of 0.77. The IDPpub repository contains 18,458 unique human phosphorylation sites with evidence sentences from 58,227 abstracts and 5918 mouse sites in 14,610 abstracts. This included evidence sentences for 1803 sites identified in CPTAC studies that are not covered by manually curated functional information in PhosphoSitePlus. Evaluation results demonstrate the potential of IDPpub as an effective biomedical text mining tool for collecting phosphosites. Moreover, the repository (http://idppub.ptmax.org), which can be automatically updated, can serve as a powerful complement to existing resources.


Assuntos
Mineração de Dados , Processamento de Linguagem Natural , Humanos , Mineração de Dados/métodos , Bases de Dados Factuais , PubMed
10.
Proc Natl Acad Sci U S A ; 120(34): e2221473120, 2023 08 22.
Artigo em Inglês | MEDLINE | ID: mdl-37579152

RESUMO

Collective intelligence has emerged as a powerful mechanism to boost decision accuracy across many domains, such as geopolitical forecasting, investment, and medical diagnostics. However, collective intelligence has been mostly applied to relatively simple decision tasks (e.g., binary classifications). Applications in more open-ended tasks with a much larger problem space, such as emergency management or general medical diagnostics, are largely lacking, due to the challenge of integrating unstandardized inputs from different crowd members. Here, we present a fully automated approach for harnessing collective intelligence in the domain of general medical diagnostics. Our approach leverages semantic knowledge graphs, natural language processing, and the SNOMED CT medical ontology to overcome a major hurdle to collective intelligence in open-ended medical diagnostics, namely to identify the intended diagnosis from unstructured text. We tested our method on 1,333 medical cases diagnosed on a medical crowdsourcing platform: The Human Diagnosis Project. Each case was independently rated by ten diagnosticians. Comparing the diagnostic accuracy of single diagnosticians with the collective diagnosis of differently sized groups, we find that our method substantially increases diagnostic accuracy: While single diagnosticians achieved 46% accuracy, pooling the decisions of ten diagnosticians increased this to 76%. Improvements occurred across medical specialties, chief complaints, and diagnosticians' tenure levels. Our results show the life-saving potential of tapping into the collective intelligence of the global medical community to reduce diagnostic errors and increase patient safety.


Assuntos
Crowdsourcing , Inteligência , Humanos , Erros de Diagnóstico
11.
Proc Natl Acad Sci U S A ; 120(10): e2209384120, 2023 03 07.
Artigo em Inglês | MEDLINE | ID: mdl-36848573

RESUMO

The machine learning (ML) research community has landed on automated hate speech detection as the vital tool in the mitigation of bad behavior online. However, it is not clear that this is a widely supported view outside of the ML world. Such a disconnect can have implications for whether automated detection tools are accepted or adopted. Here we lend insight into how other key stakeholders understand the challenge of addressing hate speech and the role automated detection plays in solving it. To do so, we develop and apply a structured approach to dissecting the discourses used by online platform companies, governments, and not-for-profit organizations when discussing hate speech. We find that, where hate speech mitigation is concerned, there is a profound disconnect between the computer science research community and other stakeholder groups-which puts progress on this important problem at serious risk. We identify urgent steps that need to be taken to incorporate computational researchers into a single, coherent, multistakeholder community that is working towards civil discourse online.


Assuntos
Ódio , Fala , Governo , Aprendizado de Máquina , Organizações sem Fins Lucrativos
12.
Proc Natl Acad Sci U S A ; 120(8): e2207391120, 2023 02 21.
Artigo em Inglês | MEDLINE | ID: mdl-36787355

RESUMO

Traditional substance use (SU) surveillance methods, such as surveys, incur substantial lags. Due to the continuously evolving trends in SU, insights obtained via such methods are often outdated. Social media-based sources have been proposed for obtaining timely insights, but methods leveraging such data cannot typically provide fine-grained statistics about subpopulations, unlike traditional approaches. We address this gap by developing methods for automatically characterizing a large Twitter nonmedical prescription medication use (NPMU) cohort (n = 288,562) in terms of age-group, race, and gender. Our natural language processing and machine learning methods for automated cohort characterization achieved 0.88 precision (95% CI:0.84 to 0.92) for age-group, 0.90 (95% CI: 0.85 to 0.95) for race, and 94% accuracy (95% CI: 92 to 97) for gender, when evaluated against manually annotated gold-standard data. We compared automatically derived statistics for NPMU of tranquilizers, stimulants, and opioids from Twitter with statistics reported in the National Survey on Drug Use and Health (NSDUH) and the National Emergency Department Sample (NEDS). Distributions automatically estimated from Twitter were mostly consistent with the NSDUH [Spearman r: race: 0.98 (P < 0.005); age-group: 0.67 (P < 0.005); gender: 0.66 (P = 0.27)] and NEDS, with 34/65 (52.3%) of the Twitter-based estimates lying within 95% CIs of estimates from the traditional sources. Explainable differences (e.g., overrepresentation of younger people) were found for age-group-related statistics. Our study demonstrates that accurate subpopulation-specific estimates about SU, particularly NPMU, may be automatically derived from Twitter to obtain earlier insights about targeted subpopulations compared to traditional surveillance approaches.


Assuntos
Estimulantes do Sistema Nervoso Central , Mídias Sociais , Transtornos Relacionados ao Uso de Substâncias , Humanos , Transtornos Relacionados ao Uso de Substâncias/epidemiologia , Prescrições , Demografia
13.
Proc Natl Acad Sci U S A ; 120(35): e2302269120, 2023 Aug 29.
Artigo em Inglês | MEDLINE | ID: mdl-37603755

RESUMO

This study explores the longevity of artistic reputation. We empirically examine whether artists are more- or less-venerated after their death. We construct a massive historical corpus spanning 1795 to 2020 and build separate word-embedding models for each five-year period to examine how the reputations of over 3,300 famous artists-including painters, architects, composers, musicians, and writers-evolve after their death. We find that most artists gain their highest reputation right before their death, after which it declines, losing nearly one SD every century. This posthumous decline applies to artists in all domains, includes those who died young or unexpectedly, and contradicts the popular view that artists' reputations endure. Contrary to the Matthew effect, the reputational decline is the steepest for those who had the highest reputations while alive. Two mechanisms-artists' reduced visibility and the public's changing taste-are associated with much of the posthumous reputational decline. This study underscores the fragility of human reputation and shows how the collective memory of artists unfolds over time.

14.
Proc Natl Acad Sci U S A ; 120(42): e2305290120, 2023 10 17.
Artigo em Inglês | MEDLINE | ID: mdl-37816054

RESUMO

Human cognition is underpinned by structured internal representations that encode relationships between entities in the world (cognitive maps). Clinical features of schizophrenia-from thought disorder to delusions-are proposed to reflect disorganization in such conceptual representations. Schizophrenia is also linked to abnormalities in neural processes that support cognitive map representations, including hippocampal replay and high-frequency ripple oscillations. Here, we report a computational assay of semantically guided conceptual sampling and exploit this to test a hypothesis that people with schizophrenia (PScz) exhibit abnormalities in semantically guided cognition that relate to hippocampal replay and ripples. Fifty-two participants [26 PScz (13 unmedicated) and 26 age-, gender-, and intelligence quotient (IQ)-matched nonclinical controls] completed a category- and letter-verbal fluency task, followed by a magnetoencephalography (MEG) scan involving a separate sequence-learning task. We used a pretrained word embedding model of semantic similarity, coupled to a computational model of word selection, to quantify the degree to which each participant's verbal behavior was guided by semantic similarity. Using MEG, we indexed neural replay and ripple power in a post-task rest session. Across all participants, word selection was strongly influenced by semantic similarity. The strength of this influence showed sensitivity to task demands (category > letter fluency) and predicted performance. In line with our hypothesis, the influence of semantic similarity on behavior was reduced in schizophrenia relative to controls, predicted negative psychotic symptoms, and correlated with an MEG signature of hippocampal ripple power (but not replay). The findings bridge a gap between phenomenological and neurocomputational accounts of schizophrenia.


Assuntos
Transtornos Psicóticos , Esquizofrenia , Humanos , Esquizofrenia/diagnóstico , Semântica , Comportamento Verbal , Aprendizagem
15.
Proc Natl Acad Sci U S A ; 120(25): e2220726120, 2023 06 20.
Artigo em Inglês | MEDLINE | ID: mdl-37307492

RESUMO

Large-scale language datasets and advances in natural language processing offer opportunities for studying people's cognitions and behaviors. We show how representations derived from language can be combined with laboratory-based word norms to predict implicit attitudes for diverse concepts. Our approach achieves substantially higher correlations than existing methods. We also show that our approach is more predictive of implicit attitudes than are explicit attitudes, and that it captures variance in implicit attitudes that is largely unexplained by explicit attitudes. Overall, our results shed light on how implicit attitudes can be measured by combining standard psychological data with large-scale language data. In doing so, we pave the way for highly accurate computational modeling of what people think and feel about the world around them.


Assuntos
Cognição , Emoções , Humanos , Simulação por Computador , Laboratórios , Atitude
16.
Proc Natl Acad Sci U S A ; 120(23): e2216162120, 2023 06 06.
Artigo em Inglês | MEDLINE | ID: mdl-37253013

RESUMO

Across the United States, police chiefs, city officials, and community leaders alike have highlighted the need to de-escalate police encounters with the public. This concern about escalation extends from encounters involving use of force to routine car stops, where Black drivers are disproportionately pulled over. Yet, despite the calls for action, we know little about the trajectory of police stops or how escalation unfolds. In study 1, we use methods from computational linguistics to analyze police body-worn camera footage from 577 stops of Black drivers. We find that stops with escalated outcomes (those ending in arrest, handcuffing, or a search) diverge from stops without these outcomes in their earliest moments-even in the first 45 words spoken by the officer. In stops that result in escalation, officers are more likely to issue commands as their opening words to the driver and less likely to tell drivers the reason why they are being stopped. In study 2, we expose Black males to audio clips of the same stops and find differences in how escalated stops are perceived: Participants report more negative emotion, appraise officers more negatively, worry about force being used, and predict worse outcomes after hearing only the officer's initial words in escalated versus non-escalated stops. Our findings show that car stops that end in escalated outcomes sometimes begin in an escalated fashion, with adverse effects for Black male drivers and, in turn, police-community relations.


Assuntos
Negro ou Afro-Americano , Aplicação da Lei , Polícia , Humanos , Masculino , Aplicação da Lei/métodos , Estados Unidos , Racismo , Emoções
17.
Am J Hum Genet ; 109(9): 1591-1604, 2022 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-35998640

RESUMO

Diagnosis for rare genetic diseases often relies on phenotype-driven methods, which hinge on the accuracy and completeness of the rare disease phenotypes in the underlying annotation knowledgebase. Existing knowledgebases are often manually curated with additional annotations found in published case reports. Despite their potential, real-world data such as electronic health records (EHRs) have not been fully exploited to derive rare disease annotations. Here, we present open annotation for rare diseases (OARD), a real-world-data-derived resource with annotation for rare-disease-related phenotypes. This resource is derived from the EHRs of two academic health institutions containing more than 10 million individuals spanning wide age ranges and different disease subgroups. By leveraging ontology mapping and advanced natural-language-processing (NLP) methods, OARD automatically and efficiently extracts concepts for both rare diseases and their phenotypic traits from billing codes and lab tests as well as over 100 million clinical narratives. The rare disease prevalence derived by OARD is highly correlated with those annotated in the original rare disease knowledgebase. By performing association analysis, we identified more than 1 million novel disease-phenotype association pairs that were previously missed by human annotation, and >60% were confirmed true associations via manual review of a list of sampled pairs. Compared to the manual curated annotation, OARD is 100% data driven and its pipeline can be shared across different institutions. By supporting privacy-preserving sharing of aggregated summary statistics, such as term frequencies and disease-phenotype associations, it fills an important gap to facilitate data-driven research in the rare disease community.


Assuntos
Processamento de Linguagem Natural , Doenças Raras , Registros Eletrônicos de Saúde , Humanos , Fenótipo , Doenças Raras/genética
18.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37317617

RESUMO

Human prescription drug labeling contains a summary of the essential scientific information needed for the safe and effective use of the drug and includes the Prescribing Information, FDA-approved patient labeling (Medication Guides, Patient Package Inserts and/or Instructions for Use), and/or carton and container labeling. Drug labeling contains critical information about drug products, such as pharmacokinetics and adverse events. Automatic information extraction from drug labels may facilitate finding the adverse reaction of the drugs or finding the interaction of one drug with another drug. Natural language processing (NLP) techniques, especially recently developed Bidirectional Encoder Representations from Transformers (BERT), have exhibited exceptional merits in text-based information extraction. A common paradigm in training BERT is to pretrain the model on large unlabeled generic language corpora, so that the model learns the distribution of the words in the language, and then fine-tune on a downstream task. In this paper, first, we show the uniqueness of language used in drug labels, which therefore cannot be optimally handled by other BERT models. Then, we present the developed PharmBERT, which is a BERT model specifically pretrained on the drug labels (publicly available at Hugging Face). We demonstrate that our model outperforms the vanilla BERT, ClinicalBERT and BioBERT in multiple NLP tasks in the drug label domain. Moreover, how the domain-specific pretraining has contributed to the superior performance of PharmBERT is demonstrated by analyzing different layers of PharmBERT, and more insight into how it understands different linguistic aspects of the data is gained.


Assuntos
Rotulagem de Medicamentos , Armazenamento e Recuperação da Informação , Humanos , Aprendizagem , Processamento de Linguagem Natural
19.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37344167

RESUMO

Adverse drug events (ADEs) are common in clinical practice and can cause significant harm to patients and increase resource use. Natural language processing (NLP) has been applied to automate ADE detection, but NLP systems become less adaptable when drug entities are missing or multiple medications are specified in clinical narratives. Additionally, no Chinese-language NLP system has been developed for ADE detection due to the complexity of Chinese semantics, despite ˃10 million cases of drug-related adverse events occurring annually in China. To address these challenges, we propose DKADE, a deep learning and knowledge graph-based framework for identifying ADEs. DKADE infers missing drug entities and evaluates their correlations with ADEs by combining medication orders and existing drug knowledge. Moreover, DKADE can automatically screen for new adverse drug reactions. Experimental results show that DKADE achieves an overall F1-score value of 91.13%. Furthermore, the adaptability of DKADE is validated using real-world external clinical data. In summary, DKADE is a powerful tool for studying drug safety and automating adverse event monitoring.


Assuntos
Aprendizado Profundo , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Reconhecimento Automatizado de Padrão , Semântica , Processamento de Linguagem Natural
20.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38180830

RESUMO

2'-O-methylation (2OM) is the most common post-transcriptional modification of RNA. It plays a crucial role in RNA splicing, RNA stability and innate immunity. Despite advances in high-throughput detection, the chemical stability of 2OM makes it difficult to detect and map in messenger RNA. Therefore, bioinformatics tools have been developed using machine learning (ML) algorithms to identify 2OM sites. These tools have made significant progress, but their performances remain unsatisfactory and need further improvement. In this study, we introduced H2Opred, a novel hybrid deep learning (HDL) model for accurately identifying 2OM sites in human RNA. Notably, this is the first application of HDL in developing four nucleotide-specific models [adenine (A2OM), cytosine (C2OM), guanine (G2OM) and uracil (U2OM)] as well as a generic model (N2OM). H2Opred incorporated both stacked 1D convolutional neural network (1D-CNN) blocks and stacked attention-based bidirectional gated recurrent unit (Bi-GRU-Att) blocks. 1D-CNN blocks learned effective feature representations from 14 conventional descriptors, while Bi-GRU-Att blocks learned feature representations from five natural language processing-based embeddings extracted from RNA sequences. H2Opred integrated these feature representations to make the final prediction. Rigorous cross-validation analysis demonstrated that H2Opred consistently outperforms conventional ML-based single-feature models on five different datasets. Moreover, the generic model of H2Opred demonstrated a remarkable performance on both training and testing datasets, significantly outperforming the existing predictor and other four nucleotide-specific H2Opred models. To enhance accessibility and usability, we have deployed a user-friendly web server for H2Opred, accessible at https://balalab-skku.org/H2Opred/. This platform will serve as an invaluable tool for accurately predicting 2OM sites within human RNA, thereby facilitating broader applications in relevant research endeavors.


Assuntos
Aprendizado Profundo , RNA , Humanos , RNA/genética , Sequência de Bases , Nucleotídeos , Metilação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA