Pesquisa | BVS Doenças Infecciosas e Parasitárias

1.

Scaling neural machine translation to 200 languages.

Nature ; 630(8018): 841-846, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38839963

RESUMO

The development of neural techniques has opened up new avenues for research in machine translation. Today, neural machine translation (NMT) systems can leverage highly multilingual capacities and even perform zero-shot translation, delivering promising results in terms of language coverage and quality. However, scaling quality NMT requires large volumes of parallel bilingual data, which are not equally available for the 7,000+ languages in the world1. Focusing on improving the translation qualities of a relatively small group of high-resource languages comes at the expense of directing research attention to low-resource languages, exacerbating digital inequities in the long run. To break this pattern, here we introduce No Language Left Behind-a single massively multilingual model that leverages transfer learning across languages. We developed a conditional computational model based on the Sparsely Gated Mixture of Experts architecture2-7, which we trained on data obtained with new mining techniques tailored for low-resource languages. Furthermore, we devised multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. We evaluated the performance of our model over 40,000 translation directions using tools created specifically for this purpose-an automatic benchmark (FLORES-200), a human evaluation metric (XSTS) and a toxicity detector that covers every language in our model. Compared with the previous state-of-the-art models, our model achieves an average of 44% improvement in translation quality as measured by BLEU. By demonstrating how to scale NMT to 200 languages and making all contributions in this effort freely available for non-commercial use, our work lays important groundwork for the development of a universal translation system.

Assuntos

Multilinguismo , Processamento de Linguagem Natural , Redes Neurais de Computação , Tradução , Benchmarking

2.

Solving olympiad geometry without human demonstrations.

Trinh, Trieu H; Wu, Yuhuai; Le, Quoc V; He, He; Luong, Thang.

Nature ; 625(7995): 476-482, 2024 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-38233616

RESUMO

Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning1-4, owing to their reputed difficulty among the world's best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges1,5, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.

Assuntos

Matemática , Processamento de Linguagem Natural , Resolução de Problemas , Humanos , Matemática/métodos , Matemática/normas

3.

Siri of the cell: what biology could learn from the iPhone.

Carvunis, Anne-Ruxandra; Ideker, Trey.

Cell ; 157(3): 534-8, 2014 Apr 24.

Artigo em Inglês | MEDLINE | ID: mdl-24766803

RESUMO

Modern genomics is very efficient at mapping genes and gene networks, but how to transform these maps into predictive models of the cell remains unclear. Recent progress in computer science, embodied by intelligent agents such as Siri, inspires an approach for moving from networks to multiscale models able to predict a range of cellular phenotypes and answer biological questions.

Assuntos

Inteligência Artificial , Ontologias Biológicas , Biologia Celular , Modelos Biológicos , Biologia Celular/tendências , Redes Reguladoras de Genes , Processamento de Linguagem Natural , Biologia de Sistemas

4.

Role play with large language models.

Shanahan, Murray; McDonell, Kyle; Reynolds, Laria.

Nature ; 623(7987): 493-498, 2023 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-37938776

RESUMO

As dialogue agents become increasingly human-like in their performance, we must develop effective ways to describe their behaviour in high-level terms without falling into the trap of anthropomorphism. Here we foreground the concept of role play. Casting dialogue-agent behaviour in terms of role play allows us to draw on familiar folk psychological terms, without ascribing human characteristics to language models that they in fact lack. Two important cases of dialogue-agent behaviour are addressed this way, namely, (apparent) deception and (apparent) self-awareness.

Assuntos

Comportamento Imitativo , Processamento de Linguagem Natural , Terminologia como Assunto , Humanos , Enganação , Autoavaliação (Psicologia)

5.

Health system-scale language models are all-purpose prediction engines.

Jiang, Lavender Yao; Liu, Xujin Chris; Nejatian, Nima Pour; Nasir-Moin, Mustafa; Wang, Duo; Abidin, Anas; Eaton, Kevin; Riina, Howard Antony; Laufer, Ilya; Punjabi, Paawan; Miceli, Madeline; Kim, Nora C; Orillac, Cordelia; Schnurman, Zane; Livia, Christopher; Weiss, Hannah; Kurland, David; Neifert, Sean; Dastagirzada, Yosef; Kondziolka, Douglas; Cheung, Alexander T M; Yang, Grace; Cao, Ming; Flores, Mona; Costa, Anthony B; Aphinyanaphongs, Yindalon; Cho, Kyunghyun; Oermann, Eric Karl.

Nature ; 619(7969): 357-362, 2023 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-37286606

RESUMO

Physicians make critical time-constrained decisions every day. Clinical predictive models can help physicians and administrators make decisions by forecasting clinical and operational events. Existing structured data-based clinical predictive models have limited use in everyday practice owing to complexity in data processing, as well as model development and deployment1-3. Here we show that unstructured clinical notes from the electronic health record can enable the training of clinical language models, which can be used as all-purpose clinical predictive engines with low-resistance development and deployment. Our approach leverages recent advances in natural language processing4,5 to train a large language model for medical language (NYUTron) and subsequently fine-tune it across a wide range of clinical and operational predictive tasks. We evaluated our approach within our health system for five such tasks: 30-day all-cause readmission prediction, in-hospital mortality prediction, comorbidity index prediction, length of stay prediction, and insurance denial prediction. We show that NYUTron has an area under the curve (AUC) of 78.7-94.9%, with an improvement of 5.36-14.7% in the AUC compared with traditional models. We additionally demonstrate the benefits of pretraining with clinical text, the potential for increasing generalizability to different sites through fine-tuning and the full deployment of our system in a prospective, single-arm trial. These results show the potential for using clinical language models in medicine to read alongside physicians and provide guidance at the point of care.

Assuntos

Tomada de Decisão Clínica , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Médicos , Humanos , Tomada de Decisão Clínica/métodos , Readmissão do Paciente , Mortalidade Hospitalar , Comorbidade , Tempo de Internação , Cobertura do Seguro , Área Sob a Curva , Sistemas Automatizados de Assistência Junto ao Leito/tendências , Ensaios Clínicos como Assunto

6.

Large language models encode clinical knowledge.

Singhal, Karan; Azizi, Shekoofeh; Tu, Tao; Mahdavi, S Sara; Wei, Jason; Chung, Hyung Won; Scales, Nathan; Tanwani, Ajay; Cole-Lewis, Heather; Pfohl, Stephen; Payne, Perry; Seneviratne, Martin; Gamble, Paul; Kelly, Chris; Babiker, Abubakr; Schärli, Nathanael; Chowdhery, Aakanksha; Mansfield, Philip; Demner-Fushman, Dina; Agüera Y Arcas, Blaise; Webster, Dale; Corrado, Greg S; Matias, Yossi; Chou, Katherine; Gottweis, Juraj; Tomasev, Nenad; Liu, Yun; Rajkomar, Alvin; Barral, Joelle; Semturs, Christopher; Karthikesalingam, Alan; Natarajan, Vivek.

Nature ; 620(7972): 172-180, 2023 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-37438534

RESUMO

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Assuntos

Benchmarking , Simulação por Computador , Conhecimento , Medicina , Processamento de Linguagem Natural , Viés , Competência Clínica , Compreensão , Conjuntos de Dados como Assunto , Licenciamento , Medicina/métodos , Medicina/normas , Segurança do Paciente , Médicos

7.

An autonomous debating system.

Slonim, Noam; Bilu, Yonatan; Alzate, Carlos; Bar-Haim, Roy; Bogin, Ben; Bonin, Francesca; Choshen, Leshem; Cohen-Karlik, Edo; Dankin, Lena; Edelstein, Lilach; Ein-Dor, Liat; Friedman-Melamed, Roni; Gavron, Assaf; Gera, Ariel; Gleize, Martin; Gretz, Shai; Gutfreund, Dan; Halfon, Alon; Hershcovich, Daniel; Hoory, Ron; Hou, Yufang; Hummel, Shay; Jacovi, Michal; Jochim, Charles; Kantor, Yoav; Katz, Yoav; Konopnicki, David; Kons, Zvi; Kotlerman, Lili; Krieger, Dalia; Lahav, Dan; Lavee, Tamar; Levy, Ran; Liberman, Naftali; Mass, Yosi; Menczel, Amir; Mirkin, Shachar; Moshkowich, Guy; Ofek-Koifman, Shila; Orbach, Matan; Rabinovich, Ella; Rinott, Ruty; Shechtman, Slava; Sheinwald, Dafna; Shnarch, Eyal; Shnayderman, Ilya; Soffer, Aya; Spector, Artem; Sznajder, Benjamin; Toledo, Assaf.

Nature ; 591(7850): 379-384, 2021 03.

Artigo em Inglês | MEDLINE | ID: mdl-33731946

RESUMO

Artificial intelligence (AI) is defined as the ability of machines to perform tasks that are usually associated with intelligent beings. Argument and debate are fundamental capabilities of human intelligence, essential for a wide range of human activities, and common to all human societies. The development of computational argumentation technologies is therefore an important emerging discipline in AI research1. Here we present Project Debater, an autonomous debating system that can engage in a competitive debate with humans. We provide a complete description of the system's architecture, a thorough and systematic evaluation of its operation across a wide range of debate topics, and a detailed account of the system's performance in its public debut against three expert human debaters. We also highlight the fundamental differences between debating with humans as opposed to challenging humans in game competitions, the latter being the focus of classical 'grand challenges' pursued by the AI research community over the past few decades. We suggest that such challenges lie in the 'comfort zone' of AI, whereas debating with humans lies in a different territory, in which humans still prevail, and for which novel paradigms are required to make substantial progress.

Assuntos

Inteligência Artificial , Comportamento Competitivo , Dissidências e Disputas , Atividades Humanas , Inteligência Artificial/normas , Humanos , Processamento de Linguagem Natural

8.

Democratizing protein language models with parameter-efficient fine-tuning.

Sledzieski, Samuel; Kshirsagar, Meghana; Baek, Minkyung; Dodhia, Rahul; Lavista Ferres, Juan; Berger, Bonnie.

Proc Natl Acad Sci U S A ; 121(26): e2405840121, 2024 Jun 25.

Artigo em Inglês | MEDLINE | ID: mdl-38900798

RESUMO

Proteomics has been revolutionized by large protein language models (PLMs), which learn unsupervised representations from large corpora of sequences. These models are typically fine-tuned in a supervised setting to adapt the model to specific downstream tasks. However, the computational and memory footprint of fine-tuning (FT) large PLMs presents a barrier for many research groups with limited computational resources. Natural language processing has seen a similar explosion in the size of models, where these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we introduce this paradigm to proteomics through leveraging the parameter-efficient method LoRA and training new models for two important tasks: predicting protein-protein interactions (PPIs) and predicting the symmetry of homooligomer quaternary structures. We show that these approaches are competitive with traditional FT while requiring reduced memory and substantially fewer parameters. We additionally show that for the PPI prediction task, training only the classification head also remains competitive with full FT, using five orders of magnitude fewer parameters, and that each of these methods outperform state-of-the-art PPI prediction methods with substantially reduced compute. We further perform a comprehensive evaluation of the hyperparameter space, demonstrate that PEFT of PLMs is robust to variations in these hyperparameters, and elucidate where best practices for PEFT in proteomics differ from those in natural language processing. All our model adaptation and evaluation code is available open-source at https://github.com/microsoft/peft_proteomics. Thus, we provide a blueprint to democratize the power of PLM adaptation to groups with limited computational resources.

Assuntos

Proteômica , Proteômica/métodos , Proteínas/química , Proteínas/metabolismo , Processamento de Linguagem Natural , Mapeamento de Interação de Proteínas/métodos , Biologia Computacional/métodos , Humanos , Algoritmos

9.

Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource.

Huang, Ming-Siang; Han, Jen-Chieh; Lin, Pei-Yen; You, Yu-Ting; Tsai, Richard Tzong-Han; Hsu, Wen-Lian.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38609331

RESUMO

Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.

Assuntos

Descoberta de Drogas , Processamento de Linguagem Natural , Transdução de Sinais

10.

Multi-modal features-based human-herpesvirus protein-protein interaction prediction by using LightGBM.

Yang, Xiaodi; Wuchty, Stefan; Liang, Zeyin; Ji, Li; Wang, Bingjie; Zhu, Jialin; Zhang, Ziding; Dong, Yujun.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38279649

RESUMO

The identification of human-herpesvirus protein-protein interactions (PPIs) is an essential and important entry point to understand the mechanisms of viral infection, especially in malignant tumor patients with common herpesvirus infection. While natural language processing (NLP)-based embedding techniques have emerged as powerful approaches, the application of multi-modal embedding feature fusion to predict human-herpesvirus PPIs is still limited. Here, we established a multi-modal embedding feature fusion-based LightGBM method to predict human-herpesvirus PPIs. In particular, we applied document and graph embedding approaches to represent sequence, network and function modal features of human and herpesviral proteins. Training our LightGBM models through our compiled non-rigorous and rigorous benchmarking datasets, we obtained significantly better performance compared to individual-modal features. Furthermore, our model outperformed traditional feature encodings-based machine learning methods and state-of-the-art deep learning-based methods using various benchmarking datasets. In a transfer learning step, we show that our model that was trained on human-herpesvirus PPI dataset without cytomegalovirus data can reliably predict human-cytomegalovirus PPIs, indicating that our method can comprehensively capture multi-modal fusion features of protein interactions across various herpesvirus subtypes. The implementation of our method is available at https://github.com/XiaodiYangpku/MultimodalPPI/.

Assuntos

Benchmarking , Citomegalovirus , Humanos , Aprendizado de Máquina , Processamento de Linguagem Natural

11.

Lactylation prediction models based on protein sequence and structural feature fusion.

Yang, Ye-Hong; Yang, Jun-Tao; Liu, Jiang-Feng.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38385873

RESUMO

Lysine lactylation (Kla) is a newly discovered posttranslational modification that is involved in important life activities, such as glycolysis-related cell function, macrophage polarization and nervous system regulation, and has received widespread attention due to the Warburg effect in tumor cells. In this work, we first design a natural language processing method to automatically extract the 3D structural features of Kla sites, avoiding potential biases caused by manually designed structural features. Then, we establish two Kla prediction frameworks, Attention-based feature fusion Kla model (ABFF-Kla) and EBFF-Kla, to integrate the sequence features and the structure features based on the attention layer and embedding layer, respectively. The results indicate that ABFF-Kla and Embedding-based feature fusion Kla model (EBFF-Kla), which fuse features from protein sequences and spatial structures, have better predictive performance than that of models that use only sequence features. Our work provides an approach for the automatic extraction of protein structural features, as well as a flexible framework for Kla prediction. The source code and the training data of the ABFF-Kla and the EBFF-Kla are publicly deposited at: https://github.com/ispotato/Lactylation_model.

Assuntos

Lisina , Processamento de Linguagem Natural , Sequência de Aminoácidos , Domínios Proteicos , Processamento de Proteína Pós-Traducional

12.

IDPpub: Illuminating the Dark Phosphoproteome Through PubMed Mining.

Savage, Sara R; Zhang, Yaoyun; Jaehnig, Eric J; Liao, Yuxing; Shi, Zhiao; Pham, Huy Anh; Xu, Hua; Zhang, Bing.

Mol Cell Proteomics ; 23(1): 100682, 2024 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-37993103

RESUMO

Global phosphoproteomics experiments quantify tens of thousands of phosphorylation sites. However, data interpretation is hampered by our limited knowledge on functions, biological contexts, or precipitating enzymes of the phosphosites. This study establishes a repository of phosphosites with associated evidence in biomedical abstracts, using deep learning-based natural language processing techniques. Our model for illuminating the dark phosphoproteome through PubMed mining (IDPpub) was generated by fine-tuning BioBERT, a deep learning tool for biomedical text mining. Trained using sentences containing protein substrates and phosphorylation site positions from 3000 abstracts, the IDPpub model was then used to extract phosphorylation sites from all MEDLINE abstracts. The extracted proteins were normalized to gene symbols using the National Center for Biotechnology Information gene query, and sites were mapped to human UniProt sequences using ProtMapper and mouse UniProt sequences by direct match. Precision and recall were calculated using 150 curated abstracts, and utility was assessed by analyzing the CPTAC (Clinical Proteomics Tumor Analysis Consortium) pan-cancer phosphoproteomics datasets and the PhosphoSitePlus database. Using 10-fold cross validation, pairs of correct substrates and phosphosite positions were extracted with an average precision of 0.93 and recall of 0.94. After entity normalization and site mapping to human reference sequences, an independent validation achieved a precision of 0.91 and recall of 0.77. The IDPpub repository contains 18,458 unique human phosphorylation sites with evidence sentences from 58,227 abstracts and 5918 mouse sites in 14,610 abstracts. This included evidence sentences for 1803 sites identified in CPTAC studies that are not covered by manually curated functional information in PhosphoSitePlus. Evaluation results demonstrate the potential of IDPpub as an effective biomedical text mining tool for collecting phosphosites. Moreover, the repository (http://idppub.ptmax.org), which can be automatically updated, can serve as a powerful complement to existing resources.

Assuntos

Mineração de Dados , Processamento de Linguagem Natural , Humanos , Mineração de Dados/métodos , Bases de Dados Factuais , PubMed

13.

Language model-based B cell receptor sequence embeddings can effectively encode receptor specificity.

Wang, Meng; Patsenker, Jonathan; Li, Henry; Kluger, Yuval; Kleinstein, Steven H.

Nucleic Acids Res ; 52(2): 548-557, 2024 Jan 25.

Artigo em Inglês | MEDLINE | ID: mdl-38109302

RESUMO

High throughput sequencing of B cell receptors (BCRs) is increasingly applied to study the immense diversity of antibodies. Learning biologically meaningful embeddings of BCR sequences is beneficial for predictive modeling. Several embedding methods have been developed for BCRs, but no direct performance benchmarking exists. Moreover, the impact of the input sequence length and paired-chain information on the prediction remains to be explored. We evaluated the performance of multiple embedding models to predict BCR sequence properties and receptor specificity. Despite the differences in model architectures, most embeddings effectively capture BCR sequence properties and specificity. BCR-specific embeddings slightly outperform general protein language models in predicting specificity. In addition, incorporating full-length heavy chains and paired light chain sequences improves the prediction performance of all embeddings. This study provides insights into the properties of BCR embeddings to improve downstream prediction applications for antibody analysis and discovery.

Assuntos

Processamento de Linguagem Natural , Receptores de Antígenos de Linfócitos B , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Imunoglobulinas , Receptores de Antígenos de Linfócitos B/química , Receptores de Antígenos de Linfócitos B/genética , Sequência de Aminoácidos , Humanos

14.

Collective events and individual affect shape autobiographical memory.

Rouhani, Nina; Stanley, Damian; Adolphs, Ralph.

Proc Natl Acad Sci U S A ; 120(29): e2221919120, 2023 07 18.

Artigo em Inglês | MEDLINE | ID: mdl-37432994

RESUMO

How do collective events shape how we remember our lives? We leveraged advances in natural language processing as well as a rich, longitudinal assessment of 1,000 Americans throughout 2020 to examine how memory is influenced by two prominent factors: surprise and emotion. Autobiographical memory for 2020 displayed a unique signature: There was a substantial bump in March, aligning with pandemic onset and lockdowns, consistent across three memory collections 1 y apart. We further investigated how emotion, using both immediate and retrieved measures, predicted the amount and content of autobiographical memory: Negative affect increased recall across all measures, whereas its more clinical indices, depression and posttraumatic stress disorder, selectively increased nonepisodic recall. Finally, in a separate cohort, we found pandemic news to be better remembered, surprising, and negative, while lockdowns compressed remembered time. Our work connects laboratory findings to the real world and delineates the effects of acute versus clinical signatures of negative emotion on memory.

Assuntos

Memória Episódica , Humanos , Emoções , Rememoração Mental , Processamento de Linguagem Natural , Pandemias

15.

Unsupervised embedding of trajectories captures the latent structure of scientific migration.

Murray, Dakota; Yoon, Jisung; Kojaku, Sadamori; Costas, Rodrigo; Jung, Woo-Sung; Milojevic, Stasa; Ahn, Yong-Yeol.

Proc Natl Acad Sci U S A ; 120(52): e2305414120, 2023 Dec 26.

Artigo em Inglês | MEDLINE | ID: mdl-38134198

RESUMO

Human migration and mobility drives major societal phenomena including epidemics, economies, innovation, and the diffusion of ideas. Although human mobility and migration have been heavily constrained by geographic distance throughout the history, advances, and globalization are making other factors such as language and culture increasingly more important. Advances in neural embedding models, originally designed for natural language, provide an opportunity to tame this complexity and open new avenues for the study of migration. Here, we demonstrate the ability of the model word2vec to encode nuanced relationships between discrete locations from migration trajectories, producing an accurate, dense, continuous, and meaningful vector-space representation. The resulting representation provides a functional distance between locations, as well as a "digital double" that can be distributed, re-used, and itself interrogated to understand the many dimensions of migration. We show that the unique power of word2vec to encode migration patterns stems from its mathematical equivalence with the gravity model of mobility. Focusing on the case of scientific migration, we apply word2vec to a database of three million migration trajectories of scientists derived from the affiliations listed on their publication records. Using techniques that leverage its semantic structure, we demonstrate that embeddings can learn the rich structure that underpins scientific migration, such as cultural, linguistic, and prestige relationships at multiple levels of granularity. Our results provide a theoretical foundation and methodological framework for using neural embeddings to represent and understand migration both within and beyond science.

Assuntos

Idioma , Semântica , Humanos , Aprendizado de Máquina , Aprendizagem , Processamento de Linguagem Natural

16.

Review of Natural Language Processing in Pharmacology.

Trajanov, Dimitar; Trajkovski, Vangel; Dimitrieva, Makedonka; Dobreva, Jovana; Jovanovik, Milos; Klemen, Matej; Zagar, Ales; Robnik-Sikonja, Marko.

Pharmacol Rev ; 75(4): 714-738, 2023 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-36931724

RESUMO

Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the past few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP: methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers. SIGNIFICANCE STATEMENT: The main objective of this work is to survey the recent use of NLP in the field of pharmacology in order to provide a comprehensive overview of the current state in the area after the rapid developments that occurred in the past few years. The resulting survey will be useful to practitioners and interested observers in the domain.

Assuntos

Inteligência Artificial , Processamento de Linguagem Natural , Humanos , Armazenamento e Recuperação da Informação , Registros Eletrônicos de Saúde , Registros

17.

OARD: Open annotations for rare diseases and their phenotypes based on real-world data.

Liu, Cong; Ta, Casey N; Havrilla, Jim M; Nestor, Jordan G; Spotnitz, Matthew E; Geneslaw, Andrew S; Hu, Yu; Chung, Wendy K; Wang, Kai; Weng, Chunhua.

Am J Hum Genet ; 109(9): 1591-1604, 2022 09 01.

Artigo em Inglês | MEDLINE | ID: mdl-35998640

RESUMO

Diagnosis for rare genetic diseases often relies on phenotype-driven methods, which hinge on the accuracy and completeness of the rare disease phenotypes in the underlying annotation knowledgebase. Existing knowledgebases are often manually curated with additional annotations found in published case reports. Despite their potential, real-world data such as electronic health records (EHRs) have not been fully exploited to derive rare disease annotations. Here, we present open annotation for rare diseases (OARD), a real-world-data-derived resource with annotation for rare-disease-related phenotypes. This resource is derived from the EHRs of two academic health institutions containing more than 10 million individuals spanning wide age ranges and different disease subgroups. By leveraging ontology mapping and advanced natural-language-processing (NLP) methods, OARD automatically and efficiently extracts concepts for both rare diseases and their phenotypic traits from billing codes and lab tests as well as over 100 million clinical narratives. The rare disease prevalence derived by OARD is highly correlated with those annotated in the original rare disease knowledgebase. By performing association analysis, we identified more than 1 million novel disease-phenotype association pairs that were previously missed by human annotation, and >60% were confirmed true associations via manual review of a list of sampled pairs. Compared to the manual curated annotation, OARD is 100% data driven and its pipeline can be shared across different institutions. By supporting privacy-preserving sharing of aggregated summary statistics, such as term frequencies and disease-phenotype associations, it fills an important gap to facilitate data-driven research in the rare disease community.

Assuntos

Processamento de Linguagem Natural , Doenças Raras , Registros Eletrônicos de Saúde , Humanos , Fenótipo , Doenças Raras/genética

18.

Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models.

Qiu, Yuchi; Wei, Guo-Wei.

Brief Bioinform ; 24(5)2023 09 20.

Artigo em Inglês | MEDLINE | ID: mdl-37580175

RESUMO

Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.

Assuntos

Inteligência Artificial , Engenharia de Proteínas , Processamento de Linguagem Natural , Anticorpos , Análise de Dados

19.

DKADE: a novel framework based on deep learning and knowledge graph for identifying adverse drug events and related medications.

Feng, Ze-Ying; Wu, Xue-Hong; Ma, Jun-Long; Li, Min; He, Ge-Fei; Cao, Dong-Sheng; Yang, Guo-Ping.

Brief Bioinform ; 24(4)2023 07 20.

Artigo em Inglês | MEDLINE | ID: mdl-37344167

RESUMO

Adverse drug events (ADEs) are common in clinical practice and can cause significant harm to patients and increase resource use. Natural language processing (NLP) has been applied to automate ADE detection, but NLP systems become less adaptable when drug entities are missing or multiple medications are specified in clinical narratives. Additionally, no Chinese-language NLP system has been developed for ADE detection due to the complexity of Chinese semantics, despite Ë10 million cases of drug-related adverse events occurring annually in China. To address these challenges, we propose DKADE, a deep learning and knowledge graph-based framework for identifying ADEs. DKADE infers missing drug entities and evaluates their correlations with ADEs by combining medication orders and existing drug knowledge. Moreover, DKADE can automatically screen for new adverse drug reactions. Experimental results show that DKADE achieves an overall F1-score value of 91.13%. Furthermore, the adaptability of DKADE is validated using real-world external clinical data. In summary, DKADE is a powerful tool for studying drug safety and automating adverse event monitoring.

Assuntos

Aprendizado Profundo , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Reconhecimento Automatizado de Padrão , Semântica , Processamento de Linguagem Natural

20.

iAMP-Attenpred: a novel antimicrobial peptide predictor based on BERT feature extraction method and CNN-BiLSTM-Attention combination model.

Xing, Wenxuan; Zhang, Jie; Li, Chen; Huo, Yujia; Dong, Gaifang.

Brief Bioinform ; 25(1)2023 11 22.

Artigo em Inglês | MEDLINE | ID: mdl-38055840

RESUMO

As a kind of small molecule protein that can fight against various microorganisms in nature, antimicrobial peptides (AMPs) play an indispensable role in maintaining the health of organisms and fortifying defenses against diseases. Nevertheless, experimental approaches for AMP identification still demand substantial allocation of human resources and material inputs. Alternatively, computing approaches can assist researchers effectively and promptly predict AMPs. In this study, we present a novel AMP predictor called iAMP-Attenpred. As far as we know, this is the first work that not only employs the popular BERT model in the field of natural language processing (NLP) for AMPs feature encoding, but also utilizes the idea of combining multiple models to discover AMPs. Firstly, we treat each amino acid from preprocessed AMPs and non-AMP sequences as a word, and then input it into BERT pre-training model for feature extraction. Moreover, the features obtained from BERT method are fed to a composite model composed of one-dimensional CNN, BiLSTM and attention mechanism for better discriminating features. Finally, a flatten layer and various fully connected layers are utilized for the final classification of AMPs. Experimental results reveal that, compared with the existing predictors, our iAMP-Attenpred predictor achieves better performance indicators, such as accuracy, precision and so on. This further demonstrates that using the BERT approach to capture effective feature information of peptide sequences and combining multiple deep learning models are effective and meaningful for predicting AMPs.

Assuntos

Aminoácidos , Peptídeos Antimicrobianos , Humanos , Sequência de Aminoácidos , Processamento de Linguagem Natural , Pesquisadores

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA