RESUMO
Generative artificial intelligence (AI) is a burgeoning field with widespread applications, including in science. Here, we explore two paradigms that provide insight into the capabilities and limitations of Chat Generative Pre-trained Transformer (ChatGPT): its ability to (i) define a core biological concept (the Central Dogma of molecular biology); and (ii) interpret the genetic code.
Assuntos
Inteligência Artificial , Código Genético , Biologia MolecularRESUMO
Large language models (LLMs) are currently at the forefront of intertwining AI systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, but were nonexistent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can trigger misaligned deceptive behavior. GPT-4, for instance, exhibits deceptive behavior in simple test scenarios 99.16% of the time (P < 0.001). In complex second-order deception test scenarios where the aim is to mislead someone who expects to be deceived, GPT-4 resorts to deceptive behavior 71.46% of the time (P < 0.001) when augmented with chain-of-thought reasoning. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology.
Assuntos
Enganação , Idioma , Humanos , Inteligência ArtificialRESUMO
The social and behavioral sciences have been increasingly using automated text analysis to measure psychological constructs in text. We explore whether GPT, the large-language model (LLM) underlying the AI chatbot ChatGPT, can be used as a tool for automated psychological text analysis in several languages. Across 15 datasets (n = 47,925 manually annotated tweets and news headlines), we tested whether different versions of GPT (3.5 Turbo, 4, and 4 Turbo) can accurately detect psychological constructs (sentiment, discrete emotions, offensiveness, and moral foundations) across 12 languages. We found that GPT (r = 0.59 to 0.77) performed much better than English-language dictionary analysis (r = 0.20 to 0.30) at detecting psychological constructs as judged by manual annotators. GPT performed nearly as well as, and sometimes better than, several top-performing fine-tuned machine learning models. Moreover, GPT's performance improved across successive versions of the model, particularly for lesser-spoken languages, and became less expensive. Overall, GPT may be superior to many existing methods of automated text analysis, since it achieves relatively high accuracy across many languages, requires no training data, and is easy to use with simple prompts (e.g., "is this text negative?") and little coding experience. We provide sample code and a video tutorial for analyzing text with the GPT application programming interface. We argue that GPT and other LLMs help democratize automated text analysis by making advanced natural language processing capabilities more accessible, and may help facilitate more cross-linguistic research with understudied languages.
Assuntos
Multilinguismo , Humanos , Idioma , Aprendizado de Máquina , Processamento de Linguagem Natural , Emoções , Mídias SociaisRESUMO
The assessment of social determinants of health (SDoH) within healthcare systems is crucial for comprehensive patient care and addressing health disparities. Current challenges arise from the limited inclusion of structured SDoH information within electronic health record (EHR) systems, often due to the lack of standardized diagnosis codes. This study delves into the transformative potential of large language models (LLM) to overcome these challenges. LLM-based classifiers-using Bidirectional Encoder Representations from Transformers (BERT) and A Robustly Optimized BERT Pretraining Approach (RoBERTa)-were developed for SDoH concepts, including homelessness, food insecurity, and domestic violence, using synthetic training datasets generated by generative pre-trained transformers combined with authentic clinical notes. Models were then validated on separate datasets: Medical Information Mart for Intensive Care-III and our institutional EHR data. When training the model with a combination of synthetic and authentic notes, validation on our institutional dataset yielded an area under the receiver operating characteristics curve of 0.78 for detecting homelessness, 0.72 for detecting food insecurity, and 0.83 for detecting domestic violence. This study underscores the potential of LLMs in extracting SDoH information from clinical text. Automated detection of SDoH may be instrumental for healthcare providers in identifying at-risk patients, guiding targeted interventions, and contributing to population health initiatives aimed at mitigating disparities.
Assuntos
Violência Doméstica , Registros Eletrônicos de Saúde , Insegurança Alimentar , Pessoas Mal Alojadas , Determinantes Sociais da Saúde , HumanosRESUMO
How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more vs. less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.
Assuntos
Bibliotecários , Humanos , Eticistas , Pesquisadores , Ética em PesquisaRESUMO
Protein phase transitions (PPTs) from the soluble state to a dense liquid phase (forming droplets via liquid-liquid phase separation) or to solid aggregates (such as amyloids) play key roles in pathological processes associated with age-related diseases such as Alzheimer's disease. Several computational frameworks are capable of separately predicting the formation of droplets or amyloid aggregates based on protein sequences, yet none have tackled the prediction of both within a unified framework. Recently, large language models (LLMs) have exhibited great success in protein structure prediction; however, they have not yet been used for PPTs. Here, we fine-tune a LLM for predicting PPTs and demonstrate its usage in evaluating how sequence variants affect PPTs, an operation useful for protein design. In addition, we show its superior performance compared to suitable classical benchmarks. Due to the "black-box" nature of the LLM, we also employ a classical random forest model along with biophysical features to facilitate interpretation. Finally, focusing on Alzheimer's disease-related proteins, we demonstrate that greater aggregation is associated with reduced gene expression in Alzheimer's disease, suggesting a natural defense mechanism.
Assuntos
Doença de Alzheimer , Transição de Fase , Doença de Alzheimer/metabolismo , Humanos , Amiloide/metabolismo , Amiloide/química , Proteínas/química , Proteínas/metabolismoRESUMO
The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach-which we call the teleological approach-we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system-one that has been shaped by its own particular set of pressures.
Assuntos
Idioma , Humanos , Modelos TeóricosRESUMO
Recent advancements in large language models (LLMs) have raised the prospect of scalable, automated, and fine-grained political microtargeting on a scale previously unseen; however, the persuasive influence of microtargeting with LLMs remains unclear. Here, we build a custom web application capable of integrating self-reported demographic and political data into GPT-4 prompts in real-time, facilitating the live creation of unique messages tailored to persuade individual users on four political issues. We then deploy this application in a preregistered randomized control experiment (n = 8,587) to investigate the extent to which access to individual-level data increases the persuasive influence of GPT-4. Our approach yields two key findings. First, messages generated by GPT-4 were broadly persuasive, in some cases increasing support for an issue stance by up to 12 percentage points. Second, in aggregate, the persuasive impact of microtargeted messages was not statistically different from that of non-microtargeted messages (4.83 vs. 6.20 percentage points, respectively, P = 0.226). These trends hold even when manipulating the type and number of attributes used to tailor the message. These findings suggest-contrary to widespread speculation-that the influence of current LLMs may reside not in their ability to tailor messages to individuals but rather in the persuasiveness of their generic, nontargeted messages. We release our experimental dataset, GPTarget2024, as an empirical baseline for future research.
Assuntos
Comunicação Persuasiva , Política , Humanos , IdiomaRESUMO
Instruction-tuned large language models (LLMs) demonstrate exceptional ability to align with human intentions. We present an LLM-based model-instruction-tuned LLM for assessment of cancer (iLLMAC)-that can detect cancer using cell-free deoxyribonucleic acid (cfDNA) end-motif profiles. Developed on plasma cfDNA sequencing data from 1135 cancer patients and 1106 controls across three datasets, iLLMAC achieved area under the receiver operating curve (AUROC) of 0.866 [95% confidence interval (CI), 0.773-0.959] for cancer diagnosis and 0.924 (95% CI, 0.841-1.0) for hepatocellular carcinoma (HCC) detection using 16 end-motifs. Performance increased with more motifs, reaching 0.886 (95% CI, 0.794-0.977) and 0.956 (95% CI, 0.89-1.0) for cancer diagnosis and HCC detection, respectively, with 64 end-motifs. On an external-testing set, iLLMAC achieved AUROC of 0.912 (95% CI, 0.849-0.976) for cancer diagnosis and 0.938 (95% CI, 0.885-0.992) for HCC detection with 64 end-motifs, significantly outperforming benchmarked methods. Furthermore, iLLMAC achieved high classification performance on datasets with bisulfite and 5-hydroxymethylcytosine sequencing. Our study highlights the effectiveness of LLM-based instruction-tuning for cfDNA-based cancer detection.
Assuntos
Carcinoma Hepatocelular , Ácidos Nucleicos Livres , Humanos , Ácidos Nucleicos Livres/sangue , Carcinoma Hepatocelular/diagnóstico , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/sangue , Neoplasias Hepáticas/diagnóstico , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/sangue , Neoplasias/diagnóstico , Neoplasias/genética , Neoplasias/sangue , Curva ROC , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/sangue , Motivos de Nucleotídeos , Metilação de DNARESUMO
Bioinformatics has undergone a paradigm shift in artificial intelligence (AI), particularly through foundation models (FMs), which address longstanding challenges in bioinformatics such as limited annotated data and data noise. These AI techniques have demonstrated remarkable efficacy across various downstream validation tasks, effectively representing diverse biological entities and heralding a new era in computational biology. The primary goal of this survey is to conduct a general investigation and summary of FMs in bioinformatics, tracing their evolutionary trajectory, current research landscape, and methodological frameworks. Our primary focus is on elucidating the application of FMs to specific biological problems, offering insights to guide the research community in choosing appropriate FMs for tasks like sequence analysis, structure prediction, and function annotation. Each section delves into the intricacies of the targeted challenges, contrasting the architectures and advancements of FMs with conventional methods and showcasing their utility across different biological domains. Further, this review scrutinizes the hurdles and constraints encountered by FMs in biology, including issues of data noise, model interpretability, and potential biases. This analysis provides a theoretical groundwork for understanding the circumstances under which certain FMs may exhibit suboptimal performance. Lastly, we outline prospective pathways and methodologies for the future development of FMs in biological research, facilitating ongoing innovation in the field. This comprehensive examination not only serves as an academic reference but also as a roadmap for forthcoming explorations and applications of FMs in biology.
Assuntos
Inteligência Artificial , Biologia Computacional , Biologia Computacional/métodos , HumanosRESUMO
In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.
Assuntos
Bases de Dados de Proteínas , Proteínas , Proteínas/química , Anotação de Sequência Molecular/métodos , Biologia Computacional/métodos , Aprendizado de MáquinaRESUMO
As the application of large language models (LLMs) has broadened into the realm of biological predictions, leveraging their capacity for self-supervised learning to create feature representations of amino acid sequences, these models have set a new benchmark in tackling downstream challenges, such as subcellular localization. However, previous studies have primarily focused on either the structural design of models or differing strategies for fine-tuning, largely overlooking investigations into the nature of the features derived from LLMs. In this research, we propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis and interpretability techniques, we have illuminated potential associations between diverse feature types and specific subcellular localizations. Particularly, the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. We also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs' utility, understanding their mechanisms, and extracting biological domain knowledge. Furthermore, we have made the code, feature extraction API, and all relevant materials available at https://github.com/yujuan-zhang/feature-representation-for-LLMs.
Assuntos
Biologia Computacional , Redes Neurais de Computação , Biologia Computacional/métodos , Sequência de Aminoácidos , Transporte ProteicoRESUMO
With their diverse biological activities, peptides are promising candidates for therapeutic applications, showing antimicrobial, antitumour and hormonal signalling capabilities. Despite their advantages, therapeutic peptides face challenges such as short half-life, limited oral bioavailability and susceptibility to plasma degradation. The rise of computational tools and artificial intelligence (AI) in peptide research has spurred the development of advanced methodologies and databases that are pivotal in the exploration of these complex macromolecules. This perspective delves into integrating AI in peptide development, encompassing classifier methods, predictive systems and the avant-garde design facilitated by deep-generative models like generative adversarial networks and variational autoencoders. There are still challenges, such as the need for processing optimization and careful validation of predictive models. This work outlines traditional strategies for machine learning model construction and training techniques and proposes a comprehensive AI-assisted peptide design and validation pipeline. The evolving landscape of peptide design using AI is emphasized, showcasing the practicality of these methods in expediting the development and discovery of novel peptides within the context of peptide-based drug discovery.
Assuntos
Inteligência Artificial , Descoberta de Drogas , Peptídeos , Peptídeos/química , Peptídeos/uso terapêutico , Peptídeos/farmacologia , Descoberta de Drogas/métodos , Humanos , Desenho de Fármacos , Aprendizado de Máquina , Biologia Computacional/métodosRESUMO
Large language models (LLMs) are sophisticated AI-driven models trained on vast sources of natural language data. They are adept at generating responses that closely mimic human conversational patterns. One of the most notable examples is OpenAI's ChatGPT, which has been extensively used across diverse sectors. Despite their flexibility, a significant challenge arises as most users must transmit their data to the servers of companies operating these models. Utilizing ChatGPT or similar models online may inadvertently expose sensitive information to the risk of data breaches. Therefore, implementing LLMs that are open source and smaller in scale within a secure local network becomes a crucial step for organizations where ensuring data privacy and protection has the highest priority, such as regulatory agencies. As a feasibility evaluation, we implemented a series of open-source LLMs within a regulatory agency's local network and assessed their performance on specific tasks involving extracting relevant clinical pharmacology information from regulatory drug labels. Our research shows that some models work well in the context of few- or zero-shot learning, achieving performance comparable, or even better than, neural network models that needed thousands of training samples. One of the models was selected to address a real-world issue of finding intrinsic factors that affect drugs' clinical exposure without any training or fine-tuning. In a dataset of over 700 000 sentences, the model showed a 78.5% accuracy rate. Our work pointed to the possibility of implementing open-source LLMs within a secure local network and using these models to perform various natural language processing tasks when large numbers of training examples are unavailable.
Assuntos
Processamento de Linguagem Natural , Humanos , Redes Neurais de Computação , Aprendizado de MáquinaRESUMO
We survey a current, heated debate in the artificial intelligence (AI) research community on whether large pretrained language models can be said to understand language-and the physical and social situations language encodes-in any humanlike sense. We describe arguments that have been made for and against such understanding and key questions for the broader sciences of intelligence that have arisen in light of these arguments. We contend that an extended science of intelligence can be developed that will provide insight into distinct modes of understanding, their strengths and limitations, and the challenge of integrating diverse forms of cognition.
Assuntos
Inteligência Artificial , Cognição , Dissidências e DisputasRESUMO
As the use of large language models (LLMs) grows, it is important to examine whether they exhibit biases in their output. Research in cultural evolution, using transmission chain experiments, demonstrates that humans have biases to attend to, remember, and transmit some types of content over others. Here, in five preregistered experiments using material from previous studies with human participants, we use the same, transmission chain-like methodology, and find that the LLM ChatGPT-3 shows biases analogous to humans for content that is gender-stereotype-consistent, social, negative, threat-related, and biologically counterintuitive, over other content. The presence of these biases in LLM output suggests that such content is widespread in its training data and could have consequential downstream effects, by magnifying preexisting human tendencies for cognitively appealing and not necessarily informative, or valuable, content.
Assuntos
Evolução Cultural , Idioma , Humanos , Rememoração Mental , Viés , Teoria ÉticaRESUMO
Many NLP applications require manual text annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using four samples of tweets and news articles (n = 6,183), we show that ChatGPT outperforms crowd workers for several annotation tasks, including relevance, stance, topics, and frame detection. Across the four datasets, the zero-shot accuracy of ChatGPT exceeds that of crowd workers by about 25 percentage points on average, while ChatGPT's intercoder agreement exceeds that of both crowd workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003-about thirty times cheaper than MTurk. These results demonstrate the potential of large language models to drastically increase the efficiency of text classification.
RESUMO
As large language models (LLMs) like GPT become increasingly prevalent, it is essential that we assess their capabilities beyond language processing. This paper examines the economic rationality of GPT by instructing it to make budgetary decisions in four domains: risk, time, social, and food preferences. We measure economic rationality by assessing the consistency of GPT's decisions with utility maximization in classic revealed preference theory. We find that GPT's decisions are largely rational in each domain and demonstrate higher rationality score than those of human subjects in a parallel experiment and in the literature. Moreover, the estimated preference parameters of GPT are slightly different from human subjects and exhibit a lower degree of heterogeneity. We also find that the rationality scores are robust to the degree of randomness and demographic settings such as age and gender but are sensitive to contexts based on the language frames of the choice situations. These results suggest the potential of LLMs to make good decisions and the need to further understand their capabilities, limitations, and underlying mechanisms.
RESUMO
BACKGROUND & AIMS: Early identification and accurate characterization of overt gastrointestinal bleeding (GIB) enables opportunities to optimize patient management and ensures appropriately risk-adjusted coding for claims-based quality measures and reimbursement. Recent advancements in generative artificial intelligence, particularly large language models (LLMs), create opportunities to support accurate identification of clinical conditions. In this study, we present the first LLM-based pipeline for identification of overt GIB in the electronic health record (EHR). We demonstrate 2 clinically relevant applications: the automated detection of recurrent bleeding and appropriate reimbursement coding for patients with GIB. METHODS: Development of the LLM-based pipeline was performed on 17,712 nursing notes from 1108 patients who were hospitalized with acute GIB and underwent endoscopy in the hospital from 2014 to 2023. The pipeline was used to train an EHR-based machine learning model for detection of recurrent bleeding on 546 patients presenting to 2 hospitals and externally validated on 562 patients presenting to 4 different hospitals. The pipeline was used to develop an algorithm for appropriate reimbursement coding on 7956 patients who underwent endoscopy in the hospital from 2019 to 2023. RESULTS: The LLM-based pipeline accurately detected melena (positive predictive value, 0.972; sensitivity, 0.900), hematochezia (positive predictive value, 0.900; sensitivity, 0.908), and hematemesis (positive predictive value, 0.859; sensitivity, 0.932). The EHR-based machine learning model identified recurrent bleeding with area under the curve of 0.986, sensitivity of 98.4%, and specificity of 97.5%. The reimbursement coding algorithm resulted in an average per-patient reimbursement increase of $1299 to $3247 with a total difference of $697,460 to $1,743,649. CONCLUSIONS: An LLM-based pipeline can robustly detect overt GIB in the EHR with clinically relevant applications in detection of recurrent bleeding and appropriate reimbursement coding.