Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 830
Filtrar
1.
Mol Cell ; 84(7): 1257-1270.e6, 2024 Apr 04.
Artigo em Inglês | MEDLINE | ID: mdl-38377993

RESUMO

Current base editors (BEs) use DNA deaminases, including cytidine deaminase in cytidine BE (CBE) or adenine deaminase in adenine BE (ABE), to facilitate transition nucleotide substitutions. Combining CBE or ABE with glycosylase enzymes can induce limited transversion mutations. Nonetheless, a critical demand remains for BEs capable of generating alternative mutation types, such as T>G corrections. In this study, we leveraged pre-trained protein language models to optimize a uracil-N-glycosylase (UNG) variant with altered specificity for thymines (eTDG). Notably, after two rounds of testing fewer than 50 top-ranking variants, more than 50% exhibited over 1.5-fold enhancement in enzymatic activities. When eTDG was fused with nCas9, it induced programmable T-to-S (G/C) substitutions and corrected db/db diabetic mutation in mice (up to 55%). Our findings not only establish orthogonal strategies for developing novel BEs but also demonstrate the capacities of protein language models for optimizing enzymes without extensive task-specific training data.


Assuntos
Ácidos Alcanossulfônicos , Edição de Genes , Uracila-DNA Glicosidase , Animais , Camundongos , Mutação , Uracila-DNA Glicosidase/genética , Uracila-DNA Glicosidase/metabolismo
2.
Trends Biochem Sci ; 48(12): 1014-1018, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-37833131

RESUMO

Generative artificial intelligence (AI) is a burgeoning field with widespread applications, including in science. Here, we explore two paradigms that provide insight into the capabilities and limitations of Chat Generative Pre-trained Transformer (ChatGPT): its ability to (i) define a core biological concept (the Central Dogma of molecular biology); and (ii) interpret the genetic code.


Assuntos
Inteligência Artificial , Código Genético , Biologia Molecular
3.
Proc Natl Acad Sci U S A ; 121(24): e2317967121, 2024 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-38833474

RESUMO

Large language models (LLMs) are currently at the forefront of intertwining AI systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, but were nonexistent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can trigger misaligned deceptive behavior. GPT-4, for instance, exhibits deceptive behavior in simple test scenarios 99.16% of the time (P < 0.001). In complex second-order deception test scenarios where the aim is to mislead someone who expects to be deceived, GPT-4 resorts to deceptive behavior 71.46% of the time (P < 0.001) when augmented with chain-of-thought reasoning. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology.


Assuntos
Enganação , Idioma , Humanos , Inteligência Artificial
4.
Proc Natl Acad Sci U S A ; 121(24): e2318124121, 2024 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-38830100

RESUMO

There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.


Assuntos
Idioma , Matemática , Resolução de Problemas , Humanos , Resolução de Problemas/fisiologia , Estudantes/psicologia
5.
Proc Natl Acad Sci U S A ; 121(24): e2316401121, 2024 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-38838016

RESUMO

The accurate prediction of binding between T cell receptors (TCR) and their cognate epitopes is key to understanding the adaptive immune response and developing immunotherapies. Current methods face two significant limitations: the shortage of comprehensive high-quality data and the bias introduced by the selection of the negative training data commonly used in the supervised learning approaches. We propose a method, Transformer-based Unsupervised Language model for Interacting Peptides and T cell receptors (TULIP), that addresses both limitations by leveraging incomplete data and unsupervised learning and using the transformer architecture of language models. Our model is flexible and integrates all possible data sources, regardless of their quality or completeness. We demonstrate the existence of a bias introduced by the sampling procedure used in previous supervised approaches, emphasizing the need for an unsupervised approach. TULIP recognizes the specific TCRs binding an epitope, performing well on unseen epitopes. Our model outperforms state-of-the-art models and offers a promising direction for the development of more accurate TCR epitope recognition models.


Assuntos
Peptídeos , Receptores de Antígenos de Linfócitos T , Receptores de Antígenos de Linfócitos T/imunologia , Receptores de Antígenos de Linfócitos T/metabolismo , Peptídeos/imunologia , Peptídeos/química , Peptídeos/metabolismo , Humanos , Epitopos/imunologia , Ligação Proteica , Epitopos de Linfócito T/imunologia , Aprendizado de Máquina não Supervisionado
6.
Proc Natl Acad Sci U S A ; 121(24): e2403116121, 2024 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-38848300

RESUMO

Recent advancements in large language models (LLMs) have raised the prospect of scalable, automated, and fine-grained political microtargeting on a scale previously unseen; however, the persuasive influence of microtargeting with LLMs remains unclear. Here, we build a custom web application capable of integrating self-reported demographic and political data into GPT-4 prompts in real-time, facilitating the live creation of unique messages tailored to persuade individual users on four political issues. We then deploy this application in a preregistered randomized control experiment (n = 8,587) to investigate the extent to which access to individual-level data increases the persuasive influence of GPT-4. Our approach yields two key findings. First, messages generated by GPT-4 were broadly persuasive, in some cases increasing support for an issue stance by up to 12 percentage points. Second, in aggregate, the persuasive impact of microtargeted messages was not statistically different from that of non-microtargeted messages (4.83 vs. 6.20 percentage points, respectively, P = 0.226). These trends hold even when manipulating the type and number of attributes used to tailor the message. These findings suggest-contrary to widespread speculation-that the influence of current LLMs may reside not in their ability to tailor messages to individuals but rather in the persuasiveness of their generic, nontargeted messages. We release our experimental dataset, GPTarget2024, as an empirical baseline for future research.


Assuntos
Comunicação Persuasiva , Política , Humanos , Idioma
7.
Proc Natl Acad Sci U S A ; 121(27): e2311887121, 2024 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-38913900

RESUMO

Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.


Assuntos
Proteínas , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Proteínas/química , Proteínas/metabolismo , Sequência de Aminoácidos , Algoritmos , Análise de Sequência de Proteína/métodos , Biologia Computacional/métodos , Bases de Dados de Proteínas
8.
Proc Natl Acad Sci U S A ; 121(26): e2405840121, 2024 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-38900798

RESUMO

Proteomics has been revolutionized by large protein language models (PLMs), which learn unsupervised representations from large corpora of sequences. These models are typically fine-tuned in a supervised setting to adapt the model to specific downstream tasks. However, the computational and memory footprint of fine-tuning (FT) large PLMs presents a barrier for many research groups with limited computational resources. Natural language processing has seen a similar explosion in the size of models, where these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we introduce this paradigm to proteomics through leveraging the parameter-efficient method LoRA and training new models for two important tasks: predicting protein-protein interactions (PPIs) and predicting the symmetry of homooligomer quaternary structures. We show that these approaches are competitive with traditional FT while requiring reduced memory and substantially fewer parameters. We additionally show that for the PPI prediction task, training only the classification head also remains competitive with full FT, using five orders of magnitude fewer parameters, and that each of these methods outperform state-of-the-art PPI prediction methods with substantially reduced compute. We further perform a comprehensive evaluation of the hyperparameter space, demonstrate that PEFT of PLMs is robust to variations in these hyperparameters, and elucidate where best practices for PEFT in proteomics differ from those in natural language processing. All our model adaptation and evaluation code is available open-source at https://github.com/microsoft/peft_proteomics. Thus, we provide a blueprint to democratize the power of PLM adaptation to groups with limited computational resources.


Assuntos
Proteômica , Proteômica/métodos , Proteínas/química , Proteínas/metabolismo , Processamento de Linguagem Natural , Mapeamento de Interação de Proteínas/métodos , Biologia Computacional/métodos , Humanos , Algoritmos
9.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38706315

RESUMO

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.


Assuntos
Bases de Dados de Proteínas , Proteínas , Proteínas/química , Anotação de Sequência Molecular/métodos , Biologia Computacional/métodos , Aprendizado de Máquina
10.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38279650

RESUMO

As the application of large language models (LLMs) has broadened into the realm of biological predictions, leveraging their capacity for self-supervised learning to create feature representations of amino acid sequences, these models have set a new benchmark in tackling downstream challenges, such as subcellular localization. However, previous studies have primarily focused on either the structural design of models or differing strategies for fine-tuning, largely overlooking investigations into the nature of the features derived from LLMs. In this research, we propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis and interpretability techniques, we have illuminated potential associations between diverse feature types and specific subcellular localizations. Particularly, the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. We also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs' utility, understanding their mechanisms, and extracting biological domain knowledge. Furthermore, we have made the code, feature extraction API, and all relevant materials available at https://github.com/yujuan-zhang/feature-representation-for-LLMs.


Assuntos
Biologia Computacional , Redes Neurais de Computação , Biologia Computacional/métodos , Sequência de Aminoácidos , Transporte Proteico
11.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38856172

RESUMO

With their diverse biological activities, peptides are promising candidates for therapeutic applications, showing antimicrobial, antitumour and hormonal signalling capabilities. Despite their advantages, therapeutic peptides face challenges such as short half-life, limited oral bioavailability and susceptibility to plasma degradation. The rise of computational tools and artificial intelligence (AI) in peptide research has spurred the development of advanced methodologies and databases that are pivotal in the exploration of these complex macromolecules. This perspective delves into integrating AI in peptide development, encompassing classifier methods, predictive systems and the avant-garde design facilitated by deep-generative models like generative adversarial networks and variational autoencoders. There are still challenges, such as the need for processing optimization and careful validation of predictive models. This work outlines traditional strategies for machine learning model construction and training techniques and proposes a comprehensive AI-assisted peptide design and validation pipeline. The evolving landscape of peptide design using AI is emphasized, showcasing the practicality of these methods in expediting the development and discovery of novel peptides within the context of peptide-based drug discovery.


Assuntos
Inteligência Artificial , Descoberta de Drogas , Peptídeos , Peptídeos/química , Peptídeos/uso terapêutico , Peptídeos/farmacologia , Descoberta de Drogas/métodos , Humanos , Desenho de Fármacos , Aprendizado de Máquina , Biologia Computacional/métodos
12.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38975896

RESUMO

Mechanisms of protein-DNA interactions are involved in a wide range of biological activities and processes. Accurately identifying binding sites between proteins and DNA is crucial for analyzing genetic material, exploring protein functions, and designing novel drugs. In recent years, several computational methods have been proposed as alternatives to time-consuming and expensive traditional experiments. However, accurately predicting protein-DNA binding sites still remains a challenge. Existing computational methods often rely on handcrafted features and a single-model architecture, leaving room for improvement. We propose a novel computational method, called EGPDI, based on multi-view graph embedding fusion. This approach involves the integration of Equivariant Graph Neural Networks (EGNN) and Graph Convolutional Networks II (GCNII), independently configured to profoundly mine the global and local node embedding representations. An advanced gated multi-head attention mechanism is subsequently employed to capture the attention weights of the dual embedding representations, thereby facilitating the integration of node features. Besides, extra node features from protein language models are introduced to provide more structural information. To our knowledge, this is the first time that multi-view graph embedding fusion has been applied to the task of protein-DNA binding site prediction. The results of five-fold cross-validation and independent testing demonstrate that EGPDI outperforms state-of-the-art methods. Further comparative experiments and case studies also verify the superiority and generalization ability of EGPDI.


Assuntos
Biologia Computacional , Proteínas de Ligação a DNA , DNA , Redes Neurais de Computação , Sítios de Ligação , DNA/metabolismo , DNA/química , Proteínas de Ligação a DNA/metabolismo , Proteínas de Ligação a DNA/química , Biologia Computacional/métodos , Algoritmos , Ligação Proteica
13.
Proc Natl Acad Sci U S A ; 120(13): e2215907120, 2023 03 28.
Artigo em Inglês | MEDLINE | ID: mdl-36943882

RESUMO

We survey a current, heated debate in the artificial intelligence (AI) research community on whether large pretrained language models can be said to understand language-and the physical and social situations language encodes-in any humanlike sense. We describe arguments that have been made for and against such understanding and key questions for the broader sciences of intelligence that have arisen in light of these arguments. We contend that an extended science of intelligence can be developed that will provide insight into distinct modes of understanding, their strengths and limitations, and the challenge of integrating diverse forms of cognition.


Assuntos
Inteligência Artificial , Cognição , Dissidências e Disputas
14.
Proc Natl Acad Sci U S A ; 120(51): e2309583120, 2023 Dec 19.
Artigo em Inglês | MEDLINE | ID: mdl-38091290

RESUMO

Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human-like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next-word predictions. This work tests whether stability and accuracy are showcased by GPT-3/text-davinci-002, GPT-3/text-davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes-response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs' performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.


Assuntos
Idioma , Linguística , Humanos , Cognição , Julgamento/fisiologia
15.
Proc Natl Acad Sci U S A ; 120(9): e2211823120, 2023 02 28.
Artigo em Inglês | MEDLINE | ID: mdl-36827259

RESUMO

There are several hundred million protein sequences, but the relationships among them are not fully available from existing homolog detection methods. There is an essential need for an improved method to push homolog detection to lower levels of sequence identity. The method used here relies on a language model to represent proteins numerically in a matrix (an embedding) and uses discrete cosine transforms to compress the data to extract the most essential part, significantly reducing the data size. This PRotein Ortholog Search Tool (PROST) is significantly faster with linear runtimes, and most importantly, computes the distances between pairs of protein sequences to yield homologs at significantly lower levels of sequence identity than previously. The extent of allosteric effects in proteins points out the importance of global aspects of structure and sequence. PROST excels at global homology detection but not at detecting local homologs. Results are validated by strong similarities between the corresponding pairs of structures. The number of remote homologs detected increased significantly and pushes the effective sequence matches more deeply into the twilight zone. Human protein sequences presently having no assigned function now find significant numbers of putative homologs for 93% of cases and structurally verified assigned functions for 76.4% of these cases. The data compression enables massive searches for homologs with short search times while yielding significant gains in the numbers of remote homologs detected. The method is sufficiently efficient to permit whole-genome/proteome comparisons. The PROST web server is accessible at https://mesihk.github.io/prost.


Assuntos
Compressão de Dados , Proteoma , Humanos , Sequência de Aminoácidos , Ferramenta de Busca , Genoma , Bases de Dados de Proteínas
16.
Proc Natl Acad Sci U S A ; 120(6): e2218523120, 2023 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-36730192

RESUMO

We study GPT-3, a recent large language model, using tools from cognitive psychology. More specifically, we assess GPT-3's decision-making, information search, deliberation, and causal reasoning abilities on a battery of canonical experiments from the literature. We find that much of GPT-3's behavior is impressive: It solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multiarmed bandit task, and shows signatures of model-based reinforcement learning. Yet, we also find that small perturbations to vignette-based tasks can lead GPT-3 vastly astray, that it shows no signatures of directed exploration, and that it fails miserably in a causal reasoning task. Taken together, these results enrich our understanding of current large language models and pave the way for future investigations using tools from cognitive psychology to study increasingly capable and opaque artificial agents.


Assuntos
Psicologia Cognitiva , Tomada de Decisões , Humanos , Resolução de Problemas , Aprendizagem , Reforço Psicológico
17.
Proc Natl Acad Sci U S A ; 120(24): e2220778120, 2023 Jun 13.
Artigo em Inglês | MEDLINE | ID: mdl-37289807

RESUMO

Sequence-based prediction of drug-target interactions has the potential to accelerate drug discovery by complementing experimental screens. Such computational prediction needs to be generalizable and scalable while remaining sensitive to subtle variations in the inputs. However, current computational techniques fail to simultaneously meet these goals, often sacrificing performance of one to achieve the others. We develop a deep learning model, ConPLex, successfully leveraging the advances in pretrained protein language models ("PLex") and employing a protein-anchored contrastive coembedding ("Con") to outperform state-of-the-art approaches. ConPLex achieves high accuracy, broad adaptivity to unseen data, and specificity against decoy compounds. It makes predictions of binding based on the distance between learned representations, enabling predictions at the scale of massive compound libraries and the human proteome. Experimental testing of 19 kinase-drug interaction predictions validated 12 interactions, including four with subnanomolar affinity, plus a strongly binding EPHB1 inhibitor (KD = 1.3 nM). Furthermore, ConPLex embeddings are interpretable, which enables us to visualize the drug-target embedding space and use embeddings to characterize the function of human cell-surface proteins. We anticipate that ConPLex will facilitate efficient drug discovery by making highly sensitive in silico drug screening feasible at the genome scale. ConPLex is available open source at https://ConPLex.csail.mit.edu.


Assuntos
Descoberta de Drogas , Proteínas , Humanos , Proteínas/química , Descoberta de Drogas/métodos , Avaliação Pré-Clínica de Medicamentos , Idioma
18.
Proc Natl Acad Sci U S A ; 120(44): e2313790120, 2023 Oct 31.
Artigo em Inglês | MEDLINE | ID: mdl-37883432

RESUMO

As the use of large language models (LLMs) grows, it is important to examine whether they exhibit biases in their output. Research in cultural evolution, using transmission chain experiments, demonstrates that humans have biases to attend to, remember, and transmit some types of content over others. Here, in five preregistered experiments using material from previous studies with human participants, we use the same, transmission chain-like methodology, and find that the LLM ChatGPT-3 shows biases analogous to humans for content that is gender-stereotype-consistent, social, negative, threat-related, and biologically counterintuitive, over other content. The presence of these biases in LLM output suggests that such content is widespread in its training data and could have consequential downstream effects, by magnifying preexisting human tendencies for cognitively appealing and not necessarily informative, or valuable, content.


Assuntos
Evolução Cultural , Idioma , Humanos , Rememoração Mental , Viés , Teoria Ética
19.
Proc Natl Acad Sci U S A ; 120(44): e2311219120, 2023 Oct 31.
Artigo em Inglês | MEDLINE | ID: mdl-37883436

RESUMO

The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome.


Assuntos
Arabidopsis , Arabidopsis/genética , Estudo de Associação Genômica Ampla , Genômica , Genoma , DNA
20.
Proc Natl Acad Sci U S A ; 120(30): e2305016120, 2023 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-37463210

RESUMO

Many NLP applications require manual text annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using four samples of tweets and news articles (n = 6,183), we show that ChatGPT outperforms crowd workers for several annotation tasks, including relevance, stance, topics, and frame detection. Across the four datasets, the zero-shot accuracy of ChatGPT exceeds that of crowd workers by about 25 percentage points on average, while ChatGPT's intercoder agreement exceeds that of both crowd workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003-about thirty times cheaper than MTurk. These results demonstrate the potential of large language models to drastically increase the efficiency of text classification.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA