RESUMEN
Machine learning, a collection of data-analytical techniques aimed at building predictive models from multi-dimensional datasets, is becoming integral to modern biological research. By enabling one to generate models that learn from large datasets and make predictions on likely outcomes, machine learning can be used to study complex cellular systems such as biological networks. Here, we provide a primer on machine learning for life scientists, including an introduction to deep learning. We discuss opportunities and challenges at the intersection of machine learning and network biology, which could impact disease biology, drug discovery, microbiome research, and synthetic biology.
Asunto(s)
Biología Computacional/métodos , Aprendizaje Automático , Algoritmos , Bases de Datos Factuales , Descubrimiento de Drogas , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Humanos , Microbiota , Redes Neurales de la ComputaciónRESUMEN
There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.
Asunto(s)
Lenguaje , Matemática , Solución de Problemas , Humanos , Solución de Problemas/fisiología , Estudiantes/psicologíaRESUMEN
Recent progress in DNA synthesis and sequencing technology has enabled systematic studies of protein function at a massive scale. We explore a deep mutational scanning study that measured the transcriptional repression function of 43,669 variants of the Escherichia coli LacI protein. We analyze structural and evolutionary aspects that relate to how the function of this protein is maintained, including an in-depth look at the C-terminal domain. We develop a deep neural network to predict transcriptional repression mediated by the lac repressor of Escherichia coli using experimental measurements of variant function. When measured across 10 separate training and validation splits using 5,009 single mutations of the lac repressor, our best-performing model achieved a median Pearson correlation of 0.79, exceeding any previous model. We demonstrate that deep representation learning approaches, first trained in an unsupervised manner across millions of diverse proteins, can be fine-tuned in a supervised fashion using lac repressor experimental datasets to more effectively predict a variant's effect on repression. These findings suggest a deep representation learning model may improve the prediction of other important properties of proteins.
Asunto(s)
Aprendizaje Profundo , Proteínas de Escherichia coli/metabolismo , Represoras Lac/metabolismo , Transcripción Genética , Epistasis Genética , Proteínas de Escherichia coli/genética , Represoras Lac/genética , Mutación/genética , Dominios Proteicos , Reproducibilidad de los ResultadosRESUMEN
Mixed-status families-whose members have multiple immigration statuses-are common in US immigrant communities. Large-scale worksite raids, an immigration enforcement tactic used throughout US history, returned during the Trump administration. Yet, little research characterizes the impacts of these raids, especially as related to mixed-status families. The current study (1) describes a working definition of a large-scale worksite raid and (2) considers impacts of these raids on mixed-status families. We conducted semistructured interviews in Spanish and English at 6 communities that experienced the largest worksite raids in 2018. Participants were 77 adults who provided material, emotional, or professional support following raids. Qualitative analysis methods were used to develop a codebook and code all interviews. The unpredictability of worksite raids resulted in chaos and confusion, often stemming from potential family separation. Financial crises followed because of the removal of primary financial providers. In response, families rearranged roles to generate income. Large-scale worksite raids result in similar harms to mixed-status families as other enforcement tactics but on a much larger scale. They also uniquely drain community resources, with long-term impacts. Advocacy and policy efforts are needed to mitigate damage and end this practice.
Asunto(s)
Emigrantes e Inmigrantes , Emigración e Inmigración , Adulto , Relaciones Familiares , Hispánicos o Latinos , Humanos , Lugar de TrabajoRESUMEN
TUT4 and the closely related TUT7 are non-templated poly(U) polymerases required at different stages of development, and their mis-regulation or mutation has been linked to important cancer pathologies. While TUT4(7) interaction with its pre-miRNA targets has been characterized in detail, the molecular bases of the broader target recognition process are unclear. Here, we examine RNA binding by the ZnF domains of the protein. We show that TUT4(7) ZnF2 contains two distinct RNA binding surfaces that are used in the interaction with different RNA nucleobases in different targets, i.e that this small domain encodes diversity in TUT4(7) selectivity and molecular function. Interestingly and unlike other well-characterized CCHC ZnFs, ZnF2 is not physically coupled to the flanking ZnF3 and acts independently in miRNA recognition, while the remaining CCHC ZnF of TUT4(7), ZnF1, has lost its intrinsic RNA binding capability. Together, our data suggest that the ZnFs of TUT4(7) are independent units for RNA and, possibly, protein-protein interactions that underlay the protein's functional flexibility and are likely to play an important role in building its interaction network.
Asunto(s)
Proteínas de Unión al ADN/metabolismo , Epistasis Genética , Regulación de la Expresión Génica , MicroARNs/genética , Proteínas de Unión al ARN/metabolismo , Dedos de Zinc , Composición de Base , Proteínas de Unión al ADN/química , Humanos , Espectroscopía de Resonancia Magnética , MicroARNs/química , MicroARNs/metabolismo , Poli U , Dominios y Motivos de Interacción de Proteínas , Proteínas de Unión al ARN/química , Relación Estructura-ActividadRESUMEN
RBM10 is an RNA-binding protein that plays an essential role in development and is frequently mutated in the context of human disease. RBM10 recognizes a diverse set of RNA motifs in introns and exons and regulates alternative splicing. However, the molecular mechanisms underlying this seemingly relaxed sequence specificity are not understood and functional studies have focused on 3Î intronic sites only. Here, we dissect the RNA code recognized by RBM10 and relate it to the splicing regulatory function of this protein. We show that a two-domain RRM1-ZnF unit recognizes a GGA-centered motif enriched in RBM10 exonic sites with high affinity and specificity and test that the interaction with these exonic sequences promotes exon skipping. Importantly, a second RRM domain (RRM2) of RBM10 recognizes a C-rich sequence, which explains its known interaction with the intronic 3Î site of NUMB exon 9 contributing to regulation of the Notch pathway in cancer. Together, these findings explain RBM10's broad RNA specificity and suggest that RBM10 functions as a splicing regulator using two RNA-binding units with different specificities to promote exon skipping.
Asunto(s)
Proteínas de Unión al ARN/fisiología , Autoantígenos , Secuencia de Bases , Sitios de Unión , Exones , Células HEK293 , Humanos , Unión Proteica , Empalme del ARN , ARN Mensajero/química , ARN Mensajero/metabolismo , Proteínas de Unión al ARN/química , Dedos de ZincRESUMEN
Defining the RNA target selectivity of the proteins regulating mRNA metabolism is a key issue in RNA biology. Here we present a novel use of principal component analysis (PCA) to extract the RNA sequence preference of RNA binding proteins. We show that PCA can be used to compare the changes in the nuclear magnetic resonance (NMR) spectrum of a protein upon binding a set of quasi-degenerate RNAs and define the nucleobase specificity. We couple this application of PCA to an automated NMR spectra recording and processing protocol and obtain an unbiased and high-throughput NMR method for the analysis of nucleobase preference in protein-RNA interactions. We test the method on the RNA binding domains of three important regulators of RNA metabolism.
Asunto(s)
Ensayos Analíticos de Alto Rendimiento/métodos , Resonancia Magnética Nuclear Biomolecular/métodos , Proteínas de Unión al ARN/metabolismo , ARN/genética , ARN/metabolismo , Secuencia de Bases , Proteínas de Unión al ADN/química , Proteínas de Unión al ADN/metabolismo , Ensayos Analíticos de Alto Rendimiento/estadística & datos numéricos , Humanos , Modelos Moleculares , Análisis de Componente Principal , Dominios y Motivos de Interacción de Proteínas , Proteínas de Unión al ARN/química , Proteínas Recombinantes/química , Proteínas Recombinantes/metabolismo , Proteínas de Saccharomyces cerevisiae/química , Proteínas de Saccharomyces cerevisiae/metabolismo , Factores de Escisión y Poliadenilación de ARNm/química , Factores de Escisión y Poliadenilación de ARNm/metabolismoRESUMEN
What do we want from machine intelligence? We envision machines that are not just tools for thought but partners in thought: reasonable, insightful, knowledgeable, reliable and trustworthy systems that think with us. Current artificial intelligence systems satisfy some of these criteria, some of the time. In this Perspective, we show how the science of collaborative cognition can be put to work to engineer systems that really can be called 'thought partners', systems built to meet our expectations and complement our limitations. We lay out several modes of collaborative thought in which humans and artificial intelligence thought partners can engage, and we propose desiderata for human-compatible thought partnerships. Drawing on motifs from computational cognitive science, we motivate an alternative scaling path for the design of thought partners and ecosystems around their use through a Bayesian lens, whereby the partners we construct actively build and reason over models of the human and world.
Asunto(s)
Inteligencia Artificial , Pensamiento , Humanos , Teorema de Bayes , Conducta Cooperativa , CogniciónRESUMEN
Bacteria use an array of sigma factors to regulate gene expression during different stages of their life cycles. Full-length, atomic-level structures of sigma factors have been challenging to obtain experimentally as a result of their many regions of intrinsic disorder. AlphaFold has now supplied plausible full-length models for most sigma factors. Here we discuss the current understanding of the structures and functions of sigma factors in the model organism, Bacillus subtilis, and present an X-ray crystal structure of a region of B. subtilis SigE, a sigma factor that plays a critical role in the developmental process of spore formation.
RESUMEN
Understanding how the RNA-binding domains of a protein regulator are used to recognize its RNA targets is a key problem in RNA biology, but RNA-binding domains with very low affinity do not perform well in the methods currently available to characterize protein-RNA interactions. Here, we propose to use conservative mutations that enhance the affinity of RNA-binding domains to overcome this limitation. As a proof of principle, we have designed and validated an affinity-enhanced K-homology (KH) domain mutant of the fragile X syndrome protein FMRP, a key regulator of neuronal development, and used this mutant to determine the domain's sequence preference and to explain FMRP recognition of specific RNA motifs in the cell. Our results validate our concept and our nuclear magnetic resonance (NMR)-based workflow. While effective mutant design requires an understanding of the underlying principles of RNA recognition by the relevant domain type, we expect the method will be used effectively in many RNA-binding domains.
Asunto(s)
Proteína de la Discapacidad Intelectual del Síndrome del Cromosoma X Frágil , ARN , ARN/genética , Proteína de la Discapacidad Intelectual del Síndrome del Cromosoma X Frágil/genética , Proteínas/genética , Mutación , Motivos de Unión al ARN/genéticaRESUMEN
BACKGROUND: Alzheimer's disease's (AD) prevalence is projected to increase as the population ages and current treatments are minimally effective. Transcranial photobiomodulation (t-PBM) with near-infrared (NIR) light penetrates into the cerebral cortex, stimulates the mitochondrial respiratory chain, and increases cerebral blood flow. Preliminary data suggests t-PBM may be efficacious in improving cognition in people with early AD and amnestic mild cognitive impairment (aMCI). METHODS: In this randomized, double-blind, placebo-controlled study with aMCI and early AD participants, we will test the efficacy, safety, and impact on cognition of 24 sessions of t-PBM delivered over 8 weeks. Brain mechanisms of t-PBM in this population will be explored by testing whether the baseline tau burden (measured with 18F-MK6240), or changes in mitochondrial function over 8 weeks (assessed with 31P-MRSI), moderates the changes observed in cognitive functions after t-PBM therapy. We will also use changes in the fMRI Blood-Oxygenation-Level-Dependent (BOLD) signal after a single treatment to demonstrate t-PBM-dependent increases in prefrontal cortex blood flow. CONCLUSION: This study will test whether t-PBM, a low-cost, accessible, and user-friendly intervention, has the potential to improve cognition and function in an aMCI and early AD population.
RESUMEN
The design choices underlying machine-learning (ML) models present important barriers to entry for many biologists who aim to incorporate ML in their research. Automated machine-learning (AutoML) algorithms can address many challenges that come with applying ML to the life sciences. However, these algorithms are rarely used in systems and synthetic biology studies because they typically do not explicitly handle biological sequences (e.g., nucleotide, amino acid, or glycan sequences) and cannot be easily compared with other AutoML algorithms. Here, we present BioAutoMATED, an AutoML platform for biological sequence analysis that integrates multiple AutoML methods into a unified framework. Users are automatically provided with relevant techniques for analyzing, interpreting, and designing biological sequences. BioAutoMATED predicts gene regulation, peptide-drug interactions, and glycan annotation, and designs optimized synthetic biology components, revealing salient sequence characteristics. By automating sequence modeling, BioAutoMATED allows life scientists to incorporate ML more readily into their work.
Asunto(s)
Algoritmos , Aprendizaje AutomáticoRESUMEN
While synthetic biology has revolutionized our approaches to medicine, agriculture, and energy, the design of completely novel biological circuit components beyond naturally-derived templates remains challenging due to poorly understood design rules. Toehold switches, which are programmable nucleic acid sensors, face an analogous design bottleneck; our limited understanding of how sequence impacts functionality often necessitates expensive, time-consuming screens to identify effective switches. Here, we introduce Sequence-based Toehold Optimization and Redesign Model (STORM) and Nucleic-Acid Speech (NuSpeak), two orthogonal and synergistic deep learning architectures to characterize and optimize toeholds. Applying techniques from computer vision and natural language processing, we 'un-box' our models using convolutional filters, attention maps, and in silico mutagenesis. Through transfer-learning, we redesign sub-optimal toehold sensors, even with sparse training data, experimentally validating their improved performance. This work provides sequence-to-function deep learning frameworks for toehold selection and design, augmenting our ability to construct potent biological circuit components and precision diagnostics.
Asunto(s)
Biotecnología/métodos , Aprendizaje Profundo , Ingeniería Genética/métodos , Riboswitch/genética , Biología Sintética/métodos , Secuencia de Bases/genética , Simulación por Computador , Conjuntos de Datos como Asunto , Genoma Humano/genética , Genoma Viral/genética , Humanos , Modelos Genéticos , Mutagénesis , Procesamiento de Lenguaje Natural , Relación Estructura-ActividadRESUMEN
Global changes in bacterial gene expression can be orchestrated by the coordinated activation/deactivation of alternative sigma (σ) factor subunits of RNA polymerase. Sigma factors themselves are regulated in myriad ways, including via anti-sigma factors. Here, we have determined the solution structure of anti-sigma factor CsfB, responsible for inhibition of two alternative sigma factors, σG and σE, during spore formation by Bacillus subtilis. CsfB assembles into a symmetrical homodimer, with each monomer bound to a single Zn2+ ion via a treble-clef zinc finger fold. Directed mutagenesis indicates that dimer formation is critical for CsfB-mediated inhibition of both σG and σE, and we have characterized these interactions in vitro. This work represents an advance in our understanding of how CsfB mediates inhibition of two alternative sigma factors to drive developmental gene expression in a bacterium.