Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 51
Filtrar
1.
Open Biol ; 14(6): 230449, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38862018

RESUMO

Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.


Assuntos
5-Metilcitosina , Metilação de DNA , Epigênese Genética , Redes Neurais de Computação , 5-Metilcitosina/análogos & derivados , 5-Metilcitosina/química , 5-Metilcitosina/metabolismo , Sequenciamento por Nanoporos/métodos , Nanoporos , Humanos , Cadeias de Markov , DNA/química , DNA/genética
2.
Bioinformatics ; 2024 Jun 21.
Artigo em Inglês | MEDLINE | ID: mdl-38905502

RESUMO

SUMMARY: The design of two overlapping genes in a microbial genome is an emerging technique for adding more reliable control mechanisms in engineered organisms for increased stability. The design of functional overlapping gene pairs is a challenging procedure and computational design tools are used to improve the efficiency to deploy successful designs in genetically engineered systems. GENTANGLE (Gene Tuples ArraNGed in overLapping Elements) is a high-performance containerized pipeline for the computational design of two overlapping genes translated in different reading frames of the genome. This new software package can be used to design and test gene entanglements for microbial engineering projects using arbitrary sets of user specified gene pairs. AVAILABILITY AND IMPLEMENTATION: The GENTANGLE source code and its submodules are freely available on GitHub at https://github.com/BiosecSFA/gentangle. The DATANGLE (DATA for genTANGLE) repository contains related data and results, and is freely available on GitHub at https://github.com/BiosecSFA/datangle. The GENTANGLE container is freely available on Singularity Cloud Library at https://cloud.sylabs.io/library/khyox/gentangle/gentangle.sif. The GENTANGLE repository wiki (https://github.com/BiosecSFA/gentangle/wiki), website (https://biosecsfa.github.io/gentangle/) and user manual contain detailed instructions on how to use the different components of software and data, including examples and reproducing the results. The code is licensed under the GNU Affero General Public License version 3 (https://www.gnu.org/licenses/agpl.html). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

3.
Trends Biochem Sci ; 49(5): 457-469, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38531696

RESUMO

Gene delivery vehicles based on adeno-associated viruses (AAVs) are enabling increasing success in human clinical trials, and they offer the promise of treating a broad spectrum of both genetic and non-genetic disorders. However, delivery efficiency and targeting must be improved to enable safe and effective therapies. In recent years, considerable effort has been invested in creating AAV variants with improved delivery, and computational approaches have been increasingly harnessed for AAV engineering. In this review, we discuss how computationally designed AAV libraries are enabling directed evolution. Specifically, we highlight approaches that harness sequences outputted by next-generation sequencing (NGS) coupled with machine learning (ML) to generate new functional AAV capsids and related regulatory elements, pushing the frontier of what vector engineering and gene therapy may achieve.


Assuntos
Dependovirus , Técnicas de Transferência de Genes , Dependovirus/genética , Humanos , Terapia Genética/métodos , Vetores Genéticos/metabolismo , Engenharia Genética , Animais , Biologia Computacional/métodos
5.
Sci Adv ; 10(4): eadj3786, 2024 Jan 26.
Artigo em Inglês | MEDLINE | ID: mdl-38266077

RESUMO

Adeno-associated viruses (AAVs) hold tremendous promise as delivery vectors for gene therapies. AAVs have been successfully engineered-for instance, for more efficient and/or cell-specific delivery to numerous tissues-by creating large, diverse starting libraries and selecting for desired properties. However, these starting libraries often contain a high proportion of variants unable to assemble or package their genomes, a prerequisite for any gene delivery goal. Here, we present and showcase a machine learning (ML) method for designing AAV peptide insertion libraries that achieve fivefold higher packaging fitness than the standard NNK library with negligible reduction in diversity. To demonstrate our ML-designed library's utility for downstream engineering goals, we show that it yields approximately 10-fold more successful variants than the NNK library after selection for infection of human brain tissue, leading to a promising glial-specific variant. Moreover, our design approach can be applied to other types of libraries for AAV and beyond.


Assuntos
Dependovirus , Terapia Genética , Humanos , Dependovirus/genética , Biblioteca de Peptídeos , Encéfalo , Aprendizado de Máquina
7.
Artigo em Inglês | MEDLINE | ID: mdl-38052497

RESUMO

Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal applications ranging from drug development and plastic degradation to carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. If one trusts learned models too much in extrapolation, one is likely to design rubbish. In contrast, if one does not extrapolate, one cannot find novelty. Herein, we ponder how one might strike a useful balance between these two extremes. We focus in particular on designing proteins with novel property values, although much of our discussion is relevant to machine learning-based design more broadly.


Assuntos
Aprendizado de Máquina
8.
ACS Synth Biol ; 12(11): 3242-3251, 2023 Nov 17.
Artigo em Inglês | MEDLINE | ID: mdl-37888887

RESUMO

Predicting properties of proteins is of interest for basic biological understanding and protein engineering alike. Increasingly, machine learning (ML) approaches are being used for this task. However, the accuracy of such ML models typically degrades as test proteins stray further from the training data distribution. On the other hand, models that are more data-free, such as biophysics-based models, are typically uniformly accurate over all of the protein space, even if inferior for test points close to the training distribution. Consequently, being able to cohesively blend these two types of information within one model, as appropriate in different parts of the protein space, will improve overall importance. Herein, we tackle just this problem to yield a simple, practical, and scalable approach that can be easily implemented. In particular, we use a Bayesian formulation to integrate biophysical knowledge into neural networks. However, in doing so, a technical challenge arises: Bayesian neural networks (BNNs) enable the user to specify prior information only on the neural network weight parameters, rather than on the function values given to us from a typical biophysics-based model. Consequently, we devise a principled probabilistic method to overcome this challenge. Our approach yields intuitively pleasing results: predictions rely more heavily on the biophysical prior information when the BNN epistemic uncertainty─uncertainty arising from a lack of training data rather than sensor noise─is large and more heavily on the neural network when the epistemic uncertainty is small. We demonstrate this approach on an illustrative synthetic example, on two examples of protein property prediction (fluorescence and binding), and for generality on one small molecule property prediction problem.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Teorema de Bayes , Proteínas
9.
Genome Biol ; 24(1): 218, 2023 Oct 02.
Artigo em Inglês | MEDLINE | ID: mdl-37784130

RESUMO

Characterizing differences in sequences between two conditions, such as with and without drug exposure, using high-throughput sequencing data is a prevalent problem involving quantifying changes in sequence abundances, and predicting such differences for unobserved sequences. A key shortcoming of current approaches is their extremely limited ability to share information across related but non-identical reads. Consequently, they cannot use sequencing data effectively, nor be directly applied in many settings of interest. We introduce model-based enrichment (MBE) to overcome this shortcoming. We evaluate MBE using both simulated and real data. Overall, MBE improves accuracy compared to current differential analysis methods.

10.
Proc Natl Acad Sci U S A ; 119(43): e2204569119, 2022 10 25.
Artigo em Inglês | MEDLINE | ID: mdl-36256807

RESUMO

Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting-one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data-that is, the designed sequences-has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty.


Assuntos
Algoritmos , Aprendizado de Máquina , Retroalimentação , Incerteza , Conformação Molecular
11.
Nat Biotechnol ; 40(7): 1114-1122, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35039677

RESUMO

Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.


Assuntos
Aprendizado de Máquina , Proteínas , Proteínas/química , Proteínas/genética
12.
Proc Natl Acad Sci U S A ; 119(1)2022 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-34937698

RESUMO

Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model's interpretable parameters-sequence length, alphabet size, and assumed interactions between sequence positions-on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.


Assuntos
Epistasia Genética , Aprendizagem/fisiologia , Algoritmos , Modelos Teóricos
13.
Nat Commun ; 12(1): 5225, 2021 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-34471113

RESUMO

Despite recent advances in high-throughput combinatorial mutagenesis assays, the number of labeled sequences available to predict molecular functions has remained small for the vastness of the sequence space combined with the ruggedness of many fitness functions. While deep neural networks (DNNs) can capture high-order epistatic interactions among the mutational sites, they tend to overfit to the small number of labeled sequences available for training. Here, we developed Epistatic Net (EN), a method for spectral regularization of DNNs that exploits evidence that epistatic interactions in many fitness functions are sparse. We built a scalable extension of EN, usable for larger sequences, which enables spectral regularization using fast sparse recovery algorithms informed by coding theory. Results on several biological landscapes show that EN consistently improves the prediction accuracy of DNNs and enables them to outperform competing models which assume other priors. EN estimates the higher-order epistatic interactions of DNNs trained on massive sequence spaces-a computational problem that otherwise takes years to solve.


Assuntos
Algoritmos , Redes Neurais de Computação , Bactérias , Proteínas de Fluorescência Verde
14.
Nat Rev Drug Discov ; 19(5): 353-364, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-31801986

RESUMO

Artificial intelligence (AI) tools are increasingly being applied in drug discovery. While some protagonists point to vast opportunities potentially offered by such tools, others remain sceptical, waiting for a clear impact to be shown in drug discovery projects. The reality is probably somewhere in-between these extremes, yet it is clear that AI is providing new challenges not only for the scientists involved but also for the biopharma industry and its established processes for discovering and developing new medicines. This article presents the views of a diverse group of international experts on the 'grand challenges' in small-molecule drug discovery with AI and the approaches to address them.


Assuntos
Inteligência Artificial , Desenho de Fármacos , Descoberta de Drogas/métodos , Humanos
15.
Nat Biomed Eng ; 2(1): 38-47, 2018 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-29998038

RESUMO

The CRISPR-Cas9 system provides unprecedented genome editing capabilities. However, off-target effects lead to sub-optimal usage and additionally are a bottleneck in the development of therapeutic uses. Herein, we introduce the first machine learning-based approach to off-target prediction, yielding a state-of-the-art model for CRISPR-Cas9 that outperforms all other guide design services. Our approach, Elevation, consists of two interdependent machine learning models-one for scoring individual guide-target pairs, and another which aggregates these guide-target scores into a single, overall summary guide score. Through systematic investigation, we demonstrate that Elevation performs substantially better than competing approaches on both tasks. Additionally, we are the first to systematically evaluate approaches on the guide summary score problem; we show that the most widely-used method performs no better than random at times, whereas Elevation consistently outperformed it, sometimes by an order of magnitude. We also introduce an evaluation method that balances errors between active and inactive guides, thereby encapsulating a range of practical use cases; Elevation is consistently superior to other methods across the entire range. Finally, because of the large scale and computational demands of off-target prediction, we have developed a cloud-based service for quick retrieval. This service provides end-to-end guide design by also incorporating our previously reported on-target model, Azimuth. (https://crispr.ml:please treat this web site as confidential until publication).

16.
Nat Biotechnol ; 36(2): 179-189, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-29251726

RESUMO

Combinatorial genetic screening using CRISPR-Cas9 is a useful approach to uncover redundant genes and to explore complex gene networks. However, current methods suffer from interference between the single-guide RNAs (sgRNAs) and from limited gene targeting activity. To increase the efficiency of combinatorial screening, we employ orthogonal Cas9 enzymes from Staphylococcus aureus and Streptococcus pyogenes. We used machine learning to establish S. aureus Cas9 sgRNA design rules and paired S. aureus Cas9 with S. pyogenes Cas9 to achieve dual targeting in a high fraction of cells. We also developed a lentiviral vector and cloning strategy to generate high-complexity pooled dual-knockout libraries to identify synthetic lethal and buffering gene pairs across multiple cell types, including MAPK pathway genes and apoptotic genes. Our orthologous approach also enabled a screen combining gene knockouts with transcriptional activation, which revealed genetic interactions with TP53. The "Big Papi" (paired aureus and pyogenes for interactions) approach described here will be widely applicable for the study of combinatorial phenotypes.


Assuntos
Sistemas CRISPR-Cas/genética , Epistasia Genética/genética , Testes Genéticos , RNA Guia de Cinetoplastídeos/genética , Apoptose/genética , Técnicas de Inativação de Genes , Marcação de Genes , Humanos , Aprendizado de Máquina , Quinases de Proteína Quinase Ativadas por Mitógeno/genética , Transdução de Sinais/genética , Staphylococcus aureus/genética , Streptococcus pyogenes/genética , Proteína Supressora de Tumor p53/genética
17.
J Comput Biol ; 24(6): 524-535, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-28056190

RESUMO

Genome-wide association studies commonly examine one trait at a time. Occasionally they examine several related traits with the hope of increasing power; in such a setting, the traits are not generally smoothly varying in any way such as time or space. However, for function-valued traits, the trait is often smoothly varying along the axis of interest, such as space or time. For instance, in the case of longitudinal traits such as growth curves, the axis of interest is time; for spatially varying traits such as chromatin accessibility, it would be position along the genome. Although there have been efforts to perform genome-wide association studies with such function-valued traits, the statistical approaches developed for this purpose often have limitations such as requiring the trait to behave linearly in time or space, or constraining the genetic effect itself to be constant or linear in time. Herein, we present a flexible model for this problem-the Partitioned Gaussian Process-which removes many such limitations and is especially effective as the number of time points increases. The theoretical basis of this model provides machinery for handling missing and unaligned function values such as would occur when not all individuals are measured at the same time points. Furthermore, we make use of algebraic refactorizations to substantially reduce the time complexity of our model beyond the naive implementation. Finally, we apply our approach and several others to synthetic data before closing, with some directions for improved modeling and statistical testing.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Modelos Estatísticos , Característica Quantitativa Herdável , Análise de Sequência de DNA/métodos , Simulação por Computador , Humanos , Distribuição Normal , Estatísticas não Paramétricas
18.
Nat Biotechnol ; 34(2): 184-191, 2016 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26780180

RESUMO

CRISPR-Cas9-based genetic screens are a powerful new tool in biology. By simply altering the sequence of the single-guide RNA (sgRNA), one can reprogram Cas9 to target different sites in the genome with relative ease, but the on-target activity and off-target effects of individual sgRNAs can vary widely. Here, we use recently devised sgRNA design rules to create human and mouse genome-wide libraries, perform positive and negative selection screens and observe that the use of these rules produced improved results. Additionally, we profile the off-target activity of thousands of sgRNAs and develop a metric to predict off-target sites. We incorporate these findings from large-scale, empirical data to improve our computational design rules and create optimized sgRNA libraries that maximize on-target activity and minimize off-target effects to enable more effective and efficient genetic screens and genome engineering.


Assuntos
Sistemas CRISPR-Cas/genética , Engenharia Genética/métodos , Genômica/métodos , RNA Guia de Cinetoplastídeos/genética , Animais , Linhagem Celular Tumoral , Resistência a Medicamentos/genética , Biblioteca Gênica , Genoma/genética , Humanos , Camundongos
19.
Pac Symp Biocomput ; : 342-6, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25592594

RESUMO

Advances in molecular profiling and sensor technologies are expanding the scope of personalized medicine beyond genotypes, providing new opportunities for developing richer and more dynamic multi-scale models of individual health. Recent studies demonstrate the value of scoring high-dimensional microbiome, immune, and metabolic traits from individuals to inform personalized medicine. Efforts to integrate multiple dimensions of clinical and molecular data towards predictive multi-scale models of individual health and wellness are already underway. Improved methods for mining and discovery of clinical phenotypes from electronic medical records and technological developments in wearable sensor technologies present new opportunities for mapping and exploring the critical yet poorly characterized "phenome" and "envirome" dimensions of personalized medicine. There are ambitious new projects underway to collect multi-scale molecular, sensor, clinical, behavioral, and environmental data streams from large population cohorts longitudinally to enable more comprehensive and dynamic models of individual biology and personalized health. Personalized medicine stands to benefit from inclusion of rich new sources and dimensions of data. However, realizing these improvements in care relies upon novel informatics methodologies, tools, and systems to make full use of these data to advance both the science and translational applications of personalized medicine.


Assuntos
Medicina de Precisão/tendências , Biologia Computacional , Genótipo , Humanos , Modelagem Computacional Específica para o Paciente , Fenótipo , Medicina de Precisão/estatística & dados numéricos , Biologia de Sistemas
20.
Sci Rep ; 4: 6874, 2014 Nov 12.
Artigo em Inglês | MEDLINE | ID: mdl-25387525

RESUMO

We examine improvements to the linear mixed model (LMM) that better correct for population structure and family relatedness in genome-wide association studies (GWAS). LMMs rely on the estimation of a genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in a cohort. These similarities are estimated from single nucleotide polymorphisms (SNPs) or other genetic variants. Traditionally, all available SNPs are used to estimate the GSM. In empirical studies across a wide range of synthetic and real data, we find that modifications to this approach improve GWAS performance as measured by type I error control and power. Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM. In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM. Software implementing these improvements and the experimental comparisons are available at http://microsoft.com/science.


Assuntos
Estudo de Associação Genômica Ampla/estatística & dados numéricos , Modelos Lineares , Polimorfismo de Nucleotídeo Único , Software , Algoritmos , Animais , Genótipo , Humanos , Camundongos , Modelos Genéticos , Fenótipo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA