Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 51
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Trends Biochem Sci ; 49(5): 457-469, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38531696

RESUMO

Gene delivery vehicles based on adeno-associated viruses (AAVs) are enabling increasing success in human clinical trials, and they offer the promise of treating a broad spectrum of both genetic and non-genetic disorders. However, delivery efficiency and targeting must be improved to enable safe and effective therapies. In recent years, considerable effort has been invested in creating AAV variants with improved delivery, and computational approaches have been increasingly harnessed for AAV engineering. In this review, we discuss how computationally designed AAV libraries are enabling directed evolution. Specifically, we highlight approaches that harness sequences outputted by next-generation sequencing (NGS) coupled with machine learning (ML) to generate new functional AAV capsids and related regulatory elements, pushing the frontier of what vector engineering and gene therapy may achieve.


Assuntos
Dependovirus , Técnicas de Transferência de Genes , Dependovirus/genética , Humanos , Terapia Genética/métodos , Vetores Genéticos/metabolismo , Engenharia Genética , Animais , Biologia Computacional/métodos
2.
Bioinformatics ; 40(7)2024 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-38905502

RESUMO

SUMMARY: The design of two overlapping genes in a microbial genome is an emerging technique for adding more reliable control mechanisms in engineered organisms for increased stability. The design of functional overlapping gene pairs is a challenging procedure, and computational design tools are used to improve the efficiency to deploy successful designs in genetically engineered systems. GENTANGLE (Gene Tuples ArraNGed in overLapping Elements) is a high-performance containerized pipeline for the computational design of two overlapping genes translated in different reading frames of the genome. This new software package can be used to design and test gene entanglements for microbial engineering projects using arbitrary sets of user-specified gene pairs. AVAILABILITY AND IMPLEMENTATION: The GENTANGLE source code and its submodules are freely available on GitHub at https://github.com/BiosecSFA/gentangle. The DATANGLE (DATA for genTANGLE) repository contains related data and results and is freely available on GitHub at https://github.com/BiosecSFA/datangle. The GENTANGLE container is freely available on Singularity Cloud Library at https://cloud.sylabs.io/library/khyox/gentangle/gentangle.sif. The GENTANGLE repository wiki (https://github.com/BiosecSFA/gentangle/wiki), website (https://biosecsfa.github.io/gentangle/), and user manual contain detailed instructions on how to use the different components of software and data, including examples and reproducing the results. The code is licensed under the GNU Affero General Public License version 3 (https://www.gnu.org/licenses/agpl.html).


Assuntos
Software , Biologia Computacional/métodos , Genoma Microbiano , Engenharia Genética/métodos
3.
Proc Natl Acad Sci U S A ; 119(1)2022 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-34937698

RESUMO

Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model's interpretable parameters-sequence length, alphabet size, and assumed interactions between sequence positions-on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.


Assuntos
Epistasia Genética , Aprendizagem/fisiologia , Algoritmos , Modelos Teóricos
4.
Proc Natl Acad Sci U S A ; 119(43): e2204569119, 2022 10 25.
Artigo em Inglês | MEDLINE | ID: mdl-36256807

RESUMO

Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting-one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data-that is, the designed sequences-has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty.


Assuntos
Algoritmos , Aprendizado de Máquina , Retroalimentação , Incerteza , Conformação Molecular
5.
Nat Methods ; 11(3): 309-11, 2014 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-24464286

RESUMO

In epigenome-wide association studies, cell-type composition often differs between cases and controls, yielding associations that simply tag cell type rather than reveal fundamental biology. Current solutions require actual or estimated cell-type composition--information not easily obtainable for many samples of interest. We propose a method, FaST-LMM-EWASher, that automatically corrects for cell-type composition without the need for explicit knowledge of it, and then validate our method by comparison with the state-of-the-art approach. Corresponding software is available from http://www.microsoft.com/science/.


Assuntos
Células , Epigenômica , Estudo de Associação Genômica Ampla , Humanos , Modelos Lineares
6.
Bioinformatics ; 30(22): 3206-14, 2014 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-25075117

RESUMO

MOTIVATION: Set-based variance component tests have been identified as a way to increase power in association studies by aggregating weak individual effects. However, the choice of test statistic has been largely ignored even though it may play an important role in obtaining optimal power. We compared a standard statistical test-a score test-with a recently developed likelihood ratio (LR) test. Further, when correction for hidden structure is needed, or gene-gene interactions are sought, state-of-the art algorithms for both the score and LR tests can be computationally impractical. Thus we develop new computationally efficient methods. RESULTS: After reviewing theoretical differences in performance between the score and LR tests, we find empirically on real data that the LR test generally has more power. In particular, on 15 of 17 real datasets, the LR test yielded at least as many associations as the score test-up to 23 more associations-whereas the score test yielded at most one more association than the LR test in the two remaining datasets. On synthetic data, we find that the LR test yielded up to 12% more associations, consistent with our results on real data, but also observe a regime of extremely small signal where the score test yielded up to 25% more associations than the LR test, consistent with theory. Finally, our computational speedups now enable (i) efficient LR testing when the background kernel is full rank, and (ii) efficient score testing when the background kernel changes with each test, as for gene-gene interaction tests. The latter yielded a factor of 2000 speedup on a cohort of size 13 500. AVAILABILITY: Software available at http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/. CONTACT: heckerma@microsoft.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Estudos de Associação Genética/métodos , Variação Genética , Algoritmos , Interpretação Estatística de Dados , Humanos , Funções Verossimilhança , Fenótipo , Polimorfismo de Nucleotídeo Único
7.
Nucleic Acids Res ; 41(4): 2095-104, 2013 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-23303775

RESUMO

DNA methylation has been implicated in a number of diseases and other phenotypes. It is, therefore, of interest to identify and understand the genetic determinants of methylation and epigenomic variation. We investigated the extent to which genetic variation in cis-DNA sequence explains variation in CpG dinucleotide methylation in publicly available data for four brain regions from unrelated individuals, finding that 3-4% of CpG loci assayed were heritable, with a mean estimated narrow-sense heritability of 30% over the heritable loci. Over all loci, the mean estimated heritability was 3%, as compared with a recent twin-based study reporting 18%. Heritable loci were enriched for open chromatin regions and binding sites of CTCF, an influential regulator of transcription and chromatin architecture. Additionally, heritable loci were proximal to genes enriched in several known pathways, suggesting a possible functional role for these loci. Our estimates of heritability are conservative, and we suspect that the number of identified heritable loci will increase as the methylome is assayed across a broader range of cell types and the density of the tested loci is increased. Finally, we show that the number of heritable loci depends on the window size parameter commonly used to identify candidate cis-acting single-nucleotide polymorphism variants.


Assuntos
Encéfalo/metabolismo , Metilação de DNA , Locos de Características Quantitativas , Característica Quantitativa Herdável , Ilhas de CpG , DNA/química , Humanos , Polimorfismo de Nucleotídeo Único , Sequências Reguladoras de Ácido Nucleico
8.
Nat Methods ; 8(10): 833-5, 2011 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-21892150

RESUMO

We describe factored spectrally transformed linear mixed models (FaST-LMM), an algorithm for genome-wide association studies (GWAS) that scales linearly with cohort size in both run time and memory use. On Wellcome Trust data for 15,000 individuals, FaST-LMM ran an order of magnitude faster than current efficient algorithms. Our algorithm can analyze data for 120,000 individuals in just a few hours, whereas current algorithms fail on data for even 20,000 individuals (http://mscompbio.codeplex.com/).


Assuntos
Estudo de Associação Genômica Ampla , Modelos Genéticos , Algoritmos , Simulação por Computador , Software
9.
Bioinformatics ; 29(12): 1526-33, 2013 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-23599503

RESUMO

MOTIVATION: Approaches for testing sets of variants, such as a set of rare or common variants within a gene or pathway, for association with complex traits are important. In particular, set tests allow for aggregation of weak signal within a set, can capture interplay among variants and reduce the burden of multiple hypothesis testing. Until now, these approaches did not address confounding by family relatedness and population structure, a problem that is becoming more important as larger datasets are used to increase power. RESULTS: We introduce a new approach for set tests that handles confounders. Our model is based on the linear mixed model and uses two random effects-one to capture the set association signal and one to capture confounders. We also introduce a computational speedup for two random-effects models that makes this approach feasible even for extremely large cohorts. Using this model with both the likelihood ratio test and score test, we find that the former yields more power while controlling type I error. Application of our approach to richly structured Genetic Analysis Workshop 14 data demonstrates that our method successfully corrects for population structure and family relatedness, whereas application of our method to a 15 000 individual Crohn's disease case-control cohort demonstrates that it additionally recovers genes not recoverable by univariate analysis. AVAILABILITY: A Python-based library implementing our approach is available at http://mscompbio.codeplex.com.


Assuntos
Marcadores Genéticos , Estudo de Associação Genômica Ampla/métodos , Algoritmos , Estudos de Casos e Controles , Doença de Crohn/genética , Humanos , Modelos Lineares , Fenótipo , Polimorfismo de Nucleotídeo Único
10.
Artigo em Inglês | MEDLINE | ID: mdl-38052497

RESUMO

Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal applications ranging from drug development and plastic degradation to carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. If one trusts learned models too much in extrapolation, one is likely to design rubbish. In contrast, if one does not extrapolate, one cannot find novelty. Herein, we ponder how one might strike a useful balance between these two extremes. We focus in particular on designing proteins with novel property values, although much of our discussion is relevant to machine learning-based design more broadly.


Assuntos
Aprendizado de Máquina
11.
Open Biol ; 14(6): 230449, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38862018

RESUMO

Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.


Assuntos
5-Metilcitosina , Metilação de DNA , Epigênese Genética , Redes Neurais de Computação , 5-Metilcitosina/análogos & derivados , 5-Metilcitosina/química , 5-Metilcitosina/metabolismo , Sequenciamento por Nanoporos/métodos , Nanoporos , Humanos , Cadeias de Markov , DNA/química , DNA/genética
12.
Sci Adv ; 10(4): eadj3786, 2024 Jan 26.
Artigo em Inglês | MEDLINE | ID: mdl-38266077

RESUMO

Adeno-associated viruses (AAVs) hold tremendous promise as delivery vectors for gene therapies. AAVs have been successfully engineered-for instance, for more efficient and/or cell-specific delivery to numerous tissues-by creating large, diverse starting libraries and selecting for desired properties. However, these starting libraries often contain a high proportion of variants unable to assemble or package their genomes, a prerequisite for any gene delivery goal. Here, we present and showcase a machine learning (ML) method for designing AAV peptide insertion libraries that achieve fivefold higher packaging fitness than the standard NNK library with negligible reduction in diversity. To demonstrate our ML-designed library's utility for downstream engineering goals, we show that it yields approximately 10-fold more successful variants than the NNK library after selection for infection of human brain tissue, leading to a promising glial-specific variant. Moreover, our design approach can be applied to other types of libraries for AAV and beyond.


Assuntos
Dependovirus , Terapia Genética , Humanos , Dependovirus/genética , Biblioteca de Peptídeos , Encéfalo , Aprendizado de Máquina
13.
J Virol ; 86(9): 5230-43, 2012 May.
Artigo em Inglês | MEDLINE | ID: mdl-22379086

RESUMO

The promiscuous presentation of epitopes by similar HLA class I alleles holds promise for a universal T-cell-based HIV-1 vaccine. However, in some instances, cytotoxic T lymphocytes (CTL) restricted by HLA alleles with similar or identical binding motifs are known to target epitopes at different frequencies, with different functional avidities and with different apparent clinical outcomes. Such differences may be illuminated by the association of similar HLA alleles with distinctive escape pathways. Using a novel computational method featuring phylogenetically corrected odds ratios, we systematically analyzed differential patterns of immune escape across all optimally defined epitopes in Gag, Pol, and Nef in 2,126 HIV-1 clade C-infected adults. Overall, we identified 301 polymorphisms in 90 epitopes associated with HLA alleles belonging to shared supertypes. We detected differential escape in 37 of 38 epitopes restricted by more than one allele, which included 278 instances of differential escape at the polymorphism level. The majority (66 to 97%) of these resulted from the selection of unique HLA-specific polymorphisms rather than differential epitope targeting rates, as confirmed by gamma interferon (IFN-γ) enzyme-linked immunosorbent spot assay (ELISPOT) data. Discordant associations between HLA alleles and viral load were frequently observed between allele pairs that selected for differential escape. Furthermore, the total number of associated polymorphisms strongly correlated with average viral load. These studies confirm that differential escape is a widespread phenomenon and may be the norm when two alleles present the same epitope. Given the clinical correlates of immune escape, such heterogeneity suggests that certain epitopes will lead to discordant outcomes if applied universally in a vaccine.


Assuntos
Infecções por HIV/genética , Infecções por HIV/imunologia , HIV-1/imunologia , Antígenos HLA/genética , Antígenos HLA/imunologia , Evasão da Resposta Imune/genética , Alelos , Epitopos/genética , Epitopos/imunologia , Expressão Gênica , Infecções por HIV/virologia , Humanos , Mutação , Polimorfismo Genético , Carga Viral
14.
J Virol ; 86(24): 13202-16, 2012 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-23055555

RESUMO

HLA class I-associated polymorphisms identified at the population level mark viral sites under immune pressure by individual HLA alleles. As such, analysis of their distribution, frequency, location, statistical strength, sequence conservation, and other properties offers a unique perspective from which to identify correlates of protective cellular immunity. We analyzed HLA-associated HIV-1 subtype B polymorphisms in 1,888 treatment-naïve, chronically infected individuals using phylogenetically informed methods and identified characteristics of HLA-associated immune pressures that differentiate protective and nonprotective alleles. Over 2,100 HLA-associated HIV-1 polymorphisms were identified, approximately one-third of which occurred inside or within 3 residues of an optimally defined cytotoxic T-lymphocyte (CTL) epitope. Differential CTL escape patterns between closely related HLA alleles were common and increased with greater evolutionary distance between allele group members. Among 9-mer epitopes, mutations at HLA-specific anchor residues represented the most frequently detected escape type: these occurred nearly 2-fold more frequently than expected by chance and were computationally predicted to reduce peptide-HLA binding nearly 10-fold on average. Characteristics associated with protective HLA alleles (defined using hazard ratios for progression to AIDS from natural history cohorts) included the potential to mount broad immune selection pressures across all HIV-1 proteins except Nef, the tendency to drive multisite and/or anchor residue escape mutations within known CTL epitopes, and the ability to strongly select mutations in conserved regions within HIV's structural and functional proteins. Thus, the factors defining protective cellular immune responses may be more complex than simply targeting conserved viral regions. The results provide new information to guide vaccine design and immunogenicity studies.


Assuntos
HIV-1/imunologia , Evasão da Resposta Imune , Imunidade Celular , Alelos , Epitopos/imunologia , Antígenos HLA/genética , Humanos
15.
J Immunol ; 186(10): 5675-86, 2011 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-21498667

RESUMO

The potential contribution of HLA-A alleles to viremic control in chronic HIV type 1 (HIV-1) infection has been relatively understudied compared with HLA-B. In these studies, we show that HLA-A*7401 is associated with favorable viremic control in extended southern African cohorts of >2100 C-clade-infected subjects. We present evidence that HLA-A*7401 operates an effect that is independent of HLA-B*5703, with which it is in linkage disequilibrium in some populations, to mediate lowered viremia. We describe a novel statistical approach to detecting additive effects between class I alleles in control of HIV-1 disease, highlighting improved viremic control in subjects with HLA-A*7401 combined with HLA-B*57. In common with HLA-B alleles that are associated with effective control of viremia, HLA-A*7401 presents highly targeted epitopes in several proteins, including Gag, Pol, Rev, and Nef, of which the Gag epitopes appear immunodominant. We identify eight novel putative HLA-A*7401-restricted epitopes, of which three have been defined to the optimal epitope. In common with HLA-B alleles linked with slow progression, viremic control through an HLA-A*7401-restricted response appears to be associated with the selection of escape mutants within Gag epitopes that reduce viral replicative capacity. These studies highlight the potentially important contribution of an HLA-A allele to immune control of HIV infection, which may have been concealed by a stronger effect mediated by an HLA-B allele with which it is in linkage disequilibrium. In addition, these studies identify a factor contributing to different HIV disease outcomes in individuals expressing HLA-B*5703.


Assuntos
Infecções por HIV/imunologia , HIV-1/imunologia , Antígenos HLA-A/genética , Antígenos HLA-B/genética , Viremia/imunologia , África , Alelos , Contagem de Linfócito CD4 , Linfócitos T CD8-Positivos/imunologia , Epitopos de Linfócito T/genética , Epitopos de Linfócito T/imunologia , Feminino , Citometria de Fluxo , Infecções por HIV/genética , Infecções por HIV/virologia , HIV-1/genética , Antígenos HLA-A/imunologia , Antígenos HLA-B/imunologia , Humanos , Desequilíbrio de Ligação , Dados de Sequência Molecular , Análise de Sequência de Proteína , Carga Viral , Produtos do Gene gag do Vírus da Imunodeficiência Humana/imunologia , Produtos do Gene nef do Vírus da Imunodeficiência Humana/imunologia , Produtos do Gene pol do Vírus da Imunodeficiência Humana/imunologia , Produtos do Gene rev do Vírus da Imunodeficiência Humana/imunologia
16.
Proc Natl Acad Sci U S A ; 107(38): 16465-70, 2010 Sep 21.
Artigo em Inglês | MEDLINE | ID: mdl-20810919

RESUMO

Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. One way of getting at such an understanding is to find out which parts of our DNA, such as single-nucleotide polymorphisms, affect particular intermediary processes such as gene expression. Naively, such associations can be identified using a simple statistical test on all paired combinations of genetic variants and gene transcripts. However, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. We present a statistical model that jointly corrects for two particular kinds of hidden structure--population structure (e.g., race, family-relatedness), and microarray expression artifacts (e.g., batch effects), when these confounders are unknown. Applying our method to both real and synthetic, human and mouse data, we demonstrate the need for such a joint correction of confounders, and also the disadvantages of other possible approaches based on those in the current literature. In particular, we show that our class of models has maximum power to detect eQTL on synthetic data, and has the best performance on a bronze standard applied to real data. Lastly, our software and the associations we found with it are available at http://www.microsoft.com/science.


Assuntos
Expressão Gênica , Genômica/estatística & dados numéricos , Animais , Bases de Dados Genéticas , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Humanos , Modelos Lineares , Camundongos , Modelos Genéticos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Software
17.
Genome Biol ; 24(1): 218, 2023 Oct 02.
Artigo em Inglês | MEDLINE | ID: mdl-37784130

RESUMO

Characterizing differences in sequences between two conditions, such as with and without drug exposure, using high-throughput sequencing data is a prevalent problem involving quantifying changes in sequence abundances, and predicting such differences for unobserved sequences. A key shortcoming of current approaches is their extremely limited ability to share information across related but non-identical reads. Consequently, they cannot use sequencing data effectively, nor be directly applied in many settings of interest. We introduce model-based enrichment (MBE) to overcome this shortcoming. We evaluate MBE using both simulated and real data. Overall, MBE improves accuracy compared to current differential analysis methods.

18.
ACS Synth Biol ; 12(11): 3242-3251, 2023 Nov 17.
Artigo em Inglês | MEDLINE | ID: mdl-37888887

RESUMO

Predicting properties of proteins is of interest for basic biological understanding and protein engineering alike. Increasingly, machine learning (ML) approaches are being used for this task. However, the accuracy of such ML models typically degrades as test proteins stray further from the training data distribution. On the other hand, models that are more data-free, such as biophysics-based models, are typically uniformly accurate over all of the protein space, even if inferior for test points close to the training distribution. Consequently, being able to cohesively blend these two types of information within one model, as appropriate in different parts of the protein space, will improve overall importance. Herein, we tackle just this problem to yield a simple, practical, and scalable approach that can be easily implemented. In particular, we use a Bayesian formulation to integrate biophysical knowledge into neural networks. However, in doing so, a technical challenge arises: Bayesian neural networks (BNNs) enable the user to specify prior information only on the neural network weight parameters, rather than on the function values given to us from a typical biophysics-based model. Consequently, we devise a principled probabilistic method to overcome this challenge. Our approach yields intuitively pleasing results: predictions rely more heavily on the biophysical prior information when the BNN epistemic uncertainty─uncertainty arising from a lack of training data rather than sensor noise─is large and more heavily on the neural network when the epistemic uncertainty is small. We demonstrate this approach on an illustrative synthetic example, on two examples of protein property prediction (fluorescence and binding), and for generality on one small molecule property prediction problem.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Teorema de Bayes , Proteínas
19.
Nat Biotechnol ; 40(7): 1114-1122, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35039677

RESUMO

Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.


Assuntos
Aprendizado de Máquina , Proteínas , Proteínas/química , Proteínas/genética
20.
J Virol ; 84(19): 9879-88, 2010 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-20660184

RESUMO

Previous studies have identified a central role for HLA-B alleles in influencing control of HIV infection. An alternative possibility is that a small number of HLA-B alleles may have a very strong impact on HIV disease outcome, dominating the contribution of other HLA alleles. Here, we find that even following the exclusion of subjects expressing any of the HLA-B class I alleles (B*57, B*58, and B*18) identified to have the strongest influence on control, the dominant impact of HLA-B alleles on virus set point and absolute CD4 count variation remains significant. However, we also find that the influence of HLA on HIV control in this C-clade-infected cohort from South Africa extends beyond HLA-B as HLA-Cw type remains a significant predictor of virus and CD4 count following exclusion of the strongest HLA-B associations. Furthermore, there is evidence of interdependent protective effects of the HLA-Cw*0401-B*8101, HLA-Cw*1203-B*3910, and HLA-A*7401-B*5703 haplotypes that cannot be explained solely by linkage to a protective HLA-B allele. Analysis of individuals expressing both protective and detrimental alleles shows that even the strongest HLA alleles appear to have an additive rather than dominant effect on HIV control at the individual level. Finally, weak but significant frequency-dependent effects in this cohort can be detected only by looking at an individual's combined HLA allele frequencies. Taken together, these data suggest that although individual HLA alleles, particularly HLA-B, can have a strong impact, HIV control overall is likely to be influenced by the additive effect of some or all of the other HLA alleles present.


Assuntos
Genes MHC Classe I , Infecções por HIV/genética , Infecções por HIV/imunologia , HIV-1 , Adulto , Alelos , Contagem de Linfócito CD4 , Estudos de Coortes , Frequência do Gene , Genótipo , Infecções por HIV/virologia , HIV-1/classificação , HIV-1/imunologia , HIV-1/fisiologia , Antígenos HLA-B/genética , Antígenos HLA-C/genética , Haplótipos , Heterozigoto , Homozigoto , Interações Hospedeiro-Patógeno/genética , Interações Hospedeiro-Patógeno/imunologia , Humanos , África do Sul , Carga Viral , Replicação Viral/imunologia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA