Pesquisa | BVS - MINISTÉRIO DA SAÚDE

1.

Inferring sparse structure in genotype-phenotype maps.

Petti, Samantha; Reddy, Gautam; Desai, Michael M.

Genetics ; 225(1)2023 08 31.

Artigo em Inglês | MEDLINE | ID: mdl-37437111

RESUMO

Correlation among multiple phenotypes across related individuals may reflect some pattern of shared genetic architecture: individual genetic loci affect multiple phenotypes (an effect known as pleiotropy), creating observable relationships between phenotypes. A natural hypothesis is that pleiotropic effects reflect a relatively small set of common "core" cellular processes: each genetic locus affects one or a few core processes, and these core processes in turn determine the observed phenotypes. Here, we propose a method to infer such structure in genotype-phenotype data. Our approach, sparse structure discovery (SSD) is based on a penalized matrix decomposition designed to identify latent structure that is low-dimensional (many fewer core processes than phenotypes and genetic loci), locus-sparse (each locus affects few core processes), and/or phenotype-sparse (each phenotype is influenced by few core processes). Our use of sparsity as a guide in the matrix decomposition is motivated by the results of a novel empirical test indicating evidence of sparse structure in several recent genotype-phenotype datasets. First, we use synthetic data to show that our SSD approach can accurately recover core processes if each genetic locus affects few core processes or if each phenotype is affected by few core processes. Next, we apply the method to three datasets spanning adaptive mutations in yeast, genotoxin robustness assay in human cell lines, and genetic loci identified from a yeast cross, and evaluate the biological plausibility of the core process identified. More generally, we propose sparsity as a guiding prior for resolving latent structure in empirical genotype-phenotype maps.

Assuntos

Saccharomyces cerevisiae , Humanos , Genótipo , Saccharomyces cerevisiae/genética , Fenótipo , Mutação

2.

Correction: Constructing Benchmark Test Sets for Biological Sequence Analysis Using Independent Set Algorithms.

Petti, Samantha; Eddy, Sean R.

PLoS Comput Biol ; 19(3): e1010971, 2023 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-36888579

RESUMO

[This corrects the article DOI: 10.1371/journal.pcbi.1009492.].

3.

End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman.

Petti, Samantha; Bhattacharya, Nicholas; Rao, Roshan; Dauparas, Justas; Thomas, Neil; Zhou, Juannan; Rush, Alexander M; Koo, Peter; Ovchinnikov, Sergey.

Bioinformatics ; 39(1)2023 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-36355460

RESUMO

MOTIVATION: Multiple sequence alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for. RESULTS: Here, we implement a smooth and differentiable version of the Smith-Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF learns MSAs that mildly improve contact prediction on a diverse set of protein and RNA families. As a proof of concept, we demonstrate that by connecting our differentiable alignment module to AlphaFold2 and maximizing predicted confidence, we can learn MSAs that improve structure predictions over the initial MSAs. Interestingly, the alignments that improve AlphaFold predictions are self-inconsistent and can be viewed as adversarial. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment and the potential dangers of optimizing predictions of protein sequences with methods that are not fully understood. AVAILABILITY AND IMPLEMENTATION: Our code and examples are available at: https://github.com/spetti/SMURF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Proteínas , Humanos , Alinhamento de Sequência , Proteínas/química , Redes Neurais de Computação , Sequência de Aminoácidos

4.

Constructing benchmark test sets for biological sequence analysis using independent set algorithms.

Petti, Samantha; Eddy, Sean R.

PLoS Comput Biol ; 18(3): e1009492, 2022 03.

Artigo em Inglês | MEDLINE | ID: mdl-35255082

RESUMO

Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.

Assuntos

Algoritmos , Benchmarking , Análise de Sequência

5.

Differential privacy in the 2020 US census: what will it do? Quantifying the accuracy/privacy tradeoff.

Petti, Samantha; Flaxman, Abraham.

Gates Open Res ; 3: 1722, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-32478311

RESUMO

Background: The 2020 US Census will use a novel approach to disclosure avoidance to protect respondents' data, called TopDown. This TopDown algorithm was applied to the 2018 end-to-end (E2E) test of the decennial census. The computer code used for this test as well as accompanying exposition has recently been released publicly by the Census Bureau. Methods: We used the available code and data to better understand the error introduced by the E2E disclosure avoidance system when Census Bureau applied it to 1940 census data and we developed an empirical measure of privacy loss to compare the error and privacy of the new approach to that of a simple-random-sampling approach to protecting privacy. Results: We found that the empirical privacy loss of TopDown is substantially smaller than the theoretical guarantee for all privacy loss budgets we examined. When run on the 1940 census data, TopDown with a privacy budget of 1.0 was similar in error and privacy loss to that of a simple random sample of 50% of the US population. When run with a privacy budget of 4.0, it was similar in error and privacy loss of a 90% sample. Conclusions: This work fits into the beginning of a discussion on how to best balance privacy and accuracy in decennial census data collection, and there is a need for continued discussion.

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA