Pesquisa | Portal Regional da BVS

Genomic and Synthetic Biology Digital Biosecurity.

Hudson, Corey M; Pattengale, Nicholas D; Iyer, Ravishankar K; Kalbarczyk, Zbigniew T; Alli, Nina.

Pac Symp Biocomput ; 27: 402-406, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-34890167

RESUMO

Trends toward automation of synthetic biology and the individualization of biology and medicine raise varied and critical security issues. Digital biosecurity brings together researchers working in secure algorithms, vulnerability assessments, and emerging threat models. The fundamental goal of this digital biosecurity workshop is to identify and present distinct areas of research around making the next generation of biology safer and more secure. The workshop will include a panel overview of the field, including representatives from academia, industry, and non-profits. It will also include novel presentations from the research community. We expect that attendees will leave this workshop with a new appreciation of the research and implementation challenges in maintaining the digital aspects of biosecurity.

Assuntos

Biosseguridade , Biologia Sintética , Biologia Computacional , Genômica , Humanos

Decentralized genomics audit logging via permissioned blockchain ledgering.

Pattengale, Nicholas D; Hudson, Corey M.

BMC Med Genomics ; 13(Suppl 7): 102, 2020 07 21.

Artigo em Inglês | MEDLINE | ID: mdl-32693795

RESUMO

BACKGROUND: One of the tasks in the iDASH Secure Genome Analysis Competition in 2018 was to develop blockchain-based immutable logging and querying for a cross-site genomic dataset access audit trail. The specific challenge was to design a time/space efficient structure and mechanism of storing/retrieving genomic data access logs, based on MultiChain version 1.0.4 ( https://www.multichain.com/ ). METHODS: Our technique uses the MultiChain stream application programming interface (which affords treating MultiChain as a key value store) and employs a two-level index, which naturally supports efficient queries of the data for single clause constraints. The scheme also supports heuristic and binary search techniques for queries containing conjunctions of clause constraints, and timestamp range queries. Of note, all of our techniques have complexity independent of inserted data set size, other than the timestamp ranges, which logarithmically scale with input size. RESULTS: We implemented our insertion and querying techniques in Python, using the MultiChain library Savoir ( https://github.com/dxmarkets/savoir ), and comprehensively tested our implementation across a benchmark of datasets of varying sizes. We also tested a port of our challenge submission to a newer version of MultiChain (2.0 beta), which natively supports multiple indices. CONCLUSIONS: We presented creative and efficient techniques for storing and querying log file data in MultiChain 1.0.4 and 2.0 beta. We demonstrated that it is feasible to use a permissioned blockchain ledger for genomic query log data when data volume is on the order of hundreds of megabytes and query times of dozens of minutes is acceptable. We demonstrated that evolution in the ledger platform (MultiChain 1 to 2) yielded a 30%-40% increase in insertion efficiency. All source code for this challenge has been made available under a BSD-3 license from https://github.com/sandialabs/idash2018task1/ .

Assuntos

Blockchain , Genômica/métodos , Algoritmos , Registros Eletrônicos de Saúde , Humanos

The Kernel of Maximum Agreement Subtrees.

Swenson, Krister M; Chen, Eric; Pattengale, Nicholas D; Sankoff, David.

IEEE/ACM Trans Comput Biol Bioinform ; 9(4): 1023-31, 2012.

Artigo em Inglês | MEDLINE | ID: mdl-22231622

RESUMO

A Maximum Agreement SubTree (MAST) is a largest subtree common to a set of trees and serves as a summary of common substructure in the trees. A single MAST can be misleading, however, since there can be an exponential number of MASTs, and two MASTs for the same tree set do not even necessarily share any leaves. In this paper, we introduce the notion of the Kernel Agreement SubTree (KAST), which is the summary of the common substructure in all MASTs, and show that it can be calculated in polynomial time (for trees with bounded degree). Suppose the input trees represent competing hypotheses for a particular phylogeny. We explore the utility of the KAST as a method to discern the common structure of confidence, and as a measure of how confident we are in a given tree set. We also show the trend of the KAST, as compared to other consensus methods, on the set of all trees visited during a Bayesian analysis of flatworm genomes.

Assuntos

Algoritmos , Biologia Computacional/métodos , Filogenia , Animais , Teorema de Bayes , Genoma Bacteriano/genética , Genoma Helmíntico , Modelos Genéticos , Platelmintos/genética , Proteobactérias

Uncovering hidden phylogenetic consensus in large data sets.

Pattengale, Nicholas D; Aberer, Andre J; Swenson, Krister M; Stamatakis, Alexandros; Moret, Bernard M E.

IEEE/ACM Trans Comput Biol Bioinform ; 8(4): 902-11, 2011.

Artigo em Inglês | MEDLINE | ID: mdl-21301032

RESUMO

Many of the steps in phylogenetic reconstruction can be confounded by "rogue" taxataxa that cannot be placed with assurance anywhere within the tree, indeed, whose location within the tree varies with almost any choice of algorithm or parameters. Phylogenetic consensus methods, in particular, are known to suffer from this problem. In this paper, we provide a novel framework to define and identify rogue taxa. In this framework, we formulate a bicriterion optimization problem, the relative information criterion, that models the net increase in useful information present in the consensus tree when certain taxa are removed from the input data. We also provide an effective greedy heuristic to identify a subset of rogue taxa and use this heuristic in a series of experiments, with both pathological examples from the literature and a collection of large biological data sets. As the presence of rogue taxa in a set of bootstrap replicates can lead to deceivingly poor support values, we propose a procedure to recompute support values in light of the rogue taxa identified by our algorithm; applying this procedure to our biological data sets caused a large number of edges to move from "unsupported" to "supported" status, indicating that many existing phylogenies should be recomputed and reevaluated to reduce any inaccuracies introduced by rogue taxa. We also discuss the implementation issues encountered while integrating our algorithm into RAxML v7.2.7, particularly those dealing with scaling up the analyses. This integration enables practitioners to benefit from our algorithm in the analysis of very large data sets (up to 2,500 taxa and 10,000 trees, although we present the results of even larger analyses).

Assuntos

Algoritmos , Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Análise por Conglomerados , Sequência Consenso , Bases de Dados Genéticas

How many bootstrap replicates are necessary?

Pattengale, Nicholas D; Alipour, Masoud; Bininda-Emonds, Olaf R P; Moret, Bernard M E; Stamatakis, Alexandros.

J Comput Biol ; 17(3): 337-54, 2010 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-20377449

RESUMO

Phylogenetic bootstrapping (BS) is a standard technique for inferring confidence values on phylogenetic trees that is based on reconstructing many trees from minor variations of the input data, trees called replicates. BS is used with all phylogenetic reconstruction approaches, but we focus here on one of the most popular, maximum likelihood (ML). Because ML inference is so computationally demanding, it has proved too expensive to date to assess the impact of the number of replicates used in BS on the relative accuracy of the support values. For the same reason, a rather small number (typically 100) of BS replicates are computed in real-world studies. Stamatakis et al. recently introduced a BS algorithm that is 1 to 2 orders of magnitude faster than previous techniques, while yielding qualitatively comparable support values, making an experimental study possible. In this article, we propose stopping criteria--that is, thresholds computed at runtime to determine when enough replicates have been generated--and we report on the first large-scale experimental study to assess the effect of the number of replicates on the quality of support values, including the performance of our proposed criteria. We run our tests on 17 diverse real-world DNA--single-gene as well as multi-gene--datasets, which include 125-2,554 taxa. We find that our stopping criteria typically stop computations after 100-500 replicates (although the most conservative criterion may continue for several thousand replicates) while producing support values that correlate at better than 99.5% with the reference values on the best ML trees. Significantly, we also find that the stopping criteria can recommend very different numbers of replicates for different datasets of comparable sizes. Our results are thus twofold: (i) they give the first experimental assessment of the effect of the number of BS replicates on the quality of support values returned through BS, and (ii) they validate our proposals for stopping criteria. Practitioners will no longer have to enter a guess nor worry about the quality of support values; moreover, with most counts of replicates in the 100-500 range, robust BS under ML inference becomes computationally practical for most datasets. The complete test suite is available at http://lcbb.epfl.ch/BS.tar.bz2, and BS with our stopping criteria is included in the latest release of RAxML v7.2.5, available at http://wwwkramer.in.tum.de/exelixis/software.html.

Assuntos

Biologia Computacional/métodos , Filogenia , Intervalos de Confiança , Bases de Dados Genéticas , Funções Verossimilhança , Reprodutibilidade dos Testes , Fatores de Tempo

Efficiently computing the Robinson-Foulds metric.

Pattengale, Nicholas D; Gottlieb, Eric J; Moret, Bernard M E.

J Comput Biol ; 14(6): 724-35, 2007.

Artigo em Inglês | MEDLINE | ID: mdl-17691890

RESUMO

The Robinson-Foulds (RF) metric is the measure most widely used in comparing phylogenetic trees; it can be computed in linear time using Day's algorithm. When faced with the need to compare large numbers of large trees, however, even linear time becomes prohibitive. We present a randomized approximation scheme that provides, in sublinear time and with high probability, a (1 + epsilon) approximation of the true RF metric. Our approach is to use a sublinear-space embedding of the trees, combined with an application of the Johnson-Lindenstrauss lemma to approximate vector norms very rapidly. We complement our algorithm by presenting an efficient embedding procedure, thereby resolving an open issue from the preliminary version of this paper. We have also improved the performance of Day's (exact) algorithm in practice by using techniques discovered while implementing our approximation scheme. Indeed, we give a unified framework for edge-based tree algorithms in which implementation tradeoffs are clear. Finally, we present detailed experimental results illustrating the precision and running-time tradeoffs as well as demonstrating the speed of our approach. Our new implementation, FastRF, is available as an open-source tool for phylogenetic analysis.

Assuntos

Algoritmos , Biologia Computacional , Evolução Molecular , Filogenia , Proteínas/genética , Bases de Dados de Proteínas , Proteínas/química , Proteínas/classificação , Software

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA