Pesquisa | Biblioteca Virtual em Saúde

1.

skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements.

Zhang, Xiaolei Brian; Oualline, Grace; Shaw, Jim; Yu, Yun William.

Bioinformatics ; 40(Suppl 2): ii155-ii164, 2024 09 01.

Artigo em Inglês | MEDLINE | ID: mdl-39230688

RESUMO

Motivation: Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbor antibiotic resistance genes. However, despite cheap and rapid whole-genome sequencing, the varied nature of MGEs makes it difficult to fully characterize them, and existing methods for detecting MGEs often do not agree on what should count. In this manuscript, we first define and argue in favor of a divergence-based characterization of mobile-genetic elements. Results: Using that paradigm, we present skandiver, a tool designed to efficiently detect MGEs from whole-genome assemblies without the need for gene annotation or markers. skandiver determines mobile elements via genome fragmentation, average nucleotide identity (ANI), and divergence time. By building on the scalable skani software for ANI computation, skandiver can query hundreds of complete assemblies against >65 000 representative genomes in a few minutes and 19 GB memory, providing scalable and efficient method for elucidating mobile element profiles in incomplete, uncharacterized genomic sequences. For isolated and integrated large plasmids (>10 kb), skandiver's recall was 48% and 47%, MobileElementFinder was 59% and 17%, and geNomad was 86% and 32%, respectively. For isolated large plasmids, skandiver's recall (48%) is lower than state-of-the-art reference-based methods geNomad (86%) and MobileElementFinder (59%). However, skandiver achieves higher recall on integrated plasmids and, unlike other methods, without comparing against a curated database, making skandiver suitable for discovery of novel MGEs. AVAILABILITY AND IMPLEMENTATION: https://github.com/YoukaiFromAccounting/skandiver.

Assuntos

Sequências Repetitivas Dispersas , Software , Genoma Bacteriano , Elementos de DNA Transponíveis/genética , Plasmídeos/genética

2.

Fairy: fast approximate coverage for multi-sample metagenomic binning.

Shaw, Jim; Yu, Yun William.

Microbiome ; 12(1): 151, 2024 Aug 14.

Artigo em Inglês | MEDLINE | ID: mdl-39143609

RESUMO

BACKGROUND: Metagenomic binning, the clustering of assembled contigs that belong to the same genome, is a crucial step for recovering metagenome-assembled genomes (MAGs). Contigs are linked by exploiting consistent signatures along a genome, such as read coverage patterns. Using coverage from multiple samples leads to higher-quality MAGs; however, standard pipelines require all-to-all read alignments for multiple samples to compute coverage, becoming a key computational bottleneck. RESULTS: We present fairy ( https://github.com/bluenote-1577/fairy ), an approximate coverage calculation method for metagenomic binning. Fairy is a fast k-mer-based alignment-free method. For multi-sample binning, fairy can be > 250 × faster than read alignment and accurate enough for binning. Fairy is compatible with several existing binners on host and non-host-associated datasets. Using MetaBAT2, fairy recovers 98.5 % of MAGs with > 50 % completeness and < 5 % contamination relative to alignment with BWA. Notably, multi-sample binning with fairy is always better than single-sample binning using BWA ( > 1.5 × more > 50 % complete MAGs on average) while still being faster. For a public sediment metagenome project, we demonstrate that multi-sample binning recovers higher quality Asgard archaea MAGs than single-sample binning and that fairy's results are indistinguishable from read alignment. CONCLUSIONS: Fairy is a new tool for approximately and quickly calculating multi-sample coverage for binning, resolving a computational bottleneck for metagenomics. Video Abstract.

Assuntos

Metagenoma , Metagenômica , Metagenômica/métodos , Software , Análise de Sequência de DNA/métodos , Biologia Computacional/métodos , Archaea/genética , Archaea/classificação , Algoritmos

3.

SlowMoMan: a web app for discovery of important features along user-drawn trajectories in 2D embeddings.

Deol, Kiran; Weber, Griffin M; Yu, Yun William.

Bioinform Adv ; 4(1): vbae095, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38962404

RESUMO

Motivation: Nonlinear low-dimensional embeddings allow humans to visualize high-dimensional data, as is often seen in bioinformatics, where datasets may have tens of thousands of dimensions. However, relating the axes of a nonlinear embedding to the original dimensions is a nontrivial problem. In particular, humans may identify patterns or interesting subsections in the embedding, but cannot easily identify what those patterns correspond to in the original data. Results: Thus, we present SlowMoMan (SLOW Motions on MANifolds), a web application which allows the user to draw a one-dimensional path onto a 2D embedding. Then, by back-projecting the manifold to the original, high-dimensional space, we sort the original features such that those most discriminative along the manifold are ranked highly. We show a number of pertinent use cases for our tool, including trajectory inference, spatial transcriptomics, and automatic cell classification. Availability and implementation: Software: https://yunwilliamyu.github.io/SlowMoMan/; Code: https://github.com/yunwilliamyu/SlowMoMan.

4.

Floria: fast and accurate strain haplotyping in metagenomes.

Shaw, Jim; Gounot, Jean-Sebastien; Chen, Hanrong; Nagarajan, Niranjan; Yu, Yun William.

Bioinformatics ; 40(Suppl 1): i30-i38, 2024 06 28.

Artigo em Inglês | MEDLINE | ID: mdl-38940183

RESUMO

SUMMARY: Shotgun metagenomics allows for direct analysis of microbial community genetics, but scalable computational methods for the recovery of bacterial strain genomes from microbiomes remains a key challenge. We introduce Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model. Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly. Benchmarking evaluations on synthetic metagenomes show that Floria is > 3× faster and recovers 21% more strain content than base-level assembly methods (Strainberry) while being over an order of magnitude faster when only phasing is required. Applying Floria to a set of 109 deeply sequenced nanopore metagenomes took <20 min on average per sample and identified several species that have consistent strain heterogeneity. Applying Floria's short-read haplotyping to a longitudinal gut metagenomics dataset revealed a dynamic multi-strain Anaerostipes hadrus community with frequent strain loss and emergence events over 636 days. With Floria, accurate haplotyping of metagenomic datasets takes mere minutes on standard workstations, paving the way for extensive strain-level metagenomic analyses. AVAILABILITY AND IMPLEMENTATION: Floria is available at https://github.com/bluenote-1577/floria, and the Floria-PL pipeline is available at https://github.com/jsgounot/Floria_analysis_workflow along with code for reproducing the benchmarks.

Assuntos

Metagenoma , Metagenômica , Metagenômica/métodos , Haplótipos , Software , Humanos , Genoma Bacteriano , Microbiota/genética , Bactérias/genética , Bactérias/classificação , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos

5.

Erratum: Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic.

Shaw, Jim; Yu, Yun William.

Genome Res ; 34(5): 810, 2024 Jun 25.

Artigo em Inglês | MEDLINE | ID: mdl-38918042

6.

On Minimizers and Convolutional Filters: Theoretical Connections and Applications to Genome Analysis.

Yu, Yun William.

J Comput Biol ; 31(5): 381-395, 2024 05.

Artigo em Inglês | MEDLINE | ID: mdl-38687333

RESUMO

Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. In this study, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step toward building a deep learning assembler, although it is at present too slow to be practical. In total, this article provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.

Assuntos

COVID-19 , Redes Neurais de Computação , SARS-CoV-2 , Humanos , COVID-19/virologia , SARS-CoV-2/genética , Algoritmos , Telômero/genética , Biologia Computacional/métodos , Genômica/métodos , Aprendizado Profundo , Genoma Humano

7.

Mora: abundance aware metagenomic read re-assignment for disentangling similar strains.

Zheng, Andrew; Shaw, Jim; Yu, Yun William.

BMC Bioinformatics ; 25(1): 161, 2024 Apr 23.

Artigo em Inglês | MEDLINE | ID: mdl-38649836

RESUMO

BACKGROUND: Taxonomic classification of reads obtained by metagenomic sequencing is often a first step for understanding a microbial community, but correctly assigning sequencing reads to the strain or sub-species level has remained a challenging computational problem. RESULTS: We introduce Mora, a MetagenOmic read Re-Assignment algorithm capable of assigning short and long metagenomic reads with high precision, even at the strain level. Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algorithm and then utilizing abundance information to re-assign query reads. The key idea behind Mora is to maximize read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes. On simulated diverse reads, this allows Mora to achieve F1 scores comparable to other algorithms while having less runtime. However, Mora significantly outshines other algorithms on very similar reads. We show that the high penalty of over assigning reads to a common reference genome allows Mora to accurately infer correct strains for real data in the form of E. coli reads. CONCLUSIONS: Mora is a fast and accurate read re-assignment algorithm that is modularized, allowing it to be incorporated into general metagenomics and genomics workflows. It is freely available at https://github.com/AfZheng126/MORA .

Assuntos

Algoritmos , Metagenômica , Metagenômica/métodos , Escherichia coli/genética , Análise de Sequência de DNA/métodos , Software , Metagenoma/genética , Genoma Bacteriano

8.

Image-centric compression of protein structures improves space savings.

Staniscia, Luke; Yu, Yun William.

BMC Bioinformatics ; 24(1): 437, 2023 Nov 21.

Artigo em Inglês | MEDLINE | ID: mdl-37990290

RESUMO

BACKGROUND: Because of the rapid generation of data, the study of compression algorithms to reduce storage and transmission costs is important to bioinformaticians. Much of the focus has been on sequence data, including both genomes and protein amino acid sequences stored in FASTA files. Current standard practice is to use an ordinary lossless compressor such as gzip on a sequential list of atomic coordinates, but this approach expends bits on saving an arbitrary ordering of atoms, and it also prevents reordering the atoms for compressibility. The standard MMTF and BCIF file formats extend this approach with custom encoding of the coordinates. However, the brand new Foldcomp tool introduces a new paradigm of compressing local angles, to great effect. In this article, we explore a different paradigm, showing for the first time that image-based compression using global angles can also significantly improve compression ratios. To this end, we implement a prototype compressor 'PIC', specialized for point clouds of atom coordinates contained in PDB and mmCIF files. PIC maps the 3D data to a 2D 8-bit greyscale image and leverages the well developed PNG image compressor to minimize the size of the resulting image, forming the compressed file. RESULTS: PIC outperforms gzip in terms of compression ratio on proteins over 20,000 atoms in size, with a savings over gzip of up to 37.4% on the proteins compressed. In addition, PIC's compression ratio increases with protein size. CONCLUSION: Image-centric compression as demonstrated by our prototype PIC provides a potential means of constructing 3D structure-aware protein compression software, though future work would be necessary to make this practical.

Assuntos

Compressão de Dados , Compressão de Dados/métodos , Algoritmos , Software , Genoma

9.

Fast and robust metagenomic sequence comparison through sparse chaining with skani.

Shaw, Jim; Yu, Yun William.

Nat Methods ; 20(11): 1661-1665, 2023 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-37735570

RESUMO

Sequence comparison tools for metagenome-assembled genomes (MAGs) struggle with high-volume or low-quality data. We present skani ( https://github.com/bluenote-1577/skani ), a method for determining average nucleotide identity (ANI) via sparse approximate alignments. skani outperforms FastANI in accuracy and speed (>20× faster) for fragmented, incomplete MAGs. skani can query genomes against >65,000 prokaryotic genomes in seconds and 6 GB memory. skani unlocks higher-resolution insights for extensive, noisy metagenomic datasets.

Assuntos

Metagenoma , Células Procarióticas , Metagenômica/métodos

10.

Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic.

Shaw, Jim; Yu, Yun William.

Genome Res ; 33(7): 1175-1187, 2023 07.

Artigo em Inglês | MEDLINE | ID: mdl-36990779

RESUMO

Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation Assume we are given a random nucleotide sequence of length â¼n that is indexed (or seeded) and a mutated substring of length â¼m ≤ n with mutation rate Î¸ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear-gap cost chaining and quadratic time gap extension is O(mn f (Î¸) log n), where f(Î¸) < 2.43 · Î¸ holds as a loose bound. The alignment also turns out to be good; we prove that more than [Formula: see text] fraction of the homologous bases is recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, that is, only a subset of all k-mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, f(Î¸) can be further reduced.

Assuntos

Algoritmos , Heurística , Simulação por Computador , Alinhamento de Sequência , Análise de Sequência de DNA/métodos

11.

Navigating bottlenecks and trade-offs in genomic data analysis.

Berger, Bonnie; Yu, Yun William.

Nat Rev Genet ; 24(4): 235-250, 2023 04.

Artigo em Inglês | MEDLINE | ID: mdl-36476810

RESUMO

Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.

Assuntos

Genoma , Genômica , Humanos , Mapeamento Cromossômico , Análise de Dados

12.

Theory of local k-mer selection with applications to long-read alignment.

Shaw, Jim; Yu, Yun William.

Bioinformatics ; 38(20): 4659-4669, 2022 10 14.

Artigo em Inglês | MEDLINE | ID: mdl-36124869

RESUMO

MOTIVATION: Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the 'lowest-ordered' k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is, however, a lack of understanding behind the theory of why certain methods perform well. RESULTS: We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more conserved k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads. However, we found that the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment. We give new insight into how one might use new k-mer selection methods as a reparameterization to optimize for speed and alignment quality. AVAILABILITY AND IMPLEMENTATION: Simulations and supplementary methods are available at https://github.com/bluenote-1577/local-kmer-selection-results. os-minimap2 is a modified version of minimap2 and available at https://github.com/bluenote-1577/os-minimap2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Mutação , Análise de Sequência de DNA/métodos

13.

flopp: Extremely Fast Long-Read Polyploid Haplotype Phasing by Uniform Tree Partitioning.

Shaw, Jim; Yu, Yun William.

J Comput Biol ; 29(2): 195-211, 2022 02.

Artigo em Inglês | MEDLINE | ID: mdl-35041529

RESUMO

Resolving haplotypes in polyploid genomes using phase information from sequencing reads is an important and challenging problem. We introduce two new mathematical formulations of polyploid haplotype phasing: (1) the min-sum max tree partition problem, which is a more flexible graphical metric compared with the standard minimum error correction (MEC) model in the polyploid setting, and (2) the uniform probabilistic error minimization model, which is a probabilistic analogue of the MEC model. We incorporate both formulations into a long-read based polyploid haplotype phasing method called flopp. We show that flopp compares favorably with state-of-the-art algorithms-up to 30 times faster with 2 times fewer switch errors on 6 × ploidy simulated data. Further, we show using real nanopore data that flopp can quickly reveal reasonable haplotype structures from the autotetraploid Solanum tuberosum (potato).

Assuntos

Algoritmos , Haplótipos , Poliploidia , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Genoma de Planta , Modelos Genéticos , Modelos Estatísticos , Família Multigênica , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/estatística & dados numéricos , Software , Solanum tuberosum/genética

14.

HyperMinHash: MinHash in LogLog space.

Yu, Yun William; Weber, Griffin M.

IEEE Trans Knowl Data Eng ; 34(1): 328-339, 2022 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-38288326

RESUMO

In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O(logn) to buckets of size O(loglogn) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For an additive approximation error Ïµ on a Jaccard index t, given a random oracle, HyperMinHash needs O(Ïµ-2(loglogn+log1Ïµ)) space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 1019 with relative error of around 10% using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 1010 with the same memory consumption.

15.

Expected 10-anonymity of HyperLogLog sketches for federated queries of clinical data repositories.

Tao, Ziye; Weber, Griffin M; Yu, Yun William.

Bioinformatics ; 37(Suppl_1): i151-i160, 2021 07 12.

Artigo em Inglês | MEDLINE | ID: mdl-34252969

RESUMO

MOTIVATION: The rapid growth in of electronic medical records provide immense potential to researchers, but are often silo-ed at separate hospitals. As a result, federated networks have arisen, which allow simultaneously querying medical databases at a group of connected institutions. The most basic such query is the aggregate count-e.g. How many patients have diabetes? However, depending on the protocol used to estimate that total, there is always a tradeoff in the accuracy of the estimate against the risk of leaking confidential data. Prior work has shown that it is possible to empirically control that tradeoff by using the HyperLogLog (HLL) probabilistic sketch. RESULTS: In this article, we prove complementary theoretical bounds on the k-anonymity privacy risk of using HLL sketches, as well as exhibit code to efficiently compute those bounds. AVAILABILITY AND IMPLEMENTATION: https://github.com/tzyRachel/K-anonymity-Expectation.

Assuntos

Privacidade , Pesquisadores , Bases de Dados Factuais , Humanos

16.

Levenshtein Distance, Sequence Comparison and Biological Database Search.

Berger, Bonnie; Waterman, Michael S; Yu, Yun William.

IEEE Trans Inf Theory ; 67(6): 3287-3294, 2021 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-34257466

RESUMO

Levenshtein edit distance has played a central role-both past and present-in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.

17.

Inherent privacy limitations of decentralized contact tracing apps.

Bengio, Yoshua; Ippolito, Daphne; Janda, Richard; Jarvie, Max; Prud'homme, Benjamin; Rousseau, Jean-François; Sharma, Abhinav; Yu, Yun William.

J Am Med Inform Assoc ; 28(1): 193-195, 2021 01 15.

Artigo em Inglês | MEDLINE | ID: mdl-32584990

RESUMO

Recently, there have been many efforts to use mobile apps as an aid in contact tracing to control the spread of the SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) (COVID-19 [coronavirus disease 2019]) pandemic. However, although many apps aim to protect individual privacy, the very nature of contact tracing must reveal some otherwise protected personal information. Digital contact tracing has endemic privacy risks that cannot be removed by technological means, and which may require legal or economic solutions. In this brief communication, we discuss a few of these inherent privacy limitations of any decentralized automatic contact tracing system.

Assuntos

COVID-19 , Busca de Comunicante/legislação & jurisprudência , Aplicativos Móveis/legislação & jurisprudência , Privacidade , COVID-19/epidemiologia , Canadá , Busca de Comunicante/ética , Busca de Comunicante/métodos , Humanos , Aplicativos Móveis/ética , Estados Unidos

18.

Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation.

Yu, Yun William; Weber, Griffin M.

J Med Internet Res ; 22(11): e18735, 2020 11 03.

Artigo em Inglês | MEDLINE | ID: mdl-33141090

RESUMO

BACKGROUND: Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of health care organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate counts of the number of matching patients. However, because patients can receive care from multiple sites in the network, simply adding the numbers frequently double counts patients. Various methods such as the use of trusted third parties or secure multiparty computation have been proposed to link patient records across sites. However, they either have large trade-offs in accuracy and privacy or are not scalable to large networks. OBJECTIVE: This study aims to enable accurate estimates of the number of patients matching a federated query while providing strong guarantees on the amount of protected medical information revealed. METHODS: We introduce a novel probabilistic approach to running federated network queries. It combines an algorithm called HyperLogLog with obfuscation in the form of hashing, masking, and homomorphic encryption. It is tunable, in that it allows networks to balance accuracy versus privacy, and it is computationally efficient even for large networks. We built a user-friendly free open-source benchmarking platform to simulate federated queries in large hospital networks. Using this platform, we compare the accuracy, k-anonymity privacy risk (with k=10), and computational runtime of our algorithm with several existing techniques. RESULTS: In simulated queries matching 1 to 100 million patients in a 100-hospital network, our method was significantly more accurate than adding aggregate counts while maintaining k-anonymity. On average, it required a total of 12 kilobytes of data to be sent to the network hub and added only 5 milliseconds to the overall federated query runtime. This was orders of magnitude better than other approaches, which guaranteed the exact answer. CONCLUSIONS: Using our method, it is possible to run highly accurate federated queries of clinical data repositories that both protect patient privacy and scale to large networks.

Assuntos

Confiabilidade dos Dados , Projetos de Pesquisa/normas , Algoritmos , Humanos , Privacidade , Reprodutibilidade dos Testes

19.

The need for privacy with public digital contact tracing during the COVID-19 pandemic.

Bengio, Yoshua; Janda, Richard; Yu, Yun William; Ippolito, Daphne; Jarvie, Max; Pilat, Dan; Struck, Brooke; Krastev, Sekoul; Sharma, Abhinav.

Lancet Digit Health ; 2(7): e342-e344, 2020 07.

Artigo em Inglês | MEDLINE | ID: mdl-32835192

Assuntos

COVID-19/prevenção & controle , Confidencialidade , Busca de Comunicante/ética , COVID-19/epidemiologia , COVID-19/transmissão , Busca de Comunicante/métodos , Tomada de Decisão Compartilhada , Humanos , Consentimento Livre e Esclarecido/ética , Consentimento Livre e Esclarecido/normas

20.

Carnelian uncovers hidden functional patterns across diverse study populations from whole metagenome sequencing reads.

Nazeen, Sumaiya; Yu, Yun William; Berger, Bonnie.

Genome Biol ; 21(1): 47, 2020 02 24.

Artigo em Inglês | MEDLINE | ID: mdl-32093762

RESUMO

Microbial populations exhibit functional changes in response to different ambient environments. Although whole metagenome sequencing promises enough raw data to study those changes, existing tools are limited in their ability to directly compare microbial metabolic function across samples and studies. We introduce Carnelian, an end-to-end pipeline for metabolic functional profiling uniquely suited to finding functional trends across diverse datasets. Carnelian is able to find shared metabolic pathways, concordant functional dysbioses, and distinguish Enzyme Commission (EC) terms missed by existing methodologies. We demonstrate Carnelian's effectiveness on type 2 diabetes, Crohn's disease, Parkinson's disease, and industrialized and non-industrialized gut microbiome cohorts.

Assuntos

Microbioma Gastrointestinal/genética , Metagenoma , Metagenômica/métodos , Software , Doença de Crohn/genética , Doença de Crohn/microbiologia , Diabetes Mellitus Tipo 2/genética , Diabetes Mellitus Tipo 2/microbiologia , Genoma Humano , Humanos , Redes e Vias Metabólicas , Doença de Parkinson/genética , Doença de Parkinson/microbiologia

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA