Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 58
Filtrar
Mais filtros

Bases de dados
Tipo de documento
Intervalo de ano de publicação
1.
Nature ; 617(7960): 312-324, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37165242

RESUMO

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.


Assuntos
Genoma Humano , Genômica , Humanos , Diploide , Genoma Humano/genética , Haplótipos/genética , Análise de Sequência de DNA , Genômica/normas , Padrões de Referência , Estudos de Coortes , Alelos , Variação Genética
2.
Nat Methods ; 2024 Oct 21.
Artigo em Inglês | MEDLINE | ID: mdl-39433878

RESUMO

Pangenome graphs can represent all variation between multiple reference genomes, but current approaches to build them exclude complex sequences or are based upon a single reference. In response, we developed the PanGenome Graph Builder, a pipeline for constructing pangenome graphs without bias or exclusion. The PanGenome Graph Builder uses all-to-all alignments to build a variation graph in which we can identify variation, measure conservation, detect recombination events and infer phylogenetic relationships.

3.
Bioinformatics ; 2024 Oct 14.
Artigo em Inglês | MEDLINE | ID: mdl-39400346

RESUMO

MOTIVATION: Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. However, current construction methods often introduce biases, excluding complex sequences or relying on references. The PanGenome Graph Builder (PGGB) addresses these issues. To date, though, there is no state-of-the-art pipeline allowing for easy deployment, efficient and dynamic use of available resources, and scalable usage at the same time. RESULTS: To overcome these limitations, we present nf-core/pangenome, a reference-unbiased approach implemented in Nextflow following nf-core's best practices. Leveraging biocontainers ensures portability and seamless deployment in HPC environments. Unlike PGGB, nf-core/pangenome distributes alignments across cluster nodes, enabling scalability. Demonstrating its efficiency, we constructed pangenome graphs for 1000 human chromosome 19 haplotypes and 2146 E. coli sequences, achieving a two to threefold speedup compared to PGGB without increasing greenhouse gas emissions. AVAILABILITY: Nf-core/pangenome is released under the MIT open-source license, available on GitHub and Zenodo, with documentation accessible at https://nf-co.re/pangenome/1.1.2/docs/usage. SUPPLEMENTARY: Supplementary data are available at Bioinformatics online.

4.
Bioinformatics ; 40(7)2024 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-38960860

RESUMO

MOTIVATION: The increasing availability of complete genomes demands for models to study genomic variability within entire populations. Pangenome graphs capture the full genomic similarity and diversity between multiple genomes. In order to understand them, we need to see them. For visualization, we need a human-readable graph layout: a graph embedding in low (e.g. two) dimensional depictions. Due to a pangenome graph's potential excessive size, this is a significant challenge. RESULTS: In response, we introduce a novel graph layout algorithm: the Path-Guided Stochastic Gradient Descent (PG-SGD). PG-SGD uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes. This avoids the quadratic cost seen in previous versions of graph drawing by SGD. We show that our implementation efficiently computes the low-dimensional layouts of gigabase-scale pangenome graphs, unveiling their biological features. AVAILABILITY AND IMPLEMENTATION: We integrated PG-SGD in ODGI which is released as free software under the MIT open source license. Source code is available at https://github.com/pangenome/odgi.


Assuntos
Algoritmos , Software , Humanos , Genômica/métodos , Gráficos por Computador , Genoma
5.
Bioinformatics ; 38(13): 3319-3326, 2022 06 27.
Artigo em Inglês | MEDLINE | ID: mdl-35552372

RESUMO

MOTIVATION: Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These models offer the opportunity to study the entire genomic diversity of a population, including structurally complex regions. Nevertheless, analyzing hundreds of gigabase-scale genomes using pangenome graphs is difficult as it is not well-supported by existing tools. Hence, fast and versatile software is required to ask advanced questions to such data in an efficient way. RESULTS: We wrote Optimized Dynamic Genome/Graph Implementation (ODGI), a novel suite of tools that implements scalable algorithms and has an efficient in-memory representation of DNA pangenome graphs in the form of variation graphs. ODGI supports pre-built graphs in the Graphical Fragment Assembly format. ODGI includes tools for detecting complex regions, extracting pangenomic loci, removing artifacts, exploratory analysis, manipulation, validation and visualization. Its fast parallel execution facilitates routine pangenomic tasks, as well as pipelines that can quickly answer complex biological questions of gigabase-scale pangenome graphs. AVAILABILITY AND IMPLEMENTATION: ODGI is published as free software under the MIT open source license. Source code can be downloaded from https://github.com/pangenome/odgi and documentation is available at https://odgi.readthedocs.io. ODGI can be installed via Bioconda https://bioconda.github.io/recipes/odgi/README.html or GNU Guix https://github.com/pangenome/odgi/blob/master/guix.scm. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Software , Genômica , Algoritmos , Documentação
6.
New Phytol ; 237(6): 2360-2374, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36457296

RESUMO

To establish persistent infections in host plants, herbivorous invaders, such as root-knot nematodes, must rely on effectors for suppressing damage-induced jasmonate-dependent host defenses. However, at present, the effector mechanisms targeting the biosynthesis of biologically active jasmonates to avoid adverse host responses are unknown. Using yeast two-hybrid, in planta co-immunoprecipitation, and mutant analyses, we identified 12-oxophytodienoate reductase 2 (OPR2) as an important host target of the stylet-secreted effector MiMSP32 of the root-knot nematode Meloidogyne incognita. MiMSP32 has no informative sequence similarities with other functionally annotated genes but was selected for the discovery of novel effector mechanisms based on evidence of positive, diversifying selection. OPR2 catalyzes the conversion of a derivative of 12-oxophytodienoate to jasmonic acid (JA) and operates parallel to 12-oxophytodienoate reductase 3 (OPR3), which controls the main pathway in the biosynthesis of jasmonates. We show that MiMSP32 targets OPR2 to promote parasitism of M. incognita in host plants independent of OPR3-mediated JA biosynthesis. Artificially manipulating the conversion of the 12-oxophytodienoate by OPRs increases susceptibility to multiple unrelated plant invaders. Our study is the first to shed light on a novel effector mechanism targeting this process to regulate the susceptibility of host plants.


Assuntos
Oxirredutases atuantes sobre Doadores de Grupo CH-CH , Tylenchoidea , Animais , Oxirredutases atuantes sobre Doadores de Grupo CH-CH/metabolismo , Oxirredutases/metabolismo , Transporte Biológico , Tylenchoidea/fisiologia , Doenças das Plantas
7.
PLoS Comput Biol ; 18(5): e1009123, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35639788

RESUMO

Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies-as well as in somatic and germline mutation studies. The VCF format can represent single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called and anchored against a reference genome. Here we present a spectrum of over 125 useful, complimentary free and open source software tools and libraries, we wrote and made available through the multiple vcflib, bio-vcf, cyvcf2, hts-nim and slivar projects. These tools are applied for comparison, filtering, normalisation, smoothing and annotation of VCF, as well as output of statistics, visualisation, and transformations of files variants. These tools run everyday in critical biomedical pipelines and countless shell scripts. Our tools are part of the wider bioinformatics ecosystem and we highlight best practices. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation through pangenome graph formats, variation that can not easily be represented by the VCF format.


Assuntos
Ecossistema , Variação Genética , Biologia Computacional , Variação Genética/genética , Nucleotídeos , Software
8.
J Neurosci ; 41(5): 927-936, 2021 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-33472826

RESUMO

High digital connectivity and a focus on reproducibility are contributing to an open science revolution in neuroscience. Repositories and platforms have emerged across the whole spectrum of subdisciplines, paving the way for a paradigm shift in the way we share, analyze, and reuse vast amounts of data collected across many laboratories. Here, we describe how open access web-based tools are changing the landscape and culture of neuroscience, highlighting six free resources that span subdisciplines from behavior to whole-brain mapping, circuits, neurons, and gene variants.


Assuntos
Acesso à Informação , Encéfalo/fisiologia , Internet/tendências , Rede Nervosa/fisiologia , Neurônios/fisiologia , Animais , Encéfalo/citologia , Conjuntos de Dados como Assunto/tendências , Redes Reguladoras de Genes/fisiologia , Humanos , Rede Nervosa/citologia
9.
Nature ; 538(7624): 260-264, 2016 Oct 13.
Artigo em Inglês | MEDLINE | ID: mdl-27698416

RESUMO

The gradual accumulation of genetic mutations in human adult stem cells (ASCs) during life is associated with various age-related diseases, including cancer. Extreme variation in cancer risk across tissues was recently proposed to depend on the lifetime number of ASC divisions, owing to unavoidable random mutations that arise during DNA replication. However, the rates and patterns of mutations in normal ASCs remain unknown. Here we determine genome-wide mutation patterns in ASCs of the small intestine, colon and liver of human donors with ages ranging from 3 to 87 years by sequencing clonal organoid cultures derived from primary multipotent cells. Our results show that mutations accumulate steadily over time in all of the assessed tissue types, at a rate of approximately 40 novel mutations per year, despite the large variation in cancer incidence among these tissues. Liver ASCs, however, have different mutation spectra compared to those of the colon and small intestine. Mutational signature analysis reveals that this difference can be attributed to spontaneous deamination of methylated cytosine residues in the colon and small intestine, probably reflecting their high ASC division rate. In liver, a signature with an as-yet-unknown underlying mechanism is predominant. Mutation spectra of driver genes in cancer show high similarity to the tissue-specific ASC mutation spectra, suggesting that intrinsic mutational processes in ASCs can initiate tumorigenesis. Notably, the inter-individual variation in mutation rate and spectra are low, suggesting tissue-specific activity of common mutational processes throughout life.


Assuntos
Células-Tronco Adultas/metabolismo , Envelhecimento/genética , Acúmulo de Mutações , Taxa de Mutação , Especificidade de Órgãos , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Animais , Criança , Pré-Escolar , Colo/metabolismo , Análise Mutacional de DNA , Feminino , Genes Neoplásicos/genética , Humanos , Incidência , Intestino Delgado/metabolismo , Fígado/metabolismo , Masculino , Camundongos , Pessoa de Meia-Idade , Células-Tronco Multipotentes/metabolismo , Neoplasias/epidemiologia , Neoplasias/genética , Organoides/metabolismo , Mutação Puntual/genética , Adulto Jovem
10.
Plant J ; 89(6): 1225-1235, 2017 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-27995664

RESUMO

Genetical genomics studies uncover genome-wide genetic interactions between genes and their transcriptional regulators. High-throughput measurement of gene expression in recombinant inbred line populations has enabled investigation of the genetic architecture of variation in gene expression. This has the potential to enrich our understanding of the molecular mechanisms affected by and underlying natural variation. Moreover, it contributes to the systems biology of natural variation, as a substantial number of experiments have resulted in a valuable amount of interconnectable phenotypic, molecular and genotypic data. A number of genetical genomics studies have been published for Arabidopsis thaliana, uncovering many expression quantitative trait loci (eQTLs). However, these complex data are not easily accessible to the plant research community, leaving most of the valuable genetic interactions unexplored as cross-analysis of these studies is a major effort. We address this problem with AraQTL (http://www.bioinformatics.nl/Ara QTL/), an easily accessible workbench and database for comparative analysis and meta-analysis of all published Arabidopsis eQTL datasets. AraQTL provides a workbench for comparing, re-using and extending upon the results of these experiments. For example, one can easily screen a physical region for specific local eQTLs that could harbour candidate genes for phenotypic QTLs, or detect gene-by-environment interactions by comparing eQTLs under different conditions.


Assuntos
Proteínas de Arabidopsis/genética , Arabidopsis/genética , Locos de Características Quantitativas/genética , Regulação da Expressão Gênica de Plantas/genética , Biologia de Sistemas , Transcrição Gênica/genética
11.
PLoS Genet ; 11(10): e1005619, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26510153

RESUMO

Macrophages display flexible activation states that range between pro-inflammatory (classical activation) and anti-inflammatory (alternative activation). These macrophage polarization states contribute to a variety of organismal phenotypes such as tissue remodeling and susceptibility to infectious and inflammatory diseases. Several macrophage- or immune-related genes have been shown to modulate infectious and inflammatory disease pathogenesis. However, the potential role that differences in macrophage activation phenotypes play in modulating differences in susceptibility to infectious and inflammatory disease is just emerging. We integrated transcriptional profiling and linkage analyses to determine the genetic basis for the differential murine macrophage response to inflammatory stimuli and to infection with the obligate intracellular parasite Toxoplasma gondii. We show that specific transcriptional programs, defined by distinct genomic loci, modulate macrophage activation phenotypes. In addition, we show that the difference between AJ and C57BL/6J macrophages in controlling Toxoplasma growth after stimulation with interferon gamma and tumor necrosis factor alpha mapped to chromosome 3, proximal to the Guanylate binding protein (Gbp) locus that is known to modulate the murine macrophage response to Toxoplasma. Using an shRNA-knockdown strategy, we show that the transcript levels of an RNA helicase, Ddx1, regulates strain differences in the amount of nitric oxide produced by macrophage after stimulation with interferon gamma and tumor necrosis factor. Our results provide a template for discovering candidate genes that modulate macrophage-mediated complex traits.


Assuntos
RNA Helicases DEAD-box/genética , Inflamação/genética , Ativação de Macrófagos/genética , Toxoplasmose/genética , Transcrição Gênica , Animais , Estudos de Associação Genética , Ligação Genética , Inflamação/microbiologia , Inflamação/patologia , Interferon gama/administração & dosagem , Interferon gama/genética , Macrófagos/microbiologia , Macrófagos/patologia , Camundongos , Toxoplasma/patogenicidade , Toxoplasmose/microbiologia , Toxoplasmose/patologia , Fator de Necrose Tumoral alfa/administração & dosagem , Fator de Necrose Tumoral alfa/genética
12.
Bioinformatics ; 31(12): 2032-4, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-25697820

RESUMO

UNLABELLED: Sambamba is a high-performance robust tool and library for working with SAM, BAM and CRAM sequence alignment files; the most common file formats for aligned next generation sequencing data. Sambamba is a faster alternative to samtools that exploits multi-core processing and dramatically reduces processing time. Sambamba is being adopted at sequencing centers, not only because of its speed, but also because of additional functionality, including coverage analysis and powerful filtering capability. AVAILABILITY AND IMPLEMENTATION: Sambamba is free and open source software, available under a GPLv2 license. Sambamba can be downloaded and installed from http://www.open-bio.org/wiki/Sambamba.Sambamba v0.5.0 was released with doi:10.5281/zenodo.13200.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Genômica , Humanos , Alinhamento de Sequência
14.
BMC Bioinformatics ; 15 Suppl 14: S7, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25472764

RESUMO

BACKGROUND: Computational biology comprises a wide range of technologies and approaches. Multiple technologies can be combined to create more powerful workflows if the individuals contributing the data or providing tools for its interpretation can find mutual understanding and consensus. Much conversation and joint investigation are required in order to identify and implement the best approaches. Traditionally, scientific conferences feature talks presenting novel technologies or insights, followed up by informal discussions during coffee breaks. In multi-institution collaborations, in order to reach agreement on implementation details or to transfer deeper insights in a technology and practical skills, a representative of one group typically visits the other. However, this does not scale well when the number of technologies or research groups is large. Conferences have responded to this issue by introducing Birds-of-a-Feather (BoF) sessions, which offer an opportunity for individuals with common interests to intensify their interaction. However, parallel BoF sessions often make it hard for participants to join multiple BoFs and find common ground between the different technologies, and BoFs are generally too short to allow time for participants to program together. RESULTS: This report summarises our experience with computational biology Codefests, Hackathons and Sprints, which are interactive developer meetings. They are structured to reduce the limitations of traditional scientific meetings described above by strengthening the interaction among peers and letting the participants determine the schedule and topics. These meetings are commonly run as loosely scheduled "unconferences" (self-organized identification of participants and topics for meetings) over at least two days, with early introductory talks to welcome and organize contributors, followed by intensive collaborative coding sessions. We summarise some prominent achievements of those meetings and describe differences in how these are organised, how their audience is addressed, and their outreach to their respective communities. CONCLUSIONS: Hackathons, Codefests and Sprints share a stimulating atmosphere that encourages participants to jointly brainstorm and tackle problems of shared interest in a self-driven proactive environment, as well as providing an opportunity for new participants to get involved in collaborative projects.


Assuntos
Biologia Computacional , Comportamento Cooperativo , Software , Comunicação , Internet
15.
Brief Bioinform ; 13(2): 135-42, 2012 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-22396485

RESUMO

During a meeting of the SYSGENET working group 'Bioinformatics', currently available software tools and databases for systems genetics in mice were reviewed and the needs for future developments discussed. The group evaluated interoperability and performed initial feasibility studies. To aid future compatibility of software and exchange of already developed software modules, a strong recommendation was made by the group to integrate HAPPY and R/qtl analysis toolboxes, GeneNetwork and XGAP database platforms, and TIQS and xQTL processing platforms. R should be used as the principal computer language for QTL data analysis in all platforms and a 'cloud' should be used for software dissemination to the community. Furthermore, the working group recommended that all data models and software source code should be made visible in public repositories to allow a coordinated effort on the use of common data structures and file formats.


Assuntos
Biologia Computacional/métodos , Bases de Dados Factuais , Algoritmos , Animais , Redes Reguladoras de Genes , Camundongos/genética , Locos de Características Quantitativas , Software
16.
bioRxiv ; 2024 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-38260597

RESUMO

The HXB/BXH family of recombinant inbred rat strains is a unique genetic resource that has been extensively phenotyped over 25 years, resulting in a vast dataset of quantitative molecular and physiological phenotypes. We built a pangenome graph from 10x Genomics Linked-Read data for 31 recombinant inbred rats to study genetic variation and association mapping. The pangenome includes 0.2Gb of sequence that is not present the reference mRatBN7.2, confirming the capture of substantial additional variation. We validated variants in challenging regions, including complex structural variants resolving into multiple haplotypes. Phenome-wide association analysis of validated SNPs uncovered variants associated with glucose/insulin levels and hippocampal gene expression. We propose an interaction between Pirl1l1, chromogranin expression, TNF-α levels, and insulin regulation. This study demonstrates the utility of linked-read pangenomes for comprehensive variant detection and mapping phenotypic diversity in a widely used rat genetic reference panel.

17.
bioRxiv ; 2024 Sep 24.
Artigo em Inglês | MEDLINE | ID: mdl-39386557

RESUMO

The public availability of genome datasets, such as The Human Genome Project (HGP), The 1000 Genomes Project, The Cancer Genome Atlas, and the International HapMap Project, has significantly advanced scientific research and medical understanding. Here our goal is to share such genomic information for downstream analysis while protecting the privacy of individuals through Differential Privacy (DP). We introduce synthetic DNA data generation based on pangenomes in combination with Pretrained-Language Models (PTLMs). We introduce two novel tokenization schemes based on pangenome graphs to enhance the modeling of DNA. We evaluated these tokenization methods, and compared them with classical single nucleotide and k-mer tokenizations. We find k-mer tokenization schemes, indicating that our tokenization schemes boost the model's performance consistency with long effective context length (covering longer sequences with the same number of tokens). Additionally, we propose a method to utilize the pangenome graph and make it comply with DP privacy standards. We assess the performance of DP training on the quality of generated sequences with discussion of the trade-offs between privacy and model accuracy. The source code for our work will be published under a free and open source license soon.

18.
bioRxiv ; 2024 Oct 21.
Artigo em Inglês | MEDLINE | ID: mdl-39463999

RESUMO

We created GNQA, a generative pre-trained transformer (GPT) knowledge base driven by a performant retrieval augmented generation (RAG) with a focus on aging, dementia, Alzheimer's and diabetes. We uploaded a corpus of three thousand peer reviewed publications on these topics into the RAG. To address concerns about inaccurate responses and GPT 'hallucinations', we implemented a context provenance tracking mechanism that enables researchers to validate responses against the original material and to get references to the original papers. To assess the effectiveness of contextual information we collected evaluations and feedback from both domain expert users and 'citizen scientists' on the relevance of GPT responses. A key innovation of our study is automated evaluation by way of a RAG assessment system (RAGAS). RAGAS combines human expert assessment with AI-driven evaluation to measure the effectiveness of RAG systems. When evaluating the responses to their questions, human respondents give a "thumbs-up" 76% of the time. Meanwhile, RAGAS scores 90% on answer relevance on questions posed by experts. And when GPT-generates questions, RAGAS scores 74% on answer relevance. With RAGAS we created a benchmark that can be used to continuously assess the performance of our knowledge base. Full GNQA functionality is embedded in the free GeneNetwork.org web service, an open-source system containing over 25 years of experimental data on model organisms and human. The code developed for this study is published under a free and open-source software license at https://git.genenetwork.org/gn-ai/tree/README.md.

19.
Cell Genom ; 4(4): 100527, 2024 Apr 10.
Artigo em Inglês | MEDLINE | ID: mdl-38537634

RESUMO

The seventh iteration of the reference genome assembly for Rattus norvegicus-mRatBN7.2-corrects numerous misplaced segments and reduces base-level errors by approximately 9-fold and increases contiguity by 290-fold compared with its predecessor. Gene annotations are now more complete, improving the mapping precision of genomic, transcriptomic, and proteomics datasets. We jointly analyzed 163 short-read whole-genome sequencing datasets representing 120 laboratory rat strains and substrains using mRatBN7.2. We defined ∼20.0 million sequence variations, of which 18,700 are predicted to potentially impact the function of 6,677 genes. We also generated a new rat genetic map from 1,893 heterogeneous stock rats and annotated transcription start sites and alternative polyadenylation sites. The mRatBN7.2 assembly, along with the extensive analysis of genomic variations among rat strains, enhances our understanding of the rat genome, providing researchers with an expanded resource for studies involving rats.


Assuntos
Genoma , Genômica , Ratos , Animais , Genoma/genética , Anotação de Sequência Molecular , Sequenciamento Completo do Genoma , Variação Genética/genética
20.
BMC Genomics ; 14 Suppl 2: S8, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23445565

RESUMO

BACKGROUND: Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources. RESULTS: We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations. CONCLUSIONS: Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from http://biit.cs.ut.ee/pfff.


Assuntos
Algoritmos , Processamento Eletrônico de Dados/métodos , Armazenamento e Recuperação da Informação/métodos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA