Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
Bioinformatics ; 38(6): 1497-1503, 2022 03 04.
Artigo em Inglês | MEDLINE | ID: mdl-34999766

RESUMO

MOTIVATION: CRAM has established itself as a high compression alternative to the BAM file format for DNA sequencing data. We describe updates to further improve this on modern sequencing instruments. RESULTS: With Illumina data CRAM 3.1 is 7-15% smaller than the equivalent CRAM 3.0 file, and 50-70% smaller than the corresponding BAM file. Long-read technology shows more modest compression due to the presence of high-entropy signals. AVAILABILITY AND IMPLEMENTATION: The CRAM 3.0 specification is freely available from https://samtools.github.io/hts-specs/CRAMv3.pdf. The CRAM 3.1 improvements are available in a separate OpenSource HTScodecs library from https://github.com/samtools/htscodecs, and have been incorporated into HTSlib. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , DNA , Sequência de Bases , Software
2.
Gigascience ; 10(2)2021 02 16.
Artigo em Inglês | MEDLINE | ID: mdl-33594436

RESUMO

BACKGROUND: Since the original publication of the VCF and SAM formats, an explosion of software tools have been created to process these data files. To facilitate this a library was produced out of the original SAMtools implementation, with a focus on performance and robustness. The file formats themselves have become international standards under the jurisdiction of the Global Alliance for Genomics and Health. FINDINGS: We present a software library for providing programmatic access to sequencing alignment and variant formats. It was born out of the widely used SAMtools and BCFtools applications. Considerable improvements have been made to the original code plus many new features including newer access protocols, the addition of the CRAM file format, better indexing and iterators, and better use of threading. CONCLUSION: Since the original Samtools release, performance has been considerably improved, with a BAM read-write loop running 5 times faster and BAM to SAM conversion 13 times faster (both using 16 threads, compared to Samtools 0.1.19). Widespread adoption has seen HTSlib downloaded >1 million times from GitHub and conda. The C library has been used directly by an estimated 900 GitHub projects and has been incorporated into Perl, Python, Rust, and R, significantly expanding the number of uses via other languages. HTSlib is open source and is freely available from htslib.org under MIT/BSD license.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Leitura , Alinhamento de Sequência , Software , Redação
3.
Gigascience ; 10(2)2021 02 16.
Artigo em Inglês | MEDLINE | ID: mdl-33590861

RESUMO

BACKGROUND: SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods. FINDINGS: The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines. CONCLUSION: Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed >1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Genoma , Genômica
4.
Cell Genom ; 1(2)2021 Nov 10.
Artigo em Inglês | MEDLINE | ID: mdl-35072136

RESUMO

The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution. We describe the GA4GH organization, which is fueled by the development efforts of eight Work Streams and informed by the needs of 24 Driver Projects and other key stakeholders. We present the GA4GH suite of secure, interoperable technical standards and policy frameworks and review the current status of standards, their relevance to key domains of research and clinical care, and future plans of GA4GH. Broad international participation in building, adopting, and deploying GA4GH standards and frameworks will catalyze an unprecedented effort in data sharing that will be critical to advancing genomic medicine and ensuring that all populations can access its benefits.

5.
Bioinformatics ; 35(2): 337-339, 2019 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-29992288

RESUMO

Motivation: The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results: On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details). Availability and implementation: Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Sequenciamento de Nucleotídeos em Larga Escala
6.
Sci Rep ; 7(1): 3935, 2017 06 21.
Artigo em Inglês | MEDLINE | ID: mdl-28638050

RESUMO

Long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacific Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics affect the assembly quality. With a given read depth of 31X, the assemblies from both Pacific Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a significant challenge.


Assuntos
Genoma Fúngico , Genômica , Saccharomyces cerevisiae/genética , Análise de Sequência de DNA , Genoma Mitocondrial , Genômica/instrumentação , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/instrumentação , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Reprodutibilidade dos Testes , Análise de Sequência de DNA/instrumentação , Análise de Sequência de DNA/métodos
7.
Nat Methods ; 13(12): 1005-1008, 2016 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-27776113

RESUMO

High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.


Assuntos
Biologia Computacional/métodos , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Animais , Cacau/genética , Drosophila melanogaster/genética , Escherichia coli/genética , Humanos , Pseudomonas aeruginosa/genética
8.
Bioinformatics ; 30(19): 2818-9, 2014 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-24930138

RESUMO

MOTIVATION: The reference CRAM file format implementation is in Java. We present 'Scramble': a new C implementation of SAM, BAM and CRAM file I/O. RESULTS: The C implementation of for CRAM is 1.5-1.7× slower than BAM at decoding but 1.8-2.6× faster at encoding. We see file size savings of 34-55%. AVAILABILITY AND IMPLEMENTATION: Source code is available at http://sourceforge.net/projects/staden/files/io_lib/ under the BSD software licence.


Assuntos
Biologia Computacional/métodos , Linguagens de Programação , Computadores , Escherichia coli/genética , Genoma Bacteriano , Genoma Humano , Humanos , Armazenamento e Recuperação da Informação , Software
9.
PLoS One ; 8(3): e59190, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23533605

RESUMO

Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz: https://sourceforge.net/projects/fastqz/, fqzcomp: https://sourceforge.net/projects/fqzcomp/, and samcomp: https://sourceforge.net/projects/samcomp/.


Assuntos
Biologia Computacional/métodos , Compressão de Dados/métodos , Análise de Sequência de DNA/métodos , Software
10.
Bioinformatics ; 26(14): 1699-703, 2010 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-20513662

RESUMO

MOTIVATION: Existing sequence assembly editors struggle with the volumes of data now readily available from the latest generation of DNA sequencing instruments. RESULTS: We describe the Gap5 software along with the data structures and algorithms used that allow it to be scalable. We demonstrate this with an assembly of 1.1 billion sequence fragments and compare the performance with several other programs. We analyse the memory, CPU, I/O usage and file sizes used by Gap5. AVAILABILITY AND IMPLEMENTATION: Gap5 is part of the Staden Package and is available under an Open Source licence from http://staden.sourceforge.net. It is implemented in C and Tcl/Tk. Currently it works on Unix systems only.


Assuntos
Análise de Sequência de DNA/métodos , Software , Sequência de Bases , Bases de Dados Factuais , Alinhamento de Sequência , Interface Usuário-Computador
11.
Nucleic Acids Res ; 38(Database issue): D39-45, 2010 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-19906712

RESUMO

The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is Europe's primary nucleotide sequence archival resource, safeguarding open nucleotide data access, engaging in worldwide collaborative data exchange and integrating with the scientific publication process. ENA has made significant contributions to the collaborative nucleotide archival arena as an active proponent of extending the traditional collaboration to cover capillary and next-generation sequencing information. We have continued to co-develop data and metadata representation formats with our collaborators for both data exchange and public data dissemination. In addition to the DDBJ/EMBL/GenBank feature table format, we share metadata formats for capillary and next-generation sequencing traces and are using and contributing to the NCBI SRA Toolkit for the long-term storage of the next-generation sequence traces. During the course of 2009, ENA has significantly improved sequence submission, search and access functionalities provided at EMBL-EBI. In this article, we briefly describe the content and scope of our archive and introduce major improvements to our services.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Bases de Dados de Ácidos Nucleicos , Acesso à Informação , Algoritmos , Animais , Biologia Computacional/tendências , DNA/genética , Europa (Continente) , Humanos , Armazenamento e Recuperação da Informação/métodos , Internet , Software
12.
Genomics ; 95(2): 105-10, 2010 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-19909804

RESUMO

Non-obese diabetic (NOD) mice spontaneously develop type 1 diabetes (T1D) due to the progressive loss of insulin-secreting beta-cells by an autoimmune driven process. NOD mice represent a valuable tool for studying the genetics of T1D and for evaluating therapeutic interventions. Here we describe the development and characterization by end-sequencing of bacterial artificial chromosome (BAC) libraries derived from NOD/MrkTac (DIL NOD) and NOD/ShiLtJ (CHORI-29), two commonly used NOD substrains. The DIL NOD library is composed of 196,032 BACs and the CHORI-29 library is composed of 110,976 BACs. The average depth of genome coverage of the DIL NOD library, estimated from mapping the BAC end-sequences to the reference mouse genome sequence, was 7.1-fold across the autosomes and 6.6-fold across the X chromosome. Clones from this library have an average insert size of 150 kb and map to over 95.6% of the reference mouse genome assembly (NCBIm37), covering 98.8% of Ensembl mouse genes. By the same metric, the CHORI-29 library has an average depth over the autosomes of 5.0-fold and 2.8-fold coverage of the X chromosome, the reduced X chromosome coverage being due to the use of a male donor for this library. Clones from this library have an average insert size of 205 kb and map to 93.9% of the reference mouse genome assembly, covering 95.7% of Ensembl genes. We have identified and validated 191,841 single nucleotide polymorphisms (SNPs) for DIL NOD and 114,380 SNPs for CHORI-29. In total we generated 229,736,133 bp of sequence for the DIL NOD and 121,963,211 bp for the CHORI-29. These BAC libraries represent a powerful resource for functional studies, such as gene targeting in NOD embryonic stem (ES) cell lines, and for sequencing and mapping experiments.


Assuntos
Cromossomos Artificiais Bacterianos/genética , Genoma , Animais , Cromossomos Artificiais Bacterianos/metabolismo , DNA Complementar/metabolismo , Masculino , Camundongos , Camundongos Endogâmicos NOD , Camundongos Endogâmicos , Dados de Sequência Molecular , Análise de Sequência de DNA
13.
Nucleic Acids Res ; 37(Database issue): D19-25, 2009 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-18978013

RESUMO

Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.


Assuntos
Bases de Dados de Ácidos Nucleicos , Análise de Sequência/tendências , Internet , Integração de Sistemas
14.
Nucleic Acids Res ; 36(Database issue): D5-12, 2008 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-18039715

RESUMO

The Ensembl Trace Archive (http://trace.ensembl.org/) and the EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/), known together as the European Nucleotide Archive, continue to see growth in data volume and diversity. Selected major developments of 2007 are presented briefly, along with data submission and retrieval information. In the face of increasing requirements for nucleotide trace, sequence and annotation data archiving, data capture priority decisions have been taken at the European Nucleotide Archive. Priorities are discussed in terms of how reliably information can be captured, the long-term benefits of its capture and the ease with which it can be captured.


Assuntos
Bases de Dados de Ácidos Nucleicos , Análise de Sequência de DNA , Animais , Arquivos , Genômica , Internet
15.
Genomics ; 86(6): 753-8, 2005 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-16257172

RESUMO

The majority of gene-targeting experiments in mice are performed in 129Sv-derived embryonic stem (ES) cell lines, which are generally considered to be more reliable at colonizing the germ line than ES cells derived from other strains. Gene targeting is reliant on homologous recombination of a targeting vector with the host ES cell genome. The efficiency of recombination is affected by many factors, including the isogenicity (H. te Riele et al., 1992, Proc. Natl. Acad. Sci. USA 89, 5128-5132) and the length of homologous sequence of the targeting vector and the location of the target locus. Here we describe the double-end sequencing and mapping of 84,507 bacterial artificial chromosomes (BACs) generated from AB2.2 ES cell DNA (129S7/SvEvBrd-Hprtb-m2). We have aligned these BACs against the mouse genome and displayed them on the Ensembl genome browser, DAS: 129S7/AB2.2. This library has an average insert size of 110.68 kb and average depth of genome coverage of 3.63- and 1.24-fold across the autosomes and sex chromosomes, respectively. Over 97% of the mouse genome and 99.1% of Ensembl genes are covered by clones from this library. This publicly available BAC resource can be used for the rapid construction of targeting vectors via recombineering. Furthermore, we show that targeting vectors containing DNA recombineered from this BAC library can be used to target genes efficiently in several 129-derived ES cell lines.


Assuntos
Cromossomos Artificiais Bacterianos/genética , Biblioteca Gênica , Marcação de Genes/métodos , Vetores Genéticos/genética , Camundongos/genética , Animais , Sequência de Bases , Mapeamento Cromossômico , Genômica/métodos , Dados de Sequência Molecular , Polimorfismo de Nucleotídeo Único/genética , Análise de Sequência de DNA , Células-Tronco/citologia
16.
Nucleic Acids Res ; 33(18): e152, 2005 Oct 12.
Artigo em Inglês | MEDLINE | ID: mdl-16221968

RESUMO

Haplotypic sequences contain significantly more information than genotypes of genetic markers and are critical for studying disease association and genome evolution. Current methods for obtaining haplotypic sequences require the physical separation of alleles before sequencing, are time consuming and are not scaleable for large surveys of genetic variation. We have developed a novel method for acquiring haplotypic sequences from long PCR products using simple, high-throughput techniques. This method applies modified shotgun sequencing protocols to sequence both alleles concurrently, with read-pair information allowing the two alleles to be separated during sequence assembly. Although the haplotypic sequences can be assembled manually from the resultant data using pre-existing sequence assembly software, we have devised a novel heuristic algorithm to automate assembly and remove human error. We validated the approach on two long PCR products amplified from the human genome and confirmed the accuracy of our sequences against full-length clones of the same alleles. This method presents a simple high-throughput means to obtain full haplotypic sequences potentially up to 20 kb in length and is suitable for surveying genetic variation even in poorly-characterized genomes as it requires no prior information on sequence variation.


Assuntos
Alelos , Triagem de Portadores Genéticos/métodos , Variação Genética , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Haplótipos , Humanos , Masculino , Dados de Sequência Molecular , Reação em Cadeia da Polimerase
17.
Nat Genet ; 37(5): 532-6, 2005 May.
Artigo em Inglês | MEDLINE | ID: mdl-15852006

RESUMO

Inbred mouse strains provide the foundation for mouse genetics. By selecting for phenotypic features of interest, inbreeding drives genomic evolution and eliminates individual variation, while fixing certain sets of alleles that are responsible for the trait characteristics of the strain. Mouse strains 129Sv (129S5) and C57BL/6J, two of the most widely used inbred lines, diverged from common ancestors within the last century, yet very little is known about the genomic differences between them. By comparative genomic hybridization and sequence analysis of 129S5 short insert libraries, we identified substantial structural variation, a complex fine-scale haplotype pattern with a continuous distribution of diversity blocks, and extensive nucleotide variation, including nonsynonymous coding SNPs and stop codons. Collectively, these genomic changes denote the level and direction of allele fixation that has occurred during inbreeding and provide a basis for defining what makes these mouse strains unique.


Assuntos
Dosagem de Genes , Variação Genética , Haplótipos , Polimorfismo Genético , Animais , Genoma , Camundongos , Camundongos Endogâmicos C57BL , Dados de Sequência Molecular
18.
Genome Res ; 15(1): 174-83, 2005 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-15590942

RESUMO

We present an analysis of the chicken (Gallus gallus) transcriptome based on the full insert sequences for 19,626 cDNAs, combined with 485,337 EST sequences. The cDNA data set has been functionally annotated and describes a minimum of 11,929 chicken coding genes, including the sequence for 2260 full-length cDNAs together with a collection of noncoding (nc) cDNAs that have been stringently filtered to remove untranslated regions of coding mRNAs. The combined collection of cDNAs and ESTs describe 62,546 clustered transcripts and provide transcriptional evidence for a total of 18,989 chicken genes, including 88% of the annotated Ensembl gene set. Analysis of the ncRNAs reveals a set that is highly conserved in chickens and mammals, including sequences for 14 pri-miRNAs encoding 23 different miRNAs. The data sets described here provide a transcriptome toolkit linked to physical clones for bioinformaticians and experimental biologists who wish to use chicken systems as a low-cost, accessible alternative to mammals for the analysis of vertebrate development, immunology, and cell biology.


Assuntos
Galinhas/genética , DNA Complementar/genética , Etiquetas de Sequências Expressas , Biblioteca Gênica , Transcrição Gênica/genética , Animais , Clonagem Molecular/métodos , Biologia Computacional/métodos , DNA Complementar/fisiologia , Humanos , MicroRNAs/genética , RNA não Traduzido/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos
19.
Bioinformatics ; 18(1): 3-10, 2002 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-11836205

RESUMO

MOTIVATION: To produce an open and extensible file format for DNA trace data which produces compact files suitable for large-scale storage and efficient use of internet bandwidth. RESULTS: We have created an extensible format named ZTR. For a set of data taken from an ABI-3700 the ZTR format produces trace files which require 61.6% of the disk space used by gzipped SCFv3, and which can be written and read at greater speed. The compression algorithms used for the trace amplitudes are used within the National Center for Biotechnology Information (NCBI) trace archive. lmb.cam.ac.uk/pub/staden/io_lib/test_data.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Análise de Sequência de DNA/estatística & dados numéricos , Algoritmos , Biologia Computacional , Projeto Genoma Humano , Humanos , Software
20.
Bioinformatics ; 18(1): 194-5, 2002 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-11836229

RESUMO

Trev is a DNA trace editor and viewer, which is available free for UNIX and Microsoft Windows platforms. It can read all the commonly used file formats, including the new, compact ZTR files.


Assuntos
Análise de Sequência de DNA/estatística & dados numéricos , Software , Biologia Computacional , Gráficos por Computador , Apresentação de Dados , Processamento Eletrônico de Dados
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA