Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
Bioinformatics ; 38(6): 1497-1503, 2022 03 04.
Artículo en Inglés | MEDLINE | ID: mdl-34999766

RESUMEN

MOTIVATION: CRAM has established itself as a high compression alternative to the BAM file format for DNA sequencing data. We describe updates to further improve this on modern sequencing instruments. RESULTS: With Illumina data CRAM 3.1 is 7-15% smaller than the equivalent CRAM 3.0 file, and 50-70% smaller than the corresponding BAM file. Long-read technology shows more modest compression due to the presence of high-entropy signals. AVAILABILITY AND IMPLEMENTATION: The CRAM 3.0 specification is freely available from https://samtools.github.io/hts-specs/CRAMv3.pdf. The CRAM 3.1 improvements are available in a separate OpenSource HTScodecs library from https://github.com/samtools/htscodecs, and have been incorporated into HTSlib. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , ADN , Secuencia de Bases , Programas Informáticos
2.
Bioinformatics ; 35(2): 337-339, 2019 01 15.
Artículo en Inglés | MEDLINE | ID: mdl-29992288

RESUMEN

Motivation: The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results: On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details). Availability and implementation: Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento
3.
Nat Methods ; 13(12): 1005-1008, 2016 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-27776113

RESUMEN

High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.


Asunto(s)
Biología Computacional/métodos , Compresión de Datos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Animales , Cacao/genética , Drosophila melanogaster/genética , Escherichia coli/genética , Humanos , Pseudomonas aeruginosa/genética
4.
Bioinformatics ; 30(19): 2818-9, 2014 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-24930138

RESUMEN

MOTIVATION: The reference CRAM file format implementation is in Java. We present 'Scramble': a new C implementation of SAM, BAM and CRAM file I/O. RESULTS: The C implementation of for CRAM is 1.5-1.7× slower than BAM at decoding but 1.8-2.6× faster at encoding. We see file size savings of 34-55%. AVAILABILITY AND IMPLEMENTATION: Source code is available at http://sourceforge.net/projects/staden/files/io_lib/ under the BSD software licence.


Asunto(s)
Biología Computacional/métodos , Lenguajes de Programación , Computadores , Escherichia coli/genética , Genoma Bacteriano , Genoma Humano , Humanos , Almacenamiento y Recuperación de la Información , Programas Informáticos
5.
Nat Genet ; 37(5): 532-6, 2005 May.
Artículo en Inglés | MEDLINE | ID: mdl-15852006

RESUMEN

Inbred mouse strains provide the foundation for mouse genetics. By selecting for phenotypic features of interest, inbreeding drives genomic evolution and eliminates individual variation, while fixing certain sets of alleles that are responsible for the trait characteristics of the strain. Mouse strains 129Sv (129S5) and C57BL/6J, two of the most widely used inbred lines, diverged from common ancestors within the last century, yet very little is known about the genomic differences between them. By comparative genomic hybridization and sequence analysis of 129S5 short insert libraries, we identified substantial structural variation, a complex fine-scale haplotype pattern with a continuous distribution of diversity blocks, and extensive nucleotide variation, including nonsynonymous coding SNPs and stop codons. Collectively, these genomic changes denote the level and direction of allele fixation that has occurred during inbreeding and provide a basis for defining what makes these mouse strains unique.


Asunto(s)
Dosificación de Gen , Variación Genética , Haplotipos , Polimorfismo Genético , Animales , Genoma , Ratones , Ratones Endogámicos C57BL , Datos de Secuencia Molecular
6.
Nucleic Acids Res ; 38(Database issue): D39-45, 2010 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-19906712

RESUMEN

The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is Europe's primary nucleotide sequence archival resource, safeguarding open nucleotide data access, engaging in worldwide collaborative data exchange and integrating with the scientific publication process. ENA has made significant contributions to the collaborative nucleotide archival arena as an active proponent of extending the traditional collaboration to cover capillary and next-generation sequencing information. We have continued to co-develop data and metadata representation formats with our collaborators for both data exchange and public data dissemination. In addition to the DDBJ/EMBL/GenBank feature table format, we share metadata formats for capillary and next-generation sequencing traces and are using and contributing to the NCBI SRA Toolkit for the long-term storage of the next-generation sequence traces. During the course of 2009, ENA has significantly improved sequence submission, search and access functionalities provided at EMBL-EBI. In this article, we briefly describe the content and scope of our archive and introduce major improvements to our services.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Bases de Datos de Ácidos Nucleicos , Acceso a la Información , Algoritmos , Animales , Biología Computacional/tendencias , ADN/genética , Europa (Continente) , Humanos , Almacenamiento y Recuperación de la Información/métodos , Internet , Programas Informáticos
7.
Bioinformatics ; 26(14): 1699-703, 2010 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-20513662

RESUMEN

MOTIVATION: Existing sequence assembly editors struggle with the volumes of data now readily available from the latest generation of DNA sequencing instruments. RESULTS: We describe the Gap5 software along with the data structures and algorithms used that allow it to be scalable. We demonstrate this with an assembly of 1.1 billion sequence fragments and compare the performance with several other programs. We analyse the memory, CPU, I/O usage and file sizes used by Gap5. AVAILABILITY AND IMPLEMENTATION: Gap5 is part of the Staden Package and is available under an Open Source licence from http://staden.sourceforge.net. It is implemented in C and Tcl/Tk. Currently it works on Unix systems only.


Asunto(s)
Análisis de Secuencia de ADN/métodos , Programas Informáticos , Secuencia de Bases , Bases de Datos Factuales , Alineación de Secuencia , Interfaz Usuario-Computador
8.
Nucleic Acids Res ; 37(Database issue): D19-25, 2009 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-18978013

RESUMEN

Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Análisis de Secuencia/tendencias , Internet , Integración de Sistemas
9.
Genomics ; 95(2): 105-10, 2010 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-19909804

RESUMEN

Non-obese diabetic (NOD) mice spontaneously develop type 1 diabetes (T1D) due to the progressive loss of insulin-secreting beta-cells by an autoimmune driven process. NOD mice represent a valuable tool for studying the genetics of T1D and for evaluating therapeutic interventions. Here we describe the development and characterization by end-sequencing of bacterial artificial chromosome (BAC) libraries derived from NOD/MrkTac (DIL NOD) and NOD/ShiLtJ (CHORI-29), two commonly used NOD substrains. The DIL NOD library is composed of 196,032 BACs and the CHORI-29 library is composed of 110,976 BACs. The average depth of genome coverage of the DIL NOD library, estimated from mapping the BAC end-sequences to the reference mouse genome sequence, was 7.1-fold across the autosomes and 6.6-fold across the X chromosome. Clones from this library have an average insert size of 150 kb and map to over 95.6% of the reference mouse genome assembly (NCBIm37), covering 98.8% of Ensembl mouse genes. By the same metric, the CHORI-29 library has an average depth over the autosomes of 5.0-fold and 2.8-fold coverage of the X chromosome, the reduced X chromosome coverage being due to the use of a male donor for this library. Clones from this library have an average insert size of 205 kb and map to 93.9% of the reference mouse genome assembly, covering 95.7% of Ensembl genes. We have identified and validated 191,841 single nucleotide polymorphisms (SNPs) for DIL NOD and 114,380 SNPs for CHORI-29. In total we generated 229,736,133 bp of sequence for the DIL NOD and 121,963,211 bp for the CHORI-29. These BAC libraries represent a powerful resource for functional studies, such as gene targeting in NOD embryonic stem (ES) cell lines, and for sequencing and mapping experiments.


Asunto(s)
Cromosomas Artificiales Bacterianos/genética , Genoma , Animales , Cromosomas Artificiales Bacterianos/metabolismo , ADN Complementario/metabolismo , Masculino , Ratones , Ratones Endogámicos NOD , Ratones Endogámicos , Datos de Secuencia Molecular , Análisis de Secuencia de ADN
10.
Gigascience ; 10(2)2021 02 16.
Artículo en Inglés | MEDLINE | ID: mdl-33594436

RESUMEN

BACKGROUND: Since the original publication of the VCF and SAM formats, an explosion of software tools have been created to process these data files. To facilitate this a library was produced out of the original SAMtools implementation, with a focus on performance and robustness. The file formats themselves have become international standards under the jurisdiction of the Global Alliance for Genomics and Health. FINDINGS: We present a software library for providing programmatic access to sequencing alignment and variant formats. It was born out of the widely used SAMtools and BCFtools applications. Considerable improvements have been made to the original code plus many new features including newer access protocols, the addition of the CRAM file format, better indexing and iterators, and better use of threading. CONCLUSION: Since the original Samtools release, performance has been considerably improved, with a BAM read-write loop running 5 times faster and BAM to SAM conversion 13 times faster (both using 16 threads, compared to Samtools 0.1.19). Widespread adoption has seen HTSlib downloaded >1 million times from GitHub and conda. The C library has been used directly by an estimated 900 GitHub projects and has been incorporated into Perl, Python, Rust, and R, significantly expanding the number of uses via other languages. HTSlib is open source and is freely available from htslib.org under MIT/BSD license.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Lectura , Alineación de Secuencia , Programas Informáticos , Escritura
11.
Gigascience ; 10(2)2021 02 16.
Artículo en Inglés | MEDLINE | ID: mdl-33590861

RESUMEN

BACKGROUND: SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods. FINDINGS: The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines. CONCLUSION: Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed >1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Genoma , Genómica
12.
Cell Genom ; 1(2)2021 Nov 10.
Artículo en Inglés | MEDLINE | ID: mdl-35072136

RESUMEN

The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution. We describe the GA4GH organization, which is fueled by the development efforts of eight Work Streams and informed by the needs of 24 Driver Projects and other key stakeholders. We present the GA4GH suite of secure, interoperable technical standards and policy frameworks and review the current status of standards, their relevance to key domains of research and clinical care, and future plans of GA4GH. Broad international participation in building, adopting, and deploying GA4GH standards and frameworks will catalyze an unprecedented effort in data sharing that will be critical to advancing genomic medicine and ensuring that all populations can access its benefits.

13.
Nucleic Acids Res ; 36(Database issue): D5-12, 2008 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-18039715

RESUMEN

The Ensembl Trace Archive (http://trace.ensembl.org/) and the EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/), known together as the European Nucleotide Archive, continue to see growth in data volume and diversity. Selected major developments of 2007 are presented briefly, along with data submission and retrieval information. In the face of increasing requirements for nucleotide trace, sequence and annotation data archiving, data capture priority decisions have been taken at the European Nucleotide Archive. Priorities are discussed in terms of how reliably information can be captured, the long-term benefits of its capture and the ease with which it can be captured.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Análisis de Secuencia de ADN , Animales , Archivos , Genómica , Internet
14.
Nucleic Acids Res ; 33(18): e152, 2005 Oct 12.
Artículo en Inglés | MEDLINE | ID: mdl-16221968

RESUMEN

Haplotypic sequences contain significantly more information than genotypes of genetic markers and are critical for studying disease association and genome evolution. Current methods for obtaining haplotypic sequences require the physical separation of alleles before sequencing, are time consuming and are not scaleable for large surveys of genetic variation. We have developed a novel method for acquiring haplotypic sequences from long PCR products using simple, high-throughput techniques. This method applies modified shotgun sequencing protocols to sequence both alleles concurrently, with read-pair information allowing the two alleles to be separated during sequence assembly. Although the haplotypic sequences can be assembled manually from the resultant data using pre-existing sequence assembly software, we have devised a novel heuristic algorithm to automate assembly and remove human error. We validated the approach on two long PCR products amplified from the human genome and confirmed the accuracy of our sequences against full-length clones of the same alleles. This method presents a simple high-throughput means to obtain full haplotypic sequences potentially up to 20 kb in length and is suitable for surveying genetic variation even in poorly-characterized genomes as it requires no prior information on sequence variation.


Asunto(s)
Alelos , Tamización de Portadores Genéticos/métodos , Variación Genética , Análisis de Secuencia de ADN/métodos , Algoritmos , Secuencia de Bases , Haplotipos , Humanos , Masculino , Datos de Secuencia Molecular , Reacción en Cadena de la Polimerasa
15.
Sci Rep ; 7(1): 3935, 2017 06 21.
Artículo en Inglés | MEDLINE | ID: mdl-28638050

RESUMEN

Long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacific Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics affect the assembly quality. With a given read depth of 31X, the assemblies from both Pacific Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a significant challenge.


Asunto(s)
Genoma Fúngico , Genómica , Saccharomyces cerevisiae/genética , Análisis de Secuencia de ADN , Genoma Mitocondrial , Genómica/instrumentación , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/instrumentación , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN/instrumentación , Análisis de Secuencia de ADN/métodos
16.
PLoS One ; 8(3): e59190, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23533605

RESUMEN

Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz: https://sourceforge.net/projects/fastqz/, fqzcomp: https://sourceforge.net/projects/fqzcomp/, and samcomp: https://sourceforge.net/projects/samcomp/.


Asunto(s)
Biología Computacional/métodos , Compresión de Datos/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos
17.
Genomics ; 86(6): 753-8, 2005 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-16257172

RESUMEN

The majority of gene-targeting experiments in mice are performed in 129Sv-derived embryonic stem (ES) cell lines, which are generally considered to be more reliable at colonizing the germ line than ES cells derived from other strains. Gene targeting is reliant on homologous recombination of a targeting vector with the host ES cell genome. The efficiency of recombination is affected by many factors, including the isogenicity (H. te Riele et al., 1992, Proc. Natl. Acad. Sci. USA 89, 5128-5132) and the length of homologous sequence of the targeting vector and the location of the target locus. Here we describe the double-end sequencing and mapping of 84,507 bacterial artificial chromosomes (BACs) generated from AB2.2 ES cell DNA (129S7/SvEvBrd-Hprtb-m2). We have aligned these BACs against the mouse genome and displayed them on the Ensembl genome browser, DAS: 129S7/AB2.2. This library has an average insert size of 110.68 kb and average depth of genome coverage of 3.63- and 1.24-fold across the autosomes and sex chromosomes, respectively. Over 97% of the mouse genome and 99.1% of Ensembl genes are covered by clones from this library. This publicly available BAC resource can be used for the rapid construction of targeting vectors via recombineering. Furthermore, we show that targeting vectors containing DNA recombineered from this BAC library can be used to target genes efficiently in several 129-derived ES cell lines.


Asunto(s)
Cromosomas Artificiales Bacterianos/genética , Biblioteca de Genes , Marcación de Gen/métodos , Vectores Genéticos/genética , Ratones/genética , Animales , Secuencia de Bases , Mapeo Cromosómico , Genómica/métodos , Datos de Secuencia Molecular , Polimorfismo de Nucleótido Simple/genética , Análisis de Secuencia de ADN , Células Madre/citología
18.
Genome Res ; 15(1): 174-83, 2005 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-15590942

RESUMEN

We present an analysis of the chicken (Gallus gallus) transcriptome based on the full insert sequences for 19,626 cDNAs, combined with 485,337 EST sequences. The cDNA data set has been functionally annotated and describes a minimum of 11,929 chicken coding genes, including the sequence for 2260 full-length cDNAs together with a collection of noncoding (nc) cDNAs that have been stringently filtered to remove untranslated regions of coding mRNAs. The combined collection of cDNAs and ESTs describe 62,546 clustered transcripts and provide transcriptional evidence for a total of 18,989 chicken genes, including 88% of the annotated Ensembl gene set. Analysis of the ncRNAs reveals a set that is highly conserved in chickens and mammals, including sequences for 14 pri-miRNAs encoding 23 different miRNAs. The data sets described here provide a transcriptome toolkit linked to physical clones for bioinformaticians and experimental biologists who wish to use chicken systems as a low-cost, accessible alternative to mammals for the analysis of vertebrate development, immunology, and cell biology.


Asunto(s)
Pollos/genética , ADN Complementario/genética , Etiquetas de Secuencia Expresada , Biblioteca de Genes , Transcripción Genética/genética , Animales , Clonación Molecular/métodos , Biología Computacional/métodos , ADN Complementario/fisiología , Humanos , MicroARNs/genética , ARN no Traducido/genética , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos
19.
Bioinformatics ; 18(1): 3-10, 2002 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-11836205

RESUMEN

MOTIVATION: To produce an open and extensible file format for DNA trace data which produces compact files suitable for large-scale storage and efficient use of internet bandwidth. RESULTS: We have created an extensible format named ZTR. For a set of data taken from an ABI-3700 the ZTR format produces trace files which require 61.6% of the disk space used by gzipped SCFv3, and which can be written and read at greater speed. The compression algorithms used for the trace amplitudes are used within the National Center for Biotechnology Information (NCBI) trace archive. lmb.cam.ac.uk/pub/staden/io_lib/test_data.


Asunto(s)
Sistemas de Administración de Bases de Datos , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Análisis de Secuencia de ADN/estadística & datos numéricos , Algoritmos , Biología Computacional , Proyecto Genoma Humano , Humanos , Programas Informáticos
20.
Bioinformatics ; 18(1): 194-5, 2002 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-11836229

RESUMEN

Trev is a DNA trace editor and viewer, which is available free for UNIX and Microsoft Windows platforms. It can read all the commonly used file formats, including the new, compact ZTR files.


Asunto(s)
Análisis de Secuencia de ADN/estadística & datos numéricos , Programas Informáticos , Biología Computacional , Gráficos por Computador , Presentación de Datos , Procesamiento Automatizado de Datos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA