Búsqueda | Portal Regional de la BVS

1.

CRAM 3.1: advances in the CRAM file format.

Bonfield, James K.

Bioinformatics ; 38(6): 1497-1503, 2022 03 04.

Artículo en Inglés | MEDLINE | ID: mdl-34999766

RESUMEN

MOTIVATION: CRAM has established itself as a high compression alternative to the BAM file format for DNA sequencing data. We describe updates to further improve this on modern sequencing instruments. RESULTS: With Illumina data CRAM 3.1 is 7-15% smaller than the equivalent CRAM 3.0 file, and 50-70% smaller than the corresponding BAM file. Long-read technology shows more modest compression due to the presence of high-entropy signals. AVAILABILITY AND IMPLEMENTATION: The CRAM 3.0 specification is freely available from https://samtools.github.io/hts-specs/CRAMv3.pdf. The CRAM 3.1 improvements are available in a separate OpenSource HTScodecs library from https://github.com/samtools/htscodecs, and have been incorporated into HTSlib. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , ADN , Secuencia de Bases , Programas Informáticos

2.

HTSlib: C library for reading/writing high-throughput sequencing data.

Bonfield, James K; Marshall, John; Danecek, Petr; Li, Heng; Ohan, Valeriu; Whitwham, Andrew; Keane, Thomas; Davies, Robert M.

Gigascience ; 10(2)2021 02 16.

Artículo en Inglés | MEDLINE | ID: mdl-33594436

RESUMEN

BACKGROUND: Since the original publication of the VCF and SAM formats, an explosion of software tools have been created to process these data files. To facilitate this a library was produced out of the original SAMtools implementation, with a focus on performance and robustness. The file formats themselves have become international standards under the jurisdiction of the Global Alliance for Genomics and Health. FINDINGS: We present a software library for providing programmatic access to sequencing alignment and variant formats. It was born out of the widely used SAMtools and BCFtools applications. Considerable improvements have been made to the original code plus many new features including newer access protocols, the addition of the CRAM file format, better indexing and iterators, and better use of threading. CONCLUSION: Since the original Samtools release, performance has been considerably improved, with a BAM read-write loop running 5 times faster and BAM to SAM conversion 13 times faster (both using 16 threads, compared to Samtools 0.1.19). Widespread adoption has seen HTSlib downloaded >1 million times from GitHub and conda. The C library has been used directly by an estimated 900 GitHub projects and has been incorporated into Perl, Python, Rust, and R, significantly expanding the number of uses via other languages. HTSlib is open source and is freely available from htslib.org under MIT/BSD license.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento , Lectura , Alineación de Secuencia , Programas Informáticos , Escritura

3.

Twelve years of SAMtools and BCFtools.

Danecek, Petr; Bonfield, James K; Liddle, Jennifer; Marshall, John; Ohan, Valeriu; Pollard, Martin O; Whitwham, Andrew; Keane, Thomas; McCarthy, Shane A; Davies, Robert M; Li, Heng.

Gigascience ; 10(2)2021 02 16.

Artículo en Inglés | MEDLINE | ID: mdl-33590861

RESUMEN

BACKGROUND: SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods. FINDINGS: The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines. CONCLUSION: Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed >1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Genoma , Genómica

4.

GA4GH: International policies and standards for data sharing across genomic research and healthcare.

Rehm, Heidi L; Page, Angela J H; Smith, Lindsay; Adams, Jeremy B; Alterovitz, Gil; Babb, Lawrence J; Barkley, Maxmillian P; Baudis, Michael; Beauvais, Michael J S; Beck, Tim; Beckmann, Jacques S; Beltran, Sergi; Bernick, David; Bernier, Alexander; Bonfield, James K; Boughtwood, Tiffany F; Bourque, Guillaume; Bowers, Sarion R; Brookes, Anthony J; Brudno, Michael; Brush, Matthew H; Bujold, David; Burdett, Tony; Buske, Orion J; Cabili, Moran N; Cameron, Daniel L; Carroll, Robert J; Casas-Silva, Esmeralda; Chakravarty, Debyani; Chaudhari, Bimal P; Chen, Shu Hui; Cherry, J Michael; Chung, Justina; Cline, Melissa; Clissold, Hayley L; Cook-Deegan, Robert M; Courtot, Mélanie; Cunningham, Fiona; Cupak, Miro; Davies, Robert M; Denisko, Danielle; Doerr, Megan J; Dolman, Lena I; Dove, Edward S; Dursi, L Jonathan; Dyke, Stephanie O M; Eddy, James A; Eilbeck, Karen; Ellrott, Kyle P; Fairley, Susan.

Cell Genom ; 1(2)2021 Nov 10.

Artículo en Inglés | MEDLINE | ID: mdl-35072136

RESUMEN

The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution. We describe the GA4GH organization, which is fueled by the development efforts of eight Work Streams and informed by the needs of 24 Driver Projects and other key stakeholders. We present the GA4GH suite of secure, interoperable technical standards and policy frameworks and review the current status of standards, their relevance to key domains of research and clinical care, and future plans of GA4GH. Broad international participation in building, adopting, and deploying GA4GH standards and frameworks will catalyze an unprecedented effort in data sharing that will be critical to advancing genomic medicine and ensuring that all populations can access its benefits.

5.

Crumble: reference free lossy compression of sequence quality values.

Bonfield, James K; McCarthy, Shane A; Durbin, Richard.

Bioinformatics ; 35(2): 337-339, 2019 01 15.

Artículo en Inglés | MEDLINE | ID: mdl-29992288

RESUMEN

Motivation: The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results: On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details). Availability and implementation: Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble. Supplementary information: Supplementary data are available at Bioinformatics online.

Asunto(s)

Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento

6.

De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms.

Giordano, Francesca; Aigrain, Louise; Quail, Michael A; Coupland, Paul; Bonfield, James K; Davies, Robert M; Tischler, German; Jackson, David K; Keane, Thomas M; Li, Jing; Yue, Jia-Xing; Liti, Gianni; Durbin, Richard; Ning, Zemin.

Sci Rep ; 7(1): 3935, 2017 06 21.

Artículo en Inglés | MEDLINE | ID: mdl-28638050

RESUMEN

Long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacific Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics affect the assembly quality. With a given read depth of 31X, the assemblies from both Pacific Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a significant challenge.

Asunto(s)

Genoma Fúngico , Genómica , Saccharomyces cerevisiae/genética , Análisis de Secuencia de ADN , Genoma Mitocondrial , Genómica/instrumentación , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/instrumentación , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN/instrumentación , Análisis de Secuencia de ADN/métodos

7.

Comparison of high-throughput sequencing data compression tools.

Numanagic, Ibrahim; Bonfield, James K; Hach, Faraz; Voges, Jan; Ostermann, Jörn; Alberti, Claudio; Mattavelli, Marco; Sahinalp, S Cenk.

Nat Methods ; 13(12): 1005-1008, 2016 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-27776113

RESUMEN

High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.

Asunto(s)

Biología Computacional/métodos , Compresión de Datos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Animales , Cacao/genética , Drosophila melanogaster/genética , Escherichia coli/genética , Humanos , Pseudomonas aeruginosa/genética

8.

The Scramble conversion tool.

Bonfield, James K.

Bioinformatics ; 30(19): 2818-9, 2014 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-24930138

RESUMEN

MOTIVATION: The reference CRAM file format implementation is in Java. We present 'Scramble': a new C implementation of SAM, BAM and CRAM file I/O. RESULTS: The C implementation of for CRAM is 1.5-1.7× slower than BAM at decoding but 1.8-2.6× faster at encoding. We see file size savings of 34-55%. AVAILABILITY AND IMPLEMENTATION: Source code is available at http://sourceforge.net/projects/staden/files/io_lib/ under the BSD software licence.

Asunto(s)

Biología Computacional/métodos , Lenguajes de Programación , Computadores , Escherichia coli/genética , Genoma Bacteriano , Genoma Humano , Humanos , Almacenamiento y Recuperación de la Información , Programas Informáticos

9.

Compression of FASTQ and SAM format sequencing data.

Bonfield, James K; Mahoney, Matthew V.

PLoS One ; 8(3): e59190, 2013.

Artículo en Inglés | MEDLINE | ID: mdl-23533605

RESUMEN

Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz: https://sourceforge.net/projects/fastqz/, fqzcomp: https://sourceforge.net/projects/fqzcomp/, and samcomp: https://sourceforge.net/projects/samcomp/.

Asunto(s)

Biología Computacional/métodos , Compresión de Datos/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos

10.

Gap5--editing the billion fragment sequence assembly.

Bonfield, James K; Whitwham, Andrew.

Bioinformatics ; 26(14): 1699-703, 2010 Jul 15.

Artículo en Inglés | MEDLINE | ID: mdl-20513662

RESUMEN

MOTIVATION: Existing sequence assembly editors struggle with the volumes of data now readily available from the latest generation of DNA sequencing instruments. RESULTS: We describe the Gap5 software along with the data structures and algorithms used that allow it to be scalable. We demonstrate this with an assembly of 1.1 billion sequence fragments and compare the performance with several other programs. We analyse the memory, CPU, I/O usage and file sizes used by Gap5. AVAILABILITY AND IMPLEMENTATION: Gap5 is part of the Staden Package and is available under an Open Source licence from http://staden.sourceforge.net. It is implemented in C and Tcl/Tk. Currently it works on Unix systems only.

Asunto(s)

Análisis de Secuencia de ADN/métodos , Programas Informáticos , Secuencia de Bases , Bases de Datos Factuales , Alineación de Secuencia , Interfaz Usuario-Computador

11.

A genome-wide, end-sequenced 129Sv BAC library resource for targeting vector construction.

Adams, David J; Quail, Michael A; Cox, Tony; van der Weyden, Louise; Gorick, Barbara D; Su, Qin; Chan, Wei-in; Davies, Rob; Bonfield, James K; Law, Frances; Humphray, Sean; Plumb, Bob; Liu, Pentao; Rogers, Jane; Bradley, Allan.

Genomics ; 86(6): 753-8, 2005 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-16257172

RESUMEN

The majority of gene-targeting experiments in mice are performed in 129Sv-derived embryonic stem (ES) cell lines, which are generally considered to be more reliable at colonizing the germ line than ES cells derived from other strains. Gene targeting is reliant on homologous recombination of a targeting vector with the host ES cell genome. The efficiency of recombination is affected by many factors, including the isogenicity (H. te Riele et al., 1992, Proc. Natl. Acad. Sci. USA 89, 5128-5132) and the length of homologous sequence of the targeting vector and the location of the target locus. Here we describe the double-end sequencing and mapping of 84,507 bacterial artificial chromosomes (BACs) generated from AB2.2 ES cell DNA (129S7/SvEvBrd-Hprtb-m2). We have aligned these BACs against the mouse genome and displayed them on the Ensembl genome browser, DAS: 129S7/AB2.2. This library has an average insert size of 110.68 kb and average depth of genome coverage of 3.63- and 1.24-fold across the autosomes and sex chromosomes, respectively. Over 97% of the mouse genome and 99.1% of Ensembl genes are covered by clones from this library. This publicly available BAC resource can be used for the rapid construction of targeting vectors via recombineering. Furthermore, we show that targeting vectors containing DNA recombineered from this BAC library can be used to target genes efficiently in several 129-derived ES cell lines.

Asunto(s)

Cromosomas Artificiales Bacterianos/genética , Biblioteca de Genes , Marcación de Gen/métodos , Vectores Genéticos/genética , Ratones/genética , Animales , Secuencia de Bases , Mapeo Cromosómico , Genómica/métodos , Datos de Secuencia Molecular , Polimorfismo de Nucleótido Simple/genética , Análisis de Secuencia de ADN , Células Madre/citología

12.

Shotgun haplotyping: a novel method for surveying allelic sequence variation.

Lindsay, Sarah J; Bonfield, James K; Hurles, Matthew E.

Nucleic Acids Res ; 33(18): e152, 2005 Oct 12.

Artículo en Inglés | MEDLINE | ID: mdl-16221968

RESUMEN

Haplotypic sequences contain significantly more information than genotypes of genetic markers and are critical for studying disease association and genome evolution. Current methods for obtaining haplotypic sequences require the physical separation of alleles before sequencing, are time consuming and are not scaleable for large surveys of genetic variation. We have developed a novel method for acquiring haplotypic sequences from long PCR products using simple, high-throughput techniques. This method applies modified shotgun sequencing protocols to sequence both alleles concurrently, with read-pair information allowing the two alleles to be separated during sequence assembly. Although the haplotypic sequences can be assembled manually from the resultant data using pre-existing sequence assembly software, we have devised a novel heuristic algorithm to automate assembly and remove human error. We validated the approach on two long PCR products amplified from the human genome and confirmed the accuracy of our sequences against full-length clones of the same alleles. This method presents a simple high-throughput means to obtain full haplotypic sequences potentially up to 20 kb in length and is suitable for surveying genetic variation even in poorly-characterized genomes as it requires no prior information on sequence variation.

Asunto(s)

Alelos , Tamización de Portadores Genéticos/métodos , Variación Genética , Análisis de Secuencia de ADN/métodos , Algoritmos , Secuencia de Bases , Haplotipos , Humanos , Masculino , Datos de Secuencia Molecular , Reacción en Cadena de la Polimerasa

13.

Transcriptome analysis for the chicken based on 19,626 finished cDNA sequences and 485,337 expressed sequence tags.

Hubbard, Simon J; Grafham, Darren V; Beattie, Kevin J; Overton, Ian M; McLaren, Stuart R; Croning, Michael D R; Boardman, Paul E; Bonfield, James K; Burnside, Joan; Davies, Robert M; Farrell, Elizabeth R; Francis, Matthew D; Griffiths-Jones, Sam; Humphray, Sean J; Hyland, Christopher; Scott, Carol E; Tang, Haizhou; Taylor, Ruth G; Tickle, Cheryll; Brown, William R A; Birney, Ewan; Rogers, Jane; Wilson, Stuart A.

Genome Res ; 15(1): 174-83, 2005 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-15590942

RESUMEN

We present an analysis of the chicken (Gallus gallus) transcriptome based on the full insert sequences for 19,626 cDNAs, combined with 485,337 EST sequences. The cDNA data set has been functionally annotated and describes a minimum of 11,929 chicken coding genes, including the sequence for 2260 full-length cDNAs together with a collection of noncoding (nc) cDNAs that have been stringently filtered to remove untranslated regions of coding mRNAs. The combined collection of cDNAs and ESTs describe 62,546 clustered transcripts and provide transcriptional evidence for a total of 18,989 chicken genes, including 88% of the annotated Ensembl gene set. Analysis of the ncRNAs reveals a set that is highly conserved in chickens and mammals, including sequences for 14 pri-miRNAs encoding 23 different miRNAs. The data sets described here provide a transcriptome toolkit linked to physical clones for bioinformaticians and experimental biologists who wish to use chicken systems as a low-cost, accessible alternative to mammals for the analysis of vertebrate development, immunology, and cell biology.

Asunto(s)

Pollos/genética , ADN Complementario/genética , Etiquetas de Secuencia Expresada , Biblioteca de Genes , Transcripción Genética/genética , Animales , Clonación Molecular/métodos , Biología Computacional/métodos , ADN Complementario/fisiología , Humanos , MicroARNs/genética , ARN no Traducido/genética , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos

14.

ZTR: a new format for DNA sequence trace data.

Bonfield, James K; Staden, Rodger.

Bioinformatics ; 18(1): 3-10, 2002 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-11836205

RESUMEN

MOTIVATION: To produce an open and extensible file format for DNA trace data which produces compact files suitable for large-scale storage and efficient use of internet bandwidth. RESULTS: We have created an extensible format named ZTR. For a set of data taken from an ABI-3700 the ZTR format produces trace files which require 61.6% of the disk space used by gzipped SCFv3, and which can be written and read at greater speed. The compression algorithms used for the trace amplitudes are used within the National Center for Biotechnology Information (NCBI) trace archive. lmb.cam.ac.uk/pub/staden/io_lib/test_data.

Asunto(s)

Sistemas de Administración de Bases de Datos , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Análisis de Secuencia de ADN/estadística & datos numéricos , Algoritmos , Biología Computacional , Proyecto Genoma Humano , Humanos , Programas Informáticos

15.

Trev: a DNA trace editor and viewer.

Bonfield, James K; Beal, Kathryn F; Betts, Matthew J; Staden, Rodger.

Bioinformatics ; 18(1): 194-5, 2002 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-11836229

RESUMEN

Trev is a DNA trace editor and viewer, which is available free for UNIX and Microsoft Windows platforms. It can read all the commonly used file formats, including the new, compact ZTR files.

Asunto(s)

Análisis de Secuencia de ADN/estadística & datos numéricos , Programas Informáticos , Biología Computacional , Gráficos por Computador , Presentación de Datos , Procesamiento Automatizado de Datos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA