RESUMO
A mechanistic understanding of the biological and technical factors that impact transcript measurements is essential to designing and analyzing single-cell and single-nucleus RNA sequencing experiments. Nuclei contain the same pre-mRNA population as cells, but they contain a small subset of the mRNAs. Nonetheless, early studies argued that single-nucleus analysis yielded results comparable to cellular samples if pre-mRNA measurements were included. However, typical workflows do not distinguish between pre-mRNA and mRNA when estimating gene expression, and variation in their relative abundances across cell types has received limited attention. These gaps are especially important given that incorporating pre-mRNA has become commonplace for both assays, despite known gene length bias in pre-mRNA capture. Here, we reanalyze public data sets from mouse and human to describe the mechanisms and contrasting effects of mRNA and pre-mRNA sampling on gene expression and marker gene selection in single-cell and single-nucleus RNA-seq. We show that pre-mRNA levels vary considerably among cell types, which mediates the degree of gene length bias and limits the generalizability of a recently published normalization method intended to correct for this bias. As an alternative, we repurpose an existing post hoc gene length-based correction method from conventional RNA-seq gene set enrichment analysis. Finally, we show that inclusion of pre-mRNA in bioinformatic processing can impart a larger effect than assay choice itself, which is pivotal to the effective reuse of existing data. These analyses advance our understanding of the sources of variation in single-cell and single-nucleus RNA-seq experiments and provide useful guidance for future studies.
Assuntos
Núcleo Celular , Precursores de RNA , Humanos , Animais , Camundongos , RNA-Seq , RNA Mensageiro/genética , Análise de Sequência de RNA/métodos , Núcleo Celular/genética , Perfilação da Expressão Gênica/métodos , Análise de Célula ÚnicaRESUMO
Genetic and gene expression heterogeneity is an essential hallmark of many tumors, allowing the cancer to evolve and to develop resistance to treatment. Currently, the most commonly used data types for studying such heterogeneity are bulk tumor/normal whole-genome or whole-exome sequencing (WGS, WES); and single-cell RNA sequencing (scRNA-seq), respectively. However, tools are currently lacking to link genomic tumor subclonality with transcriptomic heterogeneity by integrating genomic and single-cell transcriptomic data collected from the same tumor. To address this gap, we developed scBayes, a Bayesian probabilistic framework that uses tumor subclonal structure inferred from bulk DNA sequencing data to determine the subclonal identity of cells from single-cell gene expression (scRNA-seq) measurements. Grouping together cells representing the same genetically defined tumor subclones allows comparison of gene expression across different subclones, or investigation of gene expression changes within the same subclone across time (i.e., progression, treatment response, or relapse) or space (i.e., at multiple metastatic sites and organs). We used simulated data sets, in silico synthetic data sets, as well as biological data sets generated from cancer samples to extensively characterize and validate the performance of our method, as well as to show improvements over existing methods. We show the validity and utility of our approach by applying it to published data sets and recapitulating the findings, as well as arriving at novel insights into cancer subclonal expression behavior in our own data sets. We further show that our method is applicable to a wide range of single-cell sequencing technologies including single-cell DNA sequencing as well as Smart-seq and 10x Genomics scRNA-seq protocols.
Assuntos
Neoplasias , Humanos , Sequenciamento do Exoma , Teorema de Bayes , Neoplasias/genética , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodosRESUMO
MOTIVATION: In time-critical clinical settings, such as precision medicine, genomic data needs to be processed as fast as possible to arrive at data-informed treatment decisions in a timely fashion. While sequencing throughput has dramatically increased over the past decade, bioinformatics analysis throughput has not been able to keep up with the pace of computer hardware improvement, and consequently has now turned into the primary bottleneck. Modern computer hardware today is capable of much higher performance than current genomic informatics algorithms can typically utilize, therefore presenting opportunities for significant improvement of performance. Accessing the raw sequencing data from BAM files, e.g. is a necessary and time-consuming step in nearly all sequence analysis tools, however existing programming libraries for BAM access do not take full advantage of the parallel input/output capabilities of storage devices. RESULTS: In an effort to stimulate the development of a new generation of faster sequence analysis tools, we developed quickBAM, a software library to accelerate sequencing data access by exploiting the parallelism in commodity storage hardware currently widely available. We demonstrate that analysis software ported to quickBAM consistently outperforms their current versions, in some cases finishing an analysis in under 3 min while the original version took 1.5 h, using the same storage solution. AVAILABILITY AND IMPLEMENTATION: Open source and freely available at https://gitlab.com/yiq/quickbam/, we envision that quickBAM will enable a new generation of high-performance informatics tools, either directly boosting their performance if they are currently data-access bottlenecked, or allow data-access to keep up with further optimizations in algorithms and compute techniques.
Assuntos
Algoritmos , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica , Informática , Análise de Sequência de DNA/métodosRESUMO
GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.
Assuntos
Neoplasias da Mama/genética , Genoma Humano , Genômica/métodos , Ferramenta de Busca/métodos , Análise de Sequência de DNA/métodos , Software , Bases de Dados Genéticas , Feminino , Humanos , InternetAssuntos
Leucemia Linfocítica Crônica de Células B , Adenina/análogos & derivados , Benzamidas , Humanos , Leucemia Linfocítica Crônica de Células B/tratamento farmacológico , Leucemia Linfocítica Crônica de Células B/genética , Mutação , Recidiva Local de Neoplasia , Piperidinas , Inibidores de Proteínas Quinases/uso terapêutico , PirazinasRESUMO
SpeedSeq is an open-source genome analysis platform that accomplishes alignment, variant detection and functional annotation of a 50× human genome in 13 h on a low-cost server and alleviates a bioinformatics bottleneck that typically demands weeks of computation with extensive hands-on expert involvement. SpeedSeq offers performance competitive with or superior to current methods for detecting germline and somatic single-nucleotide variants, structural variants, insertions and deletions, and it includes novel functionality for streamlined interpretation.
Assuntos
Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Anotação de Sequência Molecular/métodos , Software , Variação Genética , Humanos , Neoplasias/genética , Polimorfismo de Nucleotídeo Único , Medicina de Precisão/métodos , Fluxo de TrabalhoRESUMO
By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.
Assuntos
Variação Genética/genética , Genética Populacional , Genoma Humano/genética , Genômica , Alelos , Sítios de Ligação/genética , Sequência Conservada/genética , Evolução Molecular , Genética Médica , Estudo de Associação Genômica Ampla , Haplótipos/genética , Humanos , Motivos de Nucleotídeos , Polimorfismo de Nucleotídeo Único/genética , Grupos Raciais/genética , Deleção de Sequência/genética , Fatores de Transcrição/metabolismoRESUMO
Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
Assuntos
Variações do Número de Cópias de DNA/genética , Genética Populacional , Genoma Humano/genética , Genômica , Duplicação Gênica/genética , Predisposição Genética para Doença/genética , Genótipo , Humanos , Mutagênese Insercional/genética , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Deleção de Sequência/genéticaRESUMO
BACKGROUND: Mobile elements (MEs) constitute greater than 50% of the human genome as a result of repeated insertion events during human genome evolution. Although most of these elements are now fixed in the population, some MEs, including ALU, L1, SVA and HERV-K elements, are still actively duplicating. Mobile element insertions (MEIs) have been associated with human genetic disorders, including Crohn's disease, hemophilia, and various types of cancer, motivating the need for accurate MEI detection methods. To comprehensively identify and accurately characterize these variants in whole genome next-generation sequencing (NGS) data, a computationally efficient detection and genotyping method is required. Current computational tools are unable to call MEI polymorphisms with sufficiently high sensitivity and specificity, or call individual genotypes with sufficiently high accuracy. RESULTS: Here we report Tangram, a computationally efficient MEI detection program that integrates read-pair (RP) and split-read (SR) mapping signals to detect MEI events. By utilizing SR mapping in its primary detection module, a feature unique to this software, Tangram is able to pinpoint MEI breakpoints with single-nucleotide precision. To understand the role of MEI events in disease, it is essential to produce accurate individual genotypes in clinical samples. Tangram is able to determine sample genotypes with very high accuracy. Using simulations and experimental datasets, we demonstrate that Tangram has superior sensitivity, specificity, breakpoint resolution and genotyping accuracy, when compared to other, recently developed MEI detection methods. CONCLUSIONS: Tangram serves as the primary MEI detection tool in the 1000 Genomes Project, and is implemented as a highly portable, memory-efficient, easy-to-use C++ computer program, built under an open-source development model.
Assuntos
Algoritmos , Elementos Alu , Cromossomos Humanos Par 22/genética , Biologia Computacional/métodos , Genoma Humano , Genótipo , Humanos , Modelos Genéticos , Sensibilidade e EspecificidadeRESUMO
BACKGROUND: Next generation sequencing is helping to overcome limitations in organisms less accessible to classical or reverse genetic methods by facilitating whole genome mutational analysis studies. One traditionally intractable group, the Apicomplexa, contains several important pathogenic protozoan parasites, including the Plasmodium species that cause malaria.Here we apply whole genome analysis methods to the relatively accessible model apicomplexan, Toxoplasma gondii, to optimize forward genetic methods for chemical mutagenesis using N-ethyl-N-nitrosourea (ENU) and ethylmethane sulfonate (EMS) at varying dosages. RESULTS: By comparing three different lab-strains we show that spontaneously generated mutations reflect genome composition, without nucleotide bias. However, the single nucleotide variations (SNVs) are not distributed randomly over the genome; most of these mutations reside either in non-coding sequence or are silent with respect to protein coding. This is in contrast to the random genomic distribution of mutations induced by chemical mutagenesis. Additionally, we report a genome wide transition vs transversion ratio (ti/tv) of 0.91 for spontaneous mutations in Toxoplasma, with a slightly higher rate of 1.20 and 1.06 for variants induced by ENU and EMS respectively. We also show that in the Toxoplasma system, surprisingly, both ENU and EMS have a proclivity for inducing mutations at A/T base pairs (78.6% and 69.6%, respectively). CONCLUSIONS: The number of SNVs between related laboratory strains is relatively low and managed by purifying selection away from changes to amino acid sequence. From an experimental mutagenesis point of view, both ENU (24.7%) and EMS (29.1%) are more likely to generate variation within exons than would naturally accumulate over time in culture (19.1%), demonstrating the utility of these approaches for yielding proportionally greater changes to the amino acid sequence. These results will not only direct the methods of future chemical mutagenesis in Toxoplasma, but also aid in designing forward genetic approaches in less accessible pathogenic protozoa as well.
Assuntos
Genoma , Toxoplasma/genética , Adenosina/genética , Adenosina/metabolismo , Sequência de Aminoácidos , Pareamento de Bases , Linhagem Celular , Metanossulfonato de Etila/toxicidade , Etilnitrosoureia/toxicidade , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Dados de Sequência Molecular , Pentosiltransferases/genética , Pentosiltransferases/metabolismo , Fenótipo , Mutação Puntual , Proteínas de Protozoários/genética , Proteínas de Protozoários/metabolismo , Timidina/genética , Timidina/metabolismo , Toxoplasma/efeitos dos fármacosRESUMO
MOTIVATION: A common question arises at the beginning of every experiment where RNA-Seq is used to detect differential gene expression between two conditions: How many reads should we sequence? RESULTS: Scotty is an interactive web-based application that assists biologists to design an experiment with an appropriate sample size and read depth to satisfy the user-defined experimental objectives. This design can be based on data available from either pilot samples or publicly available datasets. AVAILABILITY: Scotty can be freely accessed on the web at http://euler.bc.edu/marthlab/scotty/scotty.php
Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Software , Expressão Gênica , Humanos , InternetRESUMO
High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2-4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after â¼1,000 sequenced chromosomes per population, whereas â¼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.
Assuntos
Demografia , Evolução Molecular , Genes/genética , Variação Genética , Genética Populacional , Modelos Genéticos , Grupos Raciais/genética , Frequência do Gene , Deriva Genética , Genômica/métodos , Humanos , Análise de Sequência de DNARESUMO
As a consequence of the accumulation of insertion events over evolutionary time, mobile elements now comprise nearly half of the human genome. The Alu, L1, and SVA mobile element families are still duplicating, generating variation between individual genomes. Mobile element insertions (MEI) have been identified as causes for genetic diseases, including hemophilia, neurofibromatosis, and various cancers. Here we present a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data of 185 samples in three major populations detected with two detection methods. This catalog enables us to systematically study mutation rates, population segregation, genomic distribution, and functional properties of MEI polymorphisms and to compare MEI to SNP variation from the same individuals. Population allele frequencies of MEI and SNPs are described, broadly, by the same neutral ancestral processes despite vastly different mutation mechanisms and rates, except in coding regions where MEI are virtually absent, presumably due to strong negative selection. A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations.
Assuntos
Elementos de DNA Transponíveis , Genoma Humano , Polimorfismo de Nucleotídeo Único , Frequência do Gene , Genótipo , Heterozigoto , Humanos , Mutagênese Insercional , Taxa de MutaçãoRESUMO
Bruton's tyrosine kinase (BTK) inhibitors are effective for the treatment of chronic lymphocytic leukemia (CLL) due to BTK's role in B cell survival and proliferation. Treatment resistance is most commonly caused by the emergence of the hallmark BTKC481S mutation that inhibits drug binding. In this study, we aimed to investigate whether the presence of additional CLL driver mutations in cancer subclones harboring a BTKC481S mutation accelerates subclone expansion. In addition, we sought to determine whether BTK-mutated subclones exhibit distinct transcriptomic behavior when compared to other cancer subclones. To achieve these goals, we employ our recently published method (Qiao et al. 2024) that combines bulk DNA sequencing and single-cell RNA sequencing (scRNA-seq) data to genotype individual cells for the presence or absence of subclone-defining mutations. While the most common approach for scRNA-seq includes short-read sequencing, transcript coverage is limited due to the vast majority of the reads being concentrated at the priming end of the transcript. Here, we utilized MAS-seq, a long-read scRNAseq technology, to substantially increase transcript coverage across the entire length of the transcripts and expand the set of informative mutations to link cells to cancer subclones in six CLL patients who acquired BTKC481S mutations during BTK inhibitor treatment. We found that BTK-mutated subclones often acquire additional mutations in CLL driver genes, leading to faster subclone proliferation. When examining subclone-specific gene expression, we found that in one patient, BTK-mutated subclones are transcriptionally distinct from the rest of the malignant B cell population with an overexpression of CLL-relevant genes.
RESUMO
BACKGROUND: Next generation sequencing and advances in genomic enrichment technologies have enabled the discovery of the full spectrum of variants from common to rare alleles in the human population. The application of such technologies can be limited by the amount of DNA available. Whole genome amplification (WGA) can overcome such limitations. Here we investigate applicability of using WGA by comparing SNP and INDEL variant calls from a single genomic/WGA sample pair from two capture separate experiments: a 50 Mbp whole exome capture and a custom capture array of 4 Mbp region on chr12. RESULTS: Our results comparing variant calls derived from genomic and WGA DNA show that the majority of variant SNP and INDEL calls are common to both callsets, both at the site and genotype level and suggest that allele bias plays a minimal role when using WGA DNA in re-sequencing studies. CONCLUSIONS: Although the results of this study are based on a limited sample size, they suggest that using WGA DNA allows the discovery of the vast majority of variants, and achieves high concordance metrics, when comparing to genomic DNA calls.
Assuntos
DNA/genética , Variação Genética/genética , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Técnicas de Amplificação de Ácido Nucleico , Análise de Sequência de DNA/métodos , Alelos , Gráficos por Computador , Genótipo , HumanosRESUMO
BACKGROUND: Toxoplasma gondii has a largely clonal population in North America and Europe, with types I, II and III clonal lineages accounting for the majority of strains isolated from patients. RH, a particular type I strain, is most frequently used to characterize Toxoplasma biology. However, compared to other type I strains, RH has unique characteristics such as faster growth, increased extracellular survival rate and inability to form orally infectious cysts. Thus, to identify candidate genes that could account for these parasite phenotypic differences, we determined genetic differences and differential parasite gene expression between RH and another type I strain, GT1. Moreover, as differences in host cell modulation could affect Toxoplasma replication in the host, we determined differentially modulated host processes among the type I strains through host transcriptional profiling. RESULTS: Through whole genome sequencing, we identified 1,394 single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) between RH and GT1. These SNPs/indels together with parasite gene expression differences between RH and GT1 were used to identify candidate genes that could account for type I phenotypic differences. A polymorphism in dense granule protein, GRA2, determined RH and GT1 differences in the evasion of the interferon gamma response. In addition, host transcriptional profiling identified that genes regulated by NF-ĸB, such as interleukin (IL)-12p40, were differentially modulated by the different type I strains. We subsequently showed that this difference in NF-ĸB activation was due to polymorphisms in GRA15. Furthermore, we observed that RH, but not other type I strains, recruited phosphorylated IĸBα (a component of the NF-ĸB complex) to the parasitophorous vacuole membrane and this recruitment of p- IĸBα was partially dependent on GRA2. CONCLUSIONS: We identified candidate parasite genes that could be responsible for phenotypic variation among the type I strains through comparative genomics and transcriptomics. We also identified differentially modulated host pathways among the type I strains, and these can serve as a guideline for future studies in examining the phenotypic differences among type I strains.
Assuntos
Fenótipo , Toxoplasma/genética , Toxoplasma/fisiologia , Animais , Fibroblastos/parasitologia , Regulação da Expressão Gênica , Genes de Protozoários/genética , Células HEK293 , Humanos , Subunidade p40 da Interleucina-12/metabolismo , Membranas Intracelulares/metabolismo , Membranas Intracelulares/parasitologia , Macrófagos/metabolismo , Macrófagos/parasitologia , Camundongos , NF-kappa B/metabolismo , Polimorfismo de Nucleotídeo Único , Transporte Proteico , Proteínas de Protozoários/genética , Proteínas de Protozoários/metabolismo , Especificidade da Espécie , Toxoplasma/metabolismo , Vacúolos/metabolismoRESUMO
UNLABELLED: ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles. AVAILABILITY: Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Cromossomos Humanos Par 17/genética , Variação Genética , Humanos , SoftwareRESUMO
BACKGROUND: DNA capture technologies combined with high-throughput sequencing now enable cost-effective, deep-coverage, targeted sequencing of complete exomes. This is well suited for SNP discovery and genotyping. However there has been little attention devoted to Copy Number Variation (CNV) detection from exome capture datasets despite the potentially high impact of CNVs in exonic regions on protein function. RESULTS: As members of the 1000 Genomes Project analysis effort, we investigated 697 samples in which 931 genes were targeted and sampled with 454 or Illumina paired-end sequencing. We developed a rigorous Bayesian method to detect CNVs in the genes, based on read depth within target regions. Despite substantial variability in read coverage across samples and targeted exons, we were able to identify 107 heterozygous deletions in the dataset. The experimentally determined false discovery rate (FDR) of the cleanest dataset from the Wellcome Trust Sanger Institute is 12.5%. We were able to substantially improve the FDR in a subset of gene deletion candidates that were adjacent to another gene deletion call (17 calls). The estimated sensitivity of our call-set was 45%. CONCLUSIONS: This study demonstrates that exonic sequencing datasets, collected both in population based and medical sequencing projects, will be a useful substrate for detecting genic CNV events, particularly deletions. Based on the number of events we found and the sensitivity of the methods in the present dataset, we estimate on average 16 genic heterozygous deletions per individual genome. Our power analysis informs ongoing and future projects about sequencing depth and uniformity of read coverage required for efficient detection.
Assuntos
Variações do Número de Cópias de DNA , Genoma/genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Teorema de Bayes , Exoma , Éxons , Genótipo , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNARESUMO
MOTIVATION: Analysis of genomic sequencing data requires efficient, easy-to-use access to alignment results and flexible data management tools (e.g. filtering, merging, sorting, etc.). However, the enormous amount of data produced by current sequencing technologies is typically stored in compressed, binary formats that are not easily handled by the text-based parsers commonly used in bioinformatics research. RESULTS: We introduce a software suite for programmers and end users that facilitates research analysis and data management using BAM files. BamTools provides both the first C++ API publicly available for BAM file support as well as a command-line toolkit. AVAILABILITY: BamTools was written in C++, and is supported on Linux, Mac OSX and MS Windows. Source code and documentation are freely available at http://github.org/pezmaster31/bamtools.