RESUMEN
With the increasing availability of long-read sequencing data, high-quality human genome assemblies, and software for fully characterizing tandem repeats, genome-wide genotyping of tandem repeat loci on a population scale becomes more feasible. Such efforts not only expand our knowledge of the tandem repeat landscape in the human genome but also enhance our ability to differentiate pathogenic tandem repeat mutations from benign polymorphisms. To this end, we analyzed 272 genomes assembled using datasets from three public initiatives that employed different long-read sequencing technologies. Here, we report a catalog of over 18 million tandem repeat loci, many of which were previously unannotated. Some of these loci are highly polymorphic, and many of them reside within coding sequences.
RESUMEN
Long-read sequencing technologies have improved significantly since their emergence. Their read lengths, potentially spanning entire transcripts, is advantageous for reconstructing transcriptomes. Existing long-read transcriptome assembly methods are primarily reference-based and to date, there is little focus on reference-free transcriptome assembly. We introduce "RNA-Bloom2 [ https://github.com/bcgsc/RNA-Bloom ]", a reference-free assembly method for long-read transcriptome sequencing data. Using simulated datasets and spike-in control data, we show that the transcriptome assembly quality of RNA-Bloom2 is competitive to those of reference-based methods. Furthermore, we find that RNA-Bloom2 requires 27.0 to 80.6% of the peak memory and 3.6 to 10.8% of the total wall-clock runtime of a competing reference-free method. Finally, we showcase RNA-Bloom2 in assembling a transcriptome sample of Picea sitchensis (Sitka spruce). Since our method does not rely on a reference, it further sets the groundwork for large-scale comparative transcriptomics where high-quality draft genome assemblies are not readily available.
Asunto(s)
ARN , Transcriptoma , Transcriptoma/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodosRESUMEN
Detection of short tandem repeat (STR) expansions with standard short-read sequencing is challenging due to the difficulty in mapping multicopy repeat sequences. In this study, we explored how the long-range sequence information of barcode linked-read sequencing (BLRS) can be leveraged to improve repeat-read detection. We also devised a novel algorithm using BLRS barcodes for distance estimation and evaluated its application for STR genotyping. Both approaches were designed for genotyping large expansions (> 1 kb) that cannot be sized accurately by existing methods. Using simulated and experimental data of genomes with STR expansions from multiple BLRS platforms, we validated the utility of barcode and phasing information in attaining better STR genotypes compared to standard short-read sequencing. Although the coverage bias of extremely GC-rich STRs is an important limitation of BLRS, BLRS is an effective strategy for genotyping many other STR loci.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Repeticiones de Microsatélite , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Repeticiones de Microsatélite/genética , Análisis de Secuencia de ADN/métodosRESUMEN
Tandem repeat (TR) expansion is the underlying cause of over 40 neurological disorders. Long-read sequencing offers an exciting avenue over conventional technologies for detecting TR expansions. Here, we present Straglr, a robust software tool for both targeted genotyping and novel expansion detection from long-read alignments. We benchmark Straglr using various simulations, targeted genotyping data of cell lines carrying expansions of known diseases, and whole genome sequencing data with chromosome-scale assembly. Our results suggest that Straglr may be useful for investigating disease-associated TR expansions using long-read sequencing.
Asunto(s)
Genotipo , Técnicas de Genotipaje/métodos , Programas Informáticos , Expansión de las Repeticiones de ADN , Enfermedad/genética , Humanos , Secuenciación Completa del Genoma/métodosRESUMEN
BACKGROUND: Screening for short tandem repeat (STR) expansions in next-generation sequencing data can enable diagnosis, optimal clinical management/treatment, and accurate genetic counseling of patients with repeat expansion disorders. We aimed to develop an efficient computational workflow for reliable detection of STR expansions in next-generation sequencing data and demonstrate its clinical utility. METHODS: We characterized the performance of eight STR analysis methods (lobSTR, HipSTR, RepeatSeq, ExpansionHunter, TREDPARSE, GangSTR, STRetch, and exSTRa) on next-generation sequencing datasets of samples with known disease-causing full-mutation STR expansions and genomes simulated to harbor repeat expansions at selected loci and optimized their sensitivity. We then used a machine learning decision tree classifier to identify an optimal combination of methods for full-mutation detection. In Burrows-Wheeler Aligner (BWA)-aligned genomes, the ensemble approach of using ExpansionHunter, STRetch, and exSTRa performed the best (precision = 82%, recall = 100%, F1-score = 90%). We applied this pipeline to screen 301 families of children with suspected genetic disorders. RESULTS: We identified 10 individuals with full-mutations in the AR, ATXN1, ATXN8, DMPK, FXN, or HTT disease STR locus in the analyzed families. Additional candidates identified in our analysis include two probands with borderline ATXN2 expansions between the established repeat size range for reduced-penetrance and full-penetrance full-mutation and seven individuals with FMR1 CGG repeats in the intermediate/premutation repeat size range. In 67 probands with a prior negative clinical PCR test for the FMR1, FXN, or DMPK disease STR locus, or the spinocerebellar ataxia disease STR panel, our pipeline did not falsely identify aberrant expansion. We performed clinical PCR tests on seven (out of 10) full-mutation samples identified by our pipeline and confirmed the expansion status in all, showing absolute concordance between our bioinformatics and molecular findings. CONCLUSIONS: We have successfully demonstrated the application of a well-optimized bioinformatics pipeline that promotes the utility of genome-wide sequencing as a first-tier screening test to detect expansions of known disease STRs. Interrogating clinical next-generation sequencing data for pathogenic STR expansions using our ensemble pipeline can improve diagnostic yield and enhance clinical outcomes for patients with repeat expansion disorders.
Asunto(s)
Expansión de las Repeticiones de ADN , Estudio de Asociación del Genoma Completo , Secuenciación de Nucleótidos de Alto Rendimiento , Repeticiones de Microsatélite , Secuenciación Completa del Genoma , Algoritmos , Alelos , Toma de Decisiones Clínicas , Biología Computacional/métodos , Bases de Datos Genéticas , Árboles de Decisión , Enfermedades Genéticas Congénitas/diagnóstico , Enfermedades Genéticas Congénitas/genética , Sitios Genéticos , Estudio de Asociación del Genoma Completo/métodos , Humanos , Aprendizaje Automático , Técnicas de Diagnóstico Molecular , Mutación , Reproducibilidad de los ResultadosRESUMEN
As more clinically-relevant genomic features of myeloid malignancies are revealed, it has become clear that targeted clinical genetic testing is inadequate for risk stratification. Here, we develop and validate a clinical transcriptome-based assay for stratification of acute myeloid leukemia (AML). Comparison of ribonucleic acid sequencing (RNA-Seq) to whole genome and exome sequencing reveals that a standalone RNA-Seq assay offers the greatest diagnostic return, enabling identification of expressed gene fusions, single nucleotide and short insertion/deletion variants, and whole-transcriptome expression information. Expression data from 154 AML patients are used to develop a novel AML prognostic score, which is strongly associated with patient outcomes across 620 patients from three independent cohorts, and 42 patients from a prospective cohort. When combined with molecular risk guidelines, the risk score allows for the re-stratification of 22.1 to 25.3% of AML patients from three independent cohorts into correct risk groups. Within the adverse-risk subgroup, we identify a subset of patients characterized by dysregulated integrin signaling and RUNX1 or TP53 mutation. We show that these patients may benefit from therapy with inhibitors of focal adhesion kinase, encoded by PTK2, demonstrating additional utility of transcriptome-based testing for therapy selection in myeloid malignancy.
Asunto(s)
Biomarcadores de Tumor/metabolismo , Regulación Neoplásica de la Expresión Génica/genética , Leucemia Mieloide Aguda/diagnóstico , Leucemia Mieloide Aguda/metabolismo , Biomarcadores de Tumor/genética , Línea Celular Tumoral , Estudios de Cohortes , Subunidad alfa 2 del Factor de Unión al Sitio Principal/genética , Subunidad alfa 2 del Factor de Unión al Sitio Principal/metabolismo , Femenino , Fusión Génica , Humanos , Mutación INDEL , Integrinas/genética , Integrinas/metabolismo , Leucemia Mieloide Aguda/genética , Masculino , Polimorfismo de Nucleótido Simple , Pronóstico , Estudios Prospectivos , RNA-Seq , Factores de Riesgo , Transducción de Señal/genética , Análisis de Supervivencia , Transcriptoma , Proteína p53 Supresora de Tumor/genética , Proteína p53 Supresora de Tumor/metabolismo , Secuenciación del Exoma , Secuenciación Completa del GenomaRESUMEN
Despite the rapid advance in single-cell RNA sequencing (scRNA-seq) technologies within the last decade, single-cell transcriptome analysis workflows have primarily used gene expression data while isoform sequence analysis at the single-cell level still remains fairly limited. Detection and discovery of isoforms in single cells is difficult because of the inherent technical shortcomings of scRNA-seq data, and existing transcriptome assembly methods are mainly designed for bulk RNA samples. To address this challenge, we developed RNA-Bloom, an assembly algorithm that leverages the rich information content aggregated from multiple single-cell transcriptomes to reconstruct cell-specific isoforms. Assembly with RNA-Bloom can be either reference-guided or reference-free, thus enabling unbiased discovery of novel isoforms or foreign transcripts. We compared both assembly strategies of RNA-Bloom against five state-of-the-art reference-free and reference-based transcriptome assembly methods. In our benchmarks on a simulated 384-cell data set, reference-free RNA-Bloom reconstructed 37.9%-38.3% more isoforms than the best reference-free assembler, whereas reference-guided RNA-Bloom reconstructed 4.1%-11.6% more isoforms than reference-based assemblers. When applied to a real 3840-cell data set consisting of more than 4 billion reads, RNA-Bloom reconstructed 9.7%-25.0% more isoforms than the best competing reference-based and reference-free approaches evaluated. We expect RNA-Bloom to boost the utility of scRNA-seq data beyond gene expression analysis, expanding what is informatically accessible now.
Asunto(s)
Perfilación de la Expresión Génica/métodos , RNA-Seq/métodos , Análisis de la Célula Individual/métodos , Transcriptoma/genética , Algoritmos , Animales , Secuencia de Bases , Humanos , Ratones , Isoformas de Proteínas/genética , Programas InformáticosRESUMEN
Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.
Asunto(s)
Análisis de Secuencia de ADN/métodos , Programas Informáticos , Animales , Disparidad de Par Base , Humanos , Filogenia , Alineación de Secuencia , Análisis de Secuencia de ADN/normasRESUMEN
SUMMARY: Presence or absence of gene fusions is one of the most important diagnostic markers in many cancer types. Consequently, fusion detection methods using various genomics data types, such as RNA sequencing (RNA-seq) are valuable tools for research and clinical applications. While information-rich RNA-seq data have proven to be instrumental in discovery of a number of hallmark fusion events, bioinformatics tools to detect fusions still have room for improvement. Here, we present Fusion-Bloom, a fusion detection method that leverages recent developments in de novo transcriptome assembly and assembly-based structural variant calling technologies (RNA-Bloom and PAVFinder, respectively). We benchmarked Fusion-Bloom against the performance of five other state-of-the-art fusion detection tools using multiple datasets. Overall, we observed Fusion-Bloom to display a good balance between detection sensitivity and specificity. We expect the tool to find applications in translational research and clinical genomics pipelines. AVAILABILITY AND IMPLEMENTATION: Fusion-Bloom is implemented as a UNIX Make utility, available at https://github.com/bcgsc/pavfinder and released under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Programas Informáticos , Transcriptoma , Genómica , ARN , Análisis de Secuencia de ARNRESUMEN
We report a case of early-onset pancreatic ductal adenocarcinoma in a patient harboring biallelic MUTYH germline mutations, whose tumor featured somatic mutational signatures consistent with defective MUTYH-mediated base excision repair and the associated driver KRAS transversion mutation p.Gly12Cys. Analysis of an additional 730 advanced cancer cases (N = 731) was undertaken to determine whether the mutational signatures were also present in tumors from germline MUTYH heterozygote carriers or if instead the signatures were only seen in those with biallelic loss of function. We identified two patients with breast cancer each carrying a pathogenic germline MUTYH variant with a somatic MUTYH copy loss leading to the germline variant being homozygous in the tumor and demonstrating the same somatic signatures. Our results suggest that monoallelic inactivation of MUTYH is not sufficient for C:G>A:T transversion signatures previously linked to MUTYH deficiency to arise (N = 9), but that biallelic complete loss of MUTYH function can cause such signatures to arise even in tumors not classically seen in MUTYH-associated polyposis (N = 3). Although defective MUTYH is not the only determinant of these signatures, MUTYH germline variants may be present in a subset of patients with tumors demonstrating elevated somatic signatures possibly suggestive of MUTYH deficiency (e.g., COSMIC Signature 18, SigProfiler SBS18/SBS36, SignatureAnalyzer SBS18/SBS36).
Asunto(s)
Neoplasias de la Mama/genética , Carcinoma Ductal Pancreático/genética , ADN Glicosilasas/genética , Mutación , Neoplasias Pancreáticas/genética , Edad de Inicio , ADN Glicosilasas/deficiencia , Femenino , Mutación de Línea Germinal , Humanos , Pérdida de Heterocigocidad , Persona de Mediana Edad , Proteínas Proto-Oncogénicas p21(ras)/genéticaRESUMEN
BACKGROUND: RNA-seq is a powerful and cost-effective technology for molecular diagnostics of cancer and other diseases, and it can reach its full potential when coupled with validated clinical-grade informatics tools. Despite recent advances in long-read sequencing, transcriptome assembly of short reads remains a useful and cost-effective methodology for unveiling transcript-level rearrangements and novel isoforms. One of the major concerns for adopting the proven de novo assembly approach for RNA-seq data in clinical settings has been the analysis turnaround time. To address this concern, we have developed a targeted approach to expedite assembly and analysis of RNA-seq data. RESULTS: Here we present our Targeted Assembly Pipeline (TAP), which consists of four stages: 1) alignment-free gene-level classification of RNA-seq reads using BioBloomTools, 2) de novo assembly of individual targets using Trans-ABySS, 3) alignment of assembled contigs to the reference genome and transcriptome with GMAP and BWA and 4) structural and splicing variant detection using PAVFinder. We show that PAVFinder is a robust gene fusion detection tool when compared to established methods such as Tophat-Fusion and deFuse on simulated data of 448 events. Using the Leucegene acute myeloid leukemia (AML) RNA-seq data and a set of 580 COSMIC target genes, TAP identified a wide range of hallmark molecular anomalies including gene fusions, tandem duplications, insertions and deletions in agreement with published literature results. Moreover, also in this dataset, TAP captured AML-specific splicing variants such as skipped exons and novel splice sites reported in studies elsewhere. Running time of TAP on 100-150 million read pairs and a 580-gene set is one to 2 hours on a 48-core machine. CONCLUSIONS: We demonstrated that TAP is a fast and robust RNA-seq variant detection pipeline that is potentially amenable to clinical applications. TAP is available at http://www.bcgsc.ca/platform/bioinfo/software/pavfinder.
Asunto(s)
Variación Genética , Genómica/métodos , ARN/metabolismo , Interfaz Usuario-Computador , Humanos , Mutación INDEL , Leucemia Mieloide Aguda/genética , Leucemia Mieloide Aguda/patología , ARN/química , ARN/genética , Empalme del ARN , Análisis de Secuencia de ARNRESUMEN
BACKGROUND: Alternative polyadenylation (APA) results in messenger RNA molecules with different 3' untranslated regions (3' UTRs), affecting the molecules' stability, localization, and translation. APA is pervasive and implicated in cancer. Earlier reports on APA focused on 3' UTR length modifications and commonly characterized APA events as 3' UTR shortening or lengthening. However, such characterization oversimplifies the processing of 3' ends of transcripts and fails to adequately describe the various scenarios we observe. RESULTS: We built a cloud-based targeted de novo transcript assembly and analysis pipeline that incorporates our previously developed cleavage site prediction tool, KLEAT. We applied this pipeline to elucidate the APA profiles of 114 genes in 9939 tumor and 729 tissue normal samples from The Cancer Genome Atlas (TCGA). The full set of 10,668 RNA-Seq samples from 33 cancer types has not been utilized by previous APA studies. By comparing the frequencies of predicted cleavage sites between normal and tumor sample groups, we identified 77 events (i.e. gene-cancer type pairs) of tumor-specific APA regulation in 13 cancer types; for 15 genes, such regulation is recurrent across multiple cancers. Our results also support a previous report showing the 3' UTR shortening of FGF2 in multiple cancers. However, over half of the events we identified display complex changes to 3' UTR length that resist simple classification like shortening or lengthening. CONCLUSIONS: Recurrent tumor-specific regulation of APA is widespread in cancer. However, the regulation pattern that we observed in TCGA RNA-seq data cannot be described as straightforward 3' UTR shortening or lengthening. Continued investigation into this complex, nuanced regulatory landscape will provide further insight into its role in tumor formation and development.
Asunto(s)
Neoplasias/genética , ARN Mensajero/genética , Regiones no Traducidas 3' , Nube Computacional , Bases de Datos Genéticas , Factor 2 de Crecimiento de Fibroblastos/genética , Regulación Neoplásica de la Expresión Génica , Humanos , Recurrencia Local de Neoplasia/genética , Neoplasias/patología , Poliadenilación , División del ARN , ARN Mensajero/metabolismo , Programas InformáticosRESUMEN
BACKGROUND: Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool. RESULTS: Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences. CONCLUSIONS: Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.
Asunto(s)
Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , ADN/química , Secuenciación de Nucleótidos de Alto RendimientoRESUMEN
White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation.
Asunto(s)
Genoma de Planta , Familia de Multigenes , Fenoles/metabolismo , Picea/genética , Terpenos/metabolismo , Transferasas Alquil y Aril/metabolismo , Biología Computacional , Sistema Enzimático del Citocromo P-450/metabolismo , TranscriptomaRESUMEN
In eukaryotic cells, alternative cleavage of 3' untranslated regions (UTRs) can affect transcript stability, transport and translation. For polyadenylated (poly(A)) transcripts, cleavage sites can be characterized with short-read sequencing using specialized library construction methods. However, for large-scale cohort studies as well as for clinical sequencing applications, it is desirable to characterize such events using RNA-seq data, as the latter are already widely applied to identify other relevant information, such as mutations, alternative splicing and chimeric transcripts. Here we describe KLEAT, an analysis tool that uses de novo assembly of RNA-seq data to characterize cleavage sites on 3' UTRs. We demonstrate the performance of KLEAT on three cell line RNA-seq libraries constructed and sequenced by the ENCODE project, and assembled using Trans-ABySS. Validating the KLEAT predictions with matched ENCODE RNA-seq and RNA-PET libraries, we show that the tool has over 90% positive predictive value when there are at least three RNA-seq reads supporting a poly(A) tail and requiring at least three RNA-PET reads mapping within 100 nucleotides as validation. We also compare the performance of KLEAT with other popular RNA-seq analysis pipelines that reconstruct 3' UTR ends, and show that it performs favourably, based on an ROC-like curve.
Asunto(s)
Transcriptoma , Regiones no Traducidas 3' , Sitios de Unión , Línea Celular , Biología Computacional , Biblioteca de Genes , Humanos , Curva ROC , Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia de ARN/estadística & datos numéricosRESUMEN
BACKGROUND: Chimeric transcripts, including partial and internal tandem duplications (PTDs, ITDs) and gene fusions, are important in the detection, prognosis, and treatment of human cancers. RESULTS: We describe Barnacle, a production-grade analysis tool that detects such chimeras in de novo assemblies of RNA-seq data, and supports prioritizing them for review and validation by reporting the relative coverage of co-occurring chimeric and wild-type transcripts. We demonstrate applications in large-scale disease studies, by identifying PTDs in MLL, ITDs in FLT3, and reciprocal fusions between PML and RARA, in two deeply sequenced acute myeloid leukemia (AML) RNA-seq datasets. CONCLUSIONS: Our analyses of real and simulated data sets show that, with appropriate filter settings, Barnacle makes highly specific predictions for three types of chimeric transcripts that are important in a range of cancers: PTDs, ITDs, and fusions. High specificity makes manual review and validation efficient, which is necessary in large-scale disease studies. Characterizing an extended range of chimera types will help generate insights into progression, treatment, and outcomes for complex diseases.
Asunto(s)
Duplicación de Gen/genética , Perfilación de la Expresión Génica/métodos , Fusión Génica/genética , Genómica , Neoplasias de la Mama/genética , Exones/genética , Humanos , Leucemia Mieloide Aguda/genética , Anotación de Secuencia Molecular , ARN Mensajero/genética , Estadística como AsuntoRESUMEN
Diffuse large B-cell lymphoma (DLBCL) is a genetically heterogeneous cancer composed of at least 2 molecular subtypes that differ in gene expression and distribution of mutations. Recently, application of genome/exome sequencing and RNA-seq to DLBCL has revealed numerous genes that are recurrent targets of somatic point mutation in this disease. Here we provide a whole-genome-sequencing-based perspective of DLBCL mutational complexity by characterizing 40 de novo DLBCL cases and 13 DLBCL cell lines and combining these data with DNA copy number analysis and RNA-seq from an extended cohort of 96 cases. Our analysis identified widespread genomic rearrangements including evidence for chromothripsis as well as the presence of known and novel fusion transcripts. We uncovered new gene targets of recurrent somatic point mutations and genes that are targeted by focal somatic deletions in this disease. We highlight the recurrence of germinal center B-cell-restricted mutations affecting genes that encode the S1P receptor and 2 small GTPases (GNA13 and GNAI2) that together converge on regulation of B-cell homing. We further analyzed our data to approximate the relative temporal order in which some recurrent mutations were acquired and demonstrate that ongoing acquisition of mutations and intratumoral clonal heterogeneity are common features of DLBCL. This study further improves our understanding of the processes and pathways involved in lymphomagenesis, and some of the pathways mutated here may indicate new avenues for therapeutic intervention.
Asunto(s)
Biomarcadores de Tumor/química , Biomarcadores de Tumor/genética , Variaciones en el Número de Copia de ADN/genética , Genoma Humano , Linfoma de Células B Grandes Difuso/genética , Mutación/genética , Subunidades alfa de la Proteína de Unión al GTP G12-G13/química , Subunidades alfa de la Proteína de Unión al GTP G12-G13/genética , Perfilación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos , ARN Mensajero/genética , Reacción en Cadena en Tiempo Real de la Polimerasa , Reacción en Cadena de la Polimerasa de Transcriptasa Inversa , Células Tumorales CultivadasRESUMEN
BACKGROUND: Many mutations that contribute to the pathogenesis of acute myeloid leukemia (AML) are undefined. The relationships between patterns of mutations and epigenetic phenotypes are not yet clear. METHODS: We analyzed the genomes of 200 clinically annotated adult cases of de novo AML, using either whole-genome sequencing (50 cases) or whole-exome sequencing (150 cases), along with RNA and microRNA sequencing and DNA-methylation analysis. RESULTS: AML genomes have fewer mutations than most other adult cancers, with an average of only 13 mutations found in genes. Of these, an average of 5 are in genes that are recurrently mutated in AML. A total of 23 genes were significantly mutated, and another 237 were mutated in two or more samples. Nearly all samples had at least 1 nonsynonymous mutation in one of nine categories of genes that are almost certainly relevant for pathogenesis, including transcription-factor fusions (18% of cases), the gene encoding nucleophosmin (NPM1) (27%), tumor-suppressor genes (16%), DNA-methylation-related genes (44%), signaling genes (59%), chromatin-modifying genes (30%), myeloid transcription-factor genes (22%), cohesin-complex genes (13%), and spliceosome-complex genes (14%). Patterns of cooperation and mutual exclusivity suggested strong biologic relationships among several of the genes and categories. CONCLUSIONS: We identified at least one potential driver mutation in nearly all AML samples and found that a complex interplay of genetic events contributes to AML pathogenesis in individual patients. The databases from this study are widely available to serve as a foundation for further investigations of AML pathogenesis, classification, and risk stratification. (Funded by the National Institutes of Health.).
Asunto(s)
Leucemia Mieloide Aguda/genética , Mutación , Adulto , Islas de CpG , Metilación de ADN , Epigenómica , Femenino , Expresión Génica , Fusión Génica , Genoma Humano , Humanos , Leucemia Mieloide Aguda/clasificación , Masculino , MicroARNs/genética , Persona de Mediana Edad , Nucleofosmina , Análisis de Secuencia de ADN/métodosRESUMEN
Neuroblastoma is a malignancy of the developing sympathetic nervous system that often presents with widespread metastatic disease, resulting in survival rates of less than 50%. To determine the spectrum of somatic mutation in high-risk neuroblastoma, we studied 240 affected individuals (cases) using a combination of whole-exome, genome and transcriptome sequencing as part of the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiative. Here we report a low median exonic mutation frequency of 0.60 per Mb (0.48 nonsilent) and notably few recurrently mutated genes in these tumors. Genes with significant somatic mutation frequencies included ALK (9.2% of cases), PTPN11 (2.9%), ATRX (2.5%, and an additional 7.1% had focal deletions), MYCN (1.7%, causing a recurrent p.Pro44Leu alteration) and NRAS (0.83%). Rare, potentially pathogenic germline variants were significantly enriched in ALK, CHEK2, PINK1 and BARD1. The relative paucity of recurrent somatic mutations in neuroblastoma challenges current therapeutic strategies that rely on frequently altered oncogenic drivers.