RESUMEN
High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
Asunto(s)
Algoritmos , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento , Reproducibilidad de los ResultadosRESUMEN
BACKGROUND: Internal tandem duplications in the FLT3 gene, termed FLT3-ITDs, are useful molecular markers in acute myeloid leukemia (AML) for patient risk stratification and follow-up. FLT3-ITDs are increasingly screened through high-throughput sequencing (HTS) raising the need for robust and efficient algorithms. We developed a new algorithm, which performs no alignment and uses little resources, to identify and quantify FLT3-ITDs in HTS data. RESULTS: Our algorithm (FiLT3r) focuses on the k-mers from reads covering FLT3 exons 14 and 15. We show that those k-mers bring enough information to accurately detect, determine the length and quantify FLT3-ITD duplications. We compare the performances of FiLT3r to state-of-the-art alternatives and to fragment analysis, the gold standard method, on a cohort of 185 AML patients sequenced with capture-based HTS. On this dataset FiLT3r is more precise (no false positive nor false negative) than the other software evaluated. We also assess the software on public RNA-Seq data, which confirms the previous results and shows that FiLT3r requires little resources compared to other software. CONCLUSION: FiLT3r is a free software available at https://gitlab.univ-lille.fr/filt3r/filt3r . The repository also contains a Snakefile to reproduce our experiments. We show that FiLT3r detects FLT3-ITDs better than other software while using less memory and time.
Asunto(s)
Leucemia Mieloide Aguda , Secuencias Repetidas en Tándem , Humanos , Secuencias Repetidas en Tándem/genética , Leucemia Mieloide Aguda/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Exones , Secuencia de Bases , Tirosina Quinasa 3 Similar a fms/genética , MutaciónRESUMEN
MOTIVATION: In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. RESULTS: We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of â¼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. AVAILABILITY AND IMPLEMENTATION: https://github.com/kamimrcht/REINDEER. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Análisis de Secuencia de ADN , Programas Informáticos , Algoritmos , Humanos , Análisis de Secuencia de ARNRESUMEN
BACKGROUND: The evolution of next-generation sequencing (NGS) technologies has led to increased focus on RNA-Seq. Many bioinformatic tools have been developed for RNA-Seq analysis, each with unique performance characteristics and configuration parameters. Users face an increasingly complex task in understanding which bioinformatic tools are best for their specific needs and how they should be configured. In order to provide some answers to these questions, we investigate the performance of leading bioinformatic tools designed for RNA-Seq analysis and propose a methodology for systematic evaluation and comparison of performance to help users make well informed choices. RESULTS: To evaluate RNA-Seq pipelines, we developed a suite of two benchmarking tools. SimCT generates simulated datasets that get as close as possible to specific real biological conditions accompanied by the list of genomic incidents and mutations that have been inserted. BenchCT then compares the output of any bioinformatics pipeline that has been run against a SimCT dataset with the simulated genomic and transcriptional variations it contains to give an accurate performance evaluation in addressing specific biological question. We used these tools to simulate a real-world genomic medicine question s involving the comparison of healthy and cancerous cells. Results revealed that performance in addressing a particular biological context varied significantly depending on the choice of tools and settings used. We also found that by combining the output of certain pipelines, substantial performance improvements could be achieved. CONCLUSION: Our research emphasizes the importance of selecting and configuring bioinformatic tools for the specific biological question being investigated to obtain optimal results. Pipeline designers, developers and users should include benchmarking in the context of their biological question as part of their design and quality control process. Our SimBA suite of benchmarking tools provides a reliable basis for comparing the performance of RNA-Seq bioinformatics pipelines in addressing a specific biological question. We would like to see the creation of a reference corpus of data-sets that would allow accurate comparison between benchmarks performed by different groups and the publication of more benchmarks based on this public corpus. SimBA software and data-set are available at http://cractools.gforge.inria.fr/softwares/simba/ .
Asunto(s)
Biología Computacional/métodos , Simulación por Computador , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Fusión Génica , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Mutación INDEL/genética , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
High-throughput sequencing (HTS) is considered a technical revolution that has improved our knowledge of lymphoid and autoimmune diseases, changing our approach to leukaemia both at diagnosis and during follow-up. As part of an immunoglobulin/T cell receptor-based minimal residual disease (MRD) assessment of acute lymphoblastic leukaemia patients, we assessed the performance and feasibility of the replacement of the first steps of the approach based on DNA isolation and Sanger sequencing, using a HTS protocol combined with bioinformatics analysis and visualization using the Vidjil software. We prospectively analysed the diagnostic and relapse samples of 34 paediatric patients, thus identifying 125 leukaemic clones with recombinations on multiple loci (TRG, TRD, IGH and IGK), including Dd2/Dd3 and Intron/KDE rearrangements. Sequencing failures were halved (14% vs. 34%, P = 0.0007), enabling more patients to be monitored. Furthermore, more markers per patient could be monitored, reducing the probability of false negative MRD results. The whole analysis, from sample receipt to clinical validation, was shorter than our current diagnostic protocol, with equal resources. V(D)J recombination was successfully assigned by the software, even for unusual recombinations. This study emphasizes the progress that HTS with adapted bioinformatics tools can bring to the diagnosis of leukaemia patients.
Asunto(s)
Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Leucemia-Linfoma Linfoblástico de Células Precursoras/diagnóstico , Adolescente , Adulto , Niño , Preescolar , Células Clonales , Errores Diagnósticos/prevención & control , Reordenamiento Génico de Linfocito T , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Humanos , Lactante , Recién Nacido , Neoplasia Residual/diagnóstico , Estudios Prospectivos , Programas Informáticos , Recombinación V(D)J/genética , Adulto JovenRESUMEN
BACKGROUND: V(D)J recombinations in lymphocytes are essential for immunological diversity. They are also useful markers of pathologies. In leukemia, they are used to quantify the minimal residual disease during patient follow-up. However, the full breadth of lymphocyte diversity is not fully understood. RESULTS: We propose new algorithms that process high-throughput sequencing (HTS) data to extract unnamed V(D)J junctions and gather them into clones for quantification. This analysis is based on a seed heuristic and is fast and scalable because in the first phase, no alignment is performed with germline database sequences. The algorithms were applied to TR γ HTS data from a patient with acute lymphoblastic leukemia, and also on data simulating hypermutations. Our methods identified the main clone, as well as additional clones that were not identified with standard protocols. CONCLUSIONS: The proposed algorithms provide new insight into the analysis of high-throughput sequencing data for leukemia, and also to the quantitative assessment of any immunological profile. The methods described here are implemented in a C++ open-source program called Vidjil.
Asunto(s)
Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Leucemia-Linfoma Linfoblástico de Células Precursoras/diagnóstico , Análisis de Secuencia de ADN/métodos , Recombinación V(D)J , Humanos , Neoplasia Residual/diagnóstico , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Programas InformáticosRESUMEN
Indexing techniques relying on k-mers have proven effective in searching for RNA sequences across thousands of RNA-seq libraries, but without enabling direct RNA quantification. We show here that arbitrary RNA sequences can be quantified in seconds through their decomposition into k-mers, with a precision akin to that of conventional RNA quantification methods. Using an index of the Cancer Cell Line Encyclopedia (CCLE) collection consisting of 1019 RNA-seq samples, we show that k-mer indexing offers a powerful means to reveal non-reference sequences, and variant RNAs induced by specific gene alterations, for instance in splicing factors.
Asunto(s)
Neoplasias , Análisis de Secuencia de ARN , Humanos , Neoplasias/genética , Análisis de Secuencia de ARN/métodos , Línea Celular Tumoral , Programas Informáticos , RNA-Seq/métodosRESUMEN
Despite the use of midostaurin (MIDO) with intensive chemotherapy (ICT) as the front-line treatment for FLT3-mutated acute myeloid leukemia (AML), complete remission rates are close to 60-70%, and relapses occur in over 40% of cases. Here we studied the molecular mechanisms underlying refractory/relapsed (R/R) situation in FLT3-mutated AML patients. We conducted a retrospective and multicenter study involving 150 patients with R/R AML harboring FLT3-ITD (n=130) and/or FLT3-TKD (n=26) at diagnosis assessed by standard methods. Patients were treated in front-line with ICT + MIDO (n=54) or ICT alone (n=96) according to the diagnosis date and label of MIDO. The evolution of FLT3 clones and co-mutations was analyzed in paired diagnosis-R/R samples by targeted high-throughput sequencing. Using a dedicated algorithm for FLT3-ITD detection, 189 FLT3-ITD microclones (allelic ratio [AR] < 0.05) and 225 macroclones (AR ≥ 0.05) were detected at both time points. At R/R disease, the rate of FLT3-ITD persistence was lower in patients treated with ICT + MIDO compared with patients not receiving MIDO (68% vs. 87.5%, P=0.011). In patients receiving ICT + MIDO, detection of multiple FLT3-ITD clones (referred to as "clonal interference") was associated with a higher FLT3-ITD persistence rate at R/R disease (multiple clones: 88% vs. single clones: 57%, P=0.049). Considering both treatment groups, if only 24% of FLT3-ITD microclones detected at diagnosis were retained at relapse, 43% of them became macroclones. Together, these results identify parameters influencing the fitness of FLT3-ITD clones and highlight the importance of using sensitive techniques for FLT3--ITD screening in clinical practice.
RESUMEN
Within the EuroClonality-NGS group, immune repertoire analysis for target identification in lymphoid malignancies was initially developed using two-stage amplicon approaches, essentially as a progressive modification of preceding methods developed for Sanger sequencing. This approach has, however, limitations with respect to sample handling, adaptation to automation, and risk of contamination by amplicon products. We therefore developed one-step PCR amplicon methods with individual barcoding for batched analysis for IGH, IGK, TRD, TRG, and TRB rearrangements, followed by Vidjil-based data analysis.
Asunto(s)
Genes Codificadores de los Receptores de Linfocitos T , Secuenciación de Nucleótidos de Alto Rendimiento , Inmunoglobulinas , Leucemia-Linfoma Linfoblástico de Células Precursoras , Recombinación Genética , Genes Codificadores de los Receptores de Linfocitos T/genética , Genes Codificadores de los Receptores de Linfocitos T/inmunología , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Inmunoglobulinas/genética , Inmunoglobulinas/inmunología , Neoplasia Residual/diagnóstico , Neoplasia Residual/genética , Leucemia-Linfoma Linfoblástico de Células Precursoras/diagnóstico , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Leucemia-Linfoma Linfoblástico de Células Precursoras/inmunología , Recombinación Genética/genética , Recombinación Genética/inmunologíaRESUMEN
B cell receptor (BcR) immunoglobulins (IG) display a tremendous diversity due to complex DNA rearrangements, the V(D)J recombination, further enhanced by the somatic hypermutation process. In chronic lymphocytic leukemia (CLL), the mutational load of the clonal BcR IG expressed by the leukemic cells constitutes an important prognostic and predictive biomarker. Here, we provide a reliable methodology capable of determining the mutational status of IG genes in CLL using high-throughput sequencing, starting from leukemic cell DNA or RNA.
Asunto(s)
Leucemia Linfocítica Crónica de Células B , Genes de Inmunoglobulinas , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Inmunoglobulinas/genética , Leucemia Linfocítica Crónica de Células B/genética , Receptores de Antígenos de Linfocitos B/genéticaRESUMEN
BACKGROUND: High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently. RESULTS: Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq). CONCLUSIONS: Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Computadores , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Programas InformáticosRESUMEN
Amplicon-based next-generation sequencing (NGS) of immunoglobulin (IG) and T-cell receptor (TR) gene rearrangements for clonality assessment, marker identification and quantification of minimal residual disease (MRD) in lymphoid neoplasms has been the focus of intense research, development and application. However, standardization and validation in a scientifically controlled multicentre setting is still lacking. Therefore, IG/TR assay development and design, including bioinformatics, was performed within the EuroClonality-NGS working group and validated for MRD marker identification in acute lymphoblastic leukaemia (ALL). Five EuroMRD ALL reference laboratories performed IG/TR NGS in 50 diagnostic ALL samples, and compared results with those generated through routine IG/TR Sanger sequencing. A central polytarget quality control (cPT-QC) was used to monitor primer performance, and a central in-tube quality control (cIT-QC) was spiked into each sample as a library-specific quality control and calibrator. NGS identified 259 (average 5.2/sample, range 0-14) clonal sequences vs. Sanger-sequencing 248 (average 5.0/sample, range 0-14). NGS primers covered possible IG/TR rearrangement types more completely compared with local multiplex PCR sets and enabled sequencing of bi-allelic rearrangements and weak PCR products. The cPT-QC showed high reproducibility across all laboratories. These validated and reproducible quality-controlled EuroClonality-NGS assays can be used for standardized NGS-based identification of IG/TR markers in lymphoid malignancies.
Asunto(s)
Reordenamiento Génico de Linfocito T/genética , Genes Codificadores de los Receptores de Linfocitos T/genética , Marcadores Genéticos/genética , Inmunoglobulinas/genética , Neoplasia Residual/genética , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Biología Computacional/métodos , Genes de Inmunoglobulinas/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Receptores de Antígenos de Linfocitos T/genética , Recombinación Genética/genética , Estándares de Referencia , Reproducibilidad de los ResultadosRESUMEN
BACKGROUND: Labels are a way to add some information on a text, such as functional annotations such as genes on a DNA sequences. V(D)J recombinations are DNA recombinations involving two or three short genes in lymphocytes. Sequencing this short region (500 bp or less) produces labeled sequences and brings insight in the lymphocyte repertoire for onco-hematology or immunology studies. METHODS: We present two indexes for a text with non-overlapping labels. They store the text in a Burrows-Wheeler transform (BWT) and a compressed label sequence in a Wavelet Tree. The label sequence is taken in the order of the text (TL-index) or in the order of the BWT (TLBW-index). Both indexes need a space related to the entropy of the labeled text. RESULTS: These indexes allow efficient text-label queries to count and find labeled patterns. The TLBW-index has an overhead on simple label queries but is very efficient on combined pattern-label queries. We implemented the indexes in C++ and compared them against a baseline solution on pseudo-random as well as on V(D)J labeled texts. DISCUSSION: New indexes such as the ones we proposed improve the way we index and query labeled texts as, for instance, lymphocyte repertoire for hematological and immunological studies.
RESUMEN
[This corrects the article DOI: 10.1371/journal.pone.0166126.].
RESUMEN
Minimal residual disease (MRD) is known to be an independent prognostic factor in patients with acute lymphoblastic leukemia (ALL). High-throughput sequencing (HTS) is currently used in routine practice for the diagnosis and follow-up of patients with hematological neoplasms. In this retrospective study, we examined the role of immunoglobulin/T-cell receptor-based MRD in patients with ALL by HTS analysis of immunoglobulin H and/or T-cell receptor gamma chain loci in bone marrow samples from 11 patients with ALL, at diagnosis and during follow-up. We assessed the clinical feasibility of using combined HTS and bioinformatics analysis with interactive visualization using Vidjil software. We discuss the advantages and drawbacks of HTS for monitoring MRD. HTS gives a more complete insight of the leukemic population than conventional real-time quantitative PCR (qPCR), and allows identification of new emerging clones at each time point of the monitoring. Thus, HTS monitoring of Ig/TR based MRD is expected to improve the management of patients with ALL.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Neoplasia Residual/diagnóstico , Leucemia-Linfoma Linfoblástico de Células Precursoras/diagnóstico , Médula Ósea , Células Clonales/patología , Estudios de Seguimiento , Genes Codificadores de la Cadena gamma de los Receptores de Linfocito T , Humanos , Cadenas Pesadas de Inmunoglobulina/genética , Monitorización Inmunológica , Neoplasia Residual/genética , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Estudios Retrospectivos , Programas InformáticosRESUMEN
We introduce a k-mer-based computational protocol, DE-kupl, for capturing local RNA variation in a set of RNA-seq libraries, independently of a reference genome or transcriptome. DE-kupl extracts all k-mers with differential abundance directly from the raw data files. This enables the retrieval of virtually all variation present in an RNA-seq data set. This variation is subsequently assigned to biological events or entities such as differential long non-coding RNAs, splice and polyadenylation variants, introns, repeats, editing or mutation events, and exogenous RNA. Applying DE-kupl to human RNA-seq data sets identified multiple types of novel events, reproducibly across independent RNA-seq experiments.
Asunto(s)
Biología Computacional/métodos , Variación Genética , ARN/genética , Programas Informáticos , Alelos , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Poliadenilación , Empalme del ARN , ARN sin Sentido , ARN Largo no Codificante/genética , ARN Mensajero/genética , Reproducibilidad de los Resultados , Análisis de Secuencia de ARN , TranscriptomaRESUMEN
BACKGROUND: The B and T lymphocytes are white blood cells playing a key role in the adaptive immunity. A part of their DNA, called the V(D)J recombinations, is specific to each lymphocyte, and enables recognition of specific antigenes. Today, with new sequencing techniques, one can get billions of DNA sequences from these regions. With dedicated Repertoire Sequencing (RepSeq) methods, it is now possible to picture population of lymphocytes, and to monitor more accurately the immune response as well as pathologies such as leukemia. METHODS AND RESULTS: Vidjil is an open-source platform for the interactive analysis of high-throughput sequencing data from lymphocyte recombinations. It contains an algorithm gathering reads into clonotypes according to their V(D)J junctions, a web application made of a sample, experiment and patient database and a visualization for the analysis of clonotypes along the time. Vidjil is implemented in C++, Python and Javascript and licensed under the GPLv3 open-source license. Source code, binaries and a public web server are available at http://www.vidjil.org and at http://bioinfo.lille.inria.fr/vidjil. Using the Vidjil web application consists of four steps: 1. uploading a raw sequence file (typically a FASTQ); 2. running RepSeq analysis software; 3. visualizing the results; 4. annotating the results and saving them for future use. For the end-user, the Vidjil web application needs no specific installation and just requires a connection and a modern web browser. Vidjil is used by labs in hematology or immunology for research and clinical applications.
Asunto(s)
Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Recombinación V(D)J/genética , Navegador Web , Algoritmos , Secuencia de Bases , Humanos , Internet , Linfocitos/inmunología , Linfocitos/metabolismo , Reproducibilidad de los Resultados , Homología de Secuencia de Ácido NucleicoRESUMEN
A large number of RNA-sequencing studies set out to predict mutations, splice junctions or fusion RNAs. We propose a method, CRAC, that integrates genomic locations and local coverage to enable such predictions to be made directly from RNA-seq read analysis. A k-mer profiling approach detects candidate mutations, indels and splice or chimeric junctions in each single read. CRAC increases precision compared with existing tools, reaching 99:5% for splice junctions, without losing sensitivity. Importantly, CRAC predictions improve with read length. In cancer libraries, CRAC recovered 74% of validated fusion RNAs and predicted novel recurrent chimeric junctions. CRAC is available at http://crac.gforge.inria.fr.