Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 35
Filter
Add more filters











Publication year range
1.
J Bioinform Comput Biol ; 22(4): 2450019, 2024 Aug.
Article in English | MEDLINE | ID: mdl-39215522

ABSTRACT

The graph of sequences represents the genetic variations of pan-genome concisely and space-efficiently than multiple linear reference genome. In order to accelerate aligning reads to the graph, an index of graph-based reference genomes is used to obtain candidate locations. However, the potential combinatorial explosion of nodes on the sequence graph leads to increasing the index space and maximum memory usage of alignment process considerably, especially for large-scale datasets. For this, existing methods typically attempt to prune complex regions, or extend the length of seeds, which sacrifices the recall of alignment algorithm despite reducing space usage slightly. We present the Sparse-index of Graph (SIG) and alignment algorithm SIG-Aligner, capable of indexing and aligning at the lower memory cost. SIG builds the non-overlapping minimizers index inside nodes of sequence graph and SIG-Aligner filters out most of the false positive matches by the method based on the pigeonhole principle. Compared to Giraffe, the results of computational experiments show that SIG achieves a significant reduction in index memory space ranging from 50% to 75% for the human pan-genome graphs, while still preserving superior or comparable accuracy of alignment and the faster alignment time.


Subject(s)
Algorithms , Sequence Alignment , Sequence Analysis, DNA , Humans , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/statistics & numerical data , Genome, Human , Software , Genomics/methods , Genomics/statistics & numerical data , Genome
2.
BMC Bioinformatics ; 25(1): 238, 2024 Jul 13.
Article in English | MEDLINE | ID: mdl-39003441

ABSTRACT

MOTIVATION: Alignment of reads to a reference genome sequence is one of the key steps in the analysis of human whole-genome sequencing data obtained through Next-generation sequencing (NGS) technologies. The quality of the subsequent steps of the analysis, such as the results of clinical interpretation of genetic variants or the results of a genome-wide association study, depends on the correct identification of the position of the read as a result of its alignment. The amount of human NGS whole-genome sequencing data is constantly growing. There are a number of human genome sequencing projects worldwide that have resulted in the creation of large-scale databases of genetic variants of sequenced human genomes. Such information about known genetic variants can be used to improve the quality of alignment at the read alignment stage when analysing sequencing data obtained for a new individual, for example, by creating a genomic graph. While existing methods for aligning reads to a linear reference genome have high alignment speed, methods for aligning reads to a genomic graph have greater accuracy in variable regions of the genome. The development of a read alignment method that takes into account known genetic variants in the linear reference sequence index allows combining the advantages of both sets of methods. RESULTS: In this paper, we present the minimap2_index_modifier tool, which enables the construction of a modified index of a reference genome using known single nucleotide variants and insertions/deletions (indels) specific to a given human population. The use of the modified minimap2 index improves variant calling quality without modifying the bioinformatics pipeline and without significant additional computational overhead. Using the PrecisionFDA Truth Challenge V2 benchmark data (for HG002 short-read data aligned to the GRCh38 linear reference (GCA_000001405.15) with parameters k = 27 and w = 14) it was demonstrated that the number of false negative genetic variants decreased by more than 9500, and the number of false positives decreased by more than 7000 when modifying the index with genetic variants from the Human Pangenome Reference Consortium.


Subject(s)
Genetic Variation , Genome, Human , Whole Genome Sequencing , Humans , Whole Genome Sequencing/methods , Genetic Variation/genetics , High-Throughput Nucleotide Sequencing/methods , Polymorphism, Single Nucleotide/genetics , Sequence Alignment/methods , Software , Algorithms , Genome-Wide Association Study/methods
3.
Plants (Basel) ; 13(5)2024 Feb 21.
Article in English | MEDLINE | ID: mdl-38475429

ABSTRACT

The utmost goal of selecting an RNA-Seq alignment software is to perform accurate alignments with a robust algorithm, which is capable of detecting the various intricacies underlying read-mapping procedures and beyond. Most alignment software tools are typically pre-tuned with human or prokaryotic data, and therefore may not be suitable for applications to other organisms, such as plants. The rapidly growing plant RNA-Seq databases call for the assessment of the alignment tools on curated plant data, which will aid the calibration of these tools for applications to plant transcriptomic data. We therefore focused here on benchmarking RNA-Seq read alignment tools, using simulated data derived from the model organism Arabidopsis thaliana. We assessed the performance of five popular RNA-Seq alignment tools that are currently available, based on their usage (citation count). By introducing annotated single nucleotide polymorphisms (SNPs) from The Arabidopsis Information Resource (TAIR), we recorded alignment accuracy at both base-level and junction base-level resolutions for each alignment tool. In addition to assessing the performance of the alignment tools at their default settings, accuracies were also recorded by varying the values of numerous parameters, including the confidence threshold and the level of SNP introduction. The performances of the aligners were found consistent under various testing conditions at the base-level accuracy; however, the junction base-level assessment produced varying results depending upon the applied algorithm. At the read base-level assessment, the overall performance of the aligner STAR was superior to other aligners, with the overall accuracy reaching over 90% under different test conditions. On the other hand, at the junction base-level assessment, SubRead emerged as the most promising aligner, with an overall accuracy over 80% under most test conditions.

4.
Front Genet ; 14: 997383, 2023.
Article in English | MEDLINE | ID: mdl-36999049

ABSTRACT

RNA sequencing (RNA-seq) has become an exemplary technology in modern biology and clinical science. Its immense popularity is due in large part to the continuous efforts of the bioinformatics community to develop accurate and scalable computational tools to analyze the enormous amounts of transcriptomic data that it produces. RNA-seq analysis enables genes and their corresponding transcripts to be probed for a variety of purposes, such as detecting novel exons or whole transcripts, assessing expression of genes and alternative transcripts, and studying alternative splicing structure. It can be a challenge, however, to obtain meaningful biological signals from raw RNA-seq data because of the enormous scale of the data as well as the inherent limitations of different sequencing technologies, such as amplification bias or biases of library preparation. The need to overcome these technical challenges has pushed the rapid development of novel computational tools, which have evolved and diversified in accordance with technological advancements, leading to the current myriad of RNA-seq tools. These tools, combined with the diverse computational skill sets of biomedical researchers, help to unlock the full potential of RNA-seq. The purpose of this review is to explain basic concepts in the computational analysis of RNA-seq data and define discipline-specific jargon.

5.
Methods Mol Biol ; 2607: 199-214, 2023.
Article in English | MEDLINE | ID: mdl-36449165

ABSTRACT

Alignment of short-read sequencing data to interspersed genomic repeats, such as transposable elements, can be problematic. This is especially true for evolutionarily young elements, which have not sufficiently diverged from each other to produce distinct and uniquely mappable reads. Mapping difficulties pose a challenge for studying the portfolio of epigenetic modifications and other chromatin regulators that bind to transposons and dictate their activity, which are typically studied using chromatin immunoprecipitation followed by sequencing (ChIP-seq). Since ChIP-seq requires chromatin fragmentation to achieve appropriate resolution, longer reads do not appreciably improve mappability. Here, we present an experimental and computational protocol that couples ChIP-seq with 3D genome folding information to produce protein binding profiles with dramatically increased coverage at interspersed repeats.


Subject(s)
Chromatin Immunoprecipitation Sequencing , Chromatin , Protein Binding , Chromatin/genetics , Chromatin Immunoprecipitation , DNA Transposable Elements/genetics
6.
Genome Biol ; 23(1): 260, 2022 12 15.
Article in English | MEDLINE | ID: mdl-36522758

ABSTRACT

Read alignment is often the computational bottleneck in analyses. Recently, several advances have been made on seeding methods for fast sequence comparison. We combine two such methods, syncmers and strobemers, in a novel seeding approach for constructing dynamic-sized fuzzy seeds and implement the method in a short-read aligner, strobealign. The seeding is fast to construct and effectively reduces repetitiveness in the seeding step, as shown using a novel metric E-hits. strobealign is several times faster than traditional aligners at similar and sometimes higher accuracy while being both faster and more accurate than more recently proposed aligners for short reads of lengths 150nt and longer. Availability: https://github.com/ksahlin/strobealign.


Subject(s)
High-Throughput Nucleotide Sequencing , Software , Sequence Alignment , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Algorithms , Seeds
7.
Int J Mol Sci ; 23(19)2022 Sep 29.
Article in English | MEDLINE | ID: mdl-36232783

ABSTRACT

Advances in the next-generation sequencing technology have led to a dramatic decrease in read-generation cost and an increase in read output. Reconstruction of short DNA sequence reads generated by next-generation sequencing requires a read alignment method that reconstructs a reference genome. In addition, it is essential to analyze the results of read alignments for a biologically meaningful inference. However, read alignment from vast amounts of genomic data from various organisms is challenging in that it involves repeated automatic and manual analysis steps. We, here, devised cPlot software for read alignment of nucleotide sequences, with automated read alignment and position analysis, which allows visual assessment of the analysis results by the user. cPlot compares sequence similarity of reads by performing multiple read alignments, with FASTA format files as the input. This application provides a web-based interface for the user for facile implementation, without the need for a dedicated computing environment. cPlot identifies the location and order of the sequencing reads by comparing the sequence to a genetically close reference sequence in a way that is effective for visualizing the assembly of short reads generated by NGS and rapid gene map construction.


Subject(s)
High-Throughput Nucleotide Sequencing , Software , Algorithms , Base Sequence , High-Throughput Nucleotide Sequencing/methods , Sequence Alignment , Sequence Analysis, DNA/methods
8.
Cells ; 11(4)2022 02 10.
Article in English | MEDLINE | ID: mdl-35203259

ABSTRACT

Advances in sequencing and assembly technology have led to the creation of genome assemblies for a wide variety of non-model organisms. The rapid production and proliferation of updated, novel assembly versions can create vexing problems for researchers when multiple-genome assembly versions are available at once, requiring researchers to work with more than one reference genome. Multiple-genome assemblies are especially problematic for researchers studying the genetic makeup of individual cells, as single-cell RNA sequencing (scRNAseq) requires sequenced reads to be mapped and aligned to a single reference genome. Using the Astyanax mexicanus, this study highlights how the interpretation of a single-cell dataset from the same sample changes when aligned to its two different available genome assemblies. We found that the number of cells and expressed genes detected were drastically different when aligning to the different assemblies. When the genome assemblies were used in isolation with their respective annotations, cell-type identification was confounded, as some classic cell-type markers were assembly-specific, whilst other genes showed differential patterns of expression between the two assemblies. To overcome the problems posed by multiple-genome assemblies, we propose that researchers align to each available assembly and then integrate the resultant datasets to produce a final dataset in which all genome alignments can be used simultaneously. We found that this approach increased the accuracy of cell-type identification and maximised the amount of data that could be extracted from our single-cell sample by capturing all possible cells and transcripts. As scRNAseq becomes more widely available, it is imperative that the single-cell community is aware of how genome assembly alignment can alter single-cell data and their interpretation, especially when reviewing studies on non-model organisms.


Subject(s)
Genome , Base Sequence , Genome/genetics , Sequence Analysis, DNA/methods , Sequence Analysis, RNA , Exome Sequencing
9.
Open Res Eur ; 2: 75, 2022.
Article in English | MEDLINE | ID: mdl-37645349

ABSTRACT

Background: The maintenance, regulation, and dynamics of heterochromatin in the human malaria parasite, Plasmodium falciparum, has drawn increasing attention due to its regulatory role in mutually exclusive virulence gene expression and the silencing of key developmental regulators. The advent of genome-wide analyses such as chromatin-immunoprecipitation followed by sequencing (ChIP-seq) has been instrumental in understanding chromatin composition; however, even in model organisms, ChIP-seq experiments are susceptible to intrinsic experimental biases arising from underlying chromatin structure. Methods: We performed a control ChIP-seq experiment, re-analyzed previously published ChIP-seq datasets and compared different analysis approaches to characterize biases of genome-wide analyses in P. falciparum. Results: We found that heterochromatic regions in input control samples used for ChIP-seq normalization are systematically underrepresented in regard to sequencing coverage across the P. falciparum genome. This underrepresentation, in combination with a non-specific or inefficient immunoprecipitation, can lead to the identification of false enrichment and peaks across these regions. We observed that such biases can also be seen at background levels in specific and efficient ChIP-seq experiments. We further report on how different read mapping approaches can also skew sequencing coverage within highly similar subtelomeric regions and virulence gene families. To ameliorate these issues, we discuss orthogonal methods that can be used to characterize bona fide chromatin-associated proteins. Conclusions: Our results highlight the impact of chromatin structure on genome-wide analyses in the parasite and the need for caution when characterizing chromatin-associated proteins and features.

10.
Methods Mol Biol ; 2416: 213-237, 2022.
Article in English | MEDLINE | ID: mdl-34870839

ABSTRACT

Over the last decade, RNA-Sequencing (RNA-Seq) has revolutionized the field of transcriptomics due to its sheer advantage over previous technologies for studying gene expression. Even the domain of stem cell bioinformatics has benefited from these advancements. It has helped look deeper into how the process of pluripotency is maintained by stem cells and how it may be exploited for application in regenerative medicine. However, as it is still an evolving technology, there is no single accepted protocol for RNA-Seq data analysis. From a wide array of tools and/or algorithms available for the purpose, researchers tend to develop a pipeline that is best suited for their sample, experimental design, and computational power. In this tutorial, we describe a pipeline based on open-source tools to analyze RNA-Seq data from naïve and primed state human pluripotent stem cell samples. Precisely, we show how RNA-Seq data can be downloaded from databases, processed, and used to identify differentially expressed genes and construct a co-expression network. Further, we also show how the list of interesting genes obtained from differential expression testing or co-expression network be analyzed to gain biological insights.


Subject(s)
Pluripotent Stem Cells , Transcriptome , Computational Biology , Gene Expression Profiling , Humans , Sequence Analysis, RNA
11.
Methods Mol Biol ; 2181: 13-34, 2021.
Article in English | MEDLINE | ID: mdl-32729072

ABSTRACT

Computers are able to systematically exploit RNA-seq data allowing us to efficiently detect RNA editing sites in a genome-wide scale. This chapter introduces a very flexible computational framework for detecting RNA editing sites in plant organelles. This framework comprises three major steps: RNA-seq data processing, RNA read alignment, and RNA editing site detection. Each step is discussed in sufficient detail to be implemented by the reader. As a study case, the framework will be used with publicly available sequencing data to detect C-to-U RNA editing sites in the coding sequences of the mitochondrial genome of Nicotiana tabacum.


Subject(s)
Computational Biology/methods , Genome, Mitochondrial , Mitochondria/genetics , Nicotiana/genetics , RNA Editing/genetics , RNA, Mitochondrial/genetics , Cytidine/chemistry , Cytidine/genetics , High-Throughput Nucleotide Sequencing , Mitochondria/metabolism , RNA, Mitochondrial/metabolism , Software , Nicotiana/metabolism , Transcriptome , Uridine/chemistry , Uridine/genetics
12.
BMC Genomics ; 21(Suppl 6): 500, 2020 Dec 21.
Article in English | MEDLINE | ID: mdl-33349238

ABSTRACT

BACKGROUND: Next-generation sequencing (NGS) enables unbiased detection of pathogens by mapping the sequencing reads of a patient sample to the known reference sequence of bacteria and viruses. However, for a new pathogen without a reference sequence of a close relative, or with a high load of mutations compared to its predecessors, read mapping fails due to a low similarity between the pathogen and reference sequence, which in turn leads to insensitive and inaccurate pathogen detection outcomes. RESULTS: We developed MegaPath, which runs fast and provides high sensitivity in detecting new pathogens. In MegaPath, we have implemented and tested a combination of polishing techniques to remove non-informative human reads and spurious alignments. MegaPath applies a global optimization to the read alignments and reassigns the reads incorrectly aligned to multiple species to a unique species. The reassignment not only significantly increased the number of reads aligned to distant pathogens, but also significantly reduced incorrect alignments. MegaPath implements an enhanced maximum-exact-match prefix seeding strategy and a SIMD-accelerated Smith-Waterman algorithm to run fast. CONCLUSIONS: In our benchmarks, MegaPath demonstrated superior sensitivity by detecting eight times more reads from a low-similarity pathogen than other tools. Meanwhile, MegaPath ran much faster than the other state-of-the-art alignment-based pathogen detection tools (and compariable with the less sensitivity profile-based pathogen detection tools). The running time of MegaPath is about 20 min on a typical 1 Gb dataset.


Subject(s)
Metagenomics , Software , Algorithms , High-Throughput Nucleotide Sequencing , Humans , Metagenome , Sequence Alignment , Sequence Analysis, DNA
13.
Genome Biol ; 21(1): 239, 2020 09 07.
Article in English | MEDLINE | ID: mdl-32894187

ABSTRACT

BACKGROUND: The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. RESULTS: We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. CONCLUSION: We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly, and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.


Subject(s)
Chromosome Mapping/methods , Sequence Alignment/methods , Algorithms , Animals , Gene Expression Profiling , Mice , Sequence Analysis, RNA , Transcriptome
14.
Genome Biol ; 21(1): 65, 2020 03 11.
Article in English | MEDLINE | ID: mdl-32160922

ABSTRACT

The practical use of graph-based reference genomes depends on the ability to align reads to them. Performing substring queries to paths through these graphs lies at the core of this task. The combination of increasing pattern length and encoded variations inevitably leads to a combinatorial explosion of the search space. Instead of heuristic filtering or pruning steps to reduce the complexity, we propose CHOP, a method that constrains the search space by exploiting haplotype information, bounding the search space to the number of haplotypes so that a combinatorial explosion is prevented. We show that CHOP can be applied to large and complex datasets, by applying it on a graph-based representation of the human genome encoding all 80 million variants reported by the 1000 Genomes Project.


Subject(s)
Genome, Human , Haplotypes , Computer Graphics , Genomics , Humans
15.
Genome Biol ; 20(1): 274, 2019 12 16.
Article in English | MEDLINE | ID: mdl-31842925

ABSTRACT

The alignment of long-read RNA sequencing reads is non-trivial due to high sequencing errors and complicated gene structures. We propose deSALT, a tailored two-pass alignment approach, which constructs graph-based alignment skeletons to infer exons and uses them to generate spliced reference sequences to produce refined alignments. deSALT addresses several difficult technical issues, such as small exons and sequencing errors, which break through bottlenecks of long RNA-seq read alignment. Benchmarks demonstrate that deSALT has a greater ability to produce accurate and homogeneous full-length alignments. deSALT is available at: https://github.com/hitbc/deSALT.


Subject(s)
Sequence Alignment/methods , Animals , Humans , Software
16.
BMC Med Inform Decis Mak ; 19(Suppl 6): 265, 2019 12 19.
Article in English | MEDLINE | ID: mdl-31856811

ABSTRACT

BACKGROUND: Many genetic variants have been reported from sequencing projects due to decreasing experimental costs. Compared to the current typical paradigm, read mapping incorporating existing variants can improve the performance of subsequent analysis. This method is supposed to map sequencing reads efficiently to a graphical index with a reference genome and known variation to increase alignment quality and variant calling accuracy. However, storing and indexing various types of variation require costly RAM space. METHODS: Aligning reads to a graph model-based index including the whole set of variants is ultimately an NP-hard problem in theory. Here, we propose a variation-aware read alignment algorithm (VARA), which generates the alignment between read and multiple genomic sequences simultaneously utilizing the schema of the Landau-Vishkin algorithm. VARA dynamically extracts regional variants to construct a pseudo tree-based structure on-the-fly for seed extension without loading the whole genome variation into memory space. RESULTS: We developed the novel high-throughput sequencing read aligner deBGA-VARA by integrating VARA into deBGA. The deBGA-VARA is benchmarked both on simulated reads and the NA12878 sequencing dataset. The experimental results demonstrate that read alignment incorporating genetic variation knowledge can achieve high sensitivity and accuracy. CONCLUSIONS: Due to its efficiency, VARA provides a promising solution for further improvement of variant calling while maintaining small memory footprints. The deBGA-VARA is available at: https://github.com/hitbc/deBGA-VARA.


Subject(s)
Algorithms , Genetic Variation/genetics , Genome, Human/genetics , High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA/classification , Benchmarking , Humans , Sequence Analysis, DNA/methods , Software
17.
BMC Genomics ; 20(1): 701, 2019 Sep 09.
Article in English | MEDLINE | ID: mdl-31500583

ABSTRACT

BACKGROUND: The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. RESULTS: A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation. CONCLUSIONS: Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.


Subject(s)
Genetic Variation , Genomics/methods , Benchmarking , Databases, Genetic , Genome, Plant/genetics
18.
J Bioinform Comput Biol ; 17(2): 1950008, 2019 04.
Article in English | MEDLINE | ID: mdl-31057068

ABSTRACT

New generation sequencing machines: Illumina and Solexa can generate millions of short reads from a given genome sequence on a single run. Alignment of these reads to a reference genome is a core step in Next-generation sequencing data analysis such as genetic variation and genome re-sequencing etc. Therefore there is a need of a new approach, efficient with respect to memory as well as time to align these enormous reads with the reference genome. Existing techniques such as MAQ, Bowtie, BWA, BWBBLE, Subread, Kart, and Minimap2 require huge memory for whole reference genome indexing and reads alignment. Gapped alignment versions of these techniques are also 20-40% slower than their respective normal versions. In this paper, an efficient approach: WIT for reference genome indexing and reads alignment using Burrows-Wheeler Transform (BWT) and Wavelet Tree (WT) is proposed. Both exact and approximate alignments are possible by it. Experimental work shows that the proposed approach WIT performs the best in case of protein sequence indexing. For indexing, the reference genome space required by WIT is 0.6 N (N is the size of reference genome) whereas existing techniques BWA, Subread, Kart, and Minimap2 require space in between 1.25 N to 5 N. Experimentally, it is also observed that even using such small index size alignment time of proposed approach is comparable in comparison to BWA, Subread, Kart, and Minimap2. Other alignment parameters accuracy and confidentiality are also experimentally shown to be better than Minimap2. The source code of the proposed approach WIT is available at http://www.algorithm-skg.com/wit/home.html .


Subject(s)
Genome , High-Throughput Nucleotide Sequencing/methods , Molecular Sequence Annotation/methods , Sequence Alignment/methods , Abstracting and Indexing , Algorithms , Animals , Computational Biology/methods , Genome, Human , Humans , Pan troglodytes/genetics , Proteins/genetics
19.
Bioinform Biol Insights ; 13: 1177932218821373, 2019.
Article in English | MEDLINE | ID: mdl-30792576

ABSTRACT

The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their non-referential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.

20.
Methods Mol Biol ; 1908: 37-48, 2019.
Article in English | MEDLINE | ID: mdl-30649719

ABSTRACT

The use of next-generation sequencing and hybridization-based capture for target enrichment have enabled the interrogation of coding regions of several clinically significant cancer genes in tumor specimens using both targeted panels of a few to hundreds of genes, to whole-exome panels encompassing coding regions of all genes in the genome. Next-generation sequencing (NGS) technologies produce millions of relatively short segments of sequences or reads that require bioinformatics tools to map reads back to a reference genome using various read alignment tools, as well as to determine differences between single bases (single nucleotide variants or SNVs) or multiple bases (insertions and deletions or indels) between the aligned reads and the reference genome to call variants. In addition to single nucleotide changes or small insertions and deletions, high copy gains and losses can also be gleaned from NGS data to call gene amplifications and deletions. Throughout these processes, numerous quality control metrics can be assessed at each step to ensure that the resulting called variants are of high quality and are accurate. In this chapter we review common tools used to generate reads from Illumina-derived sequence data, align reads, and call variants from hybridization-based targeted NGS panel data generated from tumor FFPE-derived DNA specimens as well as basic quality metrics to assess for each assayed specimen.


Subject(s)
Computational Biology/methods , Exome Sequencing/methods , High-Throughput Nucleotide Sequencing/methods , Mutation , Neoplasms/genetics , DNA, Neoplasm , Humans , Nucleic Acid Hybridization/methods , Paraffin Embedding , Polymorphism, Genetic , RNA, Neoplasm , Tissue Fixation
SELECTION OF CITATIONS
SEARCH DETAIL