Pesquisa | Biblioteca Virtual em Saúde

1.

TKSM: highly modular, user-customizable, and scalable transcriptomic sequencing long-read simulator.

Karaoglanoglu, Fatih; Orabi, Baraa; Flannigan, Ryan; Chauve, Cedric; Hach, Faraz.

Bioinformatics ; 40(2)2024 02 01.

Artigo em Inglês | MEDLINE | ID: mdl-38273664

RESUMO

MOTIVATION: Transcriptomic long-read (LR) sequencing is an increasingly cost-effective technology for probing various RNA features. Numerous tools have been developed to tackle various transcriptomic sequencing tasks (e.g. isoform and gene fusion detection). However, the lack of abundant gold-standard datasets hinders the benchmarking of such tools. Therefore, the simulation of LR sequencing is an important and practical alternative. While the existing LR simulators aim to imitate the sequencing machine noise and to target specific library protocols, they lack some important library preparation steps (e.g. PCR) and are difficult to modify to new and changing library preparation techniques (e.g. single-cell LRs). RESULTS: We present TKSM, a modular and scalable LR simulator, designed so that each RNA modification step is targeted explicitly by a specific module. This allows the user to assemble a simulation pipeline as a combination of TKSM modules to emulate a specific sequencing design. Additionally, the input/output of all the core modules of TKSM follows the same simple format (Molecule Description Format) allowing the user to easily extend TKSM with new modules targeting new library preparation steps. AVAILABILITY AND IMPLEMENTATION: TKSM is available as an open source software at https://github.com/vpc-ccg/tksm.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Simulação por Computador , RNA , Perfilação da Expressão Gênica

2.

Freddie: annotation-independent detection and discovery of transcriptomic alternative splicing isoforms using long-read sequencing.

Orabi, Baraa; Xie, Ning; McConeghy, Brian; Dong, Xuesen; Chauve, Cedric; Hach, Faraz.

Nucleic Acids Res ; 51(2): e11, 2023 01 25.

Artigo em Inglês | MEDLINE | ID: mdl-36478271

RESUMO

Alternative splicing (AS) is an important mechanism in the development of many cancers, as novel or aberrant AS patterns play an important role as an independent onco-driver. In addition, cancer-specific AS is potentially an effective target of personalized cancer therapeutics. However, detecting AS events remains a challenging task, especially if these AS events are novel. This is exacerbated by the fact that existing transcriptome annotation databases are far from being comprehensive, especially with regard to cancer-specific AS. Additionally, traditional sequencing technologies are severely limited by the short length of the generated reads, which rarely spans more than a single splice junction site. Given these challenges, transcriptomic long-read (LR) sequencing presents a promising potential for the detection and discovery of AS. We present Freddie, a computational annotation-independent isoform discovery and detection tool. Freddie takes as input transcriptomic LR sequencing of a sample alongside its genomic split alignment and computes a set of isoforms for the given sample. It then partitions the input reads into sets that can be processed independently and in parallel. For each partition, Freddie segments the genomic alignment of the reads into canonical exon segments. The goal of this segmentation is to be able to represent any potential isoform as a subset of these canonical exons. This segmentation is formulated as an optimization problem and is solved with a dynamic programming algorithm. Then, Freddie reconstructs the isoforms by jointly clustering and error-correcting the reads using the canonical segmentation as a succinct representation. The clustering and error-correcting step is formulated as an optimization problem-the Minimum Error Clustering into Isoforms (MErCi) problem-and is solved using integer linear programming (ILP). We compare the performance of Freddie on simulated datasets with other isoform detection tools with varying dependence on annotation databases. We show that Freddie outperforms the other tools in its accuracy, including those given the complete ground truth annotation. We also run Freddie on a transcriptomic LR dataset generated in-house from a prostate cancer cell line with a matched short-read RNA-seq dataset. Freddie results in isoforms with a higher short-read cross-validation rate than the other tested tools. Freddie is open source and available at https://github.com/vpc-ccg/freddie/.

Assuntos

Processamento Alternativo , Software , Transcriptoma , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , RNA-Seq , Análise de Sequência de RNA/métodos

3.

Optimized high-throughput screening of non-coding variants identified from genome-wide association studies.

Morova, Tunc; Ding, Yi; Huang, Chia-Chi F; Sar, Funda; Schwarz, Tommer; Giambartolomei, Claudia; Baca, Sylvan C; Grishin, Dennis; Hach, Faraz; Gusev, Alexander; Freedman, Matthew L; Pasaniuc, Bogdan; Lack, Nathan A.

Nucleic Acids Res ; 51(3): e18, 2023 02 22.

Artigo em Inglês | MEDLINE | ID: mdl-36546757

RESUMO

The vast majority of disease-associated single nucleotide polymorphisms (SNP) identified from genome-wide association studies (GWAS) are localized in non-coding regions. A significant fraction of these variants impact transcription factors binding to enhancer elements and alter gene expression. To functionally interrogate the activity of such variants we developed snpSTARRseq, a high-throughput experimental method that can interrogate the functional impact of hundreds to thousands of non-coding variants on enhancer activity. snpSTARRseq dramatically improves signal-to-noise by utilizing a novel sequencing and bioinformatic approach that increases both insert size and the number of variants tested per loci. Using this strategy, we interrogated known prostate cancer (PCa) risk-associated loci and demonstrated that 35% of them harbor SNPs that significantly altered enhancer activity. Combining these results with chromosomal looping data we could identify interacting genes and provide a mechanism of action for 20 PCa GWAS risk regions. When benchmarked to orthogonal methods, snpSTARRseq showed a strong correlation with in vivo experimental allelic-imbalance studies whereas there was no correlation with predictive in silico approaches. Overall, snpSTARRseq provides an integrated experimental and computational framework to functionally test non-coding genetic variants.

Assuntos

Estudo de Associação Genômica Ampla , Sequências Reguladoras de Ácido Nucleico , Humanos , Masculino , Predisposição Genética para Doença , Polimorfismo de Nucleotídeo Único , Fatores de Transcrição/genética

4.

Genion, an accurate tool to detect gene fusion from long transcriptomics reads.

Karaoglanoglu, Fatih; Chauve, Cedric; Hach, Faraz.

BMC Genomics ; 23(1): 129, 2022 Feb 14.

Artigo em Inglês | MEDLINE | ID: mdl-35164688

RESUMO

BACKGROUND: The advent of next-generation sequencing technologies empowered a wide variety of transcriptomics studies. A widely studied topic is gene fusion which is observed in many cancer types and suspected of having oncogenic properties. Gene fusions are the result of structural genomic events that bring two genes closely located and result in a fused transcript. This is different from fusion transcripts created during or after the transcription process. These chimeric transcripts are also known as read-through and trans-splicing transcripts. Gene fusion discovery with short reads is a well-studied problem, and many methods have been developed. But the sensitivity of these methods is limited by the technology, especially the short read length. Advances in long-read sequencing technologies allow the generation of long transcriptomics reads at a low cost. Transcriptomic long-read sequencing presents unique opportunities to overcome the shortcomings of short-read technologies for gene fusion detection while introducing new challenges. RESULTS: We present Genion, a sensitive and fast gene fusion detection method that can also detect read-through events. We compare Genion against a recently introduced long-read gene fusion discovery method, LongGF, both on simulated and real datasets. On simulated data, Genion accurately identifies the gene fusions and its clustering accuracy for detecting fusion reads is better than LongGF. Furthermore, our results on the breast cancer cell line MCF-7 show that Genion correctly identifies all the experimentally validated gene fusions. CONCLUSIONS: Genion is an accurate gene fusion caller. Genion is implemented in C++ and is available at https://github.com/vpc-ccg/genion .

Assuntos

Software , Transcriptoma , Fusão Gênica , Genômica , Sequenciamento de Nucleotídeos em Larga Escala

5.

PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data.

Malikic, Salem; Mehrabadi, Farid Rashidi; Ciccolella, Simone; Rahman, Md Khaledur; Ricketts, Camir; Haghshenas, Ehsan; Seidman, Daniel; Hach, Faraz; Hajirasouliha, Iman; Sahinalp, S Cenk.

Genome Res ; 29(11): 1860-1877, 2019 11.

Artigo em Inglês | MEDLINE | ID: mdl-31628256

RESUMO

Available computational methods for tumor phylogeny inference via single-cell sequencing (SCS) data typically aim to identify the most likely perfect phylogeny tree satisfying the infinite sites assumption (ISA). However, the limitations of SCS technologies including frequent allele dropout and variable sequence coverage may prohibit a perfect phylogeny. In addition, ISA violations are commonly observed in tumor phylogenies due to the loss of heterozygosity, deletions, and convergent evolution. In order to address such limitations, we introduce the optimal subperfect phylogeny problem which asks to integrate SCS data with matching bulk sequencing data by minimizing a linear combination of potential false negatives (due to allele dropout or variance in sequence coverage), false positives (due to read errors) among mutation calls, and the number of mutations that violate ISA (real or because of incorrect copy number estimation). We then describe a combinatorial formulation to solve this problem which ensures that several lineage constraints imposed by the use of variant allele frequencies (VAFs, derived from bulk sequence data) are satisfied. We express our formulation both in the form of an integer linear program (ILP) and-as a first in tumor phylogeny reconstruction-a Boolean constraint satisfaction problem (CSP) and solve them by leveraging state-of-the-art ILP/CSP solvers. The resulting method, which we name PhISCS, is the first to integrate SCS and bulk sequencing data while accounting for ISA violating mutations. In contrast to the alternative methods, typically based on probabilistic approaches, PhISCS provides a guarantee of optimality in reported solutions. Using simulated and real data sets, we demonstrate that PhISCS is more general and accurate than all available approaches.

Assuntos

Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Neoplasias/genética , Filogenia , Análise de Célula Única/métodos , Humanos , Neoplasias/patologia

6.

Detecting high-scoring local alignments in pangenome graphs.

Schulz, Tizian; Wittler, Roland; Rahmann, Sven; Hach, Faraz; Stoye, Jens.

Bioinformatics ; 37(16): 2266-2274, 2021 Aug 25.

Artigo em Inglês | MEDLINE | ID: mdl-33532821

RESUMO

MOTIVATION: Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. RESULTS: We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. AVAILABILITY AND IMPLEMENTATION: Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

7.

CircMiner: accurate and rapid detection of circular RNA through splice-aware pseudo-alignment scheme.

Asghari, Hossein; Lin, Yen-Yi; Xu, Yang; Haghshenas, Ehsan; Collins, Colin C; Hach, Faraz.

Bioinformatics ; 36(12): 3703-3711, 2020 06 01.

Artigo em Inglês | MEDLINE | ID: mdl-32259207

RESUMO

MOTIVATION: The ubiquitous abundance of circular RNAs (circRNAs) has been revealed by performing high-throughput sequencing in a variety of eukaryotes. circRNAs are related to some diseases, such as cancer in which they act as oncogenes or tumor-suppressors and, therefore, have the potential to be used as biomarkers or therapeutic targets. Accurate and rapid detection of circRNAs from short reads remains computationally challenging. This is due to the fact that identifying chimeric reads, which is essential for finding back-splice junctions, is a complex process. The sensitivity of discovery methods, to a high degree, relies on the underlying mapper that is used for finding chimeric reads. Furthermore, all the available circRNA discovery pipelines are resource intensive. RESULTS: We introduce CircMiner, a novel stand-alone circRNA detection method that rapidly identifies and filters out linear RNA sequencing reads and detects back-splice junctions. CircMiner employs a rapid pseudo-alignment technique to identify linear reads that originate from transcripts, genes or the genome. CircMiner further processes the remaining reads to identify the back-splice junctions and detect circRNAs with single-nucleotide resolution. We evaluated the efficacy of CircMiner using simulated datasets generated from known back-splice junctions and showed that CircMiner has superior accuracy and speed compared to the existing circRNA detection tools. Additionally, on two RNase R treated cell line datasets, CircMiner was able to detect most of consistent, high confidence circRNAs compared to untreated samples of the same cell line. AVAILABILITY AND IMPLEMENTATION: CircMiner is implemented in C++ and is available online at https://github.com/vpc-ccg/circminer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

RNA Circular , RNA , Sequência de Bases , RNA/genética , Splicing de RNA , Análise de Sequência de RNA

8.

Structural variation and fusion detection using targeted sequencing data from circulating cell free DNA.

Gawronski, Alexander R; Lin, Yen-Yi; McConeghy, Brian; LeBihan, Stephane; Asghari, Hossein; Koçkan, Can; Orabi, Baraa; Adra, Nabil; Pili, Roberto; Collins, Colin C; Sahinalp, S Cenk; Hach, Faraz.

Nucleic Acids Res ; 47(7): e38, 2019 04 23.

Artigo em Inglês | MEDLINE | ID: mdl-30759232

RESUMO

MOTIVATION: Cancer is a complex disease that involves rapidly evolving cells, often forming multiple distinct clones. In order to effectively understand progression of a patient-specific tumor, one needs to comprehensively sample tumor DNA at multiple time points, ideally obtained through inexpensive and minimally invasive techniques. Current sequencing technologies make the 'liquid biopsy' possible, which involves sampling a patient's blood or urine and sequencing the circulating cell free DNA (cfDNA). A certain percentage of this DNA originates from the tumor, known as circulating tumor DNA (ctDNA). The ratio of ctDNA may be extremely low in the sample, and the ctDNA may originate from multiple tumors or clones. These factors present unique challenges for applying existing tools and workflows to the analysis of ctDNA, especially in the detection of structural variations which rely on sufficient read coverage to be detectable. RESULTS: Here we introduce SViCT , a structural variation (SV) detection tool designed to handle the challenges associated with cfDNA analysis. SViCT can detect breakpoints and sequences of various structural variations including deletions, insertions, inversions, duplications and translocations. SViCT extracts discordant read pairs, one-end anchors and soft-clipped/split reads, assembles them into contigs, and re-maps contig intervals to a reference genome using an efficient k-mer indexing approach. The intervals are then joined using a combination of graph and greedy algorithms to identify specific structural variant signatures. We assessed the performance of SViCT and compared it to state-of-the-art tools using simulated cfDNA datasets with properties matching those of real cfDNA samples. The positive predictive value and sensitivity of our tool was superior to all the tested tools and reasonable performance was maintained down to the lowest dilution of 0.01% tumor DNA in simulated datasets. Additionally, SViCT was able to detect all known SVs in two real cfDNA reference datasets (at 0.6-5% ctDNA) and predict a novel structural variant in a prostate cancer cohort. AVAILABILITY: SViCT is available at https://github.com/vpc-ccg/svict. Contact:faraz.hach@ubc.ca.

Assuntos

Algoritmos , Ácidos Nucleicos Livres/sangue , Ácidos Nucleicos Livres/genética , Análise Mutacional de DNA/métodos , Mutação , DNA Tumoral Circulante/sangue , DNA Tumoral Circulante/genética , Conjuntos de Dados como Assunto , Humanos , Masculino , Neoplasias da Próstata/genética , Sensibilidade e Especificidade

9.

lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data.

Haghshenas, Ehsan; Sahinalp, S Cenk; Hach, Faraz.

Bioinformatics ; 35(1): 20-27, 2019 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-30561550

RESUMO

Motivation: Recent advances in genomics and precision medicine have been made possible through the application of high throughput sequencing (HTS) to large collections of human genomes. Although HTS technologies have proven their use in cataloging human genome variation, computational analysis of the data they generate is still far from being perfect. The main limitation of Illumina and other popular sequencing technologies is their short read length relative to the lengths of (common) genomic repeats. Newer (single molecule sequencing - SMS) technologies such as Pacific Biosciences and Oxford Nanopore are producing longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. Unfortunately, because of their high sequencing error rate, reads generated by these technologies are very difficult to work with and cannot be used in many of the standard downstream analysis pipelines. Note that it is not only difficult to find the correct mapping locations of such reads in a reference genome, but also to establish their correct alignment so as to differentiate sequencing errors from real genomic variants. Furthermore, especially since newer SMS instruments provide higher throughput, mapping and alignment need to be performed much faster than before, maintaining high sensitivity. Results: We introduce lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint. Availability and implementation: lordFAST is implemented in C++ and supports multi-threading. The source code of lordFAST is available at https://github.com/vpc-ccg/lordfast. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Software , Biologia Computacional , Genoma Humano , Humanos

10.

Alignment-free clustering of UMI tagged DNA molecules.

Orabi, Baraa; Erhan, Emre; McConeghy, Brian; Volik, Stanislav V; Le Bihan, Stephane; Bell, Robert; Collins, Colin C; Chauve, Cedric; Hach, Faraz.

Bioinformatics ; 35(11): 1829-1836, 2019 06 01.

Artigo em Inglês | MEDLINE | ID: mdl-30351359

RESUMO

MOTIVATION: Next-Generation Sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including the handling of sequencing errors. This is especially pertinent in cancer genomics, e.g. for detecting low allele frequency variations from circulating tumor DNA. Barcode tagging of DNA molecules with unique molecular identifiers (UMI) attempts to mitigate sequencing errors; UMI tagged molecules are polymerase chain reaction (PCR) amplified, and the PCR copies of UMI tagged molecules are sequenced independently. However, the PCR and sequencing steps can generate errors in the sequenced reads that can be located in the barcode and/or the DNA sequence. Analyzing UMI tagged sequencing data requires an initial clustering step, with the aim of grouping reads sequenced from PCR duplicates of the same UMI tagged molecule into a single cluster, and the size of the current datasets requires this clustering process to be resource-efficient. RESULTS: We introduce Calib, a computational tool that clusters paired-end reads from UMI tagged sequencing experiments generated by substitution-error-dominant sequencing platforms such as Illumina. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity. The graph is constructed efficiently using locality sensitive hashing and MinHashing techniques. Calib's default clustering parameters are optimized empirically, for different UMI and read lengths, using a simulation module that is packaged with Calib. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining reasonable runtime and memory footprint. On a real dataset, Calib runs with far less resources than alignment-based methods, and its clusters reduce the number of tentative false positive in downstream variation calling. AVAILABILITY AND IMPLEMENTATION: Calib is implemented in C++ and its simulation module is implemented in Python. Calib is available at https://github.com/vpc-ccg/calib. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Análise por Conglomerados , DNA , Análise de Sequência de DNA

11.

Comparison of high-throughput sequencing data compression tools.

Numanagic, Ibrahim; Bonfield, James K; Hach, Faraz; Voges, Jan; Ostermann, Jörn; Alberti, Claudio; Mattavelli, Marco; Sahinalp, S Cenk.

Nat Methods ; 13(12): 1005-1008, 2016 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-27776113

RESUMO

High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.

Assuntos

Biologia Computacional/métodos , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Animais , Cacau/genética , Drosophila melanogaster/genética , Escherichia coli/genética , Humanos , Pseudomonas aeruginosa/genética

12.

Fast characterization of segmental duplications in genome assemblies.

Numanagic, Ibrahim; Gökkaya, Alim S; Zhang, Lillian; Berger, Bonnie; Alkan, Can; Hach, Faraz.

Bioinformatics ; 34(17): i706-i714, 2018 09 01.

Artigo em Inglês | MEDLINE | ID: mdl-30423092

RESUMO

Motivation: Segmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner. Results: Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% 'pairwise error' between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome. Availability and implementation: SEDEF is available at https://github.com/vpc-ccg/sedef.

Assuntos

Genoma , Duplicações Segmentares Genômicas , Algoritmos , Genômica , Humanos

13.

Computational identification of micro-structural variations and their proteogenomic consequences in cancer.

Lin, Yen-Yi; Gawronski, Alexander; Hach, Faraz; Li, Sujun; Numanagic, Ibrahim; Sarrafi, Iman; Mishra, Swati; McPherson, Andrew; Collins, Colin C; Radovich, Milan; Tang, Haixu; Sahinalp, S Cenk.

Bioinformatics ; 34(10): 1672-1681, 2018 05 15.

Artigo em Inglês | MEDLINE | ID: mdl-29267878

RESUMO

Motivation: Rapid advancement in high throughput genome and transcriptome sequencing (HTS) and mass spectrometry (MS) technologies has enabled the acquisition of the genomic, transcriptomic and proteomic data from the same tissue sample. We introduce a computational framework, ProTIE, to integratively analyze all three types of omics data for a complete molecular profile of a tissue sample. Our framework features MiStrVar, a novel algorithmic method to identify micro structural variants (microSVs) on genomic HTS data. Coupled with deFuse, a popular gene fusion detection method we developed earlier, MiStrVar can accurately profile structurally aberrant transcripts in tumors. Given the breakpoints obtained by MiStrVar and deFuse, our framework can then identify all relevant peptides that span the breakpoint junctions and match them with unique proteomic signatures. Observing structural aberrations in all three types of omics data validates their presence in the tumor samples. Results: We have applied our framework to all The Cancer Genome Atlas (TCGA) breast cancer Whole Genome Sequencing (WGS) and/or RNA-Seq datasets, spanning all four major subtypes, for which proteomics data from Clinical Proteomic Tumor Analysis Consortium (CPTAC) have been released. A recent study on this dataset focusing on SNVs has reported many that lead to novel peptides. Complementing and significantly broadening this study, we detected 244 novel peptides from 432 candidate genomic or transcriptomic sequence aberrations. Many of the fusions and microSVs we discovered have not been reported in the literature. Interestingly, the vast majority of these translated aberrations, fusions in particular, were private, demonstrating the extensive inter-genomic heterogeneity present in breast cancer. Many of these aberrations also have matching out-of-frame downstream peptides, potentially indicating novel protein sequence and structure. Availability and implementation: MiStrVar is available for download at https://bitbucket.org/compbio/mistrvar, and ProTIE is available at https://bitbucket.org/compbio/protie. Contact: cenksahi@indiana.edu. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Neoplasias da Mama/genética , Fusão Gênica , Proteínas de Neoplasias/genética , Proteogenômica/métodos , Software , Feminino , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Humanos , Espectrometria de Massas/métodos , Proteínas de Neoplasias/análise , Análise de Sequência de RNA/métodos

14.

Discovery and genotyping of novel sequence insertions in many sequenced individuals.

Kavak, Pinar; Lin, Yen-Yi; Numanagic, Ibrahim; Asghari, Hossein; Güngör, Tunga; Alkan, Can; Hach, Faraz.

Bioinformatics ; 33(14): i161-i169, 2017 Jul 15.

Artigo em Inglês | MEDLINE | ID: mdl-28881988

RESUMO

MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. RESULT: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. AVAILABILITY AND IMPLEMENTATION: Pamir is available at https://github.com/vpc-ccg/pamir . CONTACT: fhach@{sfu.ca, prostatecentre.com } or calkan@cs.bilkent.edu.tr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Genoma Humano , Variação Estrutural do Genoma , Técnicas de Genotipagem/métodos , Mutação INDEL , Análise de Sequência de DNA/métodos , Software , Algoritmos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos

15.

SiNVICT: ultra-sensitive detection of single nucleotide variants and indels in circulating tumour DNA.

Kockan, Can; Hach, Faraz; Sarrafi, Iman; Bell, Robert H; McConeghy, Brian; Beja, Kevin; Haegert, Anne; Wyatt, Alexander W; Volik, Stanislav V; Chi, Kim N; Collins, Colin C; Sahinalp, S Cenk.

Bioinformatics ; 33(1): 26-34, 2017 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-27531099

RESUMO

MOTIVATION: Successful development and application of precision oncology approaches require robust elucidation of the genomic landscape of a patient's cancer and, ideally, the ability to monitor therapy-induced genomic changes in the tumour in an inexpensive and minimally invasive manner. Thanks to recent advances in sequencing technologies, 'liquid biopsy', the sampling of patient's bodily fluids such as blood and urine, is considered as one of the most promising approaches to achieve this goal. In many cancer patients, and especially those with advanced metastatic disease, deep sequencing of circulating cell free DNA (cfDNA) obtained from patient's blood yields a mixture of reads originating from the normal DNA and from multiple tumour subclones-called circulating tumour DNA or ctDNA. The ctDNA/cfDNA ratio as well as the proportion of ctDNA originating from specific tumour subclones depend on multiple factors, making comprehensive detection of mutations difficult, especially at early stages of cancer. Furthermore, sensitive and accurate detection of single nucleotide variants (SNVs) and indels from cfDNA is constrained by several factors such as the sequencing errors and PCR artifacts, and mapping errors related to repeat regions within the genome. In this article, we introduce SiNVICT, a computational method that increases the sensitivity and specificity of SNV and indel detection at very low variant allele frequencies. SiNVICT has the capability to handle multiple sequencing platforms with different error properties; it minimizes false positives resulting from mapping errors and other technology specific artifacts including strand bias and low base quality at read ends. SiNVICT also has the capability to perform time-series analysis, where samples from a patient sequenced at multiple time points are jointly examined to report locations of interest where there is a possibility that certain clones were wiped out by some treatment while some subclones gained selective advantage. RESULTS: We tested SiNVICT on simulated data as well as prostate cancer cell lines and cfDNA obtained from castration-resistant prostate cancer patients. On both simulated and biological data, SiNVICT was able to detect SNVs and indels with variant allele percentages as low as 0.5%. The lowest amounts of total DNA used for the biological data where SNVs and indels could be detected with very high sensitivity were 2.5 ng on the Ion Torrent platform and 10 ng on Illumina. With increased sequencing and mapping accuracy, SiNVICT might be utilized in clinical settings, making it possible to track the progress of point mutations and indels that are associated with resistance to cancer therapies and provide patients personalized treatment. We also compared SiNVICT with other popular SNV callers such as MuTect, VarScan2 and Freebayes. Our results show that SiNVICT performs better than these tools in most cases and allows further data exploration such as time-series analysis on cfDNA sequencing data. AVAILABILITY AND IMPLEMENTATION: SiNVICT is available at: https://sfu-compbio.github.io/sinvictSupplementary information: Supplementary data are available at Bioinformatics online. CONTACT: cenk@sfu.ca.

Assuntos

Análise Mutacional de DNA/métodos , DNA de Neoplasias/sangue , Mutação INDEL , Neoplasias/genética , Mutação Puntual , Software , Frequência do Gene , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Masculino , Neoplasias/sangue , Sensibilidade e Especificidade

16.

CoLoRMap: Correcting Long Reads by Mapping short reads.

Haghshenas, Ehsan; Hach, Faraz; Sahinalp, S Cenk; Chauve, Cedric.

Bioinformatics ; 32(17): i545-i551, 2016 09 01.

Artigo em Inglês | MEDLINE | ID: mdl-27587673

RESUMO

MOTIVATION: Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads. RESULTS: We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods. AVAILABILITY AND IMPLEMENTATION: The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap CONTACT: ehaghshe@sfu.ca or cedric.chauve@sfu.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Análise de Sequência de DNA , Biologia Computacional , Genoma , Linguagens de Programação , Alinhamento de Sequência , Software

17.

Fast and accurate mapping of Complete Genomics reads.

Lee, Donghyuk; Hormozdiari, Farhad; Xin, Hongyi; Hach, Faraz; Mutlu, Onur; Alkan, Can.

Methods ; 79-80: 3-10, 2015 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-25461772

RESUMO

Many recent advances in genomics and the expectations of personalized medicine are made possible thanks to power of high throughput sequencing (HTS) in sequencing large collections of human genomes. There are tens of different sequencing technologies currently available, and each HTS platform have different strengths and biases. This diversity both makes it possible to use different technologies to correct for shortcomings; but also requires to develop different algorithms for each platform due to the differences in data types and error models. The first problem to tackle in analyzing HTS data for resequencing applications is the read mapping stage, where many tools have been developed for the most popular HTS methods, but publicly available and open source aligners are still lacking for the Complete Genomics (CG) platform. Unfortunately, Burrows-Wheeler based methods are not practical for CG data due to the gapped nature of the reads generated by this method. Here we provide a sensitive read mapper (sirFAST) for the CG technology based on the seed-and-extend paradigm that can quickly map CG reads to a reference genome. We evaluate the performance and accuracy of sirFAST using both simulated and publicly available real data sets, showing high precision and recall rates.

Assuntos

Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Processamento Eletrônico de Dados/métodos , Genoma Humano , Humanos , Alinhamento de Sequência , Análise de Sequência de DNA , Software

18.

mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications.

Hach, Faraz; Sarrafi, Iman; Hormozdiari, Farhad; Alkan, Can; Eichler, Evan E; Sahinalp, S Cenk.

Nucleic Acids Res ; 42(Web Server issue): W494-500, 2014 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-24810850

RESUMO

High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the 'best' mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único , Software , Genoma Humano , Humanos , Internet , Alinhamento de Sequência

19.

ORMAN: optimal resolution of ambiguous RNA-Seq multimappings in the presence of novel isoforms.

Dao, Phuong; Numanagic, Ibrahim; Lin, Yen-Yi; Hach, Faraz; Karakoc, Emre; Donmez, Nilgun; Collins, Colin; Eichler, Evan E; Sahinalp, S Cenk.

Bioinformatics ; 30(5): 644-51, 2014 Mar 01.

Artigo em Inglês | MEDLINE | ID: mdl-24130305

RESUMO

MOTIVATION: RNA-Seq technology is promising to uncover many novel alternative splicing events, gene fusions and other variations in RNA transcripts. For an accurate detection and quantification of transcripts, it is important to resolve the mapping ambiguity for those RNA-Seq reads that can be mapped to multiple loci: >17% of the reads from mouse RNA-Seq data and 50% of the reads from some plant RNA-Seq data have multiple mapping loci. In this study, we show how to resolve the mapping ambiguity in the presence of novel transcriptomic events such as exon skipping and novel indels towards accurate downstream analysis. We introduce ORMAN ( O ptimal R esolution of M ultimapping A mbiguity of R N A-Seq Reads), which aims to compute the minimum number of potential transcript products for each gene and to assign each multimapping read to one of these transcripts based on the estimated distribution of the region covering the read. ORMAN achieves this objective through a combinatorial optimization formulation, which is solved through well-known approximation algorithms, integer linear programs and heuristics. RESULTS: On a simulated RNA-Seq dataset including a random subset of transcripts from the UCSC database, the performance of several state-of-the-art methods for identifying and quantifying novel transcripts, such as Cufflinks, IsoLasso and CLIIQ, is significantly improved through the use of ORMAN. Furthermore, in an experiment using real RNA-Seq reads, we show that ORMAN is able to resolve multimapping to produce coverage values that are similar to the original distribution, even in genes with highly non-uniform coverage. AVAILABILITY: ORMAN is available at http://orman.sf.net

Assuntos

Perfilação da Expressão Gênica/métodos , Isoformas de RNA/metabolismo , Análise de Sequência de RNA/métodos , Software , Algoritmos , Processamento Alternativo , Éxons , Humanos , Isoformas de RNA/química , Alinhamento de Sequência

20.

Alu repeat discovery and characterization within human genomes.

Hormozdiari, Fereydoun; Alkan, Can; Ventura, Mario; Hajirasouliha, Iman; Malig, Maika; Hach, Faraz; Yorukoglu, Deniz; Dao, Phuong; Bakhshi, Marzieh; Sahinalp, S Cenk; Eichler, Evan E.

Genome Res ; 21(6): 840-9, 2011 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-21131385

RESUMO

Human genomes are now being rapidly sequenced, but not all forms of genetic variation are routinely characterized. In this study, we focus on Alu retrotransposition events and seek to characterize differences in the pattern of mobile insertion between individuals based on the analysis of eight human genomes sequenced using next-generation sequencing. Applying a rapid read-pair analysis algorithm, we discover 4342 Alu insertions not found in the human reference genome and show that 98% of a selected subset (63/64) experimentally validate. Of these new insertions, 89% correspond to AluY elements, suggesting that they arose by retrotransposition. Eighty percent of the Alu insertions have not been previously reported and more novel events were detected in Africans when compared with non-African samples (76% vs. 69%). Using these data, we develop an experimental and computational screen to identify ancestry informative Alu retrotransposition events among different human populations.

Assuntos

Elementos Alu/genética , Variação Genética , Genoma Humano/genética , Sequência de Bases , População Negra/genética , Biologia Computacional/métodos , Genômica/métodos , Humanos , Dados de Sequência Molecular , Análise de Sequência de DNA/métodos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA