Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 19 de 19
Filtrar
1.
Biostatistics ; 25(4): 1079-1093, 2024 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-38887902

RESUMO

Although transcriptomics data is typically used to analyze mature spliced mRNA, recent attention has focused on jointly investigating spliced and unspliced (or precursor-) mRNA, which can be used to study gene regulation and changes in gene expression production. Nonetheless, most methods for spliced/unspliced inference (such as RNA velocity tools) focus on individual samples, and rarely allow comparisons between groups of samples (e.g. healthy vs. diseased). Furthermore, this kind of inference is challenging, because spliced and unspliced mRNA abundance is characterized by a high degree of quantification uncertainty, due to the prevalence of multi-mapping reads, ie reads compatible with multiple transcripts (or genes), and/or with both their spliced and unspliced versions. Here, we present DifferentialRegulation, a Bayesian hierarchical method to discover changes between experimental conditions with respect to the relative abundance of unspliced mRNA (over the total mRNA). We model the quantification uncertainty via a latent variable approach, where reads are allocated to their gene/transcript of origin, and to the respective splice version. We designed several benchmarks where our approach shows good performance, in terms of sensitivity and error control, vs. state-of-the-art competitors. Importantly, our tool is flexible, and works with both bulk and single-cell RNA-sequencing data. DifferentialRegulation is distributed as a Bioconductor R package.


Assuntos
Teorema de Bayes , Humanos , RNA Mensageiro/genética , Perfilação da Expressão Gênica/métodos , Splicing de RNA/genética , Regulação da Expressão Gênica , Modelos Estatísticos
2.
Bioinformatics ; 40(Suppl 1): i481-i489, 2024 06 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940134

RESUMO

MOTIVATION: Cell-cell interactions (CCIs) consist of cells exchanging signals with themselves and neighboring cells by expressing ligand and receptor molecules and play a key role in cellular development, tissue homeostasis, and other critical biological functions. Since direct measurement of CCIs is challenging, multiple methods have been developed to infer CCIs by quantifying correlations between the gene expression of the ligands and receptors that mediate CCIs, originally from bulk RNA-sequencing data and more recently from single-cell or spatially resolved transcriptomics (SRT) data. SRT has a particular advantage over single-cell approaches, since ligand-receptor correlations can be computed between cells or spots that are physically close in the tissue. However, the transcript counts of individual ligands and receptors in SRT data are generally low, complicating the inference of CCIs from expression correlations. RESULTS: We introduce Copulacci, a count-based model for inferring CCIs from SRT data. Copulacci uses a Gaussian copula to model dependencies between the expression of ligands and receptors from nearby spatial locations even when the transcript counts are low. On simulated data, Copulacci outperforms existing CCI inference methods based on the standard Spearman and Pearson correlation coefficients. Using several real SRT datasets, we show that Copulacci discovers biologically meaningful ligand-receptor interactions that are lowly expressed and undiscoverable by existing CCI inference methods. AVAILABILITY AND IMPLEMENTATION: Copulacci is implemented in Python and available at https://github.com/raphael-group/copulacci.


Assuntos
Comunicação Celular , Transcriptoma , Transcriptoma/genética , Humanos , Perfilação da Expressão Gênica/métodos , Análise de Célula Única/métodos , Algoritmos , Biologia Computacional/métodos , Ligantes
3.
JCI Insight ; 9(6)2024 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-38358826

RESUMO

Neuroblastoma is an aggressive pediatric cancer with a high rate of metastasis to the BM. Despite intensive treatments including high-dose chemotherapy, the overall survival rate for children with metastatic neuroblastoma remains dismal. Understanding the cellular and molecular mechanisms of the metastatic tumor microenvironment is crucial for developing new therapies and improving clinical outcomes. Here, we used single-cell RNA-Seq to characterize immune and tumor cell alterations in neuroblastoma BM metastases by comparative analysis with patients without metastases. Our results reveal remodeling of the immune cell populations and reprogramming of gene expression profiles in the metastatic niche. In particular, within the BM metastatic niche, we observed the enrichment of immune cells, including tumor-associated neutrophils, macrophages, and exhausted T cells, as well as an increased number of Tregs and a decreased number of B cells. Furthermore, we highlighted cell communication between tumor cells and immune cell populations, and we identified prognostic markers in malignant cells that are associated with worse clinical outcomes in 3 independent neuroblastoma cohorts. Our results provide insight into the cellular, compositional, and transcriptional shifts underlying neuroblastoma BM metastases that contribute to the development of new therapeutic strategies.


Assuntos
Medula Óssea , Neuroblastoma , Humanos , Criança , Medula Óssea/patologia , Neuroblastoma/genética , Análise de Célula Única , Microambiente Tumoral
4.
Genome Med ; 16(1): 1, 2024 01 29.
Artigo em Inglês | MEDLINE | ID: mdl-38281962

RESUMO

BACKGROUND: Despite therapeutic advances, once a cancer has metastasized to the bone, it represents a highly morbid and lethal disease. One third of patients with advanced clear cell renal cell carcinoma (ccRCC) present with bone metastasis at the time of diagnosis. However, the bone metastatic niche in humans, including the immune and stromal microenvironments, has not been well-defined, hindering progress towards identification of therapeutic targets. METHODS: We collected fresh patient samples and performed single-cell transcriptomic profiling of solid metastatic tissue (Bone Met), liquid bone marrow at the vertebral level of spinal cord compression (Involved), and liquid bone marrow from a different vertebral body distant from the tumor site but within the surgical field (Distal), as well as bone marrow from patients undergoing hip replacement surgery (Benign). In addition, we incorporated single-cell data from primary ccRCC tumors (ccRCC Primary) for comparative analysis. RESULTS: The bone marrow of metastatic patients is immune-suppressive, featuring increased, exhausted CD8 + cytotoxic T cells, T regulatory cells, and tumor-associated macrophages (TAM) with distinct transcriptional states in metastatic lesions. Bone marrow stroma from tumor samples demonstrated a tumor-associated mesenchymal stromal cell population (TA-MSC) that appears to be supportive of epithelial-to mesenchymal transition (EMT), bone remodeling, and a cancer-associated fibroblast (CAFs) phenotype. This stromal subset is associated with poor progression-free and overall survival and also markedly upregulates bone remodeling through the dysregulation of RANK/RANKL/OPG signaling activity in bone cells, ultimately leading to bone resorption. CONCLUSIONS: These results provide a comprehensive analysis of the bone marrow niche in the setting of human metastatic cancer and highlight potential therapeutic targets for both cell populations and communication channels.


Assuntos
Carcinoma de Células Renais , Humanos , Carcinoma de Células Renais/genética , Células Estromais/patologia , Transdução de Sinais , Perfilação da Expressão Gênica , Análise de Célula Única , Microambiente Tumoral
5.
Nature ; 623(7986): 432-441, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37914932

RESUMO

Chromatin accessibility is essential in regulating gene expression and cellular identity, and alterations in accessibility have been implicated in driving cancer initiation, progression and metastasis1-4. Although the genetic contributions to oncogenic transitions have been investigated, epigenetic drivers remain less understood. Here we constructed a pan-cancer epigenetic and transcriptomic atlas using single-nucleus chromatin accessibility data (using single-nucleus assay for transposase-accessible chromatin) from 225 samples and matched single-cell or single-nucleus RNA-sequencing expression data from 206 samples. With over 1 million cells from each platform analysed through the enrichment of accessible chromatin regions, transcription factor motifs and regulons, we identified epigenetic drivers associated with cancer transitions. Some epigenetic drivers appeared in multiple cancers (for example, regulatory regions of ABCC1 and VEGFA; GATA6 and FOX-family motifs), whereas others were cancer specific (for example, regulatory regions of FGF19, ASAP2 and EN1, and the PBX3 motif). Among epigenetically altered pathways, TP53, hypoxia and TNF signalling were linked to cancer initiation, whereas oestrogen response, epithelial-mesenchymal transition and apical junction were tied to metastatic transition. Furthermore, we revealed a marked correlation between enhancer accessibility and gene expression and uncovered cooperation between epigenetic and genetic drivers. This atlas provides a foundation for further investigation of epigenetic dynamics in cancer transitions.


Assuntos
Epigênese Genética , Regulação Neoplásica da Expressão Gênica , Neoplasias , Humanos , Hipóxia Celular , Núcleo Celular , Cromatina/genética , Cromatina/metabolismo , Elementos Facilitadores Genéticos/genética , Epigênese Genética/genética , Transição Epitelial-Mesenquimal , Estrogênios/metabolismo , Perfilação da Expressão Gênica , Proteínas Ativadoras de GTPase/metabolismo , Metástase Neoplásica , Neoplasias/classificação , Neoplasias/genética , Neoplasias/patologia , Sequências Reguladoras de Ácido Nucleico/genética , Análise de Célula Única , Fatores de Transcrição/metabolismo
6.
bioRxiv ; 2023 Oct 13.
Artigo em Inglês | MEDLINE | ID: mdl-37873258

RESUMO

Spatially resolved transcriptomics technologies provide high-throughput measurements of gene expression in a tissue slice, but the sparsity of this data complicates the analysis of spatial gene expression patterns such as gene expression gradients. We address these issues by deriving a topographic map of a tissue slice-analogous to a map of elevation in a landscape-using a novel quantity called the isodepth. Contours of constant isodepth enclose spatial domains with distinct cell type composition, while gradients of the isodepth indicate spatial directions of maximum change in gene expression. We develop GASTON, an unsupervised and interpretable deep learning algorithm that simultaneously learns the isodepth, spatial gene expression gradients, and piecewise linear functions of the isodepth that model both continuous gradients and discontinuous spatial variation in the expression of individual genes. We validate GASTON by showing that it accurately identifies spatial domains and marker genes across several biological systems. In SRT data from the brain, GASTON reveals gradients of neuronal differentiation and firing, and in SRT data from a tumor sample, GASTON infers gradients of metabolic activity and epithelial-mesenchymal transition (EMT)-related gene expression in the tumor microenvironment.

7.
bioRxiv ; 2023 Aug 17.
Artigo em Inglês | MEDLINE | ID: mdl-37645841

RESUMO

Motivation: Although transcriptomics data is typically used to analyse mature spliced mRNA, recent attention has focused on jointly investigating spliced and unspliced (or precursor-) mRNA, which can be used to study gene regulation and changes in gene expression production. Nonetheless, most methods for spliced/unspliced inference (such as RNA velocity tools) focus on individual samples, and rarely allow comparisons between groups of samples (e.g., healthy vs. diseased). Furthermore, this kind of inference is challenging, because spliced and unspliced mRNA abundance is characterized by a high degree of quantification uncertainty, due to the prevalence of multi-mapping reads, i.e., reads compatible with multiple transcripts (or genes), and/or with both their spliced and unspliced versions. Results: Here, we present DifferentialRegulation, a Bayesian hierarchical method to discover changes between experimental conditions with respect to the relative abundance of unspliced mRNA (over the total mRNA). We model the quantification uncertainty via a latent variable approach, where reads are allocated to their gene/transcript of origin, and to the respective splice version. We designed several benchmarks where our approach shows good performance, in terms of sensitivity and error control, versus state-of-the-art competitors. Importantly, our tool is flexible, and works with both bulk and single-cell RNA-sequencing data. Availability and implementation: DifferentialRegulation is distributed as a Bioconductor R package.

8.
Nat Commun ; 14(1): 663, 2023 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-36750562

RESUMO

The treatment of low-risk primary prostate cancer entails active surveillance only, while high-risk disease requires multimodal treatment including surgery, radiation therapy, and hormonal therapy. Recurrence and development of metastatic disease remains a clinical problem, without a clear understanding of what drives immune escape and tumor progression. Here, we comprehensively describe the tumor microenvironment of localized prostate cancer in comparison with adjacent normal samples and healthy controls. Single-cell RNA sequencing and high-resolution spatial transcriptomic analyses reveal tumor context dependent changes in gene expression. Our data indicate that an immune suppressive tumor microenvironment associates with suppressive myeloid populations and exhausted T-cells, in addition to high stromal angiogenic activity. We infer cell-to-cell relationships from high throughput ligand-receptor interaction measurements within undissociated tissue sections. Our work thus provides a highly detailed and comprehensive resource of the prostate tumor microenvironment as well as tumor-stromal cell interactions.


Assuntos
Neoplasias da Próstata , Transcriptoma , Masculino , Humanos , Próstata/patologia , Microambiente Tumoral , Perfilação da Expressão Gênica , Neoplasias da Próstata/genética
9.
Nat Biotechnol ; 41(3): 417-426, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36163550

RESUMO

Genome instability and aberrant alterations of transcriptional programs both play important roles in cancer. Single-cell RNA sequencing (scRNA-seq) has the potential to investigate both genetic and nongenetic sources of tumor heterogeneity in a single assay. Here we present a computational method, Numbat, that integrates haplotype information obtained from population-based phasing with allele and expression signals to enhance detection of copy number variations from scRNA-seq. Numbat exploits the evolutionary relationships between subclones to iteratively infer single-cell copy number profiles and tumor clonal phylogeny. Analysis of 22 tumor samples, including multiple myeloma, gastric, breast and thyroid cancers, shows that Numbat can reconstruct the tumor copy number profile and precisely identify malignant cells in the tumor microenvironment. We identify genetic subpopulations with transcriptional signatures relevant to tumor progression and therapy resistance. Numbat requires neither sample-matched DNA data nor a priori genotyping, and is applicable to a wide range of experimental settings and cancer types.


Assuntos
Mieloma Múltiplo , Transcriptoma , Humanos , Transcriptoma/genética , Variações do Número de Cópias de DNA/genética , Haplótipos/genética , Filogenia , Análise de Célula Única/métodos , Microambiente Tumoral
10.
Bioinformatics ; 38(10): 2773-2780, 2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35561168

RESUMO

MOTIVATION: Allelic expression analysis aids in detection of cis-regulatory mechanisms of genetic variation, which produce allelic imbalance (AI) in heterozygotes. Measuring AI in bulk data lacking time or spatial resolution has the limitation that cell-type-specific (CTS), spatial- or time-dependent AI signals may be dampened or not detected. RESULTS: We introduce a statistical method airpart for identifying differential CTS AI from single-cell RNA-sequencing data, or dynamics AI from other spatially or time-resolved datasets. airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms of cis-genetic regulation. In order to account for low counts in single-cell data, our method uses a Generalized Fused Lasso with Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model for AI statistical inference. In simulation, airpart accurately detected partitions of cell types by their AI and had lower Root Mean Square Error (RMSE) of allelic ratio estimates than existing methods. In real data, airpart identified differential allelic imbalance patterns across cell states and could be used to define trends of AI signal over spatial or time axes. AVAILABILITY AND IMPLEMENTATION: The airpart package is available as an R/Bioconductor package at https://bioconductor.org/packages/airpart. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Desequilíbrio Alélico , Modelos Estatísticos , Alelos , Teorema de Bayes , Simulação por Computador , Software
11.
Nat Methods ; 19(3): 316-322, 2022 03.
Artigo em Inglês | MEDLINE | ID: mdl-35277707

RESUMO

The rapid growth of high-throughput single-cell and single-nucleus RNA-sequencing (scRNA-seq and snRNA-seq) technologies has produced a wealth of data over the past few years. The size, volume and distinctive characteristics of these data necessitate the development of new computational methods to accurately and efficiently quantify sc/snRNA-seq data into count matrices that constitute the input to downstream analyses. We introduce the alevin-fry framework for quantifying sc/snRNA-seq data. In addition to being faster and more memory frugal than other accurate quantification approaches, alevin-fry ameliorates the memory scalability and false-positive expression issues that are exhibited by other lightweight tools. We demonstrate how alevin-fry can be effectively used to quantify sc/snRNA-seq data, and also how the spliced and unspliced molecule quantification required as input for RNA velocity analyses can be seamlessly extracted from the same preprocessed data used to generate normal gene expression count matrices.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Perfilação da Expressão Gênica/métodos , RNA Nuclear Pequeno , RNA-Seq , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Software
12.
Bioinformatics ; 37(12): 1699-1707, 2021 Jul 19.
Artigo em Inglês | MEDLINE | ID: mdl-33471073

RESUMO

MOTIVATION: Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of 'inferential replicates', which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements. RESULTS: We demonstrate that storing only the mean and variance from a set of inferential replicates ('compression') is sufficient to capture gene-level quantification uncertainty, while reducing disk storage to as low as 9% of original storage, and memory usage when loading data to as low as 6%. Using these values, we generate 'pseudo-inferential' replicates from a negative binomial distribution and propose a general procedure for incorporating these replicates into a proposed statistical testing framework. When applying this procedure to trajectory-based differential expression analyses, we show false positives are reduced by more than a third for genes with high levels of quantification uncertainty. We additionally extend the Swish method to incorporate pseudo-inferential replicates and demonstrate improvements in computation time and memory usage without any loss in performance. Lastly, we show that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes in a real dataset. AVAILABILITY AND IMPLEMENTATION: makeInfReps and splitSwish are implemented in the R/Bioconductor fishpond package available at https://bioconductor.org/packages/fishpond. Analyses and simulated datasets can be found in the paper's GitHub repo at https://github.com/skvanburen/scUncertaintyPaperCode. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

13.
Genome Biol ; 21(1): 239, 2020 09 07.
Artigo em Inglês | MEDLINE | ID: mdl-32894187

RESUMO

BACKGROUND: The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. RESULTS: We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. CONCLUSION: We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly, and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.


Assuntos
Mapeamento Cromossômico/métodos , Alinhamento de Sequência/métodos , Algoritmos , Animais , Perfilação da Expressão Gênica , Camundongos , Análise de Sequência de RNA , Transcriptoma
14.
Bioinformatics ; 36(Suppl_1): i102-i110, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657377

RESUMO

MOTIVATION: Advances in sequencing technology, inference algorithms and differential testing methodology have enabled transcript-level analysis of RNA-seq data. Yet, the inherent inferential uncertainty in transcript-level abundance estimation, even among the most accurate approaches, means that robust transcript-level analysis often remains a challenge. Conversely, gene-level analysis remains a common and robust approach for understanding RNA-seq data, but it coarsens the resulting analysis to the level of genes, even if the data strongly support specific transcript-level effects. RESULTS: We introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. Our approach, implemented in the tool terminus, groups together transcripts in a data-driven manner allowing transcript-level analysis where it can be confidently supported, and deriving transcriptional groups where the inferential uncertainty is too high to support a transcript-level result. AVAILABILITY AND IMPLEMENTATION: Terminus is implemented in Rust, and is freely available and open source. It can be obtained from https://github.com/COMBINE-lab/Terminus. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Perfilação da Expressão Gênica , Software , Algoritmos , RNA-Seq , Análise de Sequência de RNA
15.
Bioinformatics ; 36(Suppl_1): i292-i299, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657394

RESUMO

MOTIVATION: Droplet-based single-cell RNA-seq (dscRNA-seq) data are being generated at an unprecedented pace, and the accurate estimation of gene-level abundances for each cell is a crucial first step in most dscRNA-seq analyses. When pre-processing the raw dscRNA-seq data to generate a count matrix, care must be taken to account for the potentially large number of multi-mapping locations per read. The sparsity of dscRNA-seq data, and the strong 3' sampling bias, makes it difficult to disambiguate cases where there is no uniquely mapping read to any of the candidate target genes. RESULTS: We introduce a Bayesian framework for information sharing across cells within a sample, or across multiple modalities of data using the same sample, to improve gene quantification estimates for dscRNA-seq data. We use an anchor-based approach to connect cells with similar gene-expression patterns, and learn informative, empirical priors which we provide to alevin's gene multi-mapping resolution algorithm. This improves the quantification estimates for genes with no uniquely mapping reads (i.e. when there is no unique intra-cellular information). We show our new model improves the per cell gene-level estimates and provides a principled framework for information sharing across multiple modalities. We test our method on a combination of simulated and real datasets under various setups. AVAILABILITY AND IMPLEMENTATION: The information sharing model is included in alevin and is implemented in C++14. It is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/salmon as of version 1.1.0.


Assuntos
Disseminação de Informação , Software , Algoritmos , Teorema de Bayes , Perfilação da Expressão Gênica , RNA-Seq , Análise de Sequência de RNA
16.
Bioinformatics ; 35(14): i136-i144, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510649

RESUMO

SUMMARY: With the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Análise de Célula Única , Perfilação da Expressão Gênica , Análise de Sequência de RNA , Software
17.
Bioinformatics ; 34(13): i169-i177, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29949982

RESUMO

Motivation: Indexing reference sequences for search-both individual genomes and collections of genomes-is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large. Results: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment. Availability and implementation: pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Visualização de Dados , Perfilação da Expressão Gênica/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Bactérias/genética , Genoma Bacteriano , Genoma Humano , Humanos , Análise de Sequência de DNA/métodos , Análise de Sequência de RNA/métodos
18.
Bioinformatics ; 33(21): 3380-3386, 2017 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-29077806

RESUMO

MOTIVATION: The past decade has seen an exponential increase in biological sequencing capacity, and there has been a simultaneous effort to help organize and archive some of the vast quantities of sequencing data that are being generated. Although these developments are tremendous from the perspective of maximizing the scientific utility of available data, they come with heavy costs. The storage and transmission of such vast amounts of sequencing data is expensive. RESULTS: We present Quark, a semi-reference-based compression tool designed for RNA-seq data. Quark makes use of a reference sequence when encoding reads, but produces a representation that can be decoded independently, without the need for a reference. This allows Quark to achieve markedly better compression rates than existing reference-free schemes, while still relieving the burden of assuming a specific, shared reference sequence between the encoder and decoder. We demonstrate that Quark achieves state-of-the-art compression rates, and that, typically, only a small fraction of the reference sequence must be encoded along with the reads to allow reference-free decompression. AVAILABILITY AND IMPLEMENTATION: Quark is implemented in C ++11, and is available under a GPLv3 license at www.github.com/COMBINE-lab/quark. CONTACT: rob.patro@cs.stonybrook.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Software , Algoritmos , Animais , Humanos , Camundongos
19.
Bioinformatics ; 32(12): i192-i200, 2016 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-27307617

RESUMO

MOTIVATION: The alignment of sequencing reads to a transcriptome is a common and important step in many RNA-seq analysis tasks. When aligning RNA-seq reads directly to a transcriptome (as is common in the de novo setting or when a trusted reference annotation is available), care must be taken to report the potentially large number of multi-mapping locations per read. This can pose a substantial computational burden for existing aligners, and can considerably slow downstream analysis. RESULTS: We introduce a novel concept, quasi-mapping, and an efficient algorithm implementing this approach for mapping sequencing reads to a transcriptome. By attempting only to report the potential loci of origin of a sequencing read, and not the base-to-base alignment by which it derives from the reference, RapMap-our tool implementing quasi-mapping-is capable of mapping sequencing reads to a target transcriptome substantially faster than existing alignment tools. The algorithm we use to implement quasi-mapping uses several efficient data structures and takes advantage of the special structure of shared sequence prevalent in transcriptomes to rapidly provide highly-accurate mapping information. We demonstrate how quasi-mapping can be successfully applied to the problems of transcript-level quantification from RNA-seq reads and the clustering of contigs from de novo assembled transcriptomes into biologically meaningful groups. AVAILABILITY AND IMPLEMENTATION: RapMap is implemented in C ++11 and is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/RapMap CONTACT: rob.patro@cs.stonybrook.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Transcriptoma , Algoritmos , RNA , Análise de Sequência de RNA , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA