Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Cell ; 184(13): 3573-3587.e29, 2021 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-34062119

RESUMO

The simultaneous measurement of multiple modalities represents an exciting frontier for single-cell genomics and necessitates computational methods that can define cellular states based on multimodal data. Here, we introduce "weighted-nearest neighbor" analysis, an unsupervised framework to learn the relative utility of each data type in each cell, enabling an integrative analysis of multiple modalities. We apply our procedure to a CITE-seq dataset of 211,000 human peripheral blood mononuclear cells (PBMCs) with panels extending to 228 antibodies to construct a multimodal reference atlas of the circulating immune system. Multimodal analysis substantially improves our ability to resolve cell states, allowing us to identify and validate previously unreported lymphoid subpopulations. Moreover, we demonstrate how to leverage this reference to rapidly map new datasets and to interpret immune responses to vaccination and coronavirus disease 2019 (COVID-19). Our approach represents a broadly applicable strategy to analyze single-cell multimodal datasets and to look beyond the transcriptome toward a unified and multimodal definition of cellular identity.


Assuntos
SARS-CoV-2/imunologia , Análise de Célula Única/métodos , Células 3T3 , Animais , COVID-19/imunologia , Linhagem Celular , Perfilação da Expressão Gênica/métodos , Humanos , Imunidade/imunologia , Leucócitos Mononucleares/imunologia , Linfócitos/imunologia , Camundongos , Análise de Sequência de RNA/métodos , Transcriptoma/imunologia , Vacinação
2.
Nat Methods ; 19(3): 316-322, 2022 03.
Artigo em Inglês | MEDLINE | ID: mdl-35277707

RESUMO

The rapid growth of high-throughput single-cell and single-nucleus RNA-sequencing (scRNA-seq and snRNA-seq) technologies has produced a wealth of data over the past few years. The size, volume and distinctive characteristics of these data necessitate the development of new computational methods to accurately and efficiently quantify sc/snRNA-seq data into count matrices that constitute the input to downstream analyses. We introduce the alevin-fry framework for quantifying sc/snRNA-seq data. In addition to being faster and more memory frugal than other accurate quantification approaches, alevin-fry ameliorates the memory scalability and false-positive expression issues that are exhibited by other lightweight tools. We demonstrate how alevin-fry can be effectively used to quantify sc/snRNA-seq data, and also how the spliced and unspliced molecule quantification required as input for RNA velocity analyses can be seamlessly extracted from the same preprocessed data used to generate normal gene expression count matrices.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Perfilação da Expressão Gênica/métodos , RNA Nuclear Pequeno , RNA-Seq , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Software
3.
Nat Methods ; 18(11): 1333-1341, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34725479

RESUMO

The recent development of experimental methods for measuring chromatin state at single-cell resolution has created a need for computational tools capable of analyzing these datasets. Here we developed Signac, a comprehensive toolkit for the analysis of single-cell chromatin data. Signac enables an end-to-end analysis of single-cell chromatin data, including peak calling, quantification, quality control, dimension reduction, clustering, integration with single-cell gene expression datasets, DNA motif analysis and interactive visualization. Through its seamless compatibility with the Seurat package, Signac facilitates the analysis of diverse multimodal single-cell chromatin data, including datasets that co-assay DNA accessibility with gene expression, protein abundance and mitochondrial genotype. We demonstrate scaling of the Signac framework to analyze datasets containing over 700,000 cells.


Assuntos
Células da Medula Óssea/química , Cromatina/genética , Biologia Computacional/métodos , Leucócitos Mononucleares/química , Mitocôndrias/genética , Análise de Célula Única/métodos , Software , Células da Medula Óssea/metabolismo , Cromatina/química , Cromatina/metabolismo , Perfilação da Expressão Gênica , Humanos , Leucócitos Mononucleares/metabolismo , Análise de Sequência de DNA
4.
Bioinformatics ; 38(10): 2773-2780, 2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35561168

RESUMO

MOTIVATION: Allelic expression analysis aids in detection of cis-regulatory mechanisms of genetic variation, which produce allelic imbalance (AI) in heterozygotes. Measuring AI in bulk data lacking time or spatial resolution has the limitation that cell-type-specific (CTS), spatial- or time-dependent AI signals may be dampened or not detected. RESULTS: We introduce a statistical method airpart for identifying differential CTS AI from single-cell RNA-sequencing data, or dynamics AI from other spatially or time-resolved datasets. airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms of cis-genetic regulation. In order to account for low counts in single-cell data, our method uses a Generalized Fused Lasso with Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model for AI statistical inference. In simulation, airpart accurately detected partitions of cell types by their AI and had lower Root Mean Square Error (RMSE) of allelic ratio estimates than existing methods. In real data, airpart identified differential allelic imbalance patterns across cell states and could be used to define trends of AI signal over spatial or time axes. AVAILABILITY AND IMPLEMENTATION: The airpart package is available as an R/Bioconductor package at https://bioconductor.org/packages/airpart. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Desequilíbrio Alélico , Modelos Estatísticos , Alelos , Teorema de Bayes , Simulação por Computador , Software
5.
Bioinformatics ; 37(12): 1699-1707, 2021 Jul 19.
Artigo em Inglês | MEDLINE | ID: mdl-33471073

RESUMO

MOTIVATION: Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of 'inferential replicates', which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements. RESULTS: We demonstrate that storing only the mean and variance from a set of inferential replicates ('compression') is sufficient to capture gene-level quantification uncertainty, while reducing disk storage to as low as 9% of original storage, and memory usage when loading data to as low as 6%. Using these values, we generate 'pseudo-inferential' replicates from a negative binomial distribution and propose a general procedure for incorporating these replicates into a proposed statistical testing framework. When applying this procedure to trajectory-based differential expression analyses, we show false positives are reduced by more than a third for genes with high levels of quantification uncertainty. We additionally extend the Swish method to incorporate pseudo-inferential replicates and demonstrate improvements in computation time and memory usage without any loss in performance. Lastly, we show that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes in a real dataset. AVAILABILITY AND IMPLEMENTATION: makeInfReps and splitSwish are implemented in the R/Bioconductor fishpond package available at https://bioconductor.org/packages/fishpond. Analyses and simulated datasets can be found in the paper's GitHub repo at https://github.com/skvanburen/scUncertaintyPaperCode. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

6.
PLoS Comput Biol ; 17(1): e1008585, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33428615

RESUMO

Experimental single-cell approaches are becoming widely used for many purposes, including investigation of the dynamic behaviour of developing biological systems. Consequently, a large number of computational methods for extracting dynamic information from such data have been developed. One example is RNA velocity analysis, in which spliced and unspliced RNA abundances are jointly modeled in order to infer a 'direction of change' and thereby a future state for each cell in the gene expression space. Naturally, the accuracy and interpretability of the inferred RNA velocities depend crucially on the correctness of the estimated abundances. Here, we systematically compare five widely used quantification tools, in total yielding thirteen different quantification approaches, in terms of their estimates of spliced and unspliced RNA abundances in five experimental droplet scRNA-seq data sets. We show that there are substantial differences between the quantifications obtained from different tools, and identify typical genes for which such discrepancies are observed. We further show that these abundance differences propagate to the downstream analysis, and can have a large effect on estimated velocities as well as the biological interpretation. Our results highlight that abundance quantification is a crucial aspect of the RNA velocity analysis workflow, and that both the definition of the genomic features of interest and the quantification algorithm itself require careful consideration.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , RNA Mensageiro , RNA Citoplasmático Pequeno , Análise de Sequência de RNA/métodos , Algoritmos , Animais , Bases de Dados Genéticas , Camundongos , RNA Mensageiro/análise , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , RNA Citoplasmático Pequeno/análise , RNA Citoplasmático Pequeno/genética , RNA Citoplasmático Pequeno/metabolismo , Análise de Célula Única/métodos
7.
Bioinformatics ; 36(Suppl_1): i292-i299, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657394

RESUMO

MOTIVATION: Droplet-based single-cell RNA-seq (dscRNA-seq) data are being generated at an unprecedented pace, and the accurate estimation of gene-level abundances for each cell is a crucial first step in most dscRNA-seq analyses. When pre-processing the raw dscRNA-seq data to generate a count matrix, care must be taken to account for the potentially large number of multi-mapping locations per read. The sparsity of dscRNA-seq data, and the strong 3' sampling bias, makes it difficult to disambiguate cases where there is no uniquely mapping read to any of the candidate target genes. RESULTS: We introduce a Bayesian framework for information sharing across cells within a sample, or across multiple modalities of data using the same sample, to improve gene quantification estimates for dscRNA-seq data. We use an anchor-based approach to connect cells with similar gene-expression patterns, and learn informative, empirical priors which we provide to alevin's gene multi-mapping resolution algorithm. This improves the quantification estimates for genes with no uniquely mapping reads (i.e. when there is no unique intra-cellular information). We show our new model improves the per cell gene-level estimates and provides a principled framework for information sharing across multiple modalities. We test our method on a combination of simulated and real datasets under various setups. AVAILABILITY AND IMPLEMENTATION: The information sharing model is included in alevin and is implemented in C++14. It is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/salmon as of version 1.1.0.


Assuntos
Disseminação de Informação , Software , Algoritmos , Teorema de Bayes , Perfilação da Expressão Gênica , RNA-Seq , Análise de Sequência de RNA
8.
Bioinformatics ; 36(Suppl_1): i102-i110, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657377

RESUMO

MOTIVATION: Advances in sequencing technology, inference algorithms and differential testing methodology have enabled transcript-level analysis of RNA-seq data. Yet, the inherent inferential uncertainty in transcript-level abundance estimation, even among the most accurate approaches, means that robust transcript-level analysis often remains a challenge. Conversely, gene-level analysis remains a common and robust approach for understanding RNA-seq data, but it coarsens the resulting analysis to the level of genes, even if the data strongly support specific transcript-level effects. RESULTS: We introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. Our approach, implemented in the tool terminus, groups together transcripts in a data-driven manner allowing transcript-level analysis where it can be confidently supported, and deriving transcriptional groups where the inferential uncertainty is too high to support a transcript-level result. AVAILABILITY AND IMPLEMENTATION: Terminus is implemented in Rust, and is freely available and open source. It can be obtained from https://github.com/COMBINE-lab/Terminus. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Perfilação da Expressão Gênica , Software , Algoritmos , RNA-Seq , Análise de Sequência de RNA
9.
Nucleic Acids Res ; 47(18): e105, 2019 10 10.
Artigo em Inglês | MEDLINE | ID: mdl-31372651

RESUMO

A primary challenge in the analysis of RNA-seq data is to identify differentially expressed genes or transcripts while controlling for technical biases. Ideally, a statistical testing procedure should incorporate the inherent uncertainty of the abundance estimates arising from the quantification step. Most popular methods for RNA-seq differential expression analysis fit a parametric model to the counts for each gene or transcript, and a subset of methods can incorporate uncertainty. Previous work has shown that nonparametric models for RNA-seq differential expression may have better control of the false discovery rate, and adapt well to new data types without requiring reformulation of a parametric model. Existing nonparametric models do not take into account inferential uncertainty, leading to an inflated false discovery rate, in particular at the transcript level. We propose a nonparametric model for differential expression analysis using inferential replicate counts, extending the existing SAMseq method to account for inferential uncertainty. We compare our method, Swish, with popular differential expression analysis methods. Swish has improved control of the false discovery rate, in particular for transcripts with high inferential uncertainty. We apply Swish to a single-cell RNA-seq dataset, assessing differential expression between sub-populations of cells, and compare its performance to the Wilcoxon test.


Assuntos
Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Algoritmos , Linhagem da Célula/genética , Expressão Gênica/genética , Humanos , RNA/genética , Software
11.
Bioinformatics ; 35(14): i136-i144, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510649

RESUMO

SUMMARY: With the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Análise de Célula Única , Perfilação da Expressão Gênica , Análise de Sequência de RNA , Software
12.
Bioinformatics ; 34(13): i169-i177, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29949982

RESUMO

Motivation: Indexing reference sequences for search-both individual genomes and collections of genomes-is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large. Results: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment. Availability and implementation: pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Visualização de Dados , Perfilação da Expressão Gênica/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Bactérias/genética , Genoma Bacteriano , Genoma Humano , Humanos , Análise de Sequência de DNA/métodos , Análise de Sequência de RNA/métodos
13.
Bioinformatics ; 33(14): i142-i151, 2017 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-28881996

RESUMO

MOTIVATION: Many methods for transcript-level abundance estimation reduce the computational burden associated with the iterative algorithms they use by adopting an approximate factorization of the likelihood function they optimize. This leads to considerably faster convergence of the optimization procedure, since each round of e.g. the EM algorithm, can execute much more quickly. However, these approximate factorizations of the likelihood function simplify calculations at the expense of discarding certain information that can be useful for accurate transcript abundance estimation. RESULTS: We demonstrate that model simplifications (i.e. factorizations of the likelihood function) adopted by certain abundance estimation methods can lead to a diminished ability to accurately estimate the abundances of highly related transcripts. In particular, considering factorizations based on transcript-fragment compatibility alone can result in a loss of accuracy compared to the per-fragment, unsimplified model. However, we show that such shortcomings are not an inherent limitation of approximately factorizing the underlying likelihood function. By considering the appropriate conditional fragment probabilities, and adopting improved, data-driven factorizations of this likelihood, we demonstrate that such approaches can achieve accuracy nearly indistinguishable from methods that consider the complete (i.e. per-fragment) likelihood, while retaining the computational efficiently of the compatibility-based factorizations. AVAILABILITY AND IMPLEMENTATION: Our data-driven factorizations are incorporated into a branch of the Salmon transcript quantification tool: https://github.com/COMBINE-lab/salmon/tree/factorizations . CONTACT: rob.patro@cs.stonybrook.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Software , Algoritmos , Biologia Computacional/métodos , Humanos , Funções Verossimilhança , Modelos Biológicos
14.
Bioinformatics ; 32(12): i192-i200, 2016 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-27307617

RESUMO

MOTIVATION: The alignment of sequencing reads to a transcriptome is a common and important step in many RNA-seq analysis tasks. When aligning RNA-seq reads directly to a transcriptome (as is common in the de novo setting or when a trusted reference annotation is available), care must be taken to report the potentially large number of multi-mapping locations per read. This can pose a substantial computational burden for existing aligners, and can considerably slow downstream analysis. RESULTS: We introduce a novel concept, quasi-mapping, and an efficient algorithm implementing this approach for mapping sequencing reads to a transcriptome. By attempting only to report the potential loci of origin of a sequencing read, and not the base-to-base alignment by which it derives from the reference, RapMap-our tool implementing quasi-mapping-is capable of mapping sequencing reads to a target transcriptome substantially faster than existing alignment tools. The algorithm we use to implement quasi-mapping uses several efficient data structures and takes advantage of the special structure of shared sequence prevalent in transcriptomes to rapidly provide highly-accurate mapping information. We demonstrate how quasi-mapping can be successfully applied to the problems of transcript-level quantification from RNA-seq reads and the clustering of contigs from de novo assembled transcriptomes into biologically meaningful groups. AVAILABILITY AND IMPLEMENTATION: RapMap is implemented in C ++11 and is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/RapMap CONTACT: rob.patro@cs.stonybrook.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Transcriptoma , Algoritmos , RNA , Análise de Sequência de RNA , Software
15.
Nat Biotechnol ; 42(2): 293-304, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-37231261

RESUMO

Mapping single-cell sequencing profiles to comprehensive reference datasets provides a powerful alternative to unsupervised analysis. However, most reference datasets are constructed from single-cell RNA-sequencing data and cannot be used to annotate datasets that do not measure gene expression. Here we introduce 'bridge integration', a method to integrate single-cell datasets across modalities using a multiomic dataset as a molecular bridge. Each cell in the multiomic dataset constitutes an element in a 'dictionary', which is used to reconstruct unimodal datasets and transform them into a shared space. Our procedure accurately integrates transcriptomic data with independent single-cell measurements of chromatin accessibility, histone modifications, DNA methylation and protein levels. Moreover, we demonstrate how dictionary learning can be combined with sketching techniques to improve computational scalability and harmonize 8.6 million human immune cell profiles from sequencing and mass cytometry experiments. Our approach, implemented in version 5 of our Seurat toolkit ( http://www.satijalab.org/seurat ), broadens the utility of single-cell reference datasets and facilitates comparisons across diverse molecular modalities.


Assuntos
Perfilação da Expressão Gênica , Software , Humanos , Análise de Sequência de RNA/métodos , Perfilação da Expressão Gênica/métodos , Transcriptoma , Análise de Célula Única/métodos
16.
Nat Biotechnol ; 40(8): 1220-1230, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-35332340

RESUMO

Technologies that profile chromatin modifications at single-cell resolution offer enormous promise for functional genomic characterization, but the sparsity of the measurements and integrating multiple binding maps represent substantial challenges. Here we introduce single-cell (sc)CUT&Tag-pro, a multimodal assay for profiling protein-DNA interactions coupled with the abundance of surface proteins in single cells. In addition, we introduce single-cell ChromHMM, which integrates data from multiple experiments to infer and annotate chromatin states based on combinatorial histone modification patterns. We apply these tools to perform an integrated analysis across nine different molecular modalities in circulating human immune cells. We demonstrate how these two approaches can characterize dynamic changes in the function of individual genomic elements across both discrete cell states and continuous developmental trajectories, nominate associated motifs and regulators that establish chromatin states and identify extensive and cell-type-specific regulatory priming. Finally, we demonstrate how our integrated reference can serve as a scaffold to map and improve the interpretation of additional scCUT&Tag datasets.


Assuntos
Cromatina , Histonas , Cromatina/genética , Imunoprecipitação da Cromatina , DNA , Genômica , Histonas/genética , Histonas/metabolismo , Humanos
17.
Genome Biol ; 21(1): 239, 2020 09 07.
Artigo em Inglês | MEDLINE | ID: mdl-32894187

RESUMO

BACKGROUND: The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. RESULTS: We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. CONCLUSION: We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly, and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.


Assuntos
Mapeamento Cromossômico/métodos , Alinhamento de Sequência/métodos , Algoritmos , Animais , Perfilação da Expressão Gênica , Camundongos , Análise de Sequência de RNA , Transcriptoma
18.
Genome Biol ; 20(1): 65, 2019 03 27.
Artigo em Inglês | MEDLINE | ID: mdl-30917859

RESUMO

We introduce alevin, a fast end-to-end pipeline to process droplet-based single-cell RNA sequencing data, performing cell barcode detection, read mapping, unique molecular identifier (UMI) deduplication, gene count estimation, and cell barcode whitelisting. Alevin's approach to UMI deduplication considers transcript-level constraints on the molecules from which UMIs may have arisen and accounts for both gene-unique reads and reads that multimap between genes. This addresses the inherent bias in existing tools which discard gene-ambiguous reads and improves the accuracy of gene abundance estimates. Alevin is considerably faster, typically eight times, than existing gene quantification approaches, while also using less memory.


Assuntos
Análise de Sequência de RNA , Análise de Célula Única , Software , Animais , Código de Barras de DNA Taxonômico , Humanos , Camundongos
19.
AMIA Annu Symp Proc ; 2018: 867-876, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30815129

RESUMO

Opioid-abuse epidemic in the United States has escalated to national attention due to the dramatic increase of opioid overdose deaths. Analyzing opioid-related social media has the potential to reveal patterns of opioid abuse at a national scale, understand opinions of the public, and provide insights to support prevention and treatment. Reddit is a community based social media with more reliable content curated by the community through voting. In this study, we collected and analyzed all opioid related discussions from January 2014 to October 2017, which contains 51,537 posts by 16,162 unique users. We analyzed the data to understand the psychological categories of the posts, and performed topic modeling to reveal the major topics of interest. We also characterized the extent of social support received from comments and scores by each post. Last, we analyzed statistically significant difference in the posts between anonymous and non-anonymous users.


Assuntos
Analgésicos Opioides , Análise de Dados , Transtornos Relacionados ao Uso de Opioides/epidemiologia , Mídias Sociais , Humanos , Transtornos Relacionados ao Uso de Opioides/mortalidade , Transtornos Relacionados ao Uso de Opioides/psicologia , Mídias Sociais/tendências , Apoio Social , Estados Unidos/epidemiologia
20.
Int J Bioinform Res Appl ; 10(2): 129-44, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24589833

RESUMO

Content-based image retrieval has gained considerable attention in today's scenario as a useful tool in many applications; texture is one of them. In this paper, we focus on texture-based image retrieval in compressed domain using compressive sensing with the help of DC coefficients. Medical imaging is one of the fields which have been affected most, as there had been huge size of image database and getting out the concerned image had been a daunting task. Considering this, in this paper we propose a new model of image retrieval process using compressive sampling, since it allows accurate recovery of image from far fewer samples of unknowns and it does not require a close relation of matching between sampling pattern and characteristic image structure with increase acquisition speed and enhanced image quality.


Assuntos
Compressão de Dados/métodos , Diagnóstico por Imagem/métodos , Algoritmos , Bases de Dados Factuais , Diagnóstico por Imagem/instrumentação , Humanos , Aumento da Imagem , Armazenamento e Recuperação da Informação , Reconhecimento Automatizado de Padrão/métodos , Radiografia Torácica/métodos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA