Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 177
Filtrar
1.
Nat Commun ; 12(1): 1029, 2021 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-33589635

RESUMO

A primary challenge in single-cell RNA sequencing (scRNA-seq) studies comes from the massive amount of data and the excess noise level. To address this challenge, we introduce an analysis framework, named single-cell Decomposition using Hierarchical Autoencoder (scDHA), that reliably extracts representative information of each cell. The scDHA pipeline consists of two core modules. The first module is a non-negative kernel autoencoder able to remove genes or components that have insignificant contributions to the part-based representation of the data. The second module is a stacked Bayesian autoencoder that projects the data onto a low-dimensional space (compressed). To diminish the tendency to overfit of neural networks, we repeatedly perturb the compressed space to learn a more generalized representation of the data. In an extensive analysis, we demonstrate that scDHA outperforms state-of-the-art techniques in many research sub-fields of scRNA-seq analysis, including cell segregation through unsupervised learning, visualization of transcriptome landscape, cell classification, and pseudo-time inference.


Assuntos
Redes Neurais de Computação , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Animais , Teorema de Bayes , Benchmarking , Separação Celular/métodos , Cerebelo/química , Cerebelo/citologia , Embrião de Mamíferos , Humanos , Fígado/química , Fígado/citologia , Pulmão/química , Pulmão/citologia , Camundongos , Células-Tronco Embrionárias Murinas/química , Células-Tronco Embrionárias Murinas/citologia , Pâncreas/química , Pâncreas/citologia , Retina/química , Retina/citologia , Análise de Célula Única/métodos , Córtex Visual/química , Córtex Visual/citologia , Zigoto/química , Zigoto/citologia
2.
Nat Commun ; 11(1): 4662, 2020 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-32938926

RESUMO

Haplotype reconstruction of distant genetic variants remains an unsolved problem due to the short-read length of common sequencing data. Here, we introduce HapTree-X, a probabilistic framework that utilizes latent long-range information to reconstruct unspecified haplotypes in diploid and polyploid organisms. It introduces the observation that differential allele-specific expression can link genetic variants from the same physical chromosome, thus even enabling using reads that cover only individual variants. We demonstrate HapTree-X's feasibility on in-house sequenced Genome in a Bottle RNA-seq and various whole exome, genome, and 10X Genomics datasets. HapTree-X produces more complete phases (up to 25%), even in clinically important genes, and phases more variants than other methods while maintaining similar or higher accuracy and being up to 10×  faster than other tools. The advantage of HapTree-X's ability to use multiple lines of evidence, as well as to phase polyploid genomes in a single integrative framework, substantially grows as the amount of diverse data increases.


Assuntos
Desequilíbrio Alélico , Haplótipos , Análise de Sequência de RNA , Algoritmos , Bases de Dados Genéticas , Diploide , Humanos , Células K562 , Modelos Genéticos , Modelos Estatísticos , Polimorfismo de Nucleotídeo Único , Poliploidia , RNA-Seq , Análise de Sequência de RNA/métodos , Análise de Sequência de RNA/estatística & dados numéricos
3.
Medicine (Baltimore) ; 99(21): e20422, 2020 May 22.
Artigo em Inglês | MEDLINE | ID: mdl-32481346

RESUMO

Primary hepatic carcinoma is 1 of the most common malignant tumors globally, of which hepatocellular carcinoma (HCC) accounts for 85% to 90%. Due to the high degree of deterioration and low early detection rate of HCC, most patients are diagnosed when they are already in the middle and advanced stages, and the prognosis are always poor.RNA sequencing data from the cancer genome atlas was used to explore differences in lncRNA expression profiles. LncRNA was extracted by gdcRNAtools in R package. Multivariate cox analysis was performed on the screened lncRNAs. The relationship between the lncRNA model and prognosis as well as clinical characteristics of patients with HCC was analyzed. Finally, a predictive nomogram in the the cancer genome atlas cohort was established and verified internallyBased on the RNA sequencing survival analysis, a 9- lncRNAs prognosis model, including TMCC1-AS1, AC008892.1, AL031985.3, L34079.2, U95743.1, KDM4A-AS1, SACS-AS1, AC005534.1, LINC01116 was established. The 9-lncRNA prognosis model was a reliable tool for predicting prognosis of HCC, and the nomogram of this prognosis model could help clinicians to choose personalized treatment for HCC patientsThis model was significant to complement clinic characteristics of HCC and to promote personalized management of patients, it also provided a new idea for researches on the prognosis of HCC.


Assuntos
Neoplasias Hepáticas/genética , Neoplasias Hepáticas/mortalidade , RNA Longo não Codificante/análise , Análise de Sequência de RNA/estatística & dados numéricos , Estudos de Coortes , Feminino , Perfilação da Expressão Gênica , Humanos , Estimativa de Kaplan-Meier , Neoplasias Hepáticas/epidemiologia , Masculino , Análise Multivariada , Prognóstico , Modelos de Riscos Proporcionais , RNA Longo não Codificante/classificação , Curva ROC , Medição de Risco/métodos , Análise de Sequência de RNA/métodos , Análise de Sobrevida
4.
Nat Commun ; 11(1): 3155, 2020 06 22.
Artigo em Inglês | MEDLINE | ID: mdl-32572028

RESUMO

Single-cell RNA sequencing (scRNA-seq) is a versatile tool for discovering and annotating cell types and states, but the determination and annotation of cell subtypes is often subjective and arbitrary. Often, it is not even clear whether a given cluster is uniform. Here we present an entropy-based statistic, ROGUE, to accurately quantify the purity of identified cell clusters. We demonstrate that our ROGUE metric is broadly applicable, and enables accurate, sensitive and robust assessment of cluster purity on a wide range of simulated and real datasets. Applying this metric to fibroblast, B cell and brain data, we identify additional subtypes and demonstrate the application of ROGUE-guided analyses to detect precise signals in specific subpopulations. ROGUE can be applied to all tested scRNA-seq datasets, and has important implications for evaluating the quality of putative clusters, discovering pure cell subtypes and constructing comprehensive, detailed and standardized single cell atlas.


Assuntos
Análise de Dados , Análise de Sequência de RNA , Análise de Célula Única , Humanos , Modelos Teóricos , Análise de Sequência de RNA/métodos , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/métodos , Análise de Célula Única/estatística & dados numéricos , Software
5.
Nat Commun ; 11(1): 1169, 2020 03 03.
Artigo em Inglês | MEDLINE | ID: mdl-32127540

RESUMO

One primary reason that makes single-cell RNA-seq analysis challenging is dropouts, where the data only captures a small fraction of the transcriptome of each cell. Almost all computational algorithms developed for single-cell RNA-seq adopted gene selection, dimension reduction or imputation to address the dropouts. Here, an opposite view is explored. Instead of treating dropouts as a problem to be fixed, we embrace it as a useful signal. We represent the dropout pattern by binarizing single-cell RNA-seq count data, and present a co-occurrence clustering algorithm to cluster cells based on the dropout pattern. We demonstrate in multiple published datasets that the binary dropout pattern is as informative as the quantitative expression of highly variable genes for the purpose of identifying cell types. We expect that recognizing the utility of dropouts provides an alternative direction for developing computational algorithms for single-cell RNA-seq analysis.


Assuntos
Algoritmos , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Animais , Análise por Conglomerados , Bases de Dados Genéticas , Ontologia Genética , Humanos , Leucócitos Mononucleares/fisiologia , Camundongos , Córtex Pré-Frontal/fisiologia , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos
6.
Nat Commun ; 11(1): 774, 2020 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-32034137

RESUMO

An underlying question for virtually all single-cell RNA sequencing experiments is how to allocate the limited sequencing budget: deep sequencing of a few cells or shallow sequencing of many cells? Here we present a mathematical framework which reveals that, for estimating many important gene properties, the optimal allocation is to sequence at a depth of around one read per cell per gene. Interestingly, the corresponding optimal estimator is not the widely-used plug-in estimator, but one developed via empirical Bayes.


Assuntos
Biologia Computacional/métodos , Análise de Sequência de RNA/métodos , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/métodos , Análise de Célula Única/estatística & dados numéricos , Biologia Computacional/estatística & dados numéricos , Expressão Gênica , Redes Reguladoras de Genes , Hibridização in Situ Fluorescente , Modelos Teóricos , Reprodutibilidade dos Testes , Proteína A4 de Ligação a Cálcio da Família S100/genética
7.
Neuron ; 105(6): 1027-1035.e2, 2020 03 18.
Artigo em Inglês | MEDLINE | ID: mdl-31983538

RESUMO

The interplay between viral infection and Alzheimer's disease (AD) has long been an area of interest, but proving causality has been elusive. Several recent studies have renewed the debate concerning the role of herpesviruses, and human herpesvirus 6 (HHV-6) in particular, in AD. We screened for HHV-6 detection across three independent AD brain repositories using (1) RNA sequencing (RNA-seq) datasets and (2) DNA samples extracted from AD and non-AD control brains. The RNA-seq data were screened for pathogens against taxon references from over 25,000 microbes, including 118 human viruses, whereas DNA samples were probed for PCR reactivity to HHV-6A and HHV-6B. HHV-6 demonstrated little specificity to AD brains over controls by either method, whereas other viruses, such as Epstein-Barr virus (EBV) and cytomegalovirus (CMV), were detected at comparable levels. These direct methods of viral detection do not suggest an association between HHV-6 and AD.


Assuntos
Doença de Alzheimer/virologia , Encéfalo/virologia , Herpesvirus Humano 6/isolamento & purificação , Estudos de Casos e Controles , Estudos de Coortes , Feminino , Herpesvirus Humano 6/genética , Humanos , Masculino , Análise de Sequência de DNA/estatística & dados numéricos , Análise de Sequência de RNA/estatística & dados numéricos
8.
Nucleic Acids Res ; 48(1): 86-95, 2020 01 10.
Artigo em Inglês | MEDLINE | ID: mdl-31777938

RESUMO

Clustering is an essential step in the analysis of single cell RNA-seq (scRNA-seq) data to shed light on tissue complexity including the number of cell types and transcriptomic signatures of each cell type. Due to its importance, novel methods have been developed recently for this purpose. However, different approaches generate varying estimates regarding the number of clusters and the single-cell level cluster assignments. This type of unsupervised clustering is challenging and it is often times hard to gauge which method to use because none of the existing methods outperform others across all scenarios. We present SAME-clustering, a mixture model-based approach that takes clustering solutions from multiple methods and selects a maximally diverse subset to produce an improved ensemble solution. We tested SAME-clustering across 15 scRNA-seq datasets generated by different platforms, with number of clusters varying from 3 to 15, and number of single cells from 49 to 32 695. Results show that our SAME-clustering ensemble method yields enhanced clustering, in terms of both cluster assignments and number of clusters. The mixture model ensemble clustering is not limited to clustering scRNA-seq data and may be useful to a wide range of clustering applications.


Assuntos
Algoritmos , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Transcriptoma , Análise por Conglomerados , Conjuntos de Dados como Assunto , Perfilação da Expressão Gênica , Humanos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos
9.
PLoS Comput Biol ; 15(12): e1007537, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31830035

RESUMO

Next-generation sequencing is a cutting edge technology, but to quantify a dynamic range of abundances for different RNA or DNA species requires increasing sampling depth to levels that can be prohibitively expensive due to physical limits on molecular throughput of sequencers. To overcome this problem, we introduce a new general sampling theory which uses biophysical principles to functionally encode the abundance of a species before sampling, SeQUential depletIon and enriCHment (SQUICH). In theory and simulation, SQUICH enables sampling at a logarithmic rate to achieve the same precision as attained with conventional sequencing. A simple proof of principle experimental implementation of SQUICH in a controlled complex system of ~262,000 oligonucleotides already reduces sequencing depth by a factor of 10. SQUICH lays the groundwork for a general solution to a fundamental problem in molecular sampling and enables a new generation of efficient, precise molecular measurement at logarithmic or better sampling depth.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequência de Bases , Biologia Computacional , Simulação por Computador , DNA/genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Estudo de Prova de Conceito , RNA/genética , Amostragem , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/estatística & dados numéricos , Análise de Sequência de RNA/métodos , Análise de Sequência de RNA/estatística & dados numéricos , Especificidade da Espécie
10.
PLoS Comput Biol ; 15(12): e1007525, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31809503

RESUMO

Response to acid stress is critical for Escherichia coli to successfully complete its life-cycle by passing through the stomach to colonize the digestive tract. To develop a fundamental understanding of this response, we established a molecular mechanistic description of acid stress mitigation responses in E. coli and integrated them with a genome-scale model of its metabolism and macromolecular expression (ME-model). We considered three known mechanisms of acid stress mitigation: 1) change in membrane lipid fatty acid composition, 2) change in periplasmic protein stability over external pH and periplasmic chaperone protection mechanisms, and 3) change in the activities of membrane proteins. After integrating these mechanisms into an established ME-model, we could simulate their responses in the context of other cellular processes. We validated these simulations using RNA sequencing data obtained from five E. coli strains grown under external pH ranging from 5.5 to 7.0. We found: i) that for the differentially expressed genes accounted for in the ME-model, 80% of the upregulated genes were correctly predicted by the ME-model, and ii) that these genes are mainly involved in translation processes (45% of genes), membrane proteins and related processes (18% of genes), amino acid metabolism (12% of genes), and cofactor and prosthetic group biosynthesis (8% of genes). We also demonstrated several intervention strategies on acid tolerance that can be simulated by the ME-model. We thus established a quantitative framework that describes, on a genome-scale, the acid stress mitigation response of E. coli that has both scientific and practical uses.


Assuntos
Escherichia coli/genética , Escherichia coli/metabolismo , Modelos Biológicos , Ácidos , Biologia Computacional , Simulação por Computador , Escherichia coli/crescimento & desenvolvimento , Proteínas de Escherichia coli/metabolismo , Ácidos Graxos/metabolismo , Regulação Bacteriana da Expressão Gênica , Genoma Bacteriano , Concentração de Íons de Hidrogênio , Lipídeos de Membrana/metabolismo , Modelos Genéticos , Estabilidade Proteica , Análise de Sequência de RNA/estatística & dados numéricos , Estresse Fisiológico
11.
PLoS Comput Biol ; 15(12): e1007510, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31790389

RESUMO

Quantifying cell-type proportions and their corresponding gene expression profiles in tissue samples would enhance understanding of the contributions of individual cell types to the physiological states of the tissue. Current approaches that address tissue heterogeneity have drawbacks. Experimental techniques, such as fluorescence-activated cell sorting, and single cell RNA sequencing are expensive. Computational approaches that use expression data from heterogeneous samples are promising, but most of the current methods estimate either cell-type proportions or cell-type-specific expression profiles by requiring the other as input. Although such partial deconvolution methods have been successfully applied to tumor samples, the additional input required may be unavailable. We introduce a novel complete deconvolution method, CDSeq, that uses only RNA-Seq data from bulk tissue samples to simultaneously estimate both cell-type proportions and cell-type-specific expression profiles. Using several synthetic and real experimental datasets with known cell-type composition and cell-type-specific expression profiles, we compared CDSeq's complete deconvolution performance with seven other established deconvolution methods. Complete deconvolution using CDSeq represents a substantial technical advance over partial deconvolution approaches and will be useful for studying cell mixtures in tissue samples. CDSeq is available at GitHub repository (MATLAB and Octave code): https://github.com/kkang7/CDSeq.


Assuntos
Perfilação da Expressão Gênica/estatística & dados numéricos , Análise de Sequência de RNA/estatística & dados numéricos , Aprendizado de Máquina não Supervisionado , Linhagem Celular , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Humanos , Leucócitos/classificação , Leucócitos/metabolismo , Reconhecimento Automatizado de Padrão , Transcriptoma
12.
PLoS Comput Biol ; 15(8): e1007252, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31390362

RESUMO

Massively parallel RNA sequencing (RNA-seq) in combination with metabolic labeling has become the de facto standard approach to study alterations in RNA transcription, processing or decay. Regardless of advances in the experimental protocols and techniques, every experimentalist needs to specify the key aspects of experimental design: For example, which protocol should be used (biochemical separation vs. nucleotide conversion) and what is the optimal labeling time? In this work, we provide approximate answers to these questions using the asymptotic theory of optimal design. Specifically, we investigate, how the variance of degradation rate estimates depends on the time and derive the optimal time for any given degradation rate. Subsequently, we show that an increase in sample numbers should be preferred over an increase in sequencing depth. Lastly, we provide some guidance on use cases when laborious biochemical separation outcompetes recent nucleotide conversion based methods (such as SLAMseq) and show, how inefficient conversion influences the precision of estimates. Code and documentation can be found at https://github.com/dieterich-lab/DesignMetabolicRNAlabeling.


Assuntos
Estabilidade de RNA , RNA/genética , RNA/metabolismo , Biologia Computacional , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Cinética , Células MCF-7 , Modelos Biológicos , Processamento Pós-Transcricional do RNA , Análise de Sequência de RNA/métodos , Análise de Sequência de RNA/estatística & dados numéricos , Transcrição Genética
13.
PLoS Comput Biol ; 15(8): e1007040, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31469823

RESUMO

Single-cell RNA-sequencing (scRNA-seq) provides new opportunities to gain a mechanistic understanding of many biological processes. Current approaches for single cell clustering are often sensitive to the input parameters and have difficulty dealing with cell types with different densities. Here, we present Panoramic View (PanoView), an iterative method integrated with a novel density-based clustering, Ordering Local Maximum by Convex hull (OLMC), that uses a heuristic approach to estimate the required parameters based on the input data structures. In each iteration, PanoView will identify the most confident cell clusters and repeat the clustering with the remaining cells in a new PCA space. Without adjusting any parameter in PanoView, we demonstrated that PanoView was able to detect major and rare cell types simultaneously and outperformed other existing methods in both simulated datasets and published single-cell RNA-sequencing datasets. Finally, we conducted scRNA-Seq analysis of embryonic mouse hypothalamus, and PanoView was able to reveal known cell types and several rare cell subpopulations.


Assuntos
Algoritmos , Análise de Sequência de RNA/estatística & dados numéricos , Animais , Análise por Conglomerados , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Hipotálamo/citologia , Hipotálamo/embriologia , Hipotálamo/metabolismo , Camundongos , Análise de Célula Única/estatística & dados numéricos
14.
PLoS Comput Biol ; 15(8): e1007293, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31425522

RESUMO

The Long interspersed nuclear element 1 (LINE-1) is a primary source of genetic variation in humans and other mammals. Despite its importance, LINE-1 activity remains difficult to study because of its highly repetitive nature. Here, we developed and validated a method called TeXP to gauge LINE-1 activity accurately. TeXP builds mappability signatures from LINE-1 subfamilies to deconvolve the effect of pervasive transcription from autonomous LINE-1 activity. In particular, it apportions the multiple reads aligned to the many LINE-1 instances in the genome into these two categories. Using our method, we evaluated well-established cell lines, cell-line compartments and healthy tissues and found that the vast majority (91.7%) of transcriptome reads overlapping LINE-1 derive from pervasive transcription. We validated TeXP by independently estimating the levels of LINE-1 autonomous transcription using ddPCR, finding high concordance. Next, we applied our method to comprehensively measure LINE-1 activity across healthy somatic cells, while backing out the effect of pervasive transcription. Unexpectedly, we found that LINE-1 activity is present in many normal somatic cells. This finding contrasts with earlier studies showing that LINE-1 has limited activity in healthy somatic tissues, except for neuroprogenitor cells. Interestingly, we found that the amount of LINE-1 activity was associated with the with the amount of cell turnover, with tissues with low cell turnover rates (e.g. the adult central nervous system) showing lower LINE-1 activity. Altogether, our results show how accounting for pervasive transcription is critical to accurately quantify the activity of highly repetitive regions of the human genome.


Assuntos
Elementos de DNA Transponíveis/genética , Elementos Nucleotídeos Longos e Dispersos/genética , Modelos Genéticos , Transcrição Genética , Animais , Linhagem Celular , Biologia Computacional , Técnicas Genéticas/estatística & dados numéricos , Genoma Humano , Humanos , Análise de Sequência de RNA/estatística & dados numéricos
15.
BMC Genomics ; 20(1): 540, 2019 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-31266443

RESUMO

BACKGROUND: Transcriptomic profiles can improve our understanding of the phenotypic molecular basis of biological research, and many statistical methods have been proposed to identify differentially expressed genes (DEGs) under two or more conditions with RNA-seq data. However, statistical analyses with RNA-seq data are often limited by small sample sizes, and global variance estimates of RNA expression levels have been utilized as prior distributions for gene-specific variance estimates, making it difficult to generalize the methods to more complicated settings. We herein proposed a Bartlett-Adjusted Likelihood-based LInear mixed model approach (BALLI) to analyze more complicated RNA-seq data. The proposed method estimates the technical and biological variances with a linear mixed-effects model, with and without adjusting small sample bias using Bartlkett's corrections. RESULTS: We conducted extensive simulations to compare the performance of BALLI with those of existing approaches (edgeR, DESeq2, and voom). Results from the simulation studies showed that BALLI correctly controlled the type-1 error rates at various nominal significance levels and produced better statistical power and precision estimates than those of other competing methods in various scenarios. Furthermore, BALLI was robust to variation of library size. It was also successfully applied to Holstein milk yield data, illustrating its practical value. CONCLUSIONS;: BALLI is statistically more efficient and valid than existing methods, and we conclude that it is useful for identifying DEGs in RNA-seq analysis.


Assuntos
Bovinos/genética , Biologia Computacional/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Modelos Lineares , Análise de Sequência de RNA/estatística & dados numéricos , Animais , Biologia Computacional/métodos , Feminino , Perfilação da Expressão Gênica/métodos , Funções Verossimilhança , Leite , Modelos Genéticos , Distribuição Aleatória , Tamanho da Amostra , Análise de Sequência de RNA/métodos , Software , Transcriptoma
16.
J Bioinform Comput Biol ; 17(3): 1940008, 2019 06.
Artigo em Inglês | MEDLINE | ID: mdl-31288642

RESUMO

Fusion genes are involved in cancer, and their detection using RNA-Seq is insufficient given the relatively short reading length. Therefore, we proposed a shifted short-read clustering (SSC) method, which focuses on overlapping reads from the same loci and extends them as a representative sequence. To verify their usefulness, we applied the SSC method to RNA-Seq data from four types of cell lines (BT-474, MCF-7, SKBR-3, and T-47D). As the slide width of the SSC method increased to one, two, five, or ten bases, the read length was extended from 201 bases to 217 (108%), 234 (116%), 282 (140%), or 317 (158%) bases, respectively. Furthermore, fusion genes were investigated using STAR-Fusion, a fusion gene detection tool, with and without the SSC method. When one base was shifted by the SSC method, the reads mapped to multiple loci decreased from 9.7% to 4.6%, and the sensitivity of the fusion gene was improved from 47% to 54% on average (BT-474: from 48% to 57%, MCF-7: 49% to 53%, SKBR-3: 50% to 57%, and T-47D: 43% to 50%) compared with original data. When the reads are shifted more, the positive predictive value was also improved. The SSC method could be an effective method for fusion gene detection.


Assuntos
Análise por Conglomerados , Biologia Computacional/métodos , Fusão Gênica , Neoplasias/genética , RNA-Seq , Linhagem Celular Tumoral , Bases de Dados Genéticas , Humanos , Análise de Sequência de RNA/métodos , Análise de Sequência de RNA/estatística & dados numéricos
17.
Nucleic Acids Res ; 47(16): e95, 2019 09 19.
Artigo em Inglês | MEDLINE | ID: mdl-31226206

RESUMO

Cell type identification is essential for single-cell RNA sequencing (scRNA-seq) studies, currently transforming the life sciences. CHETAH (CHaracterization of cEll Types Aided by Hierarchical classification) is an accurate cell type identification algorithm that is rapid and selective, including the possibility of intermediate or unassigned categories. Evidence for assignment is based on a classification tree of previously available scRNA-seq reference data and includes a confidence score based on the variance in gene expression per cell type. For cell types represented in the reference data, CHETAH's accuracy is as good as existing methods. Its specificity is superior when cells of an unknown type are encountered, such as malignant cells in tumor samples which it pinpoints as intermediate or unassigned. Although designed for tumor samples in particular, the use of unassigned and intermediate types is also valuable in other exploratory studies. This is exemplified in pancreas datasets where CHETAH highlights cell populations not well represented in the reference dataset, including cells with profiles that lie on a continuum between that of acinar and ductal cell types. Having the possibility of unassigned and intermediate cell types is pivotal for preventing misclassification and can yield important biological information for previously unexplored tissues.


Assuntos
Algoritmos , Linhagem da Célula/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Neoplasias/genética , RNA Mensageiro/análise , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/métodos , Células Acinares/imunologia , Células Acinares/patologia , Sequência de Bases , Linhagem da Célula/imunologia , Análise por Conglomerados , Conjuntos de Dados como Assunto , Células Dendríticas/imunologia , Células Dendríticas/patologia , Perfilação da Expressão Gênica , Humanos , Neoplasias/imunologia , Neoplasias/patologia , Especificidade de Órgãos , Pâncreas/imunologia , Pâncreas/patologia , RNA Mensageiro/genética , Software , Linfócitos T/imunologia , Linfócitos T/patologia , Células Tumorais Cultivadas
18.
Nucleic Acids Res ; 47(16): e93, 2019 09 19.
Artigo em Inglês | MEDLINE | ID: mdl-31216024

RESUMO

Single cell RNA sequencing methods have been increasingly used to understand cellular heterogeneity. Nevertheless, most of these methods suffer from one or more limitations, such as focusing only on polyadenylated RNA, sequencing of only the 3' end of the transcript, an exuberant fraction of reads mapping to ribosomal RNA, and the unstranded nature of the sequencing data. Here, we developed a novel single cell strand-specific total RNA library preparation method addressing all the aforementioned shortcomings. Our method was validated on a microfluidics system using three different cancer cell lines undergoing a chemical or genetic perturbation and on two other cancer cell lines sorted in microplates. We demonstrate that our total RNA-seq method detects an equal or higher number of genes compared to classic polyA[+] RNA-seq, including novel and non-polyadenylated genes. The obtained RNA expression patterns also recapitulate the expected biological signal. Inherent to total RNA-seq, our method is also able to detect circular RNAs. Taken together, SMARTer single cell total RNA sequencing is very well suited for any single cell sequencing experiment in which transcript level information is needed beyond polyadenylated genes.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , RNA Circular/análise , RNA Mensageiro/análise , RNA Ribossômico/análise , Análise de Célula Única/métodos , Benchmarking , Linhagem Celular Tumoral , Biblioteca Gênica , Humanos , Técnicas Analíticas Microfluídicas , Poli A/genética , Poli A/metabolismo , RNA Circular/genética , RNA Mensageiro/genética , RNA Ribossômico/genética , Análise de Sequência de RNA/estatística & dados numéricos
19.
Pac Symp Biocomput ; 24: 350-361, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30963074

RESUMO

Single-cell RNA sequencing (scRNA-seq) techniques have been very powerful in analyzing heterogeneous cell population and identifying cell types. Visualizing scRNA-seq data can help researchers effectively extract meaningful biological information and make new discoveries. While commonly used scRNA-seq visualization methods, such as t-SNE, are useful in detecting cell clusters, they often tear apart the intrinsic continuous structure in gene expression profiles. Topological Data Analysis (TDA) approaches like Mapper capture the shape of data by representing data as topological networks. TDA approaches are robust to noise and different platforms, while preserving the locality and data continuity. Moreover, instead of analyzing the whole dataset, Mapper allows researchers to explore biological meanings of specific pathways and genes by using different filter functions. In this paper, we applied Mapper to visualize scRNA-seq data. Our method can not only capture the clustering structure of cells, but also preserve the continuous gene expression topologies of cells. We demonstrated that by combining with gene co-expression network analysis, our method can reveal differential expression patterns of gene co-expression modules along the Mapper visualization.


Assuntos
RNA/genética , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Algoritmos , Biologia Computacional , Interpretação Estatística de Dados , Bases de Dados Genéticas/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Redes Reguladoras de Genes , Humanos , Melanoma/genética , Pâncreas/citologia , Pâncreas/metabolismo
20.
Pac Symp Biocomput ; 24: 362-373, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30963075

RESUMO

Single-cell RNA sequencing (scRNA-seq) is a powerful tool to profile the transcriptomes of a large number of individual cells at a high resolution. These data usually contain measurements of gene expression for many genes in thousands or tens of thousands of cells, though some datasets now reach the million-cell mark. Projecting high-dimensional scRNA-seq data into a low dimensional space aids downstream analysis and data visualization. Many recent preprints accomplish this using variational autoencoders (VAE), generative models that learn underlying structure of data by compress it into a constrained, low dimensional space. The low dimensional spaces generated by VAEs have revealed complex patterns and novel biological signals from large-scale gene expression data and drug response predictions. Here, we evaluate a simple VAE approach for gene expression data, Tybalt, by training and measuring its performance on sets of simulated scRNA-seq data. We find a number of counter-intuitive performance features: i.e., deeper neural networks can struggle when datasets contain more observations under some parameter configurations. We show that these methods are highly sensitive to parameter tuning: when tuned, the performance of the Tybalt model, which was not optimized for scRNA-seq data, outperforms other popular dimension reduction approaches - PCA, ZIFA, UMAP and t-SNE. On the other hand, without tuning performance can also be remarkably poor on the same data. Our results should discourage authors and reviewers from relying on self-reported performance comparisons to evaluate the relative value of contributions in this area at this time. Instead, we recommend that attempts to compare or benchmark autoencoder methods for scRNA-seq data be performed by disinterested third parties or by methods developers only on unseen benchmark data that are provided to all participants simultaneously because the potential for performance differences due to unequal parameter tuning is so high.


Assuntos
Perfilação da Expressão Gênica/estatística & dados numéricos , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Análise por Conglomerados , Biologia Computacional , Simulação por Computador , Humanos , Redes Neurais de Computação , Transcriptoma
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA