Pesquisa | Secretaria de Estado da Saúde

Discovery of optimal cell type classification marker genes from single cell RNA sequencing data.

Liu, Angela; Peng, Beverly; Pankajam, Ajith V; Duong, Thu Elizabeth; Pryhuber, Gloria; Scheuermann, Richard H; Zhang, Yun.

bioRxiv ; 2024 Jun 26.

Artigo em Inglês | MEDLINE | ID: mdl-38712147

RESUMO

The use of single cell/nucleus RNA sequencing (scRNA-seq) technologies that quantitively describe cell transcriptional phenotypes is revolutionizing our understanding of cell biology, leading to new insights in cell type identification, disease mechanisms, and drug development. The tremendous growth in scRNA-seq data has posed new challenges in efficiently characterizing data-driven cell types and identifying quantifiable marker genes for cell type classification. The use of machine learning and explainable artificial intelligence has emerged as an effective approach to study large-scale scRNA-seq data. NS-Forest is a random forest machine learning-based algorithm that aims to provide a scalable data-driven solution to identify minimum combinations of necessary and sufficient marker genes that capture cell type identity with maximum classification accuracy. Here, we describe the latest version, NS-Forest version 4.0 and its companion Python package (https://github.com/JCVenterInstitute/NSForest), with several enhancements to select marker gene combinations that exhibit highly selective expression patterns among closely related cell types and more efficiently perform marker gene selection for large-scale scRNA-seq data atlases with millions of cells. By modularizing the final decision tree step, NS-Forest v4.0 can be used to compare the performance of user-defined marker genes with the NS-Forest computationally-derived marker genes based on the decision tree classifiers. To quantify how well the identified markers exhibit the desired pattern of being exclusively expressed at high levels within their target cell types, we introduce the On-Target Fraction metric that ranges from 0 to 1, with a metric of 1 assigned to markers that are only expressed within their target cell types and not in cells of any other cell types. NS-Forest v4.0 outperforms previous versions on its ability to identify markers with higher On-Target Fraction values for closely related cell types and outperforms other marker gene selection approaches at classification with significantly higher F-beta scores when applied to datasets from three human organs - brain, kidney, and lung.

A Cost-effective, High-throughput, Highly Accurate Genotyping Method for Outbred Populations.

Chen, Denghui; Chitre, Apurva S; Nguyen, Khai-Minh H; Cohen, Katarina; Peng, Beverly; Ziegler, Kendra S; Okamoto, Faith; Lin, Bonnie; Johnson, Benjamin B; Sanches, Thiago M; Cheng, Riyan; Polesskaya, Oksana; Palmer, Abraham A.

bioRxiv ; 2024 Jul 18.

Artigo em Inglês | MEDLINE | ID: mdl-39071405

RESUMO

Affordable sequencing and genotyping methods are essential for large scale genome-wide association studies. While genotyping microarrays and reference panels for imputation are available for human subjects, non-human model systems often lack such options. Our lab previously demonstrated an efficient and cost-effective method to genotype heterogeneous stock rats using double-digest genotyping-by-sequencing. However, low-coverage whole-genome sequencing offers an alternative method that has several advantages. Here, we describe a cost-effective, high-throughput, high-accuracy genotyping method for N/NIH heterogeneous stock rats that can use a combination of sequencing data previously generated by double-digest genotyping-by-sequencing and more recently generated by low-coverage whole-genome-sequencing data. Using double-digest genotyping-by-sequencing data from 5,745 heterogeneous stock rats (mean 0.21x coverage) and low-coverage whole-genome-sequencing data from 8,760 heterogeneous stock rats (mean 0.27x coverage), we can impute 7.32 million bi-allelic single-nucleotide polymorphisms with a concordance rate >99.76% compared to high-coverage (mean 33.26x coverage) whole-genome sequencing data for a subset of the same individuals. Our results demonstrate the feasibility of using sequencing data from double-digest genotyping-by-sequencing or low-coverage whole-genome-sequencing for accurate genotyping, and demonstrate techniques that may also be useful for other genetic studies in non-human subjects.

Genome-wide association study reveals multiple loci for nociception and opioid consumption behaviors associated with heroin vulnerability in outbred rats.

Kuhn, Brittany N; Cannella, Nazzareno; Chitre, Apurva S; Nguyen, Khai-Minh H; Cohen, Katarina; Chen, Denghui; Peng, Beverly; Ziegler, Kendra S; Lin, Bonnie; Johnson, Benjamin B; Missfeldt Sanches, Thiago; Crow, Ayteria D; Lunerti, Veronica; Gupta, Arkobrato; Dereschewitz, Eric; Soverchia, Laura; Hopkins, Jordan L; Roberts, Analyse T; Ubaldi, Massimo; Abdulmalek, Sarah; Kinen, Analia; Hardiman, Gary; Chung, Dongjun; Polesskaya, Oksana; Solberg Woods, Leah C; Ciccocioppo, Roberto; Kalivas, Peter W; Palmer, Abraham A.

bioRxiv ; 2024 Jun 25.

Artigo em Inglês | MEDLINE | ID: mdl-38712202

RESUMO

The increased prevalence of opioid use disorder (OUD) makes it imperative to disentangle the biological mechanisms contributing to individual differences in OUD vulnerability. OUD shows strong heritability, however genetic variants contributing toward vulnerability remain poorly defined. We performed a genome-wide association study using over 850 male and female heterogeneous stock (HS) rats to identify genes underlying behaviors associated with OUD such as nociception, as well as heroin-taking, extinction and seeking behaviors. By using an animal model of OUD, we were able to identify genetic variants associated with distinct OUD behaviors while maintaining a uniform environment, an experimental design not easily achieved in humans. Furthermore, we used a novel non-linear network-based clustering approach to characterize rats based on OUD vulnerability to assess genetic variants associated with OUD susceptibility. Our findings confirm the heritability of several OUD-like behaviors, including OUD susceptibility. Additionally, several genetic variants associated with nociceptive threshold prior to heroin experience, heroin consumption, escalation of intake, and motivation to obtain heroin were identified. Tom1 , a microglial component, was implicated for nociception. Several genes involved in dopaminergic signaling, neuroplasticity and substance use disorders, including Brwd1 , Pcp4, Phb1l2 and Mmp15 were implicated for the heroin traits. Additionally, an OUD vulnerable phenotype was associated with genetic variants for consumption and break point, suggesting a specific genetic contribution for OUD-like traits contributing to vulnerability. Together, these findings identify novel genetic markers related to the susceptibility to OUD-relevant behaviors in HS rats.

Machine learning for cell type classification from single nucleus RNA sequencing data.

Le, Huy; Peng, Beverly; Uy, Janelle; Carrillo, Daniel; Zhang, Yun; Aevermann, Brian D; Scheuermann, Richard H.

PLoS One ; 17(9): e0275070, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36149937

RESUMO

With the advent of single cell/nucleus RNA sequencing (sc/snRNA-seq), the field of cell phenotyping is now a data-driven exercise providing statistical evidence to support cell type/state categorization. However, the task of classifying cells into specific, well-defined categories with the empirical data provided by sc/snRNA-seq remains nontrivial due to the difficulty in determining specific differences between related cell types with close transcriptional similarities, resulting in challenges with matching cell types identified in separate experiments. To investigate possible approaches to overcome these obstacles, we explored the use of supervised machine learning methods-logistic regression, support vector machines, random forests, neural networks, and light gradient boosting machine (LightGBM)-as approaches to classify cell types using snRNA-seq datasets from human brain middle temporal gyrus (MTG) and human kidney. Classification accuracy was evaluated using an F-beta score weighted in favor of precision to account for technical artifacts of gene expression dropout. We examined the impact of hyperparameter optimization and feature selection methods on F-beta score performance. We found that the best performing model for granular cell type classification in both datasets is a multinomial logistic regression classifier and that an effective feature selection step was the most influential factor in optimizing the performance of the machine learning pipelines.

Assuntos

Aprendizado de Máquina , RNA , Humanos , Modelos Logísticos , RNA Nuclear Pequeno , Análise de Sequência de RNA/métodos , Máquina de Vetores de Suporte

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa