Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 3.246
Filtrar
Mais filtros

Intervalo de ano de publicação
1.
Development ; 151(11)2024 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-38691188

RESUMO

Analysis of single cell transcriptomics (scRNA-seq) data is typically performed after subsetting to highly variable genes (HVGs). Here, we show that Entropy Sorting provides an alternative mathematical framework for feature selection. On synthetic datasets, continuous Entropy Sort Feature Weighting (cESFW) outperforms HVG selection in distinguishing cell-state-specific genes. We apply cESFW to six merged scRNA-seq datasets spanning human early embryo development. Without smoothing or augmenting the raw counts matrices, cESFW generates a high-resolution embedding displaying coherent developmental progression from eight-cell to post-implantation stages and delineating 15 distinct cell states. The embedding highlights sequential lineage decisions during blastocyst development, while unsupervised clustering identifies branch point populations obscured in previous analyses. The first branching region, where morula cells become specified for inner cell mass or trophectoderm, includes cells previously asserted to lack a developmental trajectory. We quantify the relatedness of different pluripotent stem cell cultures to distinct embryo cell types and identify marker genes of naïve and primed pluripotency. Finally, by revealing genes with dynamic lineage-specific expression, we provide markers for staging progression from morula to blastocyst.


Assuntos
Linhagem da Célula , Embrião de Mamíferos , Desenvolvimento Embrionário , Entropia , Análise de Célula Única , Transcriptoma , Humanos , Transcriptoma/genética , Análise de Célula Única/métodos , Desenvolvimento Embrionário/genética , Embrião de Mamíferos/metabolismo , Linhagem da Célula/genética , Regulação da Expressão Gênica no Desenvolvimento , Blastocisto/metabolismo , Blastocisto/citologia , Perfilação da Expressão Gênica , Mórula/metabolismo , Mórula/citologia , Células-Tronco Pluripotentes/metabolismo , Células-Tronco Pluripotentes/citologia
2.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38975891

RESUMO

Unsupervised feature selection is a critical step for efficient and accurate analysis of single-cell RNA-seq data. Previous benchmarks used two different criteria to compare feature selection methods: (i) proportion of ground-truth marker genes included in the selected features and (ii) accuracy of cell clustering using ground-truth cell types. Here, we systematically compare the performance of 11 feature selection methods for both criteria. We first demonstrate the discordance between these criteria and suggest using the latter. We then compare the distribution of selected genes in their means between feature selection methods. We show that lowly expressed genes exhibit seriously high coefficients of variation and are mostly excluded by high-performance methods. In particular, high-deviation- and high-expression-based methods outperform the widely used in Seurat package in clustering cells and data visualization. We further show they also enable a clear separation of the same cell type from different tissues as well as accurate estimation of cell trajectories.


Assuntos
Análise de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados , Humanos , Perfilação da Expressão Gênica/métodos , Algoritmos , Biologia Computacional/métodos , Análise de Sequência de RNA/métodos , RNA-Seq/métodos
3.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38385875

RESUMO

Metabolomics and foodomics shed light on the molecular processes within living organisms and the complex food composition by leveraging sophisticated analytical techniques to systematically analyze the vast array of molecular features. The traditional feature-picking method often results in arbitrary selections of the model, feature ranking, and cut-off, which may lead to suboptimal results. Thus, a Multiple and Optimal Screening Subset (MOSS) approach was developed in this study to achieve a balance between a minimal number of predictors and high predictive accuracy during statistical model setup. The MOSS approach compares five commonly used models in the context of food matrix analysis, specifically bourbons. These models include Student's t-test, receiver operating characteristic curve, partial least squares-discriminant analysis (PLS-DA), random forests, and support vector machines. The approach employs cross-validation to identify promising subset feature candidates that contribute to food characteristic classification. It then determines the optimal subset size by comparing it to the corresponding top-ranked features. Finally, it selects the optimal feature subset by traversing all possible feature candidate combinations. By utilizing MOSS approach to analyze 1406 mass spectral features from a collection of 122 bourbon samples, we were able to generate a subset of features for bourbon age prediction with 88% accuracy. Additionally, MOSS increased the area under the curve performance of sweetness prediction to 0.898 with only four predictors compared with the top-ranked four features at 0.681 based on the PLS-DA model. Overall, we demonstrated that MOSS provides an efficient and effective approach for selecting optimal features compared with other frequently utilized methods.


Assuntos
Metabolômica , Projetos de Pesquisa , Análise Discriminante , Modelos Estatísticos , Curva ROC
4.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-39007597

RESUMO

Thyroid cancer incidences endure to increase even though a large number of inspection tools have been developed recently. Since there is no standard and certain procedure to follow for the thyroid cancer diagnoses, clinicians require conducting various tests. This scrutiny process yields multi-dimensional big data and lack of a common approach leads to randomly distributed missing (sparse) data, which are both formidable challenges for the machine learning algorithms. This paper aims to develop an accurate and computationally efficient deep learning algorithm to diagnose the thyroid cancer. In this respect, randomly distributed missing data stemmed singularity in learning problems is treated and dimensionality reduction with inner and target similarity approaches are developed to select the most informative input datasets. In addition, size reduction with the hierarchical clustering algorithm is performed to eliminate the considerably similar data samples. Four machine learning algorithms are trained and also tested with the unseen data to validate their generalization and robustness abilities. The results yield 100% training and 83% testing preciseness for the unseen data. Computational time efficiencies of the algorithms are also examined under the equal conditions.


Assuntos
Algoritmos , Aprendizado Profundo , Neoplasias da Glândula Tireoide , Neoplasias da Glândula Tireoide/diagnóstico , Humanos , Aprendizado de Máquina , Análise por Conglomerados
5.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38797968

RESUMO

A major challenge of precision oncology is the identification and prioritization of suitable treatment options based on molecular biomarkers of the considered tumor. In pursuit of this goal, large cancer cell line panels have successfully been studied to elucidate the relationship between cellular features and treatment response. Due to the high dimensionality of these datasets, machine learning (ML) is commonly used for their analysis. However, choosing a suitable algorithm and set of input features can be challenging. We performed a comprehensive benchmarking of ML methods and dimension reduction (DR) techniques for predicting drug response metrics. Using the Genomics of Drug Sensitivity in Cancer cell line panel, we trained random forests, neural networks, boosting trees and elastic nets for 179 anti-cancer compounds with feature sets derived from nine DR approaches. We compare the results regarding statistical performance, runtime and interpretability. Additionally, we provide strategies for assessing model performance compared with a simple baseline model and measuring the trade-off between models of different complexity. Lastly, we show that complex ML models benefit from using an optimized DR strategy, and that standard models-even when using considerably fewer features-can still be superior in performance.


Assuntos
Algoritmos , Antineoplásicos , Benchmarking , Aprendizado de Máquina , Humanos , Antineoplásicos/farmacologia , Antineoplásicos/uso terapêutico , Neoplasias/tratamento farmacológico , Neoplasias/genética , Redes Neurais de Computação , Linhagem Celular Tumoral
6.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39101500

RESUMO

Genomic selection (GS) has emerged as an effective technology to accelerate crop hybrid breeding by enabling early selection prior to phenotype collection. Genomic best linear unbiased prediction (GBLUP) is a robust method that has been routinely used in GS breeding programs. However, GBLUP assumes that markers contribute equally to the total genetic variance, which may not be the case. In this study, we developed a novel GS method called GA-GBLUP that leverages the genetic algorithm (GA) to select markers related to the target trait. We defined four fitness functions for optimization, including AIC, BIC, R2, and HAT, to improve the predictability and bin adjacent markers based on the principle of linkage disequilibrium to reduce model dimension. The results demonstrate that the GA-GBLUP model, equipped with R2 and HAT fitness function, produces much higher predictability than GBLUP for most traits in rice and maize datasets, particularly for traits with low heritability. Moreover, we have developed a user-friendly R package, GAGBLUP, for GS, and the package is freely available on CRAN (https://CRAN.R-project.org/package=GAGBLUP).


Assuntos
Algoritmos , Genômica , Seleção Genética , Zea mays , Genômica/métodos , Zea mays/genética , Oryza/genética , Modelos Genéticos , Melhoramento Vegetal/métodos , Desequilíbrio de Ligação , Fenótipo , Locos de Características Quantitativas , Genoma de Planta , Polimorfismo de Nucleotídeo Único , Software
7.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-39038932

RESUMO

MOTIVATION: Drug repositioning, the identification of new therapeutic uses for existing drugs, is crucial for accelerating drug discovery and reducing development costs. Some methods rely on heterogeneous networks, which may not fully capture the complex relationships between drugs and diseases. However, integrating diverse biological data sources offers promise for discovering new drug-disease associations (DDAs). Previous evidence indicates that the combination of information would be conducive to the discovery of new DDAs. However, the challenge lies in effectively integrating different biological data sources to identify the most effective drugs for a certain disease based on drug-disease coupled mechanisms. RESULTS: In response to this challenge, we present MiRAGE, a novel computational method for drug repositioning. MiRAGE leverages a three-step framework, comprising negative sampling using hard negative mining, classification employing random forest models, and feature selection based on feature importance. We evaluate MiRAGE on multiple benchmark datasets, demonstrating its superiority over state-of-the-art algorithms across various metrics. Notably, MiRAGE consistently outperforms other methods in uncovering novel DDAs. Case studies focusing on Parkinson's disease and schizophrenia showcase MiRAGE's ability to identify top candidate drugs supported by previous studies. Overall, our study underscores MiRAGE's efficacy and versatility as a computational tool for drug repositioning, offering valuable insights for therapeutic discoveries and addressing unmet medical needs.


Assuntos
Algoritmos , Mineração de Dados , Reposicionamento de Medicamentos , Reposicionamento de Medicamentos/métodos , Mineração de Dados/métodos , Humanos , Biologia Computacional/métodos , Esquizofrenia/tratamento farmacológico , Doença de Parkinson/tratamento farmacológico , Descoberta de Drogas/métodos
8.
Am J Hum Genet ; 109(11): 1974-1985, 2022 11 03.
Artigo em Inglês | MEDLINE | ID: mdl-36206757

RESUMO

Almost always, the analysis of single-cell RNA-sequencing (scRNA-seq) data begins with the generation of the low dimensional embedding of the data by principal-component analysis (PCA). Because scRNA-seq data are count data, log transformation is routinely applied to correct skewness prior to PCA, which is often argued to have added bias to data. Alternatively, studies have proposed methods that directly assume a count model and use approximately normally distributed count residuals for PCA. Despite their theoretical advantage of directly modeling count data, these methods are extremely slow for large datasets. In fact, when the data size grows, even the standard log normalization becomes inefficient. Here, we present FastRNA, a highly efficient solution for PCA of scRNA-seq data based on a count model accounting for both batches and cell size factors. Although we assume the same general count model as previous methods, our method uses two orders of magnitude less time and memory than the other count-based methods and an order of magnitude less time and memory than the standard log normalization. This achievement results from our unique algebraic optimization that completely avoids the formation of the large dense residual matrix in memory. In addition, our method enjoys a benefit that the batch effects are eliminated from data prior to PCA. Generating a batch-accounted PC of an atlas-scale dataset with 2 million cells takes less than a minute and 1 GB memory with our method.


Assuntos
RNA , Análise de Célula Única , Humanos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Análise de Componente Principal , Sequenciamento do Exoma , Perfilação da Expressão Gênica
9.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38084922

RESUMO

Single-cell RNA sequencing (scRNA-seq) has revealed important insights into the heterogeneity of malignant cells. However, sample-specific genomic alterations often confound such analysis, resulting in patient-specific clusters that are difficult to interpret. Here, we present a novel approach to address the issue. By normalizing gene expression variances to identify universally variable genes (UVGs), we were able to reduce the formation of sample-specific clusters and identify underlying molecular hallmarks in malignant cells. In contrast to highly variable genes vulnerable to a specific sample bias, UVGs led to better detection of clusters corresponding to distinct malignant cell states. Our results demonstrate the utility of this approach for analyzing scRNA-seq data and suggest avenues for further exploration of malignant cell heterogeneity.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Humanos , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Análise por Conglomerados , Genômica
10.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37369636

RESUMO

Untargeted metabolomics is gaining widespread applications. The key aspects of the data analysis include modeling complex activities of the metabolic network, selecting metabolites associated with clinical outcome and finding critical metabolic pathways to reveal biological mechanisms. One of the key roadblocks in data analysis is not well-addressed, which is the problem of matching uncertainty between data features and known metabolites. Given the limitations of the experimental technology, the identities of data features cannot be directly revealed in the data. The predominant approach for mapping features to metabolites is to match the mass-to-charge ratio (m/z) of data features to those derived from theoretical values of known metabolites. The relationship between features and metabolites is not one-to-one since some metabolites share molecular composition, and various adduct ions can be derived from the same metabolite. This matching uncertainty causes unreliable metabolite selection and functional analysis results. Here we introduce an integrated deep learning framework for metabolomics data that take matching uncertainty into consideration. The model is devised with a gradual sparsification neural network based on the known metabolic network and the annotation relationship between features and metabolites. This architecture characterizes metabolomics data and reflects the modular structure of biological system. Three goals can be achieved simultaneously without requiring much complex inference and additional assumptions: (1) evaluate metabolite importance, (2) infer feature-metabolite matching likelihood and (3) select disease sub-networks. When applied to a COVID metabolomics dataset and an aging mouse brain dataset, our method found metabolic sub-networks that were easily interpretable.


Assuntos
COVID-19 , Aprendizado Profundo , Animais , Camundongos , Metabolômica/métodos , Metaboloma , Redes e Vias Metabólicas
11.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38113078

RESUMO

Single-cell chromatin accessibility sequencing (scCAS) technologies have enabled characterizing the epigenomic heterogeneity of individual cells. However, the identification of features of scCAS data that are relevant to underlying biological processes remains a significant gap. Here, we introduce a novel method Cofea, to fill this gap. Through comprehensive experiments on 5 simulated and 54 real datasets, Cofea demonstrates its superiority in capturing cellular heterogeneity and facilitating downstream analysis. Applying this method to identification of cell type-specific peaks and candidate enhancers, as well as pathway enrichment analysis and partitioned heritability analysis, we illustrate the potential of Cofea to uncover functional biological process.


Assuntos
Cromatina , Sequências Reguladoras de Ácido Nucleico , Cromatina/genética
12.
Brief Bioinform ; 24(5)2023 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-37649385

RESUMO

Protein crystallization is crucial for biology, but the steps involved are complex and demanding in terms of external factors and internal structure. To save on experimental costs and time, the tendency of proteins to crystallize can be initially determined and screened by modeling. As a result, this study created a new pipeline aimed at using protein sequence to predict protein crystallization propensity in the protein material production stage, purification stage and production of crystal stage. The newly created pipeline proposed a new feature selection method, which involves combining Chi-square (${\chi }^{2}$) and recursive feature elimination together with the 12 selected features, followed by a linear discriminant analysisfor dimensionality reduction and finally, a support vector machine algorithm with hyperparameter tuning and 10-fold cross-validation is used to train the model and test the results. This new pipeline has been tested on three different datasets, and the accuracy rates are higher than the existing pipelines. In conclusion, our model provides a new solution to predict multistage protein crystallization propensity which is a big challenge in computational biology.


Assuntos
Algoritmos , Aprendizado de Máquina , Cristalização , Sequência de Aminoácidos , Biologia Computacional
13.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37889118

RESUMO

Selecting informative features, such as accurate biomarkers for disease diagnosis, prognosis and response to treatment, is an essential task in the field of bioinformatics. Medical data often contain thousands of features and identifying potential biomarkers is challenging due to small number of samples in the data, method dependence and non-reproducibility. This paper proposes a novel ensemble feature selection method, named Filter and Wrapper Stacking Ensemble (FWSE), to identify reproducible biomarkers from high-dimensional omics data. In FWSE, filter feature selection methods are run on numerous subsets of the data to eliminate irrelevant features, and then wrapper feature selection methods are applied to rank the top features. The method was validated on four high-dimensional medical datasets related to mental illnesses and cancer. The results indicate that the features selected by FWSE are stable and statistically more significant than the ones obtained by existing methods while also demonstrating biological relevance. Furthermore, FWSE is a generic method, applicable to various high-dimensional datasets in the fields of machine intelligence and bioinformatics.


Assuntos
Transtornos Mentais , Neoplasias , Humanos , Algoritmos , Inteligência Artificial , Biomarcadores , Neoplasias/diagnóstico , Neoplasias/genética
14.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38058187

RESUMO

The worldwide appearance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has generated significant concern and posed a considerable challenge to global health. Phosphorylation is a common post-translational modification that affects many vital cellular functions and is closely associated with SARS-CoV-2 infection. Precise identification of phosphorylation sites could provide more in-depth insight into the processes underlying SARS-CoV-2 infection and help alleviate the continuing COVID-19 crisis. Currently, available computational tools for predicting these sites lack accuracy and effectiveness. In this study, we designed an innovative meta-learning model, Meta-Learning for Serine/Threonine Phosphorylation (MeL-STPhos), to precisely identify protein phosphorylation sites. We initially performed a comprehensive assessment of 29 unique sequence-derived features, establishing prediction models for each using 14 renowned machine learning methods, ranging from traditional classifiers to advanced deep learning algorithms. We then selected the most effective model for each feature by integrating the predicted values. Rigorous feature selection strategies were employed to identify the optimal base models and classifier(s) for each cell-specific dataset. To the best of our knowledge, this is the first study to report two cell-specific models and a generic model for phosphorylation site prediction by utilizing an extensive range of sequence-derived features and machine learning algorithms. Extensive cross-validation and independent testing revealed that MeL-STPhos surpasses existing state-of-the-art tools for phosphorylation site prediction. We also developed a publicly accessible platform at https://balalab-skku.org/MeL-STPhos. We believe that MeL-STPhos will serve as a valuable tool for accelerating the discovery of serine/threonine phosphorylation sites and elucidating their role in post-translational regulation.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , Fosforilação , SARS-CoV-2/metabolismo , Serina/metabolismo , Treonina/metabolismo
15.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-37150785

RESUMO

A-to-I editing is the most prevalent RNA editing event, which refers to the change of adenosine (A) bases to inosine (I) bases in double-stranded RNAs. Several studies have revealed that A-to-I editing can regulate cellular processes and is associated with various human diseases. Therefore, accurate identification of A-to-I editing sites is crucial for understanding RNA-level (i.e. transcriptional) modifications and their potential roles in molecular functions. To date, various computational approaches for A-to-I editing site identification have been developed; however, their performance is still unsatisfactory and needs further improvement. In this study, we developed a novel stacked-ensemble learning model, ATTIC (A-To-I ediTing predICtor), to accurately identify A-to-I editing sites across three species, including Homo sapiens, Mus musculus and Drosophila melanogaster. We first comprehensively evaluated 37 RNA sequence-derived features combined with 14 popular machine learning algorithms. Then, we selected the optimal base models to build a series of stacked ensemble models. The final ATTIC framework was developed based on the optimal models improved by the feature selection strategy for specific species. Extensive cross-validation and independent tests illustrate that ATTIC outperforms state-of-the-art tools for predicting A-to-I editing sites. We also developed a web server for ATTIC, which is publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/ATTIC/. We anticipate that ATTIC can be utilized as a useful tool to accelerate the identification of A-to-I RNA editing events and help characterize their roles in post-transcriptional regulation.


Assuntos
Drosophila melanogaster , Edição de RNA , Animais , Camundongos , Humanos , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , RNA/genética , Adenosina/genética , Adenosina/metabolismo , Inosina/genética , Inosina/metabolismo
16.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38205965

RESUMO

DNA methylation profiling is a useful tool to increase the accuracy of a cancer diagnosis. However, a comprehensive R package specially for it is lacking. Hence, we developed the R package methylClass for methylation-based classification. Within it, we provide the eSVM (ensemble-based support vector machine) model to achieve much higher accuracy in methylation data classification than the popular random forest model and overcome the time-consuming problem of the traditional SVM. In addition, some novel feature selection methods are included in the package to improve the classification. Furthermore, because methylation data can be converted to other omics, such as copy number variation data, we also provide functions for multi-omics studies. The testing of this package on four datasets shows the accurate performance of our package, especially eSVM, which can be used in both methylation and multi-omics models and outperforms other methods in both cases. methylClass is available at: https://github.com/yuabrahamliu/methylClass.


Assuntos
Variações do Número de Cópias de DNA , Metilação de DNA , Processamento de Proteína Pós-Traducional , Máquina de Vetores de Suporte
17.
Brief Bioinform ; 24(5)2023 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-37507115

RESUMO

Single cell RNA-sequencing (scRNA-seq) technology has significantly advanced the understanding of transcriptomic signatures. Although various statistical models have been used to describe the distribution of gene expression across cells, a comprehensive assessment of the different models is missing. Moreover, the growing number of features associated with scRNA-seq datasets creates new challenges for analytical accuracy and computing speed. Here, we developed a Python-based package (TensorZINB) to solve the zero-inflated negative binomial (ZINB) model using the TensorFlow deep learning framework. We used a sequential initialization method to solve the numerical stability issues associated with hurdle and zero-inflated models. A recursive feature selection protocol was used to optimize feature selections for data processing and downstream differentially expressed gene (DEG) analysis. We proposed a class of hybrid models combining nested models to further improve the model's performance. Additionally, we developed a new method to convert a continuous distribution to its equivalent discrete form, so that statistical models can be fairly compared. Finally, we showed that the proposed TensorFlow algorithm (TensorZINB) was numerically stable and that its computing speed and performance were superior to those of existing ZINB solvers. Moreover, we implemented seven hurdle and zero-inflated statistical models in Python and systematically assessed their performance using a real scRNA-seq dataset. We demonstrated that the ZINB model achieved the lowest Akaike information criterion compared with other models tested. Taken together, TensorZINB was accurate, efficient and scalable for the implementation of ZINB and for large-scale scRNA-seq data analysis with DEG identification.


Assuntos
Perfilação da Expressão Gênica , Modelos Estatísticos , Distribuição de Poisson , Perfilação da Expressão Gênica/métodos , RNA , Análise de Sequência de RNA/métodos
18.
Brief Bioinform ; 24(2)2023 03 19.
Artigo em Inglês | MEDLINE | ID: mdl-36754847

RESUMO

Feature gene selection has significant impact on the performance of cell clustering in single-cell RNA sequencing (scRNA-seq) analysis. A well-rounded feature selection (FS) method should consider relevance, redundancy and complementarity of the features. Yet most existing FS methods focus on gene relevance to the cell types but neglect redundancy and complementarity, which undermines the cell clustering performance. We develop a novel computational method GeneClust to select feature genes for scRNA-seq cell clustering. GeneClust groups genes based on their expression profiles, then selects genes with the aim of maximizing relevance, minimizing redundancy and preserving complementarity. It can work as a plug-in tool for FS with any existing cell clustering method. Extensive benchmark results demonstrate that GeneClust significantly improve the clustering performance. Moreover, GeneClust can group cofunctional genes in biological process and pathway into clusters, thus providing a means of investigating gene interactions and identifying potential genes relevant to biological characteristics of the dataset. GeneClust is freely available at https://github.com/ToryDeng/scGeneClust.


Assuntos
Algoritmos , Perfilação da Expressão Gênica , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise da Expressão Gênica de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados
19.
Methods ; 230: 147-157, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-39191338

RESUMO

Epigenetics involves reversible modifications in gene expression without altering the genetic code itself. Among these modifications, histone deacetylases (HDACs) play a key role by removing acetyl groups from lysine residues on histones. Overexpression of HDACs is linked to the proliferation and survival of tumor cells. To combat this, HDAC inhibitors (HDACi) are commonly used in cancer treatments. However, pan-HDAC inhibition can lead to numerous side effects. Therefore, isoform-selective HDAC inhibitors, such as HDAC3i, could be advantageous for treating various medical conditions while minimizing off-target effects. To date, computational approaches that use only the SMILES notation without any experimental evidence have become increasingly popular and necessary for the initial discovery of novel potential therapeutic drugs. In this study, we develop an innovative and high-precision stacked-ensemble framework, called Stack-HDAC3i, which can directly identify HDAC3i using only the SMILES notation. Using an up-to-date benchmark dataset, we first employed both molecular descriptors and Mol2Vec embeddings to generate feature representations that cover multi-view information embedded in HDAC3i, such as structural and contextual information. Subsequently, these feature representations were used to train baseline models using nine popular ML algorithms. Finally, the probabilistic features derived from the selected baseline models were fused to construct the final stacked model. Both cross-validation and independent tests showed that Stack-HDAC3i is a high-accuracy prediction model with great generalization ability for identifying HDAC3i. Furthermore, in the independent test, Stack-HDAC3i achieved an accuracy of 0.926 and Matthew's correlation coefficient of 0.850, which are 0.44-6.11% and 0.83-11.90% higher than its constituent baseline models, respectively.


Assuntos
Inibidores de Histona Desacetilases , Histona Desacetilases , Inibidores de Histona Desacetilases/farmacologia , Inibidores de Histona Desacetilases/química , Histona Desacetilases/metabolismo , Histona Desacetilases/genética , Histona Desacetilases/química , Humanos , Aprendizado de Máquina , Descoberta de Drogas/métodos
20.
BMC Biol ; 22(1): 167, 2024 Aug 07.
Artigo em Inglês | MEDLINE | ID: mdl-39113021

RESUMO

BACKGROUND: Single-cell RNA sequencing enables studying cells individually, yet high gene dimensions and low cell numbers challenge analysis. And only a subset of the genes detected are involved in the biological processes underlying cell-type specific functions. RESULT: In this study, we present COMSE, an unsupervised feature selection framework using community detection to capture informative genes from scRNA-seq data. COMSE identified homogenous cell substates with high resolution, as demonstrated by distinguishing different cell cycle stages. Evaluations based on real and simulated scRNA-seq datasets showed COMSE outperformed methods even with high dropout rates in cell clustering assignment. We also demonstrate that by identifying communities of genes associated with batch effects, COMSE parses signals reflecting biological difference from noise arising due to differences in sequencing protocols, thereby enabling integrated analysis of scRNA-seq datasets of different sources. CONCLUSIONS: COMSE provides an efficient unsupervised framework that selects highly informative genes in scRNA-seq data improving cell sub-states identification and cell clustering. It identifies gene subsets that reveal biological and technical heterogeneity, supporting applications like batch effect correction and pathway analysis. It also provides robust results for bulk RNA-seq data analysis.


Assuntos
RNA-Seq , Análise da Expressão Gênica de Célula Única , Animais , Humanos , Camundongos , RNA-Seq/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA