Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 1.298
Filtrar
Mais filtros

Intervalo de ano de publicação
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38388682

RESUMO

Proteins play an important role in life activities and are the basic units for performing functions. Accurately annotating functions to proteins is crucial for understanding the intricate mechanisms of life and developing effective treatments for complex diseases. Traditional biological experiments struggle to keep pace with the growing number of known proteins. With the development of high-throughput sequencing technology, a wide variety of biological data provides the possibility to accurately predict protein functions by computational methods. Consequently, many computational methods have been proposed. Due to the diversity of application scenarios, it is necessary to conduct a comprehensive evaluation of these computational methods to determine the suitability of each algorithm for specific cases. In this study, we present a comprehensive benchmark, BeProf, to process data and evaluate representative computational methods. We first collect the latest datasets and analyze the data characteristics. Then, we investigate and summarize 17 state-of-the-art computational methods. Finally, we propose a novel comprehensive evaluation metric, design eight application scenarios and evaluate the performance of existing methods on these scenarios. Based on the evaluation, we provide practical recommendations for different scenarios, enabling users to select the most suitable method for their specific needs. All of these servers can be obtained from https://csuligroup.com/BEPROF and https://github.com/CSUBioGroup/BEPROF.


Assuntos
Aprendizado Profundo , Benchmarking , Proteínas , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala
2.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39120646

RESUMO

Cell-type annotation is a critical step in single-cell data analysis. With the development of numerous cell annotation methods, it is necessary to evaluate these methods to help researchers use them effectively. Reference datasets are essential for evaluation, but currently, the cell labels of reference datasets mainly come from computational methods, which may have computational biases and may not reflect the actual cell-type outcomes. This study first constructed an experimentally labeled immune cell-subtype single-cell dataset of the same batch and systematically evaluated 18 cell annotation methods. We assessed those methods under five scenarios, including intra-dataset validation, immune cell-subtype validation, unsupervised clustering, inter-dataset annotation, and unknown cell-type prediction. Accuracy and ARI were evaluation metrics. The results showed that SVM, scBERT, and scDeepSort were the best-performing supervised methods. Seurat was the best-performing unsupervised clustering method, but it couldn't fully fit the actual cell-type distribution. Our results indicated that experimentally labeled immune cell-subtype datasets revealed the deficiencies of unsupervised clustering methods and provided new dataset support for supervised methods.


Assuntos
Análise de Célula Única , Análise de Célula Única/métodos , Humanos , Análise por Conglomerados , Biologia Computacional/métodos , Anotação de Sequência Molecular , RNA-Seq/métodos , Análise da Expressão Gênica de Célula Única
3.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39256200

RESUMO

Copy number variations (CNVs) play pivotal roles in disease susceptibility and have been intensively investigated in human disease studies. Long-read sequencing technologies offer opportunities for comprehensive structural variation (SV) detection, and numerous methodologies have been developed recently. Consequently, there is a pressing need to assess these methods and aid researchers in selecting appropriate techniques for CNV detection using long-read sequencing. Hence, we conducted an evaluation of eight CNV calling methods across 22 datasets from nine publicly available samples and 15 simulated datasets, covering multiple sequencing platforms. The overall performance of CNV callers varied substantially and was influenced by the input dataset type, sequencing depth, and CNV type, among others. Specifically, the PacBio CCS sequencing platform outperformed PacBio CLR and Nanopore platforms regarding CNV detection recall rates. A sequencing depth of 10x demonstrated the capability to identify 85% of the CNVs detected in a 50x dataset. Moreover, deletions were more generally detectable than duplications. Among the eight benchmarked methods, cuteSV, Delly, pbsv, and Sniffles2 demonstrated superior accuracy, while SVIM exhibited high recall rates.


Assuntos
Algoritmos , Variações do Número de Cópias de DNA , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Biologia Computacional/métodos , Genoma Humano
4.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39154193

RESUMO

Cell segmentation is a fundamental task in analyzing biomedical images. Many computational methods have been developed for cell segmentation and instance segmentation, but their performances are not well understood in various scenarios. We systematically evaluated the performance of 18 segmentation methods to perform cell nuclei and whole cell segmentation using light microscopy and fluorescence staining images. We found that general-purpose methods incorporating the attention mechanism exhibit the best overall performance. We identified various factors influencing segmentation performances, including image channels, choice of training data, and cell morphology, and evaluated the generalizability of methods across image modalities. We also provide guidelines for choosing the optimal segmentation methods in various real application scenarios. We developed Seggal, an online resource for downloading segmentation models already pre-trained with various tissue and cell types, substantially reducing the time and effort for training cell segmentation models.


Assuntos
Processamento de Imagem Assistida por Computador , Humanos , Processamento de Imagem Assistida por Computador/métodos , Biologia Computacional/métodos , Algoritmos , Núcleo Celular
5.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38385879

RESUMO

Accurate prediction of antibody-antigen complex structures is pivotal in drug discovery, vaccine design and disease treatment and can facilitate the development of more effective therapies and diagnostics. In this work, we first review the antibody-antigen docking (ABAG-docking) datasets. Then, we present the creation and characterization of a comprehensive benchmark dataset of antibody-antigen complexes. We categorize the dataset based on docking difficulty, interface properties and structural characteristics, to provide a diverse set of cases for rigorous evaluation. Compared with Docking Benchmark 5.5, we have added 112 cases, including 14 single-domain antibody (sdAb) cases and 98 monoclonal antibody (mAb) cases, and also increased the proportion of Difficult cases. Our dataset contains diverse cases, including human/humanized antibodies, sdAbs, rodent antibodies and other types, opening the door to better algorithm development. Furthermore, we provide details on the process of building the benchmark dataset and introduce a pipeline for periodic updates to keep it up to date. We also utilize multiple complex prediction methods including ZDOCK, ClusPro, HDOCK and AlphaFold-Multimer for testing and analyzing this dataset. This benchmark serves as a valuable resource for evaluating and advancing docking computational methods in the analysis of antibody-antigen interaction, enabling researchers to develop more accurate and effective tools for predicting and designing antibody-antigen complexes. The non-redundant ABAG-docking structure benchmark dataset is available at https://github.com/Zhaonan99/Antibody-antigen-complex-structure-benchmark-dataset.


Assuntos
Algoritmos , Benchmarking , Humanos , Anticorpos Monoclonais , Anticorpos Monoclonais Humanizados , Complexo Antígeno-Anticorpo
6.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38833322

RESUMO

Recent advances in tumor molecular subtyping have revolutionized precision oncology, offering novel avenues for patient-specific treatment strategies. However, a comprehensive and independent comparison of these subtyping methodologies remains unexplored. This study introduces 'Themis' (Tumor HEterogeneity analysis on Molecular subtypIng System), an evaluation platform that encapsulates a few representative tumor molecular subtyping methods, including Stemness, Anoikis, Metabolism, and pathway-based classifications, utilizing 38 test datasets curated from The Cancer Genome Atlas (TCGA) and significant studies. Our self-designed quantitative analysis uncovers the relative strengths, limitations, and applicability of each method in different clinical contexts. Crucially, Themis serves as a vital tool in identifying the most appropriate subtyping methods for specific clinical scenarios. It also guides fine-tuning existing subtyping methods to achieve more accurate phenotype-associated results. To demonstrate the practical utility, we apply Themis to a breast cancer dataset, showcasing its efficacy in selecting the most suitable subtyping methods for personalized medicine in various clinical scenarios. This study bridges a crucial gap in cancer research and lays a foundation for future advancements in individualized cancer therapy and patient management.


Assuntos
Medicina de Precisão , Humanos , Medicina de Precisão/métodos , Neoplasias/genética , Neoplasias/classificação , Neoplasias/terapia , Biomarcadores Tumorais/genética , Biologia Computacional/métodos , Oncologia/métodos , Neoplasias da Mama/genética , Neoplasias da Mama/classificação , Neoplasias da Mama/terapia , Feminino
7.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38605641

RESUMO

Simulation of RNA-seq reads is critical in the assessment, comparison, benchmarking and development of bioinformatics tools. Yet the field of RNA-seq simulators has progressed little in the last decade. To address this need we have developed BEERS2, which combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline. BEERS2 takes input transcripts (typically fully length messenger RNA transcripts with polyA tails) from either customizable input or from CAMPAREE simulated RNA samples. It produces realistic reads of these transcripts as FASTQ, SAM or BAM formats with the SAM or BAM formats containing the true alignment to the reference genome. It also produces true transcript-level quantification values. BEERS2 combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline and is designed to include the effects of polyA selection and RiboZero for ribosomal depletion, hexamer priming sequence biases, GC-content biases in polymerase chain reaction (PCR) amplification, barcode read errors and errors during PCR amplification. These characteristics combine to make BEERS2 the most complete simulation of RNA-seq to date. Finally, we demonstrate the use of BEERS2 by measuring the effect of several settings on the popular Salmon pseudoalignment algorithm.


Assuntos
Genoma , RNA , RNA-Seq , Análise de Sequência de RNA , Simulação por Computador , RNA/genética , Sequenciamento de Nucleotídeos em Larga Escala
8.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38517697

RESUMO

Non-coding variants associated with complex traits can alter the motifs of transcription factor (TF)-deoxyribonucleic acid binding. Although many computational models have been developed to predict the effects of non-coding variants on TF binding, their predictive power lacks systematic evaluation. Here we have evaluated 14 different models built on position weight matrices (PWMs), support vector machines, ordinary least squares and deep neural networks (DNNs), using large-scale in vitro (i.e. SNP-SELEX) and in vivo (i.e. allele-specific binding, ASB) TF binding data. Our results show that the accuracy of each model in predicting SNP effects in vitro significantly exceeds that achieved in vivo. For in vitro variant impact prediction, kmer/gkm-based machine learning methods (deltaSVM_HT-SELEX, QBiC-Pred) trained on in vitro datasets exhibit the best performance. For in vivo ASB variant prediction, DNN-based multitask models (DeepSEA, Sei, Enformer) trained on the ChIP-seq dataset exhibit relatively superior performance. Among the PWM-based methods, tRap demonstrates better performance in both in vitro and in vivo evaluations. In addition, we find that TF classes such as basic leucine zipper factors could be predicted more accurately, whereas those such as C2H2 zinc finger factors are predicted less accurately, aligning with the evolutionary conservation of these TF classes. We also underscore the significance of non-sequence factors such as cis-regulatory element type, TF expression, interactions and post-translational modifications in influencing the in vivo predictive performance of TFs. Our research provides valuable insights into selecting prioritization methods for non-coding variants and further optimizing such models.


Assuntos
Polimorfismo de Nucleotídeo Único , Fatores de Transcrição , Sítios de Ligação/genética , Ligação Proteica/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , DNA/genética
9.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38985929

RESUMO

Recent advances in sequencing, mass spectrometry, and cytometry technologies have enabled researchers to collect multiple 'omics data types from a single sample. These large datasets have led to a growing consensus that a holistic approach is needed to identify new candidate biomarkers and unveil mechanisms underlying disease etiology, a key to precision medicine. While many reviews and benchmarks have been conducted on unsupervised approaches, their supervised counterparts have received less attention in the literature and no gold standard has emerged yet. In this work, we present a thorough comparison of a selection of six methods, representative of the main families of intermediate integrative approaches (matrix factorization, multiple kernel methods, ensemble learning, and graph-based methods). As non-integrative control, random forest was performed on concatenated and separated data types. Methods were evaluated for classification performance on both simulated and real-world datasets, the latter being carefully selected to cover different medical applications (infectious diseases, oncology, and vaccines) and data modalities. A total of 15 simulation scenarios were designed from the real-world datasets to explore a large and realistic parameter space (e.g. sample size, dimensionality, class imbalance, effect size). On real data, the method comparison showed that integrative approaches performed better or equally well than their non-integrative counterpart. By contrast, DIABLO and the four random forest alternatives outperform the others across the majority of simulation scenarios. The strengths and limitations of these methods are discussed in detail as well as guidelines for future applications.


Assuntos
Biologia Computacional , Humanos , Biologia Computacional/métodos , Algoritmos , Genômica/métodos , Genômica/estatística & dados numéricos , Multiômica
10.
Mol Cell Proteomics ; 23(2): 100712, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38182042

RESUMO

Data-independent acquisition (DIA) mass spectrometry (MS) has emerged as a powerful technology for high-throughput, accurate, and reproducible quantitative proteomics. This review provides a comprehensive overview of recent advances in both the experimental and computational methods for DIA proteomics, from data acquisition schemes to analysis strategies and software tools. DIA acquisition schemes are categorized based on the design of precursor isolation windows, highlighting wide-window, overlapping-window, narrow-window, scanning quadrupole-based, and parallel accumulation-serial fragmentation-enhanced DIA methods. For DIA data analysis, major strategies are classified into spectrum reconstruction, sequence-based search, library-based search, de novo sequencing, and sequencing-independent approaches. A wide array of software tools implementing these strategies are reviewed, with details on their overall workflows and scoring approaches at different steps. The generation and optimization of spectral libraries, which are critical resources for DIA analysis, are also discussed. Publicly available benchmark datasets covering global proteomics and phosphoproteomics are summarized to facilitate performance evaluation of various software tools and analysis workflows. Continued advances and synergistic developments of versatile components in DIA workflows are expected to further enhance the power of DIA-based proteomics.


Assuntos
Proteômica , Software , Proteômica/métodos , Espectrometria de Massas/métodos , Biblioteca Gênica , Proteoma/análise
11.
Mol Biol Evol ; 41(6)2024 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-38860506

RESUMO

Phylogenetic inference based on protein sequence alignment is a widely used procedure. Numerous phylogenetic algorithms have been developed, most of which have many parameters and options. Choosing a program, options, and parameters can be a nontrivial task. No benchmark for comparison of phylogenetic programs on real protein sequences was publicly available. We have developed PhyloBench, a benchmark for evaluating the quality of phylogenetic inference, and used it to test a number of popular phylogenetic programs. PhyloBench is based on natural, not simulated, protein sequences of orthologous evolutionary domains. The measure of accuracy of an inferred tree is its distance to the corresponding species tree. A number of tree-to-tree distance measures were tested. The most reliable results were obtained using the Robinson-Foulds distance. Our results confirmed recent findings that distance methods are more accurate than maximum likelihood (ML) and maximum parsimony. We tested the bayesian program MrBayes on natural protein sequences and found that, on our datasets, it performs better than ML, but worse than distance methods. Of the methods we tested, the Balanced Minimum Evolution method implemented in FastME yielded the best results on our material. Alignments and reference species trees are available at https://mouse.belozersky.msu.ru/tools/phylobench/ together with a web-interface that allows for a semi-automatic comparison of a user's method with a number of popular programs.


Assuntos
Algoritmos , Filogenia , Software , Benchmarking , Alinhamento de Sequência/métodos , Teorema de Bayes , Evolução Molecular , Biologia Computacional/métodos
12.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38037235

RESUMO

OBJECTIVE: The performances of popular genome-wide association study (GWAS) models have not been examined yet in a consistent manner under the scenario of genetic admixture, which introduces several challenging aspects: heterogeneity of minor allele frequency (MAF), wide spectrum of case-control ratio, varying effect sizes, etc. METHODS: We generated a cohort of synthetic individuals (N = 19 234) that simulates (i) a large sample size; (ii) two-way admixture (Native American and European ancestry) and (iii) a binary phenotype. We then benchmarked three popular GWAS tools [generalized linear mixed model associated test (GMMAT), scalable and accurate implementation of generalized mixed model (SAIGE) and Tractor] by computing inflation factors and power calculations under different MAFs, case-control ratios, sample sizes and varying ancestry proportions. We also employed a cohort of Peruvians (N = 249) to further examine the performances of the testing models on (i) real genetic and phenotype data and (ii) small sample sizes. RESULTS: In the synthetic cohort, SAIGE performed better than GMMAT and Tractor in terms of type-I error rate, especially under severe unbalanced case-control ratio. On the contrary, power analysis identified Tractor as the best method to pinpoint ancestry-specific causal variants but showed decreased power when the effect size displayed limited heterogeneity between ancestries. In the Peruvian cohort, only Tractor identified two suggestive loci (P-value $\le 1\ast{10}^{-5}$) associated with Native American ancestry. DISCUSSION: The current study illustrates best practice and limitations for available GWAS tools under the scenario of genetic admixture. Incorporating local ancestry in GWAS analyses boosts power, although careful consideration of complex scenarios (small sample sizes, imbalance case-control ratio, MAF heterogeneity) is needed.


Assuntos
Benchmarking , Estudo de Associação Genômica Ampla , Humanos , Estudo de Associação Genômica Ampla/métodos , Frequência do Gene , Fenótipo , Tamanho da Amostra , Polimorfismo de Nucleotídeo Único
13.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37742051

RESUMO

Single-base substitution (SBS) mutational signatures have become standard practice in cancer genomics. In lieu of de novo signature extraction, reference signature assignment allows users to estimate the activities of pre-established SBS signatures within individual malignancies. Several tools have been developed for this purpose, each with differing methodologies. However, due to a lack of standardization, there may be inter-tool variability in signature assignment. We deeply characterized three assignment strategies and five SBS signature assignment tools. We observed that assignment strategy choice can significantly influence results and interpretations. Despite varying recommendations by tools, Refit performed best by reducing overfitting and maximizing reconstruction of the original mutational spectra. Even after uniform application of Refit, tools varied remarkably in signature assignments both qualitatively (Jaccard index = 0.38-0.83) and quantitatively (Kendall tau-b = 0.18-0.76). This phenomenon was exacerbated for 'flat' signatures such as the homologous recombination deficiency signature SBS3. An ensemble approach (EnsembleFit), which leverages output from all five tools, increased SBS3 assignment accuracy in BRCA1/2-deficient breast carcinomas. After generating synthetic mutational profiles for thousands of pan-cancer tumors, EnsembleFit reduced signature activity assignment error 15.9-24.7% on average using Catalogue of Somatic Mutations In Cancer and non-standard reference signature sets. We have also released the EnsembleFit web portal (https://www.ensemblefit.pittlabgenomics.com) for users to generate or download ensemble-based SBS signature assignments using any strategy and combination of tools. Overall, we show that signature assignment heterogeneity across tools and strategies is non-negligible and propose a viable, ensemble solution.


Assuntos
Proteína BRCA1 , Proteína BRCA2 , Proteína BRCA1/genética , Proteína BRCA2/genética , Mutação
14.
Brief Bioinform ; 24(5)2023 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-37635383

RESUMO

RNA-binding proteins (RBPs) are central actors of RNA post-transcriptional regulation. Experiments to profile-binding sites of RBPs in vivo are limited to transcripts expressed in the experimental cell type, creating the need for computational methods to infer missing binding information. While numerous machine-learning based methods have been developed for this task, their use of heterogeneous training and evaluation datasets across different sets of RBPs and CLIP-seq protocols makes a direct comparison of their performance difficult. Here, we compile a set of 37 machine learning (primarily deep learning) methods for in vivo RBP-RNA interaction prediction and systematically benchmark a subset of 11 representative methods across hundreds of CLIP-seq datasets and RBPs. Using homogenized sample pre-processing and two negative-class sample generation strategies, we evaluate methods in terms of predictive performance and assess the impact of neural network architectures and input modalities on model performance. We believe that this study will not only enable researchers to choose the optimal prediction method for their tasks at hand, but also aid method developers in developing novel, high-performing methods by introducing a standardized framework for their evaluation.


Assuntos
Benchmarking , Sequenciamento de Cromatina por Imunoprecipitação , Sítios de Ligação , Aprendizado de Máquina , RNA/genética
15.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36592056

RESUMO

Circular RNAs (circRNAs) are covalently closed transcripts involved in critical regulatory axes, cancer pathways and disease mechanisms. CircRNA expression measured with RNA-seq has particular characteristics that might hamper the performance of standard biostatistical differential expression assessment methods (DEMs). We compared 38 DEM pipelines configured to fit circRNA expression data's statistical properties, including bulk RNA-seq, single-cell RNA-seq (scRNA-seq) and metagenomics DEMs. The DEMs performed poorly on data sets of typical size. Widely used DEMs, such as DESeq2, edgeR and Limma-Voom, gave scarce results, unreliable predictions or even contravened the expected behaviour with some parameter configurations. Limma-Voom achieved the most consistent performance throughout different benchmark data sets and, as well as SAMseq, reasonably balanced false discovery rate (FDR) and recall rate. Interestingly, a few scRNA-seq DEMs obtained results comparable with the best-performing bulk RNA-seq tools. Almost all DEMs' performance improved when increasing the number of replicates. CircRNA expression studies require careful design, choice of DEM and DEM configuration. This analysis can guide scientists in selecting the appropriate tools to investigate circRNA differential expression with RNA-seq experiments.


Assuntos
Benchmarking , RNA Circular , Benchmarking/métodos , Análise de Sequência de RNA/métodos , RNA-Seq , Metagenômica , RNA/genética
16.
Brief Bioinform ; 24(5)2023 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-37738402

RESUMO

Understanding the function of the human microbiome is important but the development of statistical methods specifically for the microbial gene expression (i.e. metatranscriptomics) is in its infancy. Many currently employed differential expression analysis methods have been designed for different data types and have not been evaluated in metatranscriptomics settings. To address this gap, we undertook a comprehensive evaluation and benchmarking of 10 differential analysis methods for metatranscriptomics data. We used a combination of real and simulated data to evaluate performance (i.e. type I error, false discovery rate and sensitivity) of the following methods: log-normal (LN), logistic-beta (LB), MAST, DESeq2, metagenomeSeq, ANCOM-BC, LEfSe, ALDEx2, Kruskal-Wallis and two-part Kruskal-Wallis. The simulation was informed by supragingival biofilm microbiome data from 300 preschool-age children enrolled in a study of childhood dental disease (early childhood caries, ECC), whereas validations were sought in two additional datasets from the ECC study and an inflammatory bowel disease study. The LB test showed the highest sensitivity in both small and large samples and reasonably controlled type I error. Contrarily, MAST was hampered by inflated type I error. Upon application of the LN and LB tests in the ECC study, we found that genes C8PHV7 and C8PEV7, harbored by the lactate-producing Campylobacter gracilis, had the strongest association with childhood dental disease. This comprehensive model evaluation offers practical guidance for selection of appropriate methods for rigorous analyses of differential expression in metatranscriptomics. Selection of an optimal method increases the possibility of detecting true signals while minimizing the chance of claiming false ones.


Assuntos
Benchmarking , Doenças Estomatognáticas , Criança , Humanos , Pré-Escolar , Biofilmes , Simulação por Computador , Ácido Láctico
17.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36549922

RESUMO

MOTIVATION: Single-cell assay for transposase accessible chromatin using sequencing (scATAC-seq) is a valuable resource to learn cis-regulatory elements such as cell-type specific enhancers and transcription factor binding sites. However, cell-type identification of scATAC-seq data is known to be challenging due to the heterogeneity derived from different protocols and the high dropout rate. RESULTS: In this study, we perform a systematic comparison of seven scATAC-seq datasets of mouse brain to benchmark the efficacy of neuronal cell-type annotation from gene sets. We find that redundant marker genes give a dramatic improvement for a sparse scATAC-seq annotation across the data collected from different studies. Interestingly, simple aggregation of such marker genes achieves performance comparable or higher than that of machine-learning classifiers, suggesting its potential for downstream applications. Based on our results, we reannotated all scATAC-seq data for detailed cell types using robust marker genes. Their meta scATAC-seq profiles are publicly available at https://gillisweb.cshl.edu/Meta_scATAC. Furthermore, we trained a deep neural network to predict chromatin accessibility from only DNA sequence and identified key motifs enriched for each neuronal subtype. Those predicted profiles are visualized together in our database as a valuable resource to explore cell-type specific epigenetic regulation in a sequence-dependent and -independent manner.


Assuntos
Cromatina , Epigênese Genética , Animais , Camundongos , Cromatina/genética , Sequências Reguladoras de Ácido Nucleico , Redes Neurais de Computação
18.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36575826

RESUMO

Drug response prediction is an important problem in personalized cancer therapy. Among various newly developed models, significant improvement in prediction performance has been reported using deep learning methods. However, systematic comparisons of deep learning methods, especially of the transferability from preclinical models to clinical cohorts, are currently lacking. To provide a more rigorous assessment, the performance of six representative deep learning methods for drug response prediction using nine evaluation metrics, including the overall prediction accuracy, predictability of each drug, potential associated factors and transferability to clinical cohorts, in multiple application scenarios was benchmarked. Most methods show promising prediction within cell line datasets, and TGSA, with its lower time cost and better performance, is recommended. Although the performance metrics decrease when applying models trained on cell lines to patients, a certain amount of power to distinguish clinical response on some drugs can be maintained using CRDNN and TGSA. With these assessments, we provide a guidance for researchers to choose appropriate methods, as well as insights into future directions for the development of more effective methods in clinical scenarios.


Assuntos
Aprendizado Profundo , Humanos , Linhagem Celular
19.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-37096588

RESUMO

The advances of single-cell transcriptomic technologies have led to increasing use of single-cell RNA sequencing (scRNA-seq) data in large-scale patient cohort studies. The resulting high-dimensional data can be summarized and incorporated into patient outcome prediction models in several ways; however, there is a pressing need to understand the impact of analytical decisions on such model quality. In this study, we evaluate the impact of analytical choices on model choices, ensemble learning strategies and integrate approaches on patient outcome prediction using five scRNA-seq COVID-19 datasets. First, we examine the difference in performance between using single-view feature space versus multi-view feature space. Next, we survey multiple learning platforms from classical machine learning to modern deep learning methods. Lastly, we compare different integration approaches when combining datasets is necessary. Through benchmarking such analytical combinations, our study highlights the power of ensemble learning, consistency among different learning methods and robustness to dataset normalization when using multiple datasets as the model input.


Assuntos
Benchmarking , COVID-19 , Humanos , Perfilação da Expressão Gênica , Aprendizado de Máquina , Análise de Sequência de RNA/métodos
20.
Methods ; 224: 1-9, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38295891

RESUMO

The Major Histocompatibility Complex (MHC) is a critical element of the vertebrate cellular immune system, responsible for presenting peptides derived from intracellular proteins. MHC-I presentation is pivotal in the immune response and holds considerable potential in the realms of vaccine development and cancer immunotherapy. This study delves into the limitations of current methods and benchmarks for MHC-I presentation. We introduce a novel benchmark designed to assess generalization properties and the reliability of models on unseen MHC molecules and peptides, with a focus on the Human Leukocyte Antigen (HLA)-a specific subset of MHC genes present in humans. Finally, we introduce HLABERT, a pretrained language model that outperforms previous methods significantly on our benchmark and establishes a new state-of-the-art on existing benchmarks.


Assuntos
Peptídeos , Proteínas , Humanos , Reprodutibilidade dos Testes , Peptídeos/química , Proteínas/metabolismo , Complexo Principal de Histocompatibilidade/genética , Ligação Proteica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA