Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 1.413
Filtrar
Más filtros

Intervalo de año de publicación
1.
Annu Rev Immunol ; 38: 123-145, 2020 04 26.
Artículo en Inglés | MEDLINE | ID: mdl-32045313

RESUMEN

Throughout the body, T cells monitor MHC-bound ligands expressed on the surface of essentially all cell types. MHC ligands that trigger a T cell immune response are referred to as T cell epitopes. Identifying such epitopes enables tracking, phenotyping, and stimulating T cells involved in immune responses in infectious disease, allergy, autoimmunity, transplantation, and cancer. The specific T cell epitopes recognized in an individual are determined by genetic factors such as the MHC molecules the individual expresses, in parallel to the individual's environmental exposure history. The complexity and importance of T cell epitope mapping have motivated the development of computational approaches that predict what T cell epitopes are likely to be recognized in a given individual or in a broader population. Such predictions guide experimental epitope mapping studies and enable computational analysis of the immunogenic potential of a given protein sequence region.


Asunto(s)
Epítopos de Linfocito T/inmunología , Linfocitos T/inmunología , Linfocitos T/metabolismo , Animales , Biomarcadores , Biología Computacional/métodos , Susceptibilidad a Enfermedades , Antígenos de Histocompatibilidad/inmunología , Humanos , Ligandos , Aprendizaje Automático , Unión Proteica
2.
Brief Bioinform ; 25(5)2024 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-39129364

RESUMEN

Microsatellite instability (MSI) is a phenomenon seen in several cancer types, which can be used as a biomarker to help guide immune checkpoint inhibitor treatment. To facilitate this, researchers have developed computational tools to categorize samples as having high microsatellite instability, or as being microsatellite stable using next-generation sequencing data. Most of these tools were published with unclear scope and usage, and they have yet to be independently benchmarked. To address these issues, we assessed the performance of eight leading MSI tools across several unique datasets that encompass a wide variety of sequencing methods. While we were able to replicate the original findings of each tool on whole exome sequencing data, most tools had worse receiver operating characteristic and precision-recall area under the curve values on whole genome sequencing data. We also found that they lacked agreement with one another and with commercial MSI software on gene panel data, and that optimal threshold cut-offs vary by sequencing type. Lastly, we tested tools made specifically for RNA sequencing data and found they were outperformed by tools designed for use with DNA sequencing data. Out of all, two tools (MSIsensor2, MANTIS) performed well across nearly all datasets, but when all datasets were combined, their precision decreased. Our results caution that MSI tools can have much lower performance on datasets other than those on which they were originally evaluated, and in the case of RNA sequencing tools, can even perform poorly on the type of data for which they were created.


Asunto(s)
Biología Computacional , Inestabilidad de Microsatélites , Programas Informáticos , Humanos , Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Neoplasias/genética , Secuenciación del Exoma/métodos
3.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38493343

RESUMEN

Recent advancements in single-cell sequencing technologies have generated extensive omics data in various modalities and revolutionized cell research, especially in the single-cell RNA and ATAC data. The joint analysis across scRNA-seq data and scATAC-seq data has paved the way to comprehending the cellular heterogeneity and complex cellular regulatory networks. Multi-omics integration is gaining attention as an important step in joint analysis, and the number of computational tools in this field is growing rapidly. In this paper, we benchmarked 12 multi-omics integration methods on three integration tasks via qualitative visualization and quantitative metrics, considering six main aspects that matter in multi-omics data analysis. Overall, we found that different methods have their own advantages on different aspects, while some methods outperformed other methods in most aspects. We therefore provided guidelines for selecting appropriate methods for specific scenarios and tasks to help obtain meaningful insights from multi-omics data integration.


Asunto(s)
Benchmarking , Multiómica , Algoritmos , Ciclo Celular , ARN
4.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38706320

RESUMEN

The advent of rapid whole-genome sequencing has created new opportunities for computational prediction of antimicrobial resistance (AMR) phenotypes from genomic data. Both rule-based and machine learning (ML) approaches have been explored for this task, but systematic benchmarking is still needed. Here, we evaluated four state-of-the-art ML methods (Kover, PhenotypeSeeker, Seq2Geno2Pheno and Aytan-Aktug), an ML baseline and the rule-based ResFinder by training and testing each of them across 78 species-antibiotic datasets, using a rigorous benchmarking workflow that integrates three evaluation approaches, each paired with three distinct sample splitting methods. Our analysis revealed considerable variation in the performance across techniques and datasets. Whereas ML methods generally excelled for closely related strains, ResFinder excelled for handling divergent genomes. Overall, Kover most frequently ranked top among the ML approaches, followed by PhenotypeSeeker and Seq2Geno2Pheno. AMR phenotypes for antibiotic classes such as macrolides and sulfonamides were predicted with the highest accuracies. The quality of predictions varied substantially across species-antibiotic combinations, particularly for beta-lactams; across species, resistance phenotyping of the beta-lactams compound, aztreonam, amoxicillin/clavulanic acid, cefoxitin, ceftazidime and piperacillin/tazobactam, alongside tetracyclines demonstrated more variable performance than the other benchmarked antibiotics. By organism, Campylobacter jejuni and Enterococcus faecium phenotypes were more robustly predicted than those of Escherichia coli, Staphylococcus aureus, Salmonella enterica, Neisseria gonorrhoeae, Klebsiella pneumoniae, Pseudomonas aeruginosa, Acinetobacter baumannii, Streptococcus pneumoniae and Mycobacterium tuberculosis. In addition, our study provides software recommendations for each species-antibiotic combination. It furthermore highlights the need for optimization for robust clinical applications, particularly for strains that diverge substantially from those used for training.


Asunto(s)
Antibacterianos , Fenotipo , Antibacterianos/farmacología , Aprendizaje Automático , Farmacorresistencia Bacteriana/genética , Biología Computacional/métodos , Genoma Bacteriano , Genoma Microbiano , Humanos , Bacterias/genética , Bacterias/efectos de los fármacos
5.
Brief Bioinform ; 25(5)2024 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-39082650

RESUMEN

This article provides an in-depth review of computational methods for predicting transcriptional regulators (TRs) with query gene sets. Identification of TRs is of utmost importance in many biological applications, including but not limited to elucidating biological development mechanisms, identifying key disease genes, and predicting therapeutic targets. Various computational methods based on next-generation sequencing (NGS) data have been developed in the past decade, yet no systematic evaluation of NGS-based methods has been offered. We classified these methods into two categories based on shared characteristics, namely library-based and region-based methods. We further conducted benchmark studies to evaluate the accuracy, sensitivity, coverage, and usability of NGS-based methods with molecular experimental datasets. Results show that BART, ChIP-Atlas, and Lisa have relatively better performance. Besides, we point out the limitations of NGS-based methods and explore potential directions for further improvement.


Asunto(s)
Biología Computacional , Secuenciación de Nucleótidos de Alto Rendimiento , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Biología Computacional/métodos , Humanos , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Regulación de la Expresión Génica
6.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38279646

RESUMEN

N6-methyladenosine (m6A) is the most abundant internal eukaryotic mRNA modification, and is involved in the regulation of various biological processes. Direct Nanopore sequencing of native RNA (dRNA-seq) emerged as a leading approach for its identification. Several software were published for m6A detection and there is a strong need for independent studies benchmarking their performance on data from different species, and against various reference datasets. Moreover, a computational workflow is needed to streamline the execution of tools whose installation and execution remains complicated. We developed NanOlympicsMod, a Nextflow pipeline exploiting containerized technology for comparing 14 tools for m6A detection on dRNA-seq data. NanOlympicsMod was tested on dRNA-seq data generated from in vitro (un)modified synthetic oligos. The m6A hits returned by each tool were compared to the m6A position known by design of the oligos. In addition, NanOlympicsMod was used on dRNA-seq datasets from wild-type and m6A-depleted yeast, mouse and human, and each tool's hits were compared to reference m6A sets generated by leading orthogonal methods. The performance of the tools markedly differed across datasets, and methods adopting different approaches showed different preferences in terms of precision and recall. Changing the stringency cut-offs allowed for tuning the precision-recall trade-off towards user preferences. Finally, we determined that precision and recall of tools are markedly influenced by sequencing depth, and that additional sequencing would likely reveal additional m6A sites. Thanks to the possibility of including novel tools, NanOlympicsMod will streamline the benchmarking of m6A detection tools on dRNA-seq data, improving future RNA modification characterization.


Asunto(s)
Adenina/análogos & derivados , Secuenciación de Nanoporos , Nanoporos , Humanos , Animales , Ratones , ARN/genética , Benchmarking , Análisis de Secuencia de ARN/métodos
7.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38935068

RESUMEN

BACKGROUND: We present a novel simulation method for generating connected differential expression signatures. Traditional methods have struggled with the lack of reliable benchmarking data and biases in drug-disease pair labeling, limiting the rigorous benchmarking of connectivity-based approaches. OBJECTIVE: Our aim is to develop a simulation method based on a statistical framework that allows for adjustable levels of parametrization, especially the connectivity, to generate a pair of interconnected differential signatures. This could help to address the issue of benchmarking data availability for connectivity-based drug repurposing approaches. METHODS: We first detailed the simulation process and how it reflected real biological variability and the interconnectedness of gene expression signatures. Then, we generated several datasets to enable the evaluation of different existing algorithms that compare differential expression signatures, providing insights into their performance and limitations. RESULTS: Our findings demonstrate the ability of our simulation to produce realistic data, as evidenced by correlation analyses and the log2 fold-change distribution of deregulated genes. Benchmarking reveals that methods like extreme cosine similarity and Pearson correlation outperform others in identifying connected signatures. CONCLUSION: Overall, our method provides a reliable tool for simulating differential expression signatures. The data simulated by our tool encompass a wide spectrum of possibilities to challenge and evaluate existing methods to estimate connectivity scores. This may represent a critical gap in connectivity-based drug repurposing research because reliable benchmarking data are essential for assessing and advancing in the development of new algorithms. The simulation tool is available as a R package (General Public License (GPL) license) at https://github.com/cgonzalez-gomez/cosimu.


Asunto(s)
Algoritmos , Benchmarking , Simulación por Computador , Descubrimiento de Drogas , Descubrimiento de Drogas/métodos , Humanos , Perfilación de la Expresión Génica/métodos , Biología Computacional/métodos , Reposicionamiento de Medicamentos/métodos , Transcriptoma
8.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38783705

RESUMEN

Tumor mutational signatures have gained prominence in cancer research, yet the lack of standardized methods hinders reproducibility and robustness. Leveraging colorectal cancer (CRC) as a model, we explored the influence of computational parameters on mutational signature analyses across 230 CRC cell lines and 152 CRC patients. Results were validated in three independent datasets: 483 endometrial cancer patients stratified by mismatch repair (MMR) status, 35 lung cancer patients by smoking status and 12 patient-derived organoids (PDOs) annotated for colibactin exposure. Assessing various bioinformatic tools, reference datasets and input data sizes including whole genome sequencing, whole exome sequencing and a pan-cancer gene panel, we demonstrated significant variability in the results. We report that the use of distinct algorithms and references led to statistically different results, highlighting how arbitrary choices may induce variability in the mutational signature contributions. Furthermore, we found a differential contribution of mutational signatures between coding and intergenic regions and defined the minimum number of somatic variants required for reliable mutational signature assignment. To facilitate the identification of the most suitable workflows, we developed Comparative Mutational Signature analysis on Coding and Extragenic Regions (CoMSCER), a bioinformatic tool which allows researchers to easily perform comparative mutational signature analysis by coupling the results from several tools and public reference datasets and to assess mutational signature contributions in coding and non-coding genomic regions. In conclusion, our study provides a comparative framework to elucidate the impact of distinct computational workflows on mutational signatures.


Asunto(s)
Neoplasias Colorrectales , Biología Computacional , Mutación , Humanos , Neoplasias Colorrectales/genética , Neoplasias Colorrectales/patología , Biología Computacional/métodos , Flujo de Trabajo , Línea Celular Tumoral , Secuenciación del Exoma/métodos , Femenino , Algoritmos
9.
RNA ; 29(12): 1839-1855, 2023 12.
Artículo en Inglés | MEDLINE | ID: mdl-37816550

RESUMEN

The tremendous rate with which data is generated and analysis methods emerge makes it increasingly difficult to keep track of their domain of applicability, assumptions, limitations, and consequently, of the efficacy and precision with which they solve specific tasks. Therefore, there is an increasing need for benchmarks, and for the provision of infrastructure for continuous method evaluation. APAeval is an international community effort, organized by the RNA Society in 2021, to benchmark tools for the identification and quantification of the usage of alternative polyadenylation (APA) sites from short-read, bulk RNA-sequencing (RNA-seq) data. Here, we reviewed 17 tools and benchmarked eight on their ability to perform APA identification and quantification, using a comprehensive set of RNA-seq experiments comprising real, synthetic, and matched 3'-end sequencing data. To support continuous benchmarking, we have incorporated the results into the OpenEBench online platform, which allows for continuous extension of the set of methods, metrics, and challenges. We envisage that our analyses will assist researchers in selecting the appropriate tools for their studies, while the containers and reproducible workflows could easily be deployed and extended to evaluate new methods or data sets.


Asunto(s)
Benchmarking , ARN , ARN/genética , RNA-Seq , Poliadenilación , Análisis de Secuencia de ARN/métodos
10.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37974509

RESUMEN

Local genetic correlation evaluates the correlation of additive genetic effects between different traits across the same genetic variants at a genomic locus. It has been proven informative for understanding the genetic similarities of complex traits beyond that captured by global genetic correlation calculated across the whole genome. Several summary-statistics-based approaches have been developed for estimating local genetic correlation, including $\rho$-hess, SUPERGNOVA and LAVA. However, there has not been a comprehensive evaluation of these methods to offer practical guidelines on the choices of these methods. In this study, we conduct benchmark comparisons of the performance of these three methods through extensive simulation and real data analyses. We focus on two technical difficulties in estimating local genetic correlation: sample overlaps across traits and local linkage disequilibrium (LD) estimates when only the external reference panels are available. Our simulations suggest the likelihood of incorrectly identifying correlated regions and local correlation estimation accuracy are highly dependent on the estimation of the local LD matrix. These observations are corroborated by real data analyses of 31 complex traits. Overall, our findings illuminate the distinct results yielded by different methods applied in post-genome-wide association studies (post-GWAS) local correlation studies. We underscore the sensitivity of local genetic correlation estimates and inferences to the precision of local LD estimation. These observations accentuate the vital need for ongoing refinement in methodologies.


Asunto(s)
Benchmarking , Estudio de Asociación del Genoma Completo , Estudio de Asociación del Genoma Completo/métodos , Polimorfismo de Nucleótido Simple , Fenotipo , Simulación por Computador , Desequilibrio de Ligamiento
11.
Brief Bioinform ; 24(1)2023 01 19.
Artículo en Inglés | MEDLINE | ID: mdl-36611255

RESUMEN

Accurate in silico prediction of conformational B-cell epitopes would lead to major improvements in disease diagnostics, drug design and vaccine development. A variety of computational methods, mainly based on machine learning approaches, have been developed in the last decades to tackle this challenging problem. Here, we rigorously benchmarked nine state-of-the-art conformational B-cell epitope prediction webservers, including generic and antibody-specific methods, on a dataset of over 250 antibody-antigen structures. The results of our assessment and statistical analyses show that all the methods achieve very low performances, and some do not perform better than randomly generated patches of surface residues. In addition, we also found that commonly used consensus strategies that combine the results from multiple webservers are at best only marginally better than random. Finally, we applied all the predictors to the SARS-CoV-2 spike protein as an independent case study, and showed that they perform poorly in general, which largely recapitulates our benchmarking conclusions. We hope that these results will lead to greater caution when using these tools until the biases and issues that limit current methods have been addressed, promote the use of state-of-the-art evaluation methodologies in future publications and suggest new strategies to improve the performance of conformational B-cell epitope prediction methods.


Asunto(s)
Epítopos de Linfocito B , Glicoproteína de la Espiga del Coronavirus , Humanos , Biología Computacional/métodos , Epítopos de Linfocito B/inmunología , SARS-CoV-2 , Glicoproteína de la Espiga del Coronavirus/inmunología
12.
Brief Bioinform ; 24(1)2023 01 19.
Artículo en Inglés | MEDLINE | ID: mdl-36545804

RESUMEN

Monoclonal antibodies are biotechnologically produced proteins with various applications in research, therapeutics and diagnostics. Their ability to recognize and bind to specific molecule structures makes them essential research tools and therapeutic agents. Sequence information of antibodies is helpful for understanding antibody-antigen interactions and ensuring their affinity and specificity. De novo protein sequencing based on mass spectrometry is a valuable method to obtain the amino acid sequence of peptides and proteins without a priori knowledge. In this study, we evaluated six recently developed de novo peptide sequencing algorithms (Novor, pNovo 3, DeepNovo, SMSNet, PointNovo and Casanovo), which were not specifically designed for antibody data. We validated their ability to identify and assemble antibody sequences on three multi-enzymatic data sets. The deep learning-based tools Casanovo and PointNovo showed an increased peptide recall across different enzymes and data sets compared with spectrum-graph-based approaches. We evaluated different error types of de novo peptide sequencing tools and their performance for different numbers of missing cleavage sites, noisy spectra and peptides of various lengths. We achieved a sequence coverage of 97.69-99.53% on the light chains of three different antibody data sets using the de Bruijn assembler ALPS and the predictions from Casanovo. However, low sequence coverage and accuracy on the heavy chains demonstrate that complete de novo protein sequencing remains a challenging issue in proteomics that requires improved de novo error correction, alternative digestion strategies and hybrid approaches such as homology search to achieve high accuracy on long protein sequences.


Asunto(s)
Anticuerpos Monoclonales , Péptidos , Secuencia de Aminoácidos , Anticuerpos Monoclonales/genética , Péptidos/genética , Péptidos/química , Algoritmos , Análisis de Secuencia de Proteína/métodos
13.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36847692

RESUMEN

Single-cell ribonucleic acid (RNA)-sequencing (scRNA-seq) is a powerful tool to study cellular heterogeneity. The high dimensional data generated from this technology are complex and require specialized expertise for analysis and interpretation. The core of scRNA-seq data analysis contains several key analytical steps, which include pre-processing, quality control, normalization, dimensionality reduction, integration and clustering. Each step often has many algorithms developed with varied underlying assumptions and implications. With such a diverse choice of tools available, benchmarking analyses have compared their performances and demonstrated that tools operate differentially according to the data types and complexity. Here, we present Integrated Benchmarking scRNA-seq Analytical Pipeline (IBRAP), which contains a suite of analytical components that can be interchanged throughout the pipeline alongside multiple benchmarking metrics that enable users to compare results and determine the optimal pipeline combinations for their data. We apply IBRAP to single- and multi-sample integration analysis using primary pancreatic tissue, cancer cell line and simulated data accompanied with ground truth cell labels, demonstrating the interchangeable and benchmarking functionality of IBRAP. Our results confirm that the optimal pipelines are dependent on individual samples and studies, further supporting the rationale and necessity of our tool. We then compare reference-based cell annotation with unsupervised analysis, both included in IBRAP, and demonstrate the superiority of the reference-based method in identifying robust major and minor cell types. Thus, IBRAP presents a valuable tool to integrate multiple samples and studies to create reference maps of normal and diseased tissues, facilitating novel biological discovery using the vast volume of scRNA-seq data available.


Asunto(s)
Benchmarking , Programas Informáticos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Algoritmos , Perfilación de la Expresión Génica/métodos
14.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36804804

RESUMEN

Recent technological and computational advances have made metagenomic assembly a viable approach to achieving high-resolution views of complex microbial communities. In previous benchmarking, short-read (SR) metagenomic assemblers had the highest accuracy, long-read (LR) assemblers generated the most contiguous sequences and hybrid (HY) assemblers balanced length and accuracy. However, no assessments have specifically compared the performance of these assemblers on low-abundance species, which include clinically relevant organisms in the gut. We generated semi-synthetic LR and SR datasets by spiking small and increasing amounts of Escherichia coli isolate reads into fecal metagenomes and, using different assemblers, examined E. coli contigs and the presence of antibiotic resistance genes (ARGs). For ARG assembly, although SR assemblers recovered more ARGs with high accuracy, even at low coverages, LR assemblies allowed for the placement of ARGs within longer, E. coli-specific contigs, thus pinpointing their taxonomic origin. HY assemblies identified resistance genes with high accuracy and had lower contiguity than LR assemblies. Each assembler type's strengths were maintained even when our isolate was spiked in with a competing strain, which fragmented and reduced the accuracy of all assemblies. For strain characterization and determining gene context, LR assembly is optimal, while for base-accurate gene identification, SR assemblers outperform other options. HY assembly offers contiguity and base accuracy, but requires generating data on multiple platforms, and may suffer high misassembly rates when strain diversity exists. Our results highlight the trade-offs associated with each approach for recovering low-abundance taxa, and that the optimal approach is goal-dependent.


Asunto(s)
Metagenoma , Microbiota , Análisis de Secuencia de ADN/métodos , Escherichia coli/genética , Microbiota/genética , Metagenómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
15.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37824737

RESUMEN

Transcriptome sequencing has become common in cancer research, resulting in the generation of a substantial volume of RNA sequencing (RNA-Seq) data. The ability to extract immune repertoires from these data is crucial for obtaining information on infiltrating T- and B-lymphocyte clones when dedicated amplicon T-cell/B-cell receptors sequencing (TCR-Seq/BCR-Seq) methods are unavailable. In response to this demand, several dedicated computational methods have been developed, including MiXCR, TRUST and ImRep. In the recent publication in Briefings in Bioinformatics, Peng et al. have conducted an intensive, systematic comparison of the three previously mentioned tools. Although their effort is commendable, we do have a few constructive critiques regarding technical elements of their analysis.


Asunto(s)
Benchmarking , Neoplasias , Humanos , Neoplasias/genética , Linfocitos B , Receptores de Antígenos de Linfocitos T/genética , Análisis de Secuencia de ARN
16.
Brief Bioinform ; 24(3)2023 05 19.
Artículo en Inglés | MEDLINE | ID: mdl-37020334

RESUMEN

RNA alternative splicing, a post-transcriptional stage in eukaryotes, is crucial in cellular homeostasis and disease processes. Due to the rapid development of the next-generation sequencing (NGS) technology and the flood of NGS data, the detection of differential splicing from RNA-seq data has become mainstream. A range of bioinformatic tools has been developed. However, until now, an independent and comprehensive comparison of available algorithms/tools at the event level is still lacking. Here, 21 different tools are subjected to systematic evaluation, based on simulated RNA-seq data where exact differential splicing events are introduced. We observe immense discrepancies among these tools. SUPPA, DARTS, rMATS and LeafCutter outperforme other event-based tools. We also examine the abilities of the tools to identify novel splicing events, which shows that most event-based tools are unsuitable for discovering novel splice sites. To improve the overall performance, we present two methodological approaches i.e. low-expression transcript filtering and tool-pair combination. Finally, a new protocol of selecting tools to perform differential splicing analysis for different analytical tasks (e.g. precision and recall rate) is proposed. Under this protocol, we analyze the distinct splicing landscape in the DUX4/IGH subgroup of B-cell acute lymphoblastic leukemia and uncover the differential splicing of TCF12. All codes needed to reproduce the results are available at https://github.com/mhjiang97/Benchmarking_DS.


Asunto(s)
Benchmarking , Programas Informáticos , RNA-Seq , Análisis de Secuencia de ARN/métodos , Empalme del ARN , Empalme Alternativo
17.
Brief Bioinform ; 24(3)2023 05 19.
Artículo en Inglés | MEDLINE | ID: mdl-37096592

RESUMEN

Since the 1980s, dozens of computational methods have addressed the problem of predicting RNA secondary structure. Among them are those that follow standard optimization approaches and, more recently, machine learning (ML) algorithms. The former were repeatedly benchmarked on various datasets. The latter, on the other hand, have not yet undergone extensive analysis that could suggest to the user which algorithm best fits the problem to be solved. In this review, we compare 15 methods that predict the secondary structure of RNA, of which 6 are based on deep learning (DL), 3 on shallow learning (SL) and 6 control methods on non-ML approaches. We discuss the ML strategies implemented and perform three experiments in which we evaluate the prediction of (I) representatives of the RNA equivalence classes, (II) selected Rfam sequences and (III) RNAs from new Rfam families. We show that DL-based algorithms (such as SPOT-RNA and UFold) can outperform SL and traditional methods if the data distribution is similar in the training and testing set. However, when predicting 2D structures for new RNA families, the advantage of DL is no longer clear, and its performance is inferior or equal to that of SL and non-ML methods.


Asunto(s)
Aprendizaje Automático , ARN , Humanos , ARN/genética , ARN/química , Algoritmos , Benchmarking
18.
Brief Bioinform ; 25(1)2023 11 22.
Artículo en Inglés | MEDLINE | ID: mdl-38113074

RESUMEN

Optimizing and benchmarking data reduction methods for dynamic or spatial visualization and interpretation (DSVI) face challenges due to many factors, including data complexity, lack of ground truth, time-dependent metrics, dimensionality bias and different visual mappings of the same data. Current studies often focus on independent static visualization or interpretability metrics that require ground truth. To overcome this limitation, we propose the MIBCOVIS framework, a comprehensive and interpretable benchmarking and computational approach. MIBCOVIS enhances the visualization and interpretability of high-dimensional data without relying on ground truth by integrating five robust metrics, including a novel time-ordered Markov-based structural metric, into a semi-supervised hierarchical Bayesian model. The framework assesses method accuracy and considers interaction effects among metric features. We apply MIBCOVIS using linear and nonlinear dimensionality reduction methods to evaluate optimal DSVI for four distinct dynamic and spatial biological processes captured by three single-cell data modalities: CyTOF, scRNA-seq and CODEX. These data vary in complexity based on feature dimensionality, unknown cell types and dynamic or spatial differences. Unlike traditional single-summary score approaches, MIBCOVIS compares accuracy distributions across methods. Our findings underscore the joint evaluation of visualization and interpretability, rather than relying on separate metrics. We reveal that prioritizing average performance can obscure method feature performance. Additionally, we explore the impact of data complexity on visualization and interpretability. Specifically, we provide optimal parameters and features and recommend methods, like the optimized variational contractive autoencoder, for targeted DSVI for various data complexities. MIBCOVIS shows promise for evaluating dynamic single-cell atlases and spatiotemporal data reduction models.


Asunto(s)
Benchmarking , Análisis de la Célula Individual , Teorema de Bayes , Análisis de la Célula Individual/métodos
19.
Mol Syst Biol ; 20(2): 75-97, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38225382

RESUMEN

Structural resolution of protein interactions enables mechanistic and functional studies as well as interpretation of disease variants. However, structural data is still missing for most protein interactions because we lack computational and experimental tools at scale. This is particularly true for interactions mediated by short linear motifs occurring in disordered regions of proteins. We find that AlphaFold-Multimer predicts with high sensitivity but limited specificity structures of domain-motif interactions when using small protein fragments as input. Sensitivity decreased substantially when using long protein fragments or full length proteins. We delineated a protein fragmentation strategy particularly suited for the prediction of domain-motif interfaces and applied it to interactions between human proteins associated with neurodevelopmental disorders. This enabled the prediction of highly confident and likely disease-related novel interfaces, which we further experimentally corroborated for FBXO23-STX1B, STX1B-VAMP2, ESRRG-PSMC5, PEX3-PEX19, PEX3-PEX16, and SNRPB-GIGYF1 providing novel molecular insights for diverse biological processes. Our work highlights exciting perspectives, but also reveals clear limitations and the need for future developments to maximize the power of Alphafold-Multimer for interface predictions.


Asunto(s)
Proteínas Portadoras , Proteínas , Humanos , Proteínas/metabolismo , Proteínas de la Membrana/metabolismo
20.
BMC Bioinformatics ; 25(1): 213, 2024 Jun 13.
Artículo en Inglés | MEDLINE | ID: mdl-38872097

RESUMEN

BACKGROUND: Automated hypothesis generation (HG) focuses on uncovering hidden connections within the extensive information that is publicly available. This domain has become increasingly popular, thanks to modern machine learning algorithms. However, the automated evaluation of HG systems is still an open problem, especially on a larger scale. RESULTS: This paper presents a novel benchmarking framework Dyport for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypotheses accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community. CONCLUSIONS: Dyport is an open-source benchmarking framework designed for biomedical hypothesis generation systems evaluation, which takes into account knowledge dynamics, semantics and impact. All code and datasets are available at: https://github.com/IlyaTyagin/Dyport .


Asunto(s)
Benchmarking , Benchmarking/métodos , Algoritmos , Investigación Biomédica/métodos , Programas Informáticos , Aprendizaje Automático , Bases de Datos Factuales , Biología Computacional/métodos , Semántica
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA