Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 2.342
Filtrar
Más filtros

Publication year range
1.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38797968

RESUMEN

A major challenge of precision oncology is the identification and prioritization of suitable treatment options based on molecular biomarkers of the considered tumor. In pursuit of this goal, large cancer cell line panels have successfully been studied to elucidate the relationship between cellular features and treatment response. Due to the high dimensionality of these datasets, machine learning (ML) is commonly used for their analysis. However, choosing a suitable algorithm and set of input features can be challenging. We performed a comprehensive benchmarking of ML methods and dimension reduction (DR) techniques for predicting drug response metrics. Using the Genomics of Drug Sensitivity in Cancer cell line panel, we trained random forests, neural networks, boosting trees and elastic nets for 179 anti-cancer compounds with feature sets derived from nine DR approaches. We compare the results regarding statistical performance, runtime and interpretability. Additionally, we provide strategies for assessing model performance compared with a simple baseline model and measuring the trade-off between models of different complexity. Lastly, we show that complex ML models benefit from using an optimized DR strategy, and that standard models-even when using considerably fewer features-can still be superior in performance.


Asunto(s)
Algoritmos , Antineoplásicos , Benchmarking , Aprendizaje Automático , Humanos , Antineoplásicos/farmacología , Antineoplásicos/uso terapéutico , Neoplasias/tratamiento farmacológico , Neoplasias/genética , Redes Neurales de la Computación , Línea Celular Tumoral
2.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38385873

RESUMEN

Lysine lactylation (Kla) is a newly discovered posttranslational modification that is involved in important life activities, such as glycolysis-related cell function, macrophage polarization and nervous system regulation, and has received widespread attention due to the Warburg effect in tumor cells. In this work, we first design a natural language processing method to automatically extract the 3D structural features of Kla sites, avoiding potential biases caused by manually designed structural features. Then, we establish two Kla prediction frameworks, Attention-based feature fusion Kla model (ABFF-Kla) and EBFF-Kla, to integrate the sequence features and the structure features based on the attention layer and embedding layer, respectively. The results indicate that ABFF-Kla and Embedding-based feature fusion Kla model (EBFF-Kla), which fuse features from protein sequences and spatial structures, have better predictive performance than that of models that use only sequence features. Our work provides an approach for the automatic extraction of protein structural features, as well as a flexible framework for Kla prediction. The source code and the training data of the ABFF-Kla and the EBFF-Kla are publicly deposited at: https://github.com/ispotato/Lactylation_model.


Asunto(s)
Lisina , Procesamiento de Lenguaje Natural , Secuencia de Aminoácidos , Dominios Proteicos , Procesamiento Proteico-Postraduccional
3.
Brief Bioinform ; 24(5)2023 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-37594311

RESUMEN

Transmembrane proteins are receptors, enzymes, transporters and ion channels that are instrumental in regulating a variety of cellular activities, such as signal transduction and cell communication. Despite tremendous progress in computational capacities to support protein research, there is still a significant gap in the availability of specialized computational analysis toolkits for transmembrane protein research. Here, we introduce TMKit, an open-source Python programming interface that is modular, scalable and specifically designed for processing transmembrane protein data. TMKit is a one-stop computational analysis tool for transmembrane proteins, enabling users to perform database wrangling, engineer features at the mutational, domain and topological levels, and visualize protein-protein interaction interfaces. In addition, TMKit includes seqNetRR, a high-performance computing library that allows customized construction of a large number of residue connections. This library is particularly well suited for assigning correlation matrix-based features at a fast speed. TMKit should serve as a useful tool for researchers in assisting the study of transmembrane protein sequences and structures. TMKit is publicly available through https://github.com/2003100127/tmkit and https://tmkit-guide.herokuapp.com/doc/overview.


Asunto(s)
Biología Computacional , Programas Informáticos , Proteínas de la Membrana/genética , Secuencia de Aminoácidos , Biblioteca de Genes
4.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37328692

RESUMEN

Protein complexes are key functional units in cellular processes. High-throughput techniques, such as co-fractionation coupled with mass spectrometry (CF-MS), have advanced protein complex studies by enabling global interactome inference. However, dealing with complex fractionation characteristics to define true interactions is not a simple task, since CF-MS is prone to false positives due to the co-elution of non-interacting proteins by chance. Several computational methods have been designed to analyze CF-MS data and construct probabilistic protein-protein interaction (PPI) networks. Current methods usually first infer PPIs based on handcrafted CF-MS features, and then use clustering algorithms to form potential protein complexes. While powerful, these methods suffer from the potential bias of handcrafted features and severely imbalanced data distribution. However, the handcrafted features based on domain knowledge might introduce bias, and current methods also tend to overfit due to the severely imbalanced PPI data. To address these issues, we present a balanced end-to-end learning architecture, Software for Prediction of Interactome with Feature-extraction Free Elution Data (SPIFFED), to integrate feature representation from raw CF-MS data and interactome prediction by convolutional neural network. SPIFFED outperforms the state-of-the-art methods in predicting PPIs under the conventional imbalanced training. When trained with balanced data, SPIFFED had greatly improved sensitivity for true PPIs. Moreover, the ensemble SPIFFED model provides different voting schemes to integrate predicted PPIs from multiple CF-MS data. Using the clustering software (i.e. ClusterONE), SPIFFED allows users to infer high-confidence protein complexes depending on the CF-MS experimental designs. The source code of SPIFFED is freely available at: https://github.com/bio-it-station/SPIFFED.


Asunto(s)
Mapeo de Interacción de Proteínas , Proteínas , Mapeo de Interacción de Proteínas/métodos , Proteínas/química , Algoritmos , Mapas de Interacción de Proteínas , Programas Informáticos
5.
Brief Bioinform ; 24(1)2023 01 19.
Artículo en Inglés | MEDLINE | ID: mdl-36528802

RESUMEN

Accurate prediction of deoxyribonucleic acid (DNA) modifications is essential to explore and discern the process of cell differentiation, gene expression and epigenetic regulation. Several computational approaches have been proposed for particular type-specific DNA modification prediction. Two recent generalized computational predictors are capable of detecting three different types of DNA modifications; however, type-specific and generalized modifications predictors produce limited performance across multiple species mainly due to the use of ineffective sequence encoding methods. The paper in hand presents a generalized computational approach "DNA-MP" that is competent to more precisely predict three different DNA modifications across multiple species. Proposed DNA-MP approach makes use of a powerful encoding method "position specific nucleotides occurrence based 117 on modification and non-modification class densities normalized difference" (POCD-ND) to generate the statistical representations of DNA sequences and a deep forest classifier for modifications prediction. POCD-ND encoder generates statistical representations by extracting position specific distributional information of nucleotides in the DNA sequences. We perform a comprehensive intrinsic and extrinsic evaluation of the proposed encoder and compare its performance with 32 most widely used encoding methods on $17$ benchmark DNA modifications prediction datasets of $12$ different species using $10$ different machine learning classifiers. Overall, with all classifiers, the proposed POCD-ND encoder outperforms existing $32$ different encoders. Furthermore, combinedly over 5-fold cross validation benchmark datasets and independent test sets, proposed DNA-MP predictor outperforms state-of-the-art type-specific and generalized modifications predictors by an average accuracy of 7% across 4mc datasets, 1.35% across 5hmc datasets and 10% for 6ma datasets. To facilitate the scientific community, the DNA-MP web application is available at https://sds_genetic_analysis.opendfki.de/DNA_Modifications/.


Asunto(s)
Epigénesis Genética , Aprendizaje Automático , Programas Informáticos , Nucleótidos , ADN/genética
6.
Methods ; 226: 127-132, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38604414

RESUMEN

Protein lysine methylation is a particular type of post translational modification that plays an important role in both histone and non-histone function regulation in proteins. Deregulation caused by lysine methyltransferases has been identified as the cause of several diseases including cancer as well as both mental and developmental disorders. Identifying lysine methylation sites is a critical step in both early diagnosis and drug design. This study proposes a new Machine Learning method called CNN-Meth for predicting lysine methylation sites using a convolutional neural network (CNN). Our model is trained using evolutionary, structural, and physicochemical-based presentation along with binary encoding. Unlike previous studies, instead of extracting handcrafted features, we use CNN to automatically extract features from different presentations of amino acids to avoid information loss. Automated feature extraction from these representations of amino acids as well as CNN as a classifier have never been used for this problem. Our results demonstrate that CNN-Meth can significantly outperform previous methods for predicting methylation sites. It achieves 96.0%, 85.1%, 96.4%, and 0.65 in terms of Accuracy, Sensitivity, Specificity, and Matthew's Correlation Coefficient (MCC), respectively. CNN-Meth and its source code are publicly available at https://github.com/MLBC-lab/CNN-Meth.


Asunto(s)
Lisina , Redes Neurales de la Computación , Lisina/metabolismo , Lisina/química , Metilación , Procesamiento Proteico-Postraduccional , Aprendizaje Automático , Humanos , N-Metiltransferasa de Histona-Lisina/metabolismo , N-Metiltransferasa de Histona-Lisina/genética , N-Metiltransferasa de Histona-Lisina/química , Biología Computacional/métodos
7.
J Neurosci ; 43(18): 3294-3311, 2023 05 03.
Artículo en Inglés | MEDLINE | ID: mdl-36977581

RESUMEN

In bistable perception, observers experience alternations between two interpretations of an unchanging stimulus. Neurophysiological studies of bistable perception typically partition neural measurements into stimulus-based epochs and assess neuronal differences between epochs based on subjects' perceptual reports. Computational studies replicate statistical properties of percept durations with modeling principles like competitive attractors or Bayesian inference. However, bridging neuro-behavioral findings with modeling theory requires the analysis of single-trial dynamic data. Here, we propose an algorithm for extracting nonstationary timeseries features from single-trial electrocorticography (ECoG) data. We applied the proposed algorithm to 5-min ECoG recordings from human primary auditory cortex obtained during perceptual alternations in an auditory triplet streaming task (six subjects: four male, two female). We report two ensembles of emergent neuronal features in all trial blocks. One ensemble consists of periodic functions that encode a stereotypical response to the stimulus. The other comprises more transient features and encodes dynamics associated with bistable perception at multiple time scales: minutes (within-trial alternations), seconds (duration of individual percepts), and milliseconds (switches between percepts). Within the second ensemble, we identified a slowly drifting rhythm that correlates with the perceptual states and several oscillators with phase shifts near perceptual switches. Projections of single-trial ECoG data onto these features establish low-dimensional attractor-like geometric structures invariant across subjects and stimulus types. These findings provide supporting neural evidence for computational models with oscillatory-driven attractor-based principles. The feature extraction techniques described here generalize across recording modality and are appropriate when hypothesized low-dimensional dynamics characterize an underlying neural system.SIGNIFICANCE STATEMENT Irrespective of the sensory modality, neurophysiological studies of multistable perception have typically investigated events time-locked to the perceptual switching rather than the time course of the perceptual states per se. Here, we propose an algorithm that extracts neuronal features of bistable auditory perception from largescale single-trial data while remaining agnostic to the subject's perceptual reports. The algorithm captures the dynamics of perception at multiple timescales, minutes (within-trial alternations), seconds (durations of individual percepts), and milliseconds (timing of switches), and distinguishes attributes of neural encoding of the stimulus from those encoding the perceptual states. Finally, our analysis identifies a set of latent variables that exhibit alternating dynamics along a low-dimensional manifold, similar to trajectories in attractor-based models for perceptual bistability.


Asunto(s)
Percepción Auditiva , Electrocorticografía , Humanos , Masculino , Femenino , Teorema de Bayes , Percepción Auditiva/fisiología , Neuronas , Percepción Visual/fisiología
8.
BMC Bioinformatics ; 25(1): 61, 2024 Feb 07.
Artículo en Inglés | MEDLINE | ID: mdl-38321434

RESUMEN

BACKGROUND: The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS: We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS: This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.


Asunto(s)
Arabidopsis , Inteligencia Artificial , Teorema de Bayes , Escherichia coli , Staphylococcus aureus , Programas Informáticos , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Aprendizaje Automático , Análisis de Secuencia de ADN
9.
Neuroimage ; 287: 120522, 2024 Feb 15.
Artículo en Inglés | MEDLINE | ID: mdl-38253216

RESUMEN

Designing a comprehensive four-dimensional resting-state functional magnetic resonance imaging (4D Rs-fMRI) based default mode network (DMN) modeling methodology to reveal the spatio-temporal patterns of individual DMN, is crucial for understanding the cognitive mechanisms of the brain and the pathogenesis of psychiatric disorders. However, there are still two limitations of existing approaches for DMN modeling. The approaches either (1) simply split the spatio-temporal components and ignore the overall character of the spatio-temporal patterns or (2) are biased in the process of feature extraction for DMN modeling, and their spatio-temporal accuracy is thus not warranted. To this end, we propose a novel Spatio-Temporal Brain Attention Skip Network (STBAS-Net) to model the personalized spatio-temporal patterns of the DMN. STBAS-Net consists of spatial and temporal components, where the multi-head attention skip connection block in the spatial component achieves detailed feature extraction and enhancement in the shallow stage. Under the guidance of spatial information, we technically fuse multiple spatio-temporal information in the temporal component, which dexterously exploits the overall spatio-temporal features and achieves mutual constraints of spatio-temporal patterns to characterize the spatio-temporal patterns of the DMN. We verify the proposed STBAS-Net on a publicly released 4D Rs-fMRI dataset and an EMCI dataset. The experimental results show that compared with existing advanced methods, the proposed network can more accurately model the personalized spatio-temporal patterns of the human brain DMN and successfully identify abnormal spatio-temporal patterns in EMCI patients. This study provides a potential tool for revealing the spatio-temporal patterns of the human brain DMN and is expected to provide an effective methodological framework for future exploration of abnormal brain spatio-temporal patterns and modeling of other functional brain networks.


Asunto(s)
Mapeo Encefálico , Red en Modo Predeterminado , Humanos , Mapeo Encefálico/métodos , Imagen por Resonancia Magnética/métodos , Encéfalo/diagnóstico por imagen , Atención , Red Nerviosa/diagnóstico por imagen
10.
Funct Integr Genomics ; 24(5): 139, 2024 Aug 19.
Artículo en Inglés | MEDLINE | ID: mdl-39158621

RESUMEN

Recent advancements in biomedical technologies and the proliferation of high-dimensional Next Generation Sequencing (NGS) datasets have led to significant growth in the bulk and density of data. The NGS high-dimensional data, characterized by a large number of genomics, transcriptomics, proteomics, and metagenomics features relative to the number of biological samples, presents significant challenges for reducing feature dimensionality. The high dimensionality of NGS data poses significant challenges for data analysis, including increased computational burden, potential overfitting, and difficulty in interpreting results. Feature selection and feature extraction are two pivotal techniques employed to address these challenges by reducing the dimensionality of the data, thereby enhancing model performance, interpretability, and computational efficiency. Feature selection and feature extraction can be categorized into statistical and machine learning methods. The present study conducts a comprehensive and comparative review of various statistical, machine learning, and deep learning-based feature selection and extraction techniques specifically tailored for NGS and microarray data interpretation of humankind. A thorough literature search was performed to gather information on these techniques, focusing on array-based and NGS data analysis. Various techniques, including deep learning architectures, machine learning algorithms, and statistical methods, have been explored for microarray, bulk RNA-Seq, and single-cell, single-cell RNA-Seq (scRNA-Seq) technology-based datasets surveyed here. The study provides an overview of these techniques, highlighting their applications, advantages, and limitations in the context of high-dimensional NGS data. This review provides better insights for readers to apply feature selection and feature extraction techniques to enhance the performance of predictive models, uncover underlying biological patterns, and gain deeper insights into massive and complex NGS and microarray data.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Aprendizaje Automático , Humanos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Aprendizaje Profundo
11.
Biostatistics ; 24(3): 653-668, 2023 Jul 14.
Artículo en Inglés | MEDLINE | ID: mdl-35950944

RESUMEN

Neuroimaging data are an increasingly important part of etiological studies of neurological and psychiatric disorders. However, mitigating the influence of nuisance variables, including confounders, remains a challenge in image analysis. In studies of Alzheimer's disease, for example, an imbalance in disease rates by age and sex may make it difficult to distinguish between structural patterns in the brain (as measured by neuroimaging scans) attributable to disease progression and those characteristic of typical human aging or sex differences. Concerningly, when not properly accounted for, nuisance variables pose threats to the generalizability and interpretability of findings from these studies. Motivated by this critical issue, in this work, we examine the impact of nuisance variables on feature extraction methods and propose Penalized Decomposition Using Residuals (PeDecURe), a new method for obtaining nuisance variable-adjusted features. PeDecURe estimates primary directions of variation which maximize covariance between partially residualized imaging features and a variable of interest (e.g., Alzheimer's diagnosis) while simultaneously mitigating the influence of nuisance variation through a penalty on the covariance between partially residualized imaging features and those variables. Using features derived using PeDecURe's first direction of variation, we train a highly accurate and generalizable predictive model, as evidenced by its robustness in testing samples with different underlying nuisance variable distributions. We compare PeDecURe to commonly used decomposition methods (principal component analysis (PCA) and partial least squares) as well as a confounder-adjusted variation of PCA. We find that features derived from PeDecURe offer greater accuracy and generalizability and lower correlations with nuisance variables compared with the other methods. While PeDecURe is primarily motivated by challenges that arise in the analysis of neuroimaging data, it is broadly applicable to data sets with highly correlated features, where novel methods to handle nuisance variables are warranted.


Asunto(s)
Enfermedad de Alzheimer , Encéfalo , Humanos , Masculino , Femenino , Encéfalo/diagnóstico por imagen , Neuroimagen , Análisis de los Mínimos Cuadrados , Procesamiento de Imagen Asistido por Computador , Progresión de la Enfermedad , Enfermedad de Alzheimer/diagnóstico por imagen , Imagen por Resonancia Magnética
12.
BMC Plant Biol ; 24(1): 136, 2024 Feb 26.
Artículo en Inglés | MEDLINE | ID: mdl-38408925

RESUMEN

Subsistence farmers and global food security depend on sufficient food production, which aligns with the UN's "Zero Hunger," "Climate Action," and "Responsible Consumption and Production" sustainable development goals. In addition to already available methods for early disease detection and classification facing overfitting and fine feature extraction complexities during the training process, how early signs of green attacks can be identified or classified remains uncertain. Most pests and disease symptoms are seen in plant leaves and fruits, yet their diagnosis by experts in the laboratory is expensive, tedious, labor-intensive, and time-consuming. Notably, how plant pests and diseases can be appropriately detected and timely prevented is a hotspot paradigm in smart, sustainable agriculture remains unknown. In recent years, deep transfer learning has demonstrated tremendous advances in the recognition accuracy of object detection and image classification systems since these frameworks utilize previously acquired knowledge to solve similar problems more effectively and quickly. Therefore, in this research, we introduce two plant disease detection (PDDNet) models of early fusion (AE) and the lead voting ensemble (LVE) integrated with nine pre-trained convolutional neural networks (CNNs) and fine-tuned by deep feature extraction for efficient plant disease identification and classification. The experiments were carried out on 15 classes of the popular PlantVillage dataset, which has 54,305 image samples of different plant disease species in 38 categories. Hyperparameter fine-tuning was done with popular pre-trained models, including DenseNet201, ResNet101, ResNet50, GoogleNet, AlexNet, ResNet18, EfficientNetB7, NASNetMobile, and ConvNeXtSmall. We test these CNNs on the stated plant disease detection and classification problem, both independently and as part of an ensemble. In the final phase, a logistic regression (LR) classifier is utilized to determine the performance of various CNN model combinations. A comparative analysis was also performed on classifiers, deep learning, the proposed model, and similar state-of-the-art studies. The experiments demonstrated that PDDNet-AE and PDDNet-LVE achieved 96.74% and 97.79%, respectively, compared to current CNNs when tested on several plant diseases, depicting its exceptional robustness and generalization capabilities and mitigating current concerns in plant disease detection and classification.


Asunto(s)
Redes Neurales de la Computación , Enfermedades de las Plantas , Frutas , Aprendizaje Automático
13.
Brief Bioinform ; 23(5)2022 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-35945147

RESUMEN

Liquid biopsy has shown promise for cancer diagnosis due to its minimally invasive nature and the potential for novel biomarker discovery. However, the low concentration of relevant blood-based biosources and the heterogeneity of samples (i.e. the variability of relative abundance of molecules identified), pose major challenges to biomarker discovery. Moreover, the number of molecular measurements or features (e.g. transcript read counts) per sample could be in the order of several thousand, whereas the number of samples is often substantially lower, leading to the curse of dimensionality. These challenges, among others, elucidate the importance of a robust biomarker panel identification or feature extraction step wherein relevant molecular measurements are identified prior to classification for cancer detection. In this work, we performed a benchmarking study on 12 feature extraction methods using transcriptomic profiles derived from different blood-based biosources. The methods were assessed both in terms of their predictive performance and the robustness of the biomarker panels in diagnosing cancer or stratifying cancer subtypes. While performing the comparison, the feature extraction methods are categorized into feature subset selection methods and transformation methods. A transformation feature extraction method, namely partial least square discriminant analysis, was found to perform consistently superior in terms of classification performance. As part of the benchmarking study, a generic pipeline has been created and made available as an R package to ensure reproducibility of the results and allow for easy extension of this study to other datasets (https://github.com/VafaeeLab/bloodbased-pancancer-diagnosis).


Asunto(s)
Neoplasias , Transcriptoma , Algoritmos , Benchmarking , Biomarcadores , Humanos , Neoplasias/diagnóstico , Neoplasias/genética , Reproducibilidad de los Resultados
14.
Brief Bioinform ; 23(6)2022 11 19.
Artículo en Inglés | MEDLINE | ID: mdl-36151714

RESUMEN

The three-dimensional genome structure plays a key role in cellular function and gene regulation. Single-cell Hi-C (high-resolution chromosome conformation capture) technology can capture genome structure information at the cell level, which provides the opportunity to study how genome structure varies among different cell types. Recently, a few methods are well designed for single-cell Hi-C clustering. In this manuscript, we perform an in-depth benchmark study of available single-cell Hi-C data clustering methods to implement an evaluation system for multiple clustering frameworks based on both human and mouse datasets. We compare eight methods in terms of visualization and clustering performance. Performance is evaluated using four benchmark metrics including adjusted rand index, normalized mutual information, homogeneity and Fowlkes-Mallows index. Furthermore, we also evaluate the eight methods for the task of separating cells at different stages of the cell cycle based on single-cell Hi-C data.


Asunto(s)
Cromatina , Cromosomas , Humanos , Ratones , Animales , Análisis por Conglomerados , Genoma , Conformación Molecular
15.
Brief Bioinform ; 23(3)2022 05 13.
Artículo en Inglés | MEDLINE | ID: mdl-35262658

RESUMEN

B-cell epitopes have the capability to recognize and attach to the surface of antigen receptors to stimulate the immune system against pathogens. Identification of B-cell epitopes from antigens has a great significance in several biomedical and biotechnological applications, provides support in the development of therapeutics, design and development of an epitope-based vaccine and antibody production. However, the identification of epitopes with experimental mapping approaches is a challenging job and usually requires extensive laboratory efforts. However, considerable efforts have been placed for the identification of epitopes using computational methods in the recent past but deprived of considerable achievements. In this study, we present LBCEPred, a python-based web-tool (http://lbcepred.pythonanywhere.com/), build with random forest classifier and statistical moment-based descriptors to predict the B-cell epitopes from the protein sequences. LBECPred outperforms all sequence-based available models that are currently in use for the B-cell epitopes prediction, with 0.868 accuracy value and 0.934 area under the curve. Moreover, the prediction performance of proposed models compared to other state-of-the-art models is 56.3% higher on average for Mathews Correlation Coefficient. LBCEPred is easy to use tool even for novice users and has also shown the models stability and reliability, thus we believe in its significant contribution to the research community and the area of bioinformatics.


Asunto(s)
Biología Computacional , Epítopos de Linfocito B , Secuencia de Aminoácidos , Biología Computacional/métodos , Aprendizaje Automático , Reproducibilidad de los Resultados
16.
Brief Bioinform ; 23(3)2022 05 13.
Artículo en Inglés | MEDLINE | ID: mdl-35368061

RESUMEN

Ribonucleic acid (RNA) is a pivotal nucleic acid that plays a crucial role in regulating many biological activities. Recently, one study utilized a machine learning algorithm to automatically classify RNA structural events generated by a Mycobacterium smegmatis porin A nanopore trap. Although it can achieve desirable classification results, compared with deep learning (DL) methods, this classic machine learning requires domain knowledge to manually extract features, which is sophisticated, labor-intensive and time-consuming. Meanwhile, the generated original RNA structural events are not strictly equal in length, which is incompatible with the input requirements of DL models. To alleviate this issue, we propose a sequence-to-sequence (S2S) module that transforms the unequal length sequence (UELS) to the equal length sequence. Furthermore, to automatically extract features from the RNA structural events, we propose a sequence-to-sequence neural network based on DL. In addition, we add an attention mechanism to capture vital information for classification, such as dwell time and blockage amplitude. Through quantitative and qualitative analysis, the experimental results have achieved about a 2% performance increase (accuracy) compared to the previous method. The proposed method can also be applied to other nanopore platforms, such as the famous Oxford nanopore. It is worth noting that the proposed method is not only aimed at pursuing state-of-the-art performance but also provides an overall idea to process nanopore data with UELS.


Asunto(s)
Aprendizaje Profundo , Nanoporos , Peso Molecular , Extractos Vegetales , ARN/química
17.
Brief Bioinform ; 23(3)2022 05 13.
Artículo en Inglés | MEDLINE | ID: mdl-35383372

RESUMEN

With the advances in sequencing technologies, a huge amount of biological data is extracted nowadays. Analyzing this amount of data is beyond the ability of human beings, creating a splendid opportunity for machine learning methods to grow. The methods, however, are practical only when the sequences are converted into feature vectors. Many tools target this task including iLearnPlus, a Python-based tool which supports a rich set of features. In this paper, we propose a holistic tool that extracts features from biological sequences (i.e. DNA, RNA and Protein). These features are the inputs to machine learning models that predict properties, structures or functions of the input sequences. Our tool not only supports all features in iLearnPlus but also 30 additional features which exist in the literature. Moreover, our tool is based on R language which makes an alternative for bioinformaticians to transform sequences into feature vectors. We have compared the conversion time of our tool with that of iLearnPlus: we transform the sequences much faster. We convert small nucleotides by a median of 2.8X faster, while we outperform iLearnPlus by a median of 6.3X for large sequences. Finally, in amino acids, our tool achieves a median speedup of 23.9X.


Asunto(s)
Aprendizaje Automático , Proteínas , ADN/genética , Humanos , Proteínas/química , ARN/genética , Análisis de Secuencia/métodos
18.
Brief Bioinform ; 23(6)2022 11 19.
Artículo en Inglés | MEDLINE | ID: mdl-36094081

RESUMEN

The identification of long noncoding RNA (lncRNA)-disease associations is of great value for disease diagnosis and treatment, and it is now commonly used to predict potential lncRNA-disease associations with computational methods. However, the existing methods do not sufficiently extract key features during data processing, and the learning model parts are either less powerful or overly complex. Therefore, there is still potential to achieve better predictive performance by improving these two aspects. In this work, we propose a novel lncRNA-disease association prediction method LDAformer based on topological feature extraction and Transformer encoder. We construct the heterogeneous network by integrating the associations between lncRNAs, diseases and micro RNAs (miRNAs). Intra-class similarities and inter-class associations are presented as the lncRNA-disease-miRNA weighted adjacency matrix to unify semantics. Next, we design a topological feature extraction process to further obtain multi-hop topological pathway features latent in the adjacency matrix. Finally, to capture the interdependencies between heterogeneous pathways, a Transformer encoder based on the global self-attention mechanism is employed to predict lncRNA-disease associations. The efficient feature extraction and the intuitive and powerful learning model lead to ideal performance. The results of computational experiments on two datasets show that our method outperforms the state-of-the-art baseline methods. Additionally, case studies further indicate its capability to discover new associations accurately.


Asunto(s)
MicroARNs , Neoplasias , ARN Largo no Codificante , Humanos , ARN Largo no Codificante/genética , ARN Largo no Codificante/metabolismo , Biología Computacional/métodos , Neoplasias/genética , MicroARNs/genética
19.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34850821

RESUMEN

2'-O-methylation (Nm) is a post-transcriptional modification of RNA that is catalyzed by 2'-O-methyltransferase and involves replacing the H on the 2'-hydroxyl group with a methyl group. The 2'-O-methylation modification site is detected in a variety of RNA types (miRNA, tRNA, mRNA, etc.), plays an important role in biological processes and is associated with different diseases. There are few functional mechanisms developed at present, and traditional high-throughput experiments are time-consuming and expensive to explore functional mechanisms. For a deeper understanding of relevant biological mechanisms, it is necessary to develop efficient and accurate recognition tools based on machine learning. Based on this, we constructed a predictor called NmRF based on optimal mixed features and random forest classifier to identify 2'-O-methylation modification sites. The predictor can identify modification sites of multiple species at the same time. To obtain a better prediction model, a two-step strategy is adopted; that is, the optimal hybrid feature set is obtained by combining the light gradient boosting algorithm and incremental feature selection strategy. In 10-fold cross-validation, the accuracies of Homo sapiens and Saccharomyces cerevisiae were 89.069 and 93.885%, and the AUC were 0.9498 and 0.9832, respectively. The rigorous 10-fold cross-validation and independent tests confirm that the proposed method is significantly better than existing tools. A user-friendly web server is accessible at http://lab.malab.cn/∼acy/NmRF.


Asunto(s)
Biología Computacional , Aprendizaje Automático , Secuencia de Bases , Biología Computacional/métodos , Humanos , Metilación , ARN/genética
20.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34750626

RESUMEN

One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.


Asunto(s)
Proteínas , ARN , Algoritmos , Secuencia de Aminoácidos , ADN/genética , Aprendizaje Automático , Proteínas/química , ARN/genética
SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda