Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
Más filtros













Base de datos
Intervalo de año de publicación
1.
Med Biol Eng Comput ; 2024 Apr 16.
Artículo en Inglés | MEDLINE | ID: mdl-38622438

RESUMEN

Understanding protein structures is crucial for various bioinformatics research, including drug discovery, disease diagnosis, and evolutionary studies. Protein structure classification is a critical aspect of structural biology, where supervised machine learning algorithms classify structures based on data from databases such as Protein Data Bank (PDB). However, the challenge lies in designing numerical embeddings for protein structures without losing essential information. Although some effort has been made in the literature, researchers have not effectively and rigorously combined the structural and sequence-based features for efficient protein classification to the best of our knowledge. To this end, we propose numerical embeddings that extract relevant features for protein sequences fetched from PDB structures from popular datasets such as PDB Bind and STCRDAB. The features are physicochemical properties such as aromaticity, instability index, flexibility, Grand Average of Hydropathy (GRAVY), isoelectric point, charge at pH, secondary structure fracture, molar extinction coefficient, and molecular weight. We also incorporate scaling features for the sliding windows (e.g., k-mers), which include Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the amino acids, and Hydropathy scale. Multiple-feature selection aims to improve the accuracy of protein classification models. The results showed that the selected features significantly improved the predictive performance of existing embeddings.

2.
Comput Biol Med ; 170: 107956, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38217977

RESUMEN

The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo Amino Acid Composition (PseAAC) protein encoding for accurate TCR protein sequence classification. The PseAAC2Vec encoding method captures the physicochemical properties of amino acids and their local sequence information, enabling the representation of protein sequences as fixed-length feature vectors. By incorporating physicochemical properties such as hydrophobicity, polarity, charge, molecular weight, and solvent accessibility, PseAAC2Vec provides a comprehensive and informative characterization of TCR protein sequences. To evaluate the effectiveness of the proposed PseAAC2Vec encoding approach, we assembled a large dataset of TCR protein sequences with annotated classes. We applied the PseAAC2Vec encoding scheme to each sequence and generated feature vectors based on a specified window size. Subsequently, we employed state-of-the-art machine learning algorithms, such as support vector machines (SVM) and random forests (RF), to classify the TCR protein sequences. Experimental results on the benchmark dataset demonstrated the superior performance of the PseAAC2Vec-based approach compared to existing methods. The PseAAC2Vec encoding effectively captures the discriminative patterns in TCR protein sequences, leading to improved classification accuracy and robustness. Furthermore, the encoding scheme showed promising results across different window sizes, indicating its adaptability to varying sequence contexts.


Asunto(s)
Biología Computacional , Proteínas , Biología Computacional/métodos , Proteínas/química , Secuencia de Aminoácidos , Aminoácidos/química , Aminoácidos/metabolismo , Algoritmos , Máquina de Vectores de Soporte , Análisis de Secuencia de Proteína/métodos , Bases de Datos de Proteínas
3.
Med Biol Eng Comput ; 61(10): 2607-2626, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37395885

RESUMEN

The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose ViralVectors, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on minimizers, a type of lightweight "signature" of a sequence, used traditionally in assembly and read mapping - to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks. Graphical Abstract showing the all steps of proposed approach. We start by collecting the sequence-based data. Then Data cleaning and preprocessing is applied. After that, we generate the feature embeddings using minimizer based approach. Then Classification and clustering algorithms are applied on the resultant data and predictions are made on the test set.


Asunto(s)
COVID-19 , Viroma , Humanos , SARS-CoV-2 , Algoritmos , Análisis de Secuencia de ADN/métodos
4.
Biomolecules ; 13(6)2023 06 02.
Artículo en Inglés | MEDLINE | ID: mdl-37371514

RESUMEN

The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Análisis de Secuencia de ADN/métodos , Pandemias , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Algoritmos , Aprendizaje Automático
5.
Biology (Basel) ; 12(6)2023 Jun 14.
Artículo en Inglés | MEDLINE | ID: mdl-37372139

RESUMEN

Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models' performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.

6.
Sci Rep ; 13(1): 4154, 2023 03 13.
Artículo en Inglés | MEDLINE | ID: mdl-36914815

RESUMEN

The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.


Asunto(s)
Simulación por Computador , Genoma Viral , Aprendizaje Automático , Proyectos de Investigación , SARS-CoV-2 , Aprendizaje Automático/normas , SARS-CoV-2/clasificación , SARS-CoV-2/genética , Genoma Viral/genética , Proteínas Virales/genética , COVID-19/virología , Análisis de Secuencia de ARN
7.
J Comput Biol ; 30(4): 469-491, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-36730750

RESUMEN

The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment, and curation may become a bottleneck, creating a need for methods that can process raw sequencing reads directly. In this article, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.


Asunto(s)
COVID-19 , Pangolines , Humanos , Animales , Pandemias , SARS-CoV-2/genética , COVID-19/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
8.
J Comput Biol ; 30(4): 432-445, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-36656554

RESUMEN

With the rapid spread of COVID-19 worldwide, viral genomic data are available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis toward the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected-the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this article, we propose an approach that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using k-mers (substrings) and then uses several machine learning models to classify the sequences based on geographical location. We show that our proposed model significantly outperforms the baselines. We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/epidemiología , COVID-19/genética , Genoma Viral , Aminoácidos/genética
9.
Genes (Basel) ; 15(1)2023 12 23.
Artículo en Inglés | MEDLINE | ID: mdl-38254915

RESUMEN

Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.


Asunto(s)
Algoritmos , Benchmarking , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Lenguaje
10.
Artículo en Inglés | MEDLINE | ID: mdl-36103437

RESUMEN

Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between k-mers (sub-sequences of length k) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences. The approximate algorithms are proven to be more scalable and perform comparably to (sometimes better than) the exact methods - they are designed in a "general" way to deal with different types of sequences (e.g., music, protein, etc.). Although general applicability is a desired property of an algorithm, it is not the case in all scenarios. For example, in the current COVID-19 (coronavirus) pandemic, there is a need for an approach that can deal specifically with the coronavirus. To this end, we propose a series of ways to improve the performance of the approximate kernel (using minimizers and information gain) in order to enhance its predictive performance pm coronavirus sequences. More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e.g., Alpha, Beta, Gamma). We report results using different classification and clustering algorithms and evaluate their performance using multiple evaluation metrics. Using two datasets, we show that our proposed method helps improve the kernel's performance compared to the baseline and state-of-the-art approaches in the healthcare domain.

11.
Med Biol Eng Comput ; 60(7): 1881-1896, 2022 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-35507111

RESUMEN

Because of the rapid spread of COVID-19 to almost every part of the globe, huge volumes of data and case studies have been made available, providing researchers with a unique opportunity to find trends and make discoveries like never before by leveraging such big data. This data is of many different varieties and can be of different levels of veracity, e.g., precise, imprecise, uncertain, and missing, making it challenging to extract meaningful information from such data. Yet, efficient analyses of this continuously growing and evolving COVID-19 data is crucial to inform - often in real-time - the relevant measures needed for controlling, mitigating, and ultimately avoiding viral spread. Applying machine learning-based algorithms to this big data is a natural approach to take to this aim since they can quickly scale to such data and extract the relevant information in the presence of variety and different levels of veracity. This is important for COVID-19 and potential future pandemics in general. In this paper, we design a straightforward encoding of clinical data (on categorical attributes) into a fixed-length feature vector representation and then propose a model that first performs efficient feature selection from such representation. We apply this approach to two clinical datasets of the COVID-19 patients and then apply different machine learning algorithms downstream for classification purposes. We show that with the efficient feature selection algorithm, we can achieve a prediction accuracy of more than 90% in most cases. We also computed the importance of different attributes in the dataset using information gain. This can help the policymakers focus on only certain attributes to study this disease rather than focusing on multiple random factors that may not be very informative to patient outcomes.


Asunto(s)
COVID-19 , Algoritmos , Humanos , Aprendizaje Automático , Pandemias
12.
Biology (Basel) ; 11(3)2022 Mar 09.
Artículo en Inglés | MEDLINE | ID: mdl-35336792

RESUMEN

The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic-an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime-in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.

13.
J Comput Biol ; 28(11): 1142-1155, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34698531

RESUMEN

In the recent years, there has been an increasing amount of single-cell sequencing studies, producing a considerable number of new data sets. This has particularly affected the field of cancer analysis, where more and more articles are published using this sequencing technique that allows for capturing more detailed information regarding the specific genetic mutations on each individually sampled cell. As the amount of information increases, it is necessary to have more sophisticated and rapid tools for analyzing the samples. To this goal, we developed plastic (PipeLine Amalgamating Single-cell Tree Inference Components), an easy-to-use and quick to adapt pipeline that integrates three different steps: (1) to simplify the input data, (2) to infer tumor phylogenies, and (3) to compare the phylogenies. We have created a pipeline submodule for each of those steps and developed new in-memory data structures that allow for easy and transparent sharing of the information across the tools implementing the above steps. While we use existing open source tools for those steps, we have extended the tool used for simplifying the input data, incorporating two machine learning procedures-which greatly reduce the running time without affecting the quality of the downstream analysis. Moreover, we have introduced the capability of producing some plots to quickly visualize results.


Asunto(s)
Biología Computacional/métodos , Mutación , Neoplasias/clasificación , Humanos , Internet , Neoplasias/genética , Filogenia , Análisis de Secuencia de ADN , Análisis de la Célula Individual , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA