Pesquisa | Portal Regional da BVS

1.

Special Issue, Part I 19th International Symposium on Bioinformatics Research and Applications (ISBRA 2023).

Patterson, Murray.

J Comput Biol ; 31(6): 473-474, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38868911

Assuntos

Biologia Computacional , Biologia Computacional/métodos , Humanos , Congressos como Assunto

2.

From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets.

Ali, Sarwan; Chourasia, Prakash; Patterson, Murray.

Med Biol Eng Comput ; 2024 Apr 16.

Artigo em Inglês | MEDLINE | ID: mdl-38622438

RESUMO

Understanding protein structures is crucial for various bioinformatics research, including drug discovery, disease diagnosis, and evolutionary studies. Protein structure classification is a critical aspect of structural biology, where supervised machine learning algorithms classify structures based on data from databases such as Protein Data Bank (PDB). However, the challenge lies in designing numerical embeddings for protein structures without losing essential information. Although some effort has been made in the literature, researchers have not effectively and rigorously combined the structural and sequence-based features for efficient protein classification to the best of our knowledge. To this end, we propose numerical embeddings that extract relevant features for protein sequences fetched from PDB structures from popular datasets such as PDB Bind and STCRDAB. The features are physicochemical properties such as aromaticity, instability index, flexibility, Grand Average of Hydropathy (GRAVY), isoelectric point, charge at pH, secondary structure fracture, molar extinction coefficient, and molecular weight. We also incorporate scaling features for the sliding windows (e.g., k-mers), which include Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the amino acids, and Hydropathy scale. Multiple-feature selection aims to improve the accuracy of protein classification models. The results showed that the selected features significantly improved the predictive performance of existing embeddings.

3.

PseAAC2Vec protein encoding for TCR protein sequence classification.

Tayebi, Zahra; Ali, Sarwan; Murad, Taslim; Khan, Imdadullah; Patterson, Murray.

Comput Biol Med ; 170: 107956, 2024 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-38217977

RESUMO

The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo Amino Acid Composition (PseAAC) protein encoding for accurate TCR protein sequence classification. The PseAAC2Vec encoding method captures the physicochemical properties of amino acids and their local sequence information, enabling the representation of protein sequences as fixed-length feature vectors. By incorporating physicochemical properties such as hydrophobicity, polarity, charge, molecular weight, and solvent accessibility, PseAAC2Vec provides a comprehensive and informative characterization of TCR protein sequences. To evaluate the effectiveness of the proposed PseAAC2Vec encoding approach, we assembled a large dataset of TCR protein sequences with annotated classes. We applied the PseAAC2Vec encoding scheme to each sequence and generated feature vectors based on a specified window size. Subsequently, we employed state-of-the-art machine learning algorithms, such as support vector machines (SVM) and random forests (RF), to classify the TCR protein sequences. Experimental results on the benchmark dataset demonstrated the superior performance of the PseAAC2Vec-based approach compared to existing methods. The PseAAC2Vec encoding effectively captures the discriminative patterns in TCR protein sequences, leading to improved classification accuracy and robustness. Furthermore, the encoding scheme showed promising results across different window sizes, indicating its adaptability to varying sequence contexts.

Assuntos

Biologia Computacional , Proteínas , Biologia Computacional/métodos , Proteínas/química , Sequência de Aminoácidos , Aminoácidos/química , Aminoácidos/metabolismo , Algoritmos , Máquina de Vetores de Suporte , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas

4.

ViralVectors: compact and scalable alignment-free virome feature generation.

Ali, Sarwan; Chourasia, Prakash; Tayebi, Zahra; Bello, Babatunde; Patterson, Murray.

Med Biol Eng Comput ; 61(10): 2607-2626, 2023 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-37395885

RESUMO

The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose ViralVectors, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on minimizers, a type of lightweight "signature" of a sequence, used traditionally in assembly and read mapping - to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks. Graphical Abstract showing the all steps of proposed approach. We start by collecting the sequence-based data. Then Data cleaning and preprocessing is applied. After that, we generate the feature embeddings using minimizer based approach. Then Classification and clustering algorithms are applied on the resultant data and predictions are made on the test set.

Assuntos

COVID-19 , Viroma , Humanos , SARS-CoV-2 , Algoritmos , Análise de Sequência de DNA/métodos

5.

Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors.

Sahoo, Bikram; Ali, Sarwan; Chen, Pin-Yu; Patterson, Murray; Zelikovsky, Alexander.

Biomolecules ; 13(6)2023 06 02.

Artigo em Inglês | MEDLINE | ID: mdl-37371514

RESUMO

The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.

Assuntos

COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Análise de Sequência de DNA/métodos , Pandemias , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Aprendizado de Máquina

6.

Exploring the Potential of GANs in Biological Sequence Analysis.

Murad, Taslim; Ali, Sarwan; Patterson, Murray.

Biology (Basel) ; 12(6)2023 Jun 14.

Artigo em Inglês | MEDLINE | ID: mdl-37372139

RESUMO

Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models' performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.

7.

Benchmarking machine learning robustness in Covid-19 genome sequence classification.

Ali, Sarwan; Sahoo, Bikram; Zelikovsky, Alexander; Chen, Pin-Yu; Patterson, Murray.

Sci Rep ; 13(1): 4154, 2023 03 13.

Artigo em Inglês | MEDLINE | ID: mdl-36914815

RESUMO

The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.

Assuntos

Simulação por Computador , Genoma Viral , Aprendizado de Máquina , Projetos de Pesquisa , SARS-CoV-2 , Aprendizado de Máquina/normas , SARS-CoV-2/classificação , SARS-CoV-2/genética , Genoma Viral/genética , Proteínas Virais/genética , COVID-19/virologia , Análise de Sequência de RNA

8.

Special Issue: 11th International Computational Advances in Bio and Medical Sciences (ICCABS 2021).

Bansal, Mukul S; Mandoiu, Ion I; Moussa, Marmar; Patterson, Murray; Rajasekaran, Sanguthevar; Skums, Pavel; Zelikovsky, Alexander.

J Comput Biol ; 30(4): 363-365, 2023 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-36847353

9.

Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data.

Chourasia, Prakash; Ali, Sarwan; Ciccolella, Simone; Vedova, Gianluca Della; Patterson, Murray.

J Comput Biol ; 30(4): 469-491, 2023 04.

Artigo em Inglês | MEDLINE | ID: mdl-36730750

RESUMO

The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment, and curation may become a bottleneck, creating a need for methods that can process raw sequencing reads directly. In this article, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.

Assuntos

COVID-19 , Pangolins , Humanos , Animais , Pandemias , SARS-CoV-2/genética , COVID-19/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos

10.

Characterizing SARS-CoV-2 Spike Sequences Based on Geographical Location.

Ali, Sarwan; Bello, Babatunde; Tayebi, Zahra; Patterson, Murray.

J Comput Biol ; 30(4): 432-445, 2023 04.

Artigo em Inglês | MEDLINE | ID: mdl-36656554

RESUMO

With the rapid spread of COVID-19 worldwide, viral genomic data are available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis toward the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected-the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this article, we propose an approach that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using k-mers (substrings) and then uses several machine learning models to classify the sequences based on geographical location. We show that our proposed model significantly outperforms the baselines. We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.

Assuntos

COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/epidemiologia , COVID-19/genética , Genoma Viral , Aminoácidos/genética

11.

When Protein Structure Embedding Meets Large Language Models.

Ali, Sarwan; Chourasia, Prakash; Patterson, Murray.

Genes (Basel) ; 15(1)2023 12 23.

Artigo em Inglês | MEDLINE | ID: mdl-38254915

RESUMO

Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.

Assuntos

Algoritmos , Benchmarking , Sequência de Aminoácidos , Bases de Dados de Proteínas , Idioma

12.

Efficient Approximate Kernel Based Spike Sequence Classification.

Ali, Sarwan; Sahoo, Bikram; Khan, Muhammad Asad; Zelikovsky, Alexander; Khan, Imdad Ullah; Patterson, Murray.

IEEE/ACM Trans Comput Biol Bioinform ; PP2022 Sep 14.

Artigo em Inglês | MEDLINE | ID: mdl-36103437

RESUMO

Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between k-mers (sub-sequences of length k) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences. The approximate algorithms are proven to be more scalable and perform comparably to (sometimes better than) the exact methods - they are designed in a "general" way to deal with different types of sequences (e.g., music, protein, etc.). Although general applicability is a desired property of an algorithm, it is not the case in all scenarios. For example, in the current COVID-19 (coronavirus) pandemic, there is a need for an approach that can deal specifically with the coronavirus. To this end, we propose a series of ways to improve the performance of the approximate kernel (using minimizers and information gain) in order to enhance its predictive performance pm coronavirus sequences. More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e.g., Alpha, Beta, Gamma). We report results using different classification and clustering algorithms and evaluate their performance using multiple evaluation metrics. Using two datasets, we show that our proposed method helps improve the kernel's performance compared to the baseline and state-of-the-art approaches in the healthcare domain.

13.

Determining Significant Correlation Between Pairs of Extant Characters in a Small Parsimony Framework.

Khandai, Kaustubh; Navarro-Martinez, Cristian; Smith, Brendan; Buonopane, Rebecca; Byun, Soyong Ashley; Patterson, Murray.

J Comput Biol ; 29(10): 1132-1154, 2022 10.

Artigo em Inglês | MEDLINE | ID: mdl-35723627

RESUMO

When studying the evolutionary relationships among a set of species, the principle of parsimony states that a relationship involving the fewest number of evolutionary events is likely the correct one. Due to its simplicity, this principle was formalized in the context of computational evolutionary biology decades ago by, for example, Fitch and Sankoff. Because the parsimony framework does not require a model of evolution, unlike maximum likelihood or Bayesian approaches, it is often a good starting point when no reasonable estimate of such a model is available. In this work, we devise a method for determining if pairs of discrete characters are significantly correlated across all most parsimonious reconstructions, given a set of species on these characters, and an evolutionary tree. The first step of this method is to use Sankoff's algorithm to compute all most parsimonious assignments of ancestral states (of each character) to the internal nodes of the phylogeny. Correlation between a pair of evolutionary events (e.g., absent to present) for a pair of characters is then determined by the (co-) occurrence patterns between the sets of their respective ancestral assignments. The probability of obtaining a correlation this extreme (or more) under a null hypothesis where the events happen randomly on the evolutionary tree is then used to assess the significance of this correlation. We implement this method: parcours (PARsimonious CO-occURrenceS) and use it to identify significantly correlated evolution among vocalizations and morphological characters in the Felidae family.

Assuntos

Algoritmos , Biologia Computacional , Teorema de Bayes , Biologia Computacional/métodos , Filogenia

14.

Efficient analysis of COVID-19 clinical data using machine learning models.

Ali, Sarwan; Zhou, Yijing; Patterson, Murray.

Med Biol Eng Comput ; 60(7): 1881-1896, 2022 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-35507111

RESUMO

Because of the rapid spread of COVID-19 to almost every part of the globe, huge volumes of data and case studies have been made available, providing researchers with a unique opportunity to find trends and make discoveries like never before by leveraging such big data. This data is of many different varieties and can be of different levels of veracity, e.g., precise, imprecise, uncertain, and missing, making it challenging to extract meaningful information from such data. Yet, efficient analyses of this continuously growing and evolving COVID-19 data is crucial to inform - often in real-time - the relevant measures needed for controlling, mitigating, and ultimately avoiding viral spread. Applying machine learning-based algorithms to this big data is a natural approach to take to this aim since they can quickly scale to such data and extract the relevant information in the presence of variety and different levels of veracity. This is important for COVID-19 and potential future pandemics in general. In this paper, we design a straightforward encoding of clinical data (on categorical attributes) into a fixed-length feature vector representation and then propose a model that first performs efficient feature selection from such representation. We apply this approach to two clinical datasets of the COVID-19 patients and then apply different machine learning algorithms downstream for classification purposes. We show that with the efficient feature selection algorithm, we can achieve a prediction accuracy of more than 90% in most cases. We also computed the importance of different attributes in the dataset using information gain. This can help the policymakers focus on only certain attributes to study this disease rather than focusing on multiple random factors that may not be very informative to patient outcomes.

Assuntos

COVID-19 , Algoritmos , Humanos , Aprendizado de Máquina , Pandemias

15.

PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences.

Ali, Sarwan; Bello, Babatunde; Chourasia, Prakash; Punathil, Ria Thazhe; Zhou, Yijing; Patterson, Murray.

Biology (Basel) ; 11(3)2022 Mar 09.

Artigo em Inglês | MEDLINE | ID: mdl-35336792

RESUMO

The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic-an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime-in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.

16.

From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering.

Melnyk, Andrew; Mohebbi, Fatemeh; Knyazev, Sergey; Sahoo, Bikram; Hosseini, Roya; Skums, Pavel; Zelikovsky, Alex; Patterson, Murray.

J Comput Biol ; 28(11): 1113-1129, 2021 11.

Artigo em Inglês | MEDLINE | ID: mdl-34698508

RESUMO

The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.

Assuntos

Análise por Conglomerados , Biologia Computacional/métodos , Brasil , Bases de Dados Genéticas , Entropia , Humanos , Método de Monte Carlo , África do Sul , Reino Unido , Estados Unidos

17.

Simpler and Faster Development of Tumor Phylogeny Pipelines.

Ali, Sarwan; Ciccolella, Simone; Lucarella, Lorenzo; Vedova, Gianluca Della; Patterson, Murray.

J Comput Biol ; 28(11): 1142-1155, 2021 11.

Artigo em Inglês | MEDLINE | ID: mdl-34698531

RESUMO

In the recent years, there has been an increasing amount of single-cell sequencing studies, producing a considerable number of new data sets. This has particularly affected the field of cancer analysis, where more and more articles are published using this sequencing technique that allows for capturing more detailed information regarding the specific genetic mutations on each individually sampled cell. As the amount of information increases, it is necessary to have more sophisticated and rapid tools for analyzing the samples. To this goal, we developed plastic (PipeLine Amalgamating Single-cell Tree Inference Components), an easy-to-use and quick to adapt pipeline that integrates three different steps: (1) to simplify the input data, (2) to infer tumor phylogenies, and (3) to compare the phylogenies. We have created a pipeline submodule for each of those steps and developed new in-memory data structures that allow for easy and transparent sharing of the information across the tools implementing the above steps. While we use existing open source tools for those steps, we have extended the tool used for simplifying the input data, incorporating two machine learning procedures-which greatly reduce the running time without affecting the quality of the downstream analysis. Moreover, we have introduced the capability of producing some plots to quickly visualize results.

Assuntos

Biologia Computacional/métodos , Mutação , Neoplasias/classificação , Humanos , Internet , Neoplasias/genética , Filogenia , Análise de Sequência de DNA , Análise de Célula Única , Software

18.

Effective Clustering for Single Cell Sequencing Cancer Data.

Ciccolella, Simone; Patterson, Murray; Bonizzoni, Paola; Della Vedova, Gianluca.

IEEE J Biomed Health Inform ; 25(11): 4068-4078, 2021 11.

Artigo em Inglês | MEDLINE | ID: mdl-34003758

RESUMO

Single cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes it difficult, sometimes infeasible using current approaches and tools. One possible solution is to reduce the size of an SCS instance - usually represented as a matrix of presence, absence, and uncertainty of the mutations found in the different sequenced cells - and to infer the tree from this reduced-size instance. In this work, we present a new clustering procedure aimed at clustering such categorical vector, or matrix data - here representing SCS instances, called celluloid. We show that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method. We demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice. Our approach, celluloid: clustering single cell sequencing data around centroids is available at https://github.com/AlgoLab/celluloid/ under an MIT license, as well as on the Python Package Index (PyPI) at https://pypi.org/project/celluloid-clust/.

Assuntos

Algoritmos , Neoplasias , Análise por Conglomerados , Humanos , Mutação/genética , Neoplasias/genética , Filogenia , Software

19.

Inferring cancer progression from Single-Cell Sequencing while allowing mutation losses.

Ciccolella, Simone; Ricketts, Camir; Soto Gomez, Mauricio; Patterson, Murray; Silverbush, Dana; Bonizzoni, Paola; Hajirasouliha, Iman; Della Vedova, Gianluca.

Bioinformatics ; 37(3): 326-333, 2021 04 20.

Artigo em Inglês | MEDLINE | ID: mdl-32805010

RESUMO

MOTIVATION: In recent years, the well-known Infinite Sites Assumption has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions. However, recent studies leveraging single-cell sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. While there exist established computational methods that infer phylogenies with mutation losses, there remain some advancements to be made. RESULTS: We present Simulated Annealing Single-Cell inference (SASC): a new and robust approach based on simulated annealing for the inference of cancer progression from SCS datasets. In particular, we introduce an extension of the model of evolution where mutations are only accumulated, by allowing also a limited amount of mutation loss in the evolutionary history of the tumor: the Dollo-k model. We demonstrate that SASC achieves high levels of accuracy when tested on both simulated and real datasets and in comparison with some other available methods. AVAILABILITY AND IMPLEMENTATION: The SASC tool is open source and available at https://github.com/sciccolella/sasc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Neoplasias , Análise de Célula Única , Humanos , Mutação , Neoplasias/genética , Filogenia , Análise de Sequência , Software

20.

gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data.

Ciccolella, Simone; Soto Gomez, Mauricio; Patterson, Murray D; Della Vedova, Gianluca; Hajirasouliha, Iman; Bonizzoni, Paola.

BMC Bioinformatics ; 21(Suppl 1): 413, 2020 Dec 09.

Artigo em Inglês | MEDLINE | ID: mdl-33297943

RESUMO

BACKGROUND: Cancer progression reconstruction is an important development stemming from the phylogenetics field. In this context, the reconstruction of the phylogeny representing the evolutionary history presents some peculiar aspects that depend on the technology used to obtain the data to analyze: Single Cell DNA Sequencing data have great specificity, but are affected by moderate false negative and missing value rates. Moreover, there has been some recent evidence of back mutations in cancer: this phenomenon is currently widely ignored. RESULTS: We present a new tool, gpps, that reconstructs a tumor phylogeny from Single Cell Sequencing data, allowing each mutation to be lost at most a fixed number of times. The General Parsimony Phylogeny from Single cell (gpps) tool is open source and available at https://github.com/AlgoLab/gpps . CONCLUSIONS: gpps provides new insights to the analysis of intra-tumor heterogeneity by proposing a new progression model to the field of cancer phylogeny reconstruction on Single Cell data.

Assuntos

Biologia Computacional/métodos , Análise Mutacional de DNA , Progressão da Doença , Mutação , Neoplasias/genética , Neoplasias/patologia , Sequência de Bases , Evolução Molecular , Humanos , Filogenia , Análise de Célula Única

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA