Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
1.
Bioinformatics ; 37(3): 326-333, 2021 04 20.
Artigo em Inglês | MEDLINE | ID: mdl-32805010

RESUMO

MOTIVATION: In recent years, the well-known Infinite Sites Assumption has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions. However, recent studies leveraging single-cell sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. While there exist established computational methods that infer phylogenies with mutation losses, there remain some advancements to be made. RESULTS: We present Simulated Annealing Single-Cell inference (SASC): a new and robust approach based on simulated annealing for the inference of cancer progression from SCS datasets. In particular, we introduce an extension of the model of evolution where mutations are only accumulated, by allowing also a limited amount of mutation loss in the evolutionary history of the tumor: the Dollo-k model. We demonstrate that SASC achieves high levels of accuracy when tested on both simulated and real datasets and in comparison with some other available methods. AVAILABILITY AND IMPLEMENTATION: The SASC tool is open source and available at https://github.com/sciccolella/sasc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Neoplasias , Análise de Célula Única , Humanos , Mutação , Neoplasias/genética , Filogenia , Análise de Sequência , Software
2.
BMC Bioinformatics ; 21(Suppl 1): 413, 2020 Dec 09.
Artigo em Inglês | MEDLINE | ID: mdl-33297943

RESUMO

BACKGROUND: Cancer progression reconstruction is an important development stemming from the phylogenetics field. In this context, the reconstruction of the phylogeny representing the evolutionary history presents some peculiar aspects that depend on the technology used to obtain the data to analyze: Single Cell DNA Sequencing data have great specificity, but are affected by moderate false negative and missing value rates. Moreover, there has been some recent evidence of back mutations in cancer: this phenomenon is currently widely ignored. RESULTS: We present a new tool, gpps, that reconstructs a tumor phylogeny from Single Cell Sequencing data, allowing each mutation to be lost at most a fixed number of times. The General Parsimony Phylogeny from Single cell (gpps) tool is open source and available at https://github.com/AlgoLab/gpps . CONCLUSIONS: gpps provides new insights to the analysis of intra-tumor heterogeneity by proposing a new progression model to the field of cancer phylogeny reconstruction on Single Cell data.


Assuntos
Biologia Computacional/métodos , Análise Mutacional de DNA , Progressão da Doença , Mutação , Neoplasias/genética , Neoplasias/patologia , Sequência de Bases , Evolução Molecular , Humanos , Filogenia , Análise de Célula Única
3.
BMC Bioinformatics ; 19(1): 252, 2018 07 03.
Artigo em Inglês | MEDLINE | ID: mdl-29970002

RESUMO

BACKGROUND: Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages. RESULTS: Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes. CONCLUSIONS: Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result. AVAILABILITY: HapCHAT is available at http://hapchat.algolab.eu under the GNU Public License (GPL).


Assuntos
Haplótipos/genética , Análise de Sequência de DNA/métodos , Algoritmos , Humanos
4.
BMC Bioinformatics ; 17(Suppl 11): 342, 2016 Sep 22.
Artigo em Inglês | MEDLINE | ID: mdl-28185544

RESUMO

BACKGROUND: Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WHATSHAP is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments. RESULTS: Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered PWHATSHAP, a parallel, high-performance version of WHATSHAP. PWHATSHAP is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WHATSHAP, PWHATSHAP exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WHATSHAP, which increases with coverage. CONCLUSIONS: Due to its structure and management of the large datasets, the parallelisation of WHATSHAP posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, PWHATSHAP, is a freely available toolkit that improves the efficiency of the analysis of genomics information.


Assuntos
Algoritmos , Biologia Computacional/métodos , Genoma Humano , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único/genética , Análise de Sequência de DNA/métodos , Genética Populacional , Genômica/métodos , Humanos
5.
Med Biol Eng Comput ; 62(8): 2449-2483, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38622438

RESUMO

Understanding protein structures is crucial for various bioinformatics research, including drug discovery, disease diagnosis, and evolutionary studies. Protein structure classification is a critical aspect of structural biology, where supervised machine learning algorithms classify structures based on data from databases such as Protein Data Bank (PDB). However, the challenge lies in designing numerical embeddings for protein structures without losing essential information. Although some effort has been made in the literature, researchers have not effectively and rigorously combined the structural and sequence-based features for efficient protein classification to the best of our knowledge. To this end, we propose numerical embeddings that extract relevant features for protein sequences fetched from PDB structures from popular datasets such as PDB Bind and STCRDAB. The features are physicochemical properties such as aromaticity, instability index, flexibility, Grand Average of Hydropathy (GRAVY), isoelectric point, charge at pH, secondary structure fracture, molar extinction coefficient, and molecular weight. We also incorporate scaling features for the sliding windows (e.g., k-mers), which include Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the amino acids, and Hydropathy scale. Multiple-feature selection aims to improve the accuracy of protein classification models. The results showed that the selected features significantly improved the predictive performance of existing embeddings.


Assuntos
Bases de Dados de Proteínas , Proteínas , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Algoritmos , Conformação Proteica
6.
Comput Biol Med ; 170: 107956, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38217977

RESUMO

The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo Amino Acid Composition (PseAAC) protein encoding for accurate TCR protein sequence classification. The PseAAC2Vec encoding method captures the physicochemical properties of amino acids and their local sequence information, enabling the representation of protein sequences as fixed-length feature vectors. By incorporating physicochemical properties such as hydrophobicity, polarity, charge, molecular weight, and solvent accessibility, PseAAC2Vec provides a comprehensive and informative characterization of TCR protein sequences. To evaluate the effectiveness of the proposed PseAAC2Vec encoding approach, we assembled a large dataset of TCR protein sequences with annotated classes. We applied the PseAAC2Vec encoding scheme to each sequence and generated feature vectors based on a specified window size. Subsequently, we employed state-of-the-art machine learning algorithms, such as support vector machines (SVM) and random forests (RF), to classify the TCR protein sequences. Experimental results on the benchmark dataset demonstrated the superior performance of the PseAAC2Vec-based approach compared to existing methods. The PseAAC2Vec encoding effectively captures the discriminative patterns in TCR protein sequences, leading to improved classification accuracy and robustness. Furthermore, the encoding scheme showed promising results across different window sizes, indicating its adaptability to varying sequence contexts.


Assuntos
Biologia Computacional , Proteínas , Biologia Computacional/métodos , Proteínas/química , Sequência de Aminoácidos , Aminoácidos/química , Aminoácidos/metabolismo , Algoritmos , Máquina de Vetores de Suporte , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas
7.
BMC Bioinformatics ; 14 Suppl 15: S4, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24564205

RESUMO

BACKGROUND: Models of ancestral gene order reconstruction have progressively integrated different evolutionary patterns and processes such as unequal gene content, gene duplications, and implicitly sequence evolution via reconciled gene trees. These models have so far ignored lateral gene transfer, even though in unicellular organisms it can have an important confounding effect, and can be a rich source of information on the function of genes through the detection of transfers of clusters of genes. RESULT: We report an algorithm together with its implementation, DeCoLT, that reconstructs ancestral genome organization based on reconciled gene trees which summarize information on sequence evolution, gene origination, duplication, loss, and lateral transfer. DeCoLT optimizes in polynomial time on the number of rearrangements, computed as the number of gains and breakages of adjacencies between pairs of genes. We apply DeCoLT to 1099 gene families from 36 cyanobacteria genomes. CONCLUSION: DeCoLT is able to reconstruct adjacencies in 35 ancestral bacterial genomes with a thousand gene families in a few hours, and detects clusters of co-transferred genes. DeCoLT may also be used with any relationship between genes instead of adjacencies, to reconstruct ancestral interactions, functions or complexes. AVAILABILITY: http://pbil.univ-lyon1.fr/software/DeCoLT/


Assuntos
Evolução Molecular , Transferência Genética Horizontal , Genoma , Algoritmos , Cianobactérias/genética , Duplicação Gênica , Software
8.
BMC Bioinformatics ; 14 Suppl 15: S18, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24564758

RESUMO

BACKGROUND: We study the problem of mapping proteins between two protein families in the presence of paralogs. This problem occurs as a difficult subproblem in coevolution-based computational approaches for protein-protein interaction prediction. RESULTS: Similar to prior approaches, our method is based on the idea that coevolution implies equal rates of sequence evolution among the interacting proteins, and we provide a first attempt to quantify this notion in a formal statistical manner. We call the units that are central to this quantification scheme the units of coevolution. A unit consists of two mapped protein pairs and its score quantifies the coevolution of the pairs. This quantification allows us to provide a maximum likelihood formulation of the paralog mapping problem and to cast it into a binary quadratic programming formulation. CONCLUSION: CUPID, our software tool based on a Lagrangian relaxation of this formulation, makes it, for the first time, possible to compute state-of-the-art quality pairings in a few minutes of runtime. In summary, we suggest a novel alternative to the earlier available approaches, which is statistically sound and computationally feasible.


Assuntos
Proteínas/análise , Software , Sequência de Aminoácidos , Dados de Sequência Molecular , Proteínas/química , Alinhamento de Sequência , Análise de Sequência de Proteína
9.
Biology (Basel) ; 12(6)2023 Jun 14.
Artigo em Inglês | MEDLINE | ID: mdl-37372139

RESUMO

Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models' performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.

10.
Genes (Basel) ; 15(1)2023 12 23.
Artigo em Inglês | MEDLINE | ID: mdl-38254915

RESUMO

Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.


Assuntos
Algoritmos , Benchmarking , Sequência de Aminoácidos , Bases de Dados de Proteínas , Idioma
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA