Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
1.
Int J Mol Sci ; 25(9)2024 May 03.
Artigo em Inglês | MEDLINE | ID: mdl-38732207

RESUMO

Prediction of binding sites for transcription factors is important to understand how the latter regulate gene expression and how this regulation can be modulated for therapeutic purposes. A consistent number of references address this issue with different approaches, Machine Learning being one of the most successful. Nevertheless, we note that many such approaches fail to propose a robust and meaningful method to embed the genetic data under analysis. We try to overcome this problem by proposing a bidirectional transformer-based encoder, empowered by bidirectional long-short term memory layers and with a capsule layer responsible for the final prediction. To evaluate the efficiency of the proposed approach, we use benchmark ChIP-seq datasets of five cell lines available in the ENCODE repository (A549, GM12878, Hep-G2, H1-hESC, and Hela). The results show that the proposed method can predict TFBS within the five different cell lines very well; moreover, cross-cell predictions provide satisfactory results as well. Experiments conducted across cell lines are reinforced by the analysis of five additional lines used only to test the model trained using the others. The results confirm that prediction across cell lines remains very high, allowing an extensive cross-transcription factor analysis to be performed from which several indications of interest for molecular biology may be drawn.


Assuntos
Aprendizado Profundo , Fatores de Transcrição , Humanos , Fatores de Transcrição/metabolismo , Fatores de Transcrição/genética , Sítios de Ligação , Biologia Computacional/métodos , Células HeLa , Ligação Proteica , Sequenciamento de Cromatina por Imunoprecipitação/métodos , Linhagem Celular
2.
Virus Res ; 317: 198814, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-35588940

RESUMO

Adaptive immune response is triggered when specific pathogen peptides called epitopes are recognised as exogenous according to the paradigm of self/non-self. To be recognized by immune cells, epitopes have to be exposed (presented) on the surface of the cell. Predicting if a peptide is exposed is important to shed light on the rules that govern immune response and, thus, identify potential targets and design vaccine and drugs. We focused on peptides exposed on cell surface and made accessible to immune system through the MHC Class I complex. Before this can happen, three successive selection steps have to take place: a) Proteasome cleveage, b) TAP Transport, and c) binding to MHC-class I. Starting from a set of 211 host human reference viruses, we computed the set of unique peptides occurring in the correspondent proteomes. Then, we obtained the probability values of Proteasome Cleveage, TAP Transport and Binding to MHC Class I associated to those peptides through established prediction software tools. Such values were analysed in conjunction with two other features that could play a major role: the distance from self, strictly linked to the concept of nullomers, and the sequence entropy, measuring the complexity of the peptide amino acid composition. The analysis confirmed and extended previous results on a larger, more significant and consistent data set; we showed that the higher the distances from self, the higher the score of TAP Transport and binding to MHC class I; no significant association was instead found between distance from self and Proteasome Cleveage. Additionally, amino acid peptide composition entropy was significantly associated with the other features. In particular, higher entropies were linked with higher scores of Proteasome Cleveage, TAP Transport, Binding to MHC Class I, and higher distance from self. The relationship among the three selection steps provided evidence of a tight inter-correlation, clearly suggesting it could be the product of a co-evolutive process. We believe that these results give new insights on the complex processes that regulate peptide presentation through MHC class I, and unveil the mechanisms the allow the immune system to distinguish self and viral non-self peptides.


Assuntos
Complexo de Endopeptidases do Proteassoma , Vírus , Transportadores de Cassetes de Ligação de ATP/genética , Aminoácidos , Apresentação de Antígeno , Entropia , Epitopos , Antígenos de Histocompatibilidade Classe I/metabolismo , Humanos , Peptídeos , Complexo de Endopeptidases do Proteassoma/metabolismo , Vírus/metabolismo
3.
Biometrics ; 78(4): 1592-1603, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-34437713

RESUMO

Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are exacerbated in cases where the sample sizes are relatively small. Effective and efficient approaches to perform sparse estimation in the presence of outliers are critical for these studies, and have received considerable attention in the last decade. We contribute to this area considering high-dimensional regressions contaminated by multiple mean-shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, that is, a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through simulations and use it to study the relationships between childhood obesity and the human microbiome.


Assuntos
Obesidade Infantil , Criança , Humanos , Algoritmos , Tamanho da Amostra , Probabilidade
4.
J Comput Graph Stat ; 30(3): 566-577, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-36406776

RESUMO

Recent advances in mathematical programming have made Mixed Integer Optimization a competitive alternative to popular regularization methods for selecting features in regression problems. The approach exhibits unquestionable foundational appeal and versatility, but also poses important challenges. Here we propose MIP-BOOST, a revision of standard Mixed Integer Programming feature selection that reduces the computational burden of tuning the critical sparsity bound parameter and improves performance in the presence of feature collinearity and of signals that vary in nature and strength. The final outcome is a more efficient and effective L 0 Feature Selection method for applications of realistic size and complexity, grounded on rigorous cross-validation tuning and exact optimization of the associated Mixed Integer Program. Computational viability and improved performance in realistic scenarios is achieved through three independent but synergistic proposals. Supplementary materials including additional results, pseudocode, and computer code are available online.

5.
Math Med Biol ; 37(2): 183-211, 2020 05 29.
Artigo em Inglês | MEDLINE | ID: mdl-31162541

RESUMO

The present study aims to clarify the role of the fraction of patients under antiretroviral therapy (ART) achieving viral suppression (VS) (i.e. having plasma viral load below the detectability threshold) on the human immunodeficiency virus (HIV) epidemic in Italy. Based on the hypothesis that VS makes the virus untransmittable, we extend a previous model and we develop a time-varying ordinary differential equation model with immigration and treatment, where the naive and non-naive populations of infected are distinguished, and different compartments account for treated subjects virally suppressed and not suppressed. Moreover, naive and non-naive individuals with acquired immune deficiency syndrome (AIDS) are considered separately. Clinical data stored in the nationwide database Antiviral Response Cohort Analysis are used to reconstruct the history of the fraction of virally suppressed patients since highly active ART introduction, as well as to assess some model parameters. Other parameters are set according to the literature and the final model calibration is obtained by fitting epidemic data over the years 2003-2015. Predictions on the evolution of the HIV epidemic up to the end of 2035 are made assuming different future trends of the fraction of virally suppressed patients and different eligibility criteria for treatment. Increasing the VS fraction is found to reduce the incidence, the new cases of AIDS and the deaths from AIDS per year, especially in combination with early ART initiation. The asymptotic properties of a time-invariant formulation of the model are studied, and the existence and global asymptotic stability of a unique positive equilibrium are proved.


Assuntos
Fármacos Anti-HIV/uso terapêutico , Infecções por HIV/tratamento farmacológico , Modelos Biológicos , Terapia Antirretroviral de Alta Atividade , Biologia Computacional , Simulação por Computador , Bases de Dados Factuais , Epidemias/estatística & dados numéricos , HIV/efeitos dos fármacos , HIV/fisiologia , Infecções por HIV/epidemiologia , Infecções por HIV/virologia , Humanos , Incidência , Itália/epidemiologia , Conceitos Matemáticos , RNA Viral/sangue , Fatores de Tempo , Carga Viral/efeitos dos fármacos , Viremia/tratamento farmacológico , Viremia/virologia , Replicação Viral/efeitos dos fármacos
6.
P R Health Sci J ; 38(1): 46-53, 2019 03.
Artigo em Inglês | MEDLINE | ID: mdl-30924915

RESUMO

OBJECTIVE: The objectives of this research were to develop an epidemiological profile of tobacco use in the Puerto Rico lesbian, gay, bisexual, transgender, and transsexual (LGBTT) populations and identify whether there are any statistically significant differences (in terms of health conditions and risk factors) between LGBTT smokers (LGBTT-S), LGBTT non-smokers (LGBTT-NS), general-populationnon-smokers (GP-NS), and general-population smokers (GP-S). METHODS: Using the Puerto Rico Behavioral Risk Factor Surveillance System database (2013-2015), we conducted a univariate analysis to obtain an epidemiological profile, and a bivariate analysis was performed to compare LGBTT-S, LGBTT-NS, GP-NS, and GP-S. Finally, to determine the odds ratios (ORs), an age-adjusted logistic regression model with a 95% level of reliability was used. RESULTS: During the period of 2013 through 2015, the Puerto Rico LGBTT population was reported to have a higher tobacco use prevalence than the general population had (21.6% vs. 10.8%). The LGBTT-S were more likely to have depression (OR: 2.63, p = 0.030) than the LGBTT-NS were. Likewise, LGBTT-S were more likely to suffer from COPD (OR: 4.81, p = 0.014), depression (OR: 3.27, p = 0.002), and heart attack (OR: 0.12, p = 0.038) than were GP-NS. Finally, LGBTT-S were more likely to suffer from COPD (OR: 5.07, p = 0.013) and heart attack (OR: 0.13, p = 0.046) than GP-S were. CONCLUSION: The results of this research demonstrate that tobacco use is one of the most critical public health issues affecting the LGBTT populations in Puerto Rico. For that reason, specific interventions and treatments directed to LGBTT populations are needed to help to reduce the impact of this addiction on the health of their members.


Assuntos
Saúde Pública , Minorias Sexuais e de Gênero/estatística & dados numéricos , Uso de Tabaco/epidemiologia , Adolescente , Adulto , Idoso , Sistema de Vigilância de Fator de Risco Comportamental , Estudos Transversais , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Prevalência , Porto Rico/epidemiologia , Fatores de Risco , Adulto Jovem
7.
BioData Min ; 11: 22, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30386434

RESUMO

BACKGROUND: In the Next Generation Sequencing (NGS) era a large amount of biological data is being sequenced, analyzed, and stored in many public databases, whose interoperability is often required to allow an enhanced accessibility. The combination of heterogeneous NGS genomic data is an open challenge: the analysis of data from different experiments is a fundamental practice for the study of diseases. In this work, we propose to combine DNA methylation and RNA sequencing NGS experiments at gene level for supervised knowledge extraction in cancer. METHODS: We retrieve DNA methylation and RNA sequencing datasets from The Cancer Genome Atlas (TCGA), focusing on the Breast Invasive Carcinoma (BRCA), the Thyroid Carcinoma (THCA), and the Kidney Renal Papillary Cell Carcinoma (KIRP). We combine the RNA sequencing gene expression values with the gene methylation quantity, as a new measure that we define for representing the methylation quantity associated to a gene. Additionally, we propose to analyze the combined data through tree- and rule-based classification algorithms (C4.5, Random Forest, RIPPER, and CAMUR). RESULTS: We extract more than 15,000 classification models (composed of gene sets), which allow to distinguish the tumoral samples from the normal ones with an average accuracy of 95%. From the integrated experiments we obtain about 5000 classification models that consider both the gene measures related to the RNA sequencing and the DNA methylation experiments. CONCLUSIONS: We compare the sets of genes obtained from the classifications on RNA sequencing and DNA methylation data with the genes obtained from the integration of the two experiments. The comparison results in several genes that are in common among the single experiments and the integrated ones (733 for BRCA, 35 for KIRP, and 861 for THCA) and 509 genes that are in common among the different experiments. Finally, we investigate the possible relationships among the different analyzed tumors by extracting a core set of 13 genes that appear in all tumors. A preliminary functional analysis confirms the relation of part of those genes (5 out of 13 and 279 out of 509) with cancer, suggesting to focus further studies on the new individuated ones.

8.
BMC Bioinformatics ; 19(Suppl 10): 354, 2018 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-30367574

RESUMO

BACKGROUND: The high growth of Next Generation Sequencing data currently demands new knowledge extraction methods. In particular, the RNA sequencing gene expression experimental technique stands out for case-control studies on cancer, which can be addressed with supervised machine learning techniques able to extract human interpretable models composed of genes, and their relation to the investigated disease. State of the art rule-based classifiers are designed to extract a single classification model, possibly composed of few relevant genes. Conversely, we aim to create a large knowledge base composed of many rule-based models, and thus determine which genes could be potentially involved in the analyzed tumor. This comprehensive and open access knowledge base is required to disseminate novel insights about cancer. RESULTS: We propose CamurWeb, a new method and web-based software that is able to extract multiple and equivalent classification models in form of logic formulas ("if then" rules) and to create a knowledge base of these rules that can be queried and analyzed. The method is based on an iterative classification procedure and an adaptive feature elimination technique that enables the computation of many rule-based models related to the cancer under study. Additionally, CamurWeb includes a user friendly interface for running the software, querying the results, and managing the performed experiments. The user can create her profile, upload her gene expression data, run the classification analyses, and interpret the results with predefined queries. In order to validate the software we apply it to all public available RNA sequencing datasets from The Cancer Genome Atlas database obtaining a large open access knowledge base about cancer. CamurWeb is available at http://bioinformatics.iasi.cnr.it/camurweb . CONCLUSIONS: The experiments prove the validity of CamurWeb, obtaining many classification models and thus several genes that are associated to 21 different cancer types. Finally, the comprehensive knowledge base about cancer and the software tool are released online; interested researchers have free access to them for further studies and to design biological experiments in cancer research.


Assuntos
Regulação Neoplásica da Expressão Gênica , Bases de Conhecimento , Neoplasias/genética , Software , Sequência de Bases , Genes Neoplásicos , Genoma Humano , Humanos , Análise de Sequência de RNA
9.
J Integr Bioinform ; 15(4)2018 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-30367805

RESUMO

Finding similarities and differences between metagenomic samples within large repositories has been rather a significant issue for researchers. Over the recent years, content-based retrieval has been suggested by various studies from different perspectives. In this study, a content-based retrieval framework for identifying relevant metagenomic samples is developed. The framework consists of feature extraction, selection methods and similarity measures for whole metagenome sequencing samples. Performance of the developed framework was evaluated on given samples. A ground truth was used to evaluate the system performance such that if the system retrieves patients with the same disease, -called positive samples-, they are labeled as relevant samples otherwise irrelevant. The experimental results show that relevant experiments can be detected by using different fingerprinting approaches. We observed that Latent Semantic Analysis (LSA) Method is a promising fingerprinting approach for representing metagenomic samples and finding relevance among them. Source codes and executable files are available at www.baskent.edu.tr/∼hogul/WMS_retrieval.rar.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenoma , Microbiota , Análise de Sequência de DNA/métodos , Software , Algoritmos , Humanos
10.
BMC Med Inform Decis Mak ; 18(1): 35, 2018 05 31.
Artigo em Inglês | MEDLINE | ID: mdl-29855305

RESUMO

BACKGROUND: Alzheimer's Disease (AD) is a neurodegenaritive disorder characterized by a progressive dementia, for which actually no cure is known. An early detection of patients affected by AD can be obtained by analyzing their electroencephalography (EEG) signals, which show a reduction of the complexity, a perturbation of the synchrony, and a slowing down of the rhythms. METHODS: In this work, we apply a procedure that exploits feature extraction and classification techniques to EEG signals, whose aim is to distinguish patient affected by AD from the ones affected by Mild Cognitive Impairment (MCI) and healthy control (HC) samples. Specifically, we perform a time-frequency analysis by applying both the Fourier and Wavelet Transforms on 109 samples belonging to AD, MCI, and HC classes. The classification procedure is designed with the following steps: (i) preprocessing of EEG signals; (ii) feature extraction by means of the Discrete Fourier and Wavelet Transforms; and (iii) classification with tree-based supervised methods. RESULTS: By applying our procedure, we are able to extract reliable human-interpretable classification models that allow to automatically assign the patients into their belonging class. In particular, by exploiting a Wavelet feature extraction we achieve 83%, 92%, and 79% of accuracy when dealing with HC vs AD, HC vs MCI, and MCI vs AD classification problems, respectively. CONCLUSIONS: Finally, by comparing the classification performances with both feature extraction methods, we find out that Wavelets analysis outperforms Fourier. Hence, we suggest it in combination with supervised methods for automatic patients classification based on their EEG signals for aiding the medical diagnosis of dementia.


Assuntos
Doença de Alzheimer/diagnóstico , Classificação/métodos , Disfunção Cognitiva/diagnóstico , Eletroencefalografia/métodos , Processamento de Sinais Assistido por Computador , Idoso , Idoso de 80 Anos ou mais , Doença de Alzheimer/fisiopatologia , Disfunção Cognitiva/fisiopatologia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade
11.
Genes Nutr ; 13: 12, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29736190

RESUMO

BACKGROUND: The multidisciplinary nature of nutrition research is one of its main strengths. At the same time, however, it presents a major obstacle to integrate data analysis, especially for the terminological and semantic interpretations that specific research fields or communities are used to. To date, a proper ontology to structure and formalize the concepts used for the description of nutritional studies is still lacking. RESULTS: We have developed the Ontology for Nutritional Studies (ONS) by harmonizing selected pre-existing de facto ontologies with novel health and nutritional terminology classifications. The ONS is the result of a scholarly consensus of 51 research centers in nine European countries. The ontology classes and relations are commonly encountered while conducting, storing, harmonizing, integrating, describing, and searching nutritional studies. The ONS facilitates the description and specification of complex nutritional studies as demonstrated with two application scenarios. CONCLUSIONS: The ONS is the first systematic effort to provide a solid and extensible formal ontology framework for nutritional studies. Integration of new information can be easily achieved by the addition of extra modules (i.e., nutrigenomics, metabolomics, nutrikinetics, and quality appraisal). The ONS provides a unified and standardized terminology for nutritional studies as a resource for nutrition researchers who might not necessarily be familiar with ontologies and standardization concepts.

12.
Math Biosci Eng ; 15(1): 181-207, 2018 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-29161832

RESUMO

In the present paper we propose a simple time-varying ODE model to describe the evolution of HIV epidemic in Italy. The model considers a single population of susceptibles, without distinction of high-risk groups within the general population, and accounts for the presence of immigration and emigration, modelling their effects on both the general demography and the dynamics of the infected subpopulations. To represent the intra-host disease progression, the untreated infected population is distributed over four compartments in cascade according to the CD4 counts. A further compartment is added to represent infected people under antiretroviral therapy. The per capita exit rate from treatment, due to voluntary interruption or failure of therapy, is assumed variable with time. The values of the model parameters not reported in the literature are assessed by fitting available epidemiological data over the decade 2003÷2012. Predictions until year 2025 are computed, enlightening the impact on the public health of the early initiation of the antiretroviral therapy. The benefits of this change in the treatment eligibility consist in reducing the HIV incidence rate, the rate of new AIDS cases, and the rate of death from AIDS. Analytical results about properties of the model in its time-invariant form are provided, in particular the global stability of the equilibrium points is established either in the absence and in the presence of infected among immigrants.


Assuntos
Síndrome da Imunodeficiência Adquirida/epidemiologia , Síndrome da Imunodeficiência Adquirida/transmissão , Antirretrovirais/farmacologia , Epidemias , Infecções por HIV/epidemiologia , Infecções por HIV/transmissão , Adulto , Idoso , Algoritmos , Terapia Antirretroviral de Alta Atividade , Linfócitos T CD4-Positivos/citologia , Progressão da Doença , Emigrantes e Imigrantes , Feminino , Humanos , Incidência , Itália/epidemiologia , Masculino , Pessoa de Meia-Idade , Modelos Teóricos , Saúde Pública , Reprodutibilidade dos Testes , Fatores de Tempo
13.
Oncotarget ; 8(61): 103340-103363, 2017 Nov 28.
Artigo em Inglês | MEDLINE | ID: mdl-29262566

RESUMO

Increasing evidence points to a key role played by epithelial-mesenchymal transition (EMT) in cancer progression and drug resistance. In this study, we used wet and in silico approaches to investigate whether EMT phenotypes are associated to resistance to target therapy in a non-small cell lung cancer model system harboring activating mutations of the epidermal growth factor receptor. The combination of different analysis techniques allowed us to describe intermediate/hybrid and complete EMT phenotypes respectively in HCC827- and HCC4006-derived drug-resistant human cancer cell lines. Interestingly, intermediate/hybrid EMT phenotypes, a collective cell migration and increased stem-like ability associate to resistance to the epidermal growth factor receptor inhibitor, erlotinib, in HCC827 derived cell lines. Moreover, the use of three complementary approaches for gene expression analysis supported the identification of a small EMT-related gene list, which may have otherwise been overlooked by standard stand-alone methods for gene expression analysis.

14.
BioData Min ; 9: 38, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27980679

RESUMO

BACKGROUND: Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods. RESULTS: We propose a supervised method based on a genetic algorithm to identify small genomic subsequences that discriminate among different species. The method identifies multiple subsequences of bounded length with the same information power in a given genomic region. The algorithm has been successfully evaluated through its integration into a rule-based classification framework and applied to three different biological data sets: Influenza, Polyoma, and Rhino virus sequences. CONCLUSIONS: We discover a large number of small subsequences that can be used to identify each virus type with high accuracy and low computational time, and moreover help to characterize different genomic regions. Bounding their length to 20, our method found 1164 characterizing subsequences for all the Influenza virus subtypes, 194 for all the Polyoma viruses, and 11 for Rhino viruses. The abundance of small separating subsequences extracted for each genomic region may be an important support for quick and robust virus identification. Finally, useful biological information can be derived by the relative location and abundance of such subsequences along the different regions.

15.
Bioinformatics ; 32(5): 697-704, 2016 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-26519501

RESUMO

MOTIVATION: Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class. RESULTS: We propose CAMUR, a new method that extracts multiple and equivalent classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set of the genes present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. CAMUR includes an ad-hoc knowledge repository (database) and a querying tool.We analyze three different types of RNA-seq data sets (Breast, Head and Neck, and Stomach Cancer) from The Cancer Genome Atlas (TCGA) and we validate CAMUR and its models also on non-TCGA data. Our experimental results show the efficacy of CAMUR: we obtain several reliable equivalent classification models, from which the most frequent genes, their relationships, and the relation with a particular cancer are deduced. AVAILABILITY AND IMPLEMENTATION: dmb.iasi.cnr.it/camur.php CONTACT: emanuel@iasi.cnr.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Neoplasias , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , RNA , Análise de Sequência de RNA
16.
J Theor Biol ; 391: 13-20, 2016 Feb 21.
Artigo em Inglês | MEDLINE | ID: mdl-26656109

RESUMO

Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid constraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence indistinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1047 natural protein sequences and 10,470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natural proteins. We analyze the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones.


Assuntos
Evolução Molecular , Modelos Genéticos , Proteínas/química , Proteínas/genética , Sequência de Aminoácidos
17.
BioData Min ; 8: 39, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26664519

RESUMO

Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (k-mers) that occur in the analyzed sequences. In this paper, we present Logic Alignment Free (LAF), a method that combines alignment-free techniques and rule-based classification algorithms in order to assign biological samples to their taxa. This method searches for a minimal subset of k-mers whose relative frequencies are used to build classification models as disjunctive-normal-form logic formulas (if-then rules). We apply LAF successfully to the classification of bacterial genomes to their corresponding taxonomy. In particular, we succeed in obtaining reliable classification at different taxonomic levels by extracting a handful of rules, each one based on the frequency of just few k-mers. State of the art methods to adjust the frequency of k-mers to the character distribution of the underlying genomes have negligible impact on classification performance, suggesting that the signal of each class is strong and that LAF is effective in identifying it.

18.
BMC Neurosci ; 16: 28, 2015 Apr 29.
Artigo em Inglês | MEDLINE | ID: mdl-25925689

RESUMO

BACKGROUND: Many approaches exist to integrate protein-protein interaction data with other sources of information, most notably with gene co-expression data, to obtain information on network dynamics. It is of interest to look at groups of interacting gene products that form a protein complex. We were interested in applying new tools to the characterization of pathogenesis and dynamic events of an Alzheimer's-like neurodegenerative model, the AD11 mice, expressing an anti-NGF monoclonal antibody. The goal was to quantify the impact of neurodegeneration on protein complexes, by measuring the correlation between gene expression data by different metrics. RESULTS: Data were extracted from the gene expression profile of AD11 brain, obtained by Agilent microarray, at 1, 3, 6, 15 months of age. For genes coding proteins in complexes, the correlation matrix of pairwise expression was computed. The dynamics between correlation matrices at different time points was evaluated: paired T-test between average correlation levels and a normalized Euclidean distance with z-score. We unveiled a differential wiring of interactions in a set of complexes, whose network structure discriminates between transgenic and control mice. Furthermore, we analyzed the dynamics of gene expression values, by looking at changes in gene-to-gene correlation over time and identified those complexes that exhibit a different timedependent behaviour between transgenic and controls. The most significant changes in correlation dynamics are concentrated in the early stage of disease, with higher correlation in AD11 mice compared to controls. Many complexes go through dynamic changes over time, showing the role of the dysfunctional immunoproteasome, as early neurodegenerative disease event. Furthermore, this analysis shows key events in the neurodegeneration process of the AD11 model, by identifying significant differences in co-expression values of other complexes, such as parvulin complex, with a role in protein misfolding and proteostasis, and of complexes involved in transcriptional mechanisms. CONCLUSIONS: We have proposed a novel approach to analyze the network structure of protein complexes, by two different measures to evaluate the dynamics of gene-gene correlation matrices from gene expression profiles. The methodology was able to investigate the re-organization of interactions within protein complexes in the AD11 model of neurodegeneration.


Assuntos
Doença de Alzheimer/metabolismo , Encéfalo/metabolismo , Envelhecimento/metabolismo , Animais , Bases de Dados de Proteínas , Modelos Animais de Doenças , Feminino , Expressão Gênica , Perfilação da Expressão Gênica/métodos , Camundongos Transgênicos , Análise em Microsséries , Fatores de Tempo
19.
BMC Res Notes ; 7: 869, 2014 Dec 03.
Artigo em Inglês | MEDLINE | ID: mdl-25465386

RESUMO

BACKGROUND: Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis. METHODS: We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance. The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae, Escherichia coli, and Homo sapiens. The comparison is based on the correctness of threshold predictors cross-validated over different samples. RESULTS: We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment. CONCLUSIONS: Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).


Assuntos
Algoritmos , DNA Bacteriano/genética , DNA Fúngico/genética , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de DNA/estatística & dados numéricos , DNA Bacteriano/química , DNA Fúngico/química , Escherichia coli/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Saccharomyces cerevisiae/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos
20.
Genomics ; 104(2): 79-86, 2014 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-25058025

RESUMO

Scarce work has been done in the analysis of the composition of conserved non-coding elements (CNEs) that are identified by comparisons of two or more genomes and are found to exist in all metazoan genomes. Here we present the analysis of CNEs with a methodology that takes into account word occurrence at various lengths scales in the form of feature vector representation and rule based classifiers. We implement our approach on both protein-coding exons and CNEs, originating from human, insect (Drosophila melanogaster) and worm (Caenorhabditis elegans) genomes, that are either identified in the present study or obtained from the literature. Alignment free feature vector representation of sequences combined with rule-based classification methods leads to successful classification of the different CNEs classes. Biologically meaningful results are derived by comparison with the genomic signatures approach, and classification rates for a variety of functional elements of the genomes along with surrogates are presented.


Assuntos
Caenorhabditis elegans/genética , DNA Intergênico/genética , Drosophila melanogaster/genética , Análise de Sequência de DNA/métodos , Animais , Sequência Conservada/genética , Evolução Molecular , Éxons , Genômica , Humanos , Alinhamento de Sequência
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA