Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Int J Mol Sci ; 25(9)2024 May 03.
Artículo en Inglés | MEDLINE | ID: mdl-38732207

RESUMEN

Prediction of binding sites for transcription factors is important to understand how the latter regulate gene expression and how this regulation can be modulated for therapeutic purposes. A consistent number of references address this issue with different approaches, Machine Learning being one of the most successful. Nevertheless, we note that many such approaches fail to propose a robust and meaningful method to embed the genetic data under analysis. We try to overcome this problem by proposing a bidirectional transformer-based encoder, empowered by bidirectional long-short term memory layers and with a capsule layer responsible for the final prediction. To evaluate the efficiency of the proposed approach, we use benchmark ChIP-seq datasets of five cell lines available in the ENCODE repository (A549, GM12878, Hep-G2, H1-hESC, and Hela). The results show that the proposed method can predict TFBS within the five different cell lines very well; moreover, cross-cell predictions provide satisfactory results as well. Experiments conducted across cell lines are reinforced by the analysis of five additional lines used only to test the model trained using the others. The results confirm that prediction across cell lines remains very high, allowing an extensive cross-transcription factor analysis to be performed from which several indications of interest for molecular biology may be drawn.


Asunto(s)
Aprendizaje Profundo , Factores de Transcripción , Humanos , Factores de Transcripción/metabolismo , Factores de Transcripción/genética , Sitios de Unión , Biología Computacional/métodos , Células HeLa , Unión Proteica , Secuenciación de Inmunoprecipitación de Cromatina/métodos , Línea Celular
2.
Biometrics ; 78(4): 1592-1603, 2022 12.
Artículo en Inglés | MEDLINE | ID: mdl-34437713

RESUMEN

Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are exacerbated in cases where the sample sizes are relatively small. Effective and efficient approaches to perform sparse estimation in the presence of outliers are critical for these studies, and have received considerable attention in the last decade. We contribute to this area considering high-dimensional regressions contaminated by multiple mean-shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, that is, a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through simulations and use it to study the relationships between childhood obesity and the human microbiome.


Asunto(s)
Obesidad Infantil , Niño , Humanos , Algoritmos , Tamaño de la Muestra , Probabilidad
3.
BMC Bioinformatics ; 19(Suppl 10): 354, 2018 Oct 15.
Artículo en Inglés | MEDLINE | ID: mdl-30367574

RESUMEN

BACKGROUND: The high growth of Next Generation Sequencing data currently demands new knowledge extraction methods. In particular, the RNA sequencing gene expression experimental technique stands out for case-control studies on cancer, which can be addressed with supervised machine learning techniques able to extract human interpretable models composed of genes, and their relation to the investigated disease. State of the art rule-based classifiers are designed to extract a single classification model, possibly composed of few relevant genes. Conversely, we aim to create a large knowledge base composed of many rule-based models, and thus determine which genes could be potentially involved in the analyzed tumor. This comprehensive and open access knowledge base is required to disseminate novel insights about cancer. RESULTS: We propose CamurWeb, a new method and web-based software that is able to extract multiple and equivalent classification models in form of logic formulas ("if then" rules) and to create a knowledge base of these rules that can be queried and analyzed. The method is based on an iterative classification procedure and an adaptive feature elimination technique that enables the computation of many rule-based models related to the cancer under study. Additionally, CamurWeb includes a user friendly interface for running the software, querying the results, and managing the performed experiments. The user can create her profile, upload her gene expression data, run the classification analyses, and interpret the results with predefined queries. In order to validate the software we apply it to all public available RNA sequencing datasets from The Cancer Genome Atlas database obtaining a large open access knowledge base about cancer. CamurWeb is available at http://bioinformatics.iasi.cnr.it/camurweb . CONCLUSIONS: The experiments prove the validity of CamurWeb, obtaining many classification models and thus several genes that are associated to 21 different cancer types. Finally, the comprehensive knowledge base about cancer and the software tool are released online; interested researchers have free access to them for further studies and to design biological experiments in cancer research.


Asunto(s)
Regulación Neoplásica de la Expresión Génica , Bases del Conocimiento , Neoplasias/genética , Programas Informáticos , Secuencia de Bases , Genes Relacionados con las Neoplasias , Genoma Humano , Humanos , Análisis de Secuencia de ARN
4.
BMC Med Inform Decis Mak ; 18(1): 35, 2018 05 31.
Artículo en Inglés | MEDLINE | ID: mdl-29855305

RESUMEN

BACKGROUND: Alzheimer's Disease (AD) is a neurodegenaritive disorder characterized by a progressive dementia, for which actually no cure is known. An early detection of patients affected by AD can be obtained by analyzing their electroencephalography (EEG) signals, which show a reduction of the complexity, a perturbation of the synchrony, and a slowing down of the rhythms. METHODS: In this work, we apply a procedure that exploits feature extraction and classification techniques to EEG signals, whose aim is to distinguish patient affected by AD from the ones affected by Mild Cognitive Impairment (MCI) and healthy control (HC) samples. Specifically, we perform a time-frequency analysis by applying both the Fourier and Wavelet Transforms on 109 samples belonging to AD, MCI, and HC classes. The classification procedure is designed with the following steps: (i) preprocessing of EEG signals; (ii) feature extraction by means of the Discrete Fourier and Wavelet Transforms; and (iii) classification with tree-based supervised methods. RESULTS: By applying our procedure, we are able to extract reliable human-interpretable classification models that allow to automatically assign the patients into their belonging class. In particular, by exploiting a Wavelet feature extraction we achieve 83%, 92%, and 79% of accuracy when dealing with HC vs AD, HC vs MCI, and MCI vs AD classification problems, respectively. CONCLUSIONS: Finally, by comparing the classification performances with both feature extraction methods, we find out that Wavelets analysis outperforms Fourier. Hence, we suggest it in combination with supervised methods for automatic patients classification based on their EEG signals for aiding the medical diagnosis of dementia.


Asunto(s)
Enfermedad de Alzheimer/diagnóstico , Clasificación/métodos , Disfunción Cognitiva/diagnóstico , Electroencefalografía/métodos , Procesamiento de Señales Asistido por Computador , Anciano , Anciano de 80 o más Años , Enfermedad de Alzheimer/fisiopatología , Disfunción Cognitiva/fisiopatología , Femenino , Humanos , Masculino , Persona de Mediana Edad
5.
Bioinformatics ; 32(5): 697-704, 2016 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-26519501

RESUMEN

MOTIVATION: Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class. RESULTS: We propose CAMUR, a new method that extracts multiple and equivalent classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set of the genes present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. CAMUR includes an ad-hoc knowledge repository (database) and a querying tool.We analyze three different types of RNA-seq data sets (Breast, Head and Neck, and Stomach Cancer) from The Cancer Genome Atlas (TCGA) and we validate CAMUR and its models also on non-TCGA data. Our experimental results show the efficacy of CAMUR: we obtain several reliable equivalent classification models, from which the most frequent genes, their relationships, and the relation with a particular cancer are deduced. AVAILABILITY AND IMPLEMENTATION: dmb.iasi.cnr.it/camur.php CONTACT: emanuel@iasi.cnr.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Neoplasias , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , ARN , Análisis de Secuencia de ARN
6.
J Theor Biol ; 391: 13-20, 2016 Feb 21.
Artículo en Inglés | MEDLINE | ID: mdl-26656109

RESUMEN

Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid constraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence indistinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1047 natural protein sequences and 10,470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natural proteins. We analyze the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones.


Asunto(s)
Evolución Molecular , Modelos Genéticos , Proteínas/química , Proteínas/genética , Secuencia de Aminoácidos
7.
BMC Neurosci ; 16: 28, 2015 Apr 29.
Artículo en Inglés | MEDLINE | ID: mdl-25925689

RESUMEN

BACKGROUND: Many approaches exist to integrate protein-protein interaction data with other sources of information, most notably with gene co-expression data, to obtain information on network dynamics. It is of interest to look at groups of interacting gene products that form a protein complex. We were interested in applying new tools to the characterization of pathogenesis and dynamic events of an Alzheimer's-like neurodegenerative model, the AD11 mice, expressing an anti-NGF monoclonal antibody. The goal was to quantify the impact of neurodegeneration on protein complexes, by measuring the correlation between gene expression data by different metrics. RESULTS: Data were extracted from the gene expression profile of AD11 brain, obtained by Agilent microarray, at 1, 3, 6, 15 months of age. For genes coding proteins in complexes, the correlation matrix of pairwise expression was computed. The dynamics between correlation matrices at different time points was evaluated: paired T-test between average correlation levels and a normalized Euclidean distance with z-score. We unveiled a differential wiring of interactions in a set of complexes, whose network structure discriminates between transgenic and control mice. Furthermore, we analyzed the dynamics of gene expression values, by looking at changes in gene-to-gene correlation over time and identified those complexes that exhibit a different timedependent behaviour between transgenic and controls. The most significant changes in correlation dynamics are concentrated in the early stage of disease, with higher correlation in AD11 mice compared to controls. Many complexes go through dynamic changes over time, showing the role of the dysfunctional immunoproteasome, as early neurodegenerative disease event. Furthermore, this analysis shows key events in the neurodegeneration process of the AD11 model, by identifying significant differences in co-expression values of other complexes, such as parvulin complex, with a role in protein misfolding and proteostasis, and of complexes involved in transcriptional mechanisms. CONCLUSIONS: We have proposed a novel approach to analyze the network structure of protein complexes, by two different measures to evaluate the dynamics of gene-gene correlation matrices from gene expression profiles. The methodology was able to investigate the re-organization of interactions within protein complexes in the AD11 model of neurodegeneration.


Asunto(s)
Enfermedad de Alzheimer/metabolismo , Encéfalo/metabolismo , Envejecimiento/metabolismo , Animales , Bases de Datos de Proteínas , Modelos Animales de Enfermedad , Femenino , Expresión Génica , Perfilación de la Expresión Génica/métodos , Ratones Transgénicos , Análisis por Micromatrices , Factores de Tiempo
8.
Genomics ; 104(2): 79-86, 2014 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-25058025

RESUMEN

Scarce work has been done in the analysis of the composition of conserved non-coding elements (CNEs) that are identified by comparisons of two or more genomes and are found to exist in all metazoan genomes. Here we present the analysis of CNEs with a methodology that takes into account word occurrence at various lengths scales in the form of feature vector representation and rule based classifiers. We implement our approach on both protein-coding exons and CNEs, originating from human, insect (Drosophila melanogaster) and worm (Caenorhabditis elegans) genomes, that are either identified in the present study or obtained from the literature. Alignment free feature vector representation of sequences combined with rule-based classification methods leads to successful classification of the different CNEs classes. Biologically meaningful results are derived by comparison with the genomic signatures approach, and classification rates for a variety of functional elements of the genomes along with surrogates are presented.


Asunto(s)
Caenorhabditis elegans/genética , ADN Intergénico/genética , Drosophila melanogaster/genética , Análisis de Secuencia de ADN/métodos , Animales , Secuencia Conservada/genética , Evolución Molecular , Exones , Genómica , Humanos , Alineación de Secuencia
9.
Virol J ; 9: 58, 2012 Mar 02.
Artículo en Inglés | MEDLINE | ID: mdl-22385517

RESUMEN

BACKGROUND: Differences in genomic sequences are crucial for the classification of viruses into different species. In this work, viral DNA sequences belonging to the human polyomaviruses BKPyV, JCPyV, KIPyV, WUPyV, and MCPyV are analyzed using a logic data mining method in order to identify the nucleotides which are able to distinguish the five different human polyomaviruses. RESULTS: The approach presented in this work is successful as it discovers several logic rules that effectively characterize the different five studied polyomaviruses. The individuated logic rules are able to separate precisely one viral type from the other and to assign an unknown DNA sequence to one of the five analyzed polyomaviruses. CONCLUSIONS: The data mining analysis is performed by considering the complete sequences of the viruses and the sequences of the different gene regions separately, obtaining in both cases extremely high correct recognition rates.


Asunto(s)
Biología Computacional/métodos , ADN Viral/química , Minería de Datos , Poliomavirus/clasificación , Poliomavirus/genética , Secuencia de Bases , Humanos
10.
Virus Res ; 317: 198814, 2022 08.
Artículo en Inglés | MEDLINE | ID: mdl-35588940

RESUMEN

Adaptive immune response is triggered when specific pathogen peptides called epitopes are recognised as exogenous according to the paradigm of self/non-self. To be recognized by immune cells, epitopes have to be exposed (presented) on the surface of the cell. Predicting if a peptide is exposed is important to shed light on the rules that govern immune response and, thus, identify potential targets and design vaccine and drugs. We focused on peptides exposed on cell surface and made accessible to immune system through the MHC Class I complex. Before this can happen, three successive selection steps have to take place: a) Proteasome cleveage, b) TAP Transport, and c) binding to MHC-class I. Starting from a set of 211 host human reference viruses, we computed the set of unique peptides occurring in the correspondent proteomes. Then, we obtained the probability values of Proteasome Cleveage, TAP Transport and Binding to MHC Class I associated to those peptides through established prediction software tools. Such values were analysed in conjunction with two other features that could play a major role: the distance from self, strictly linked to the concept of nullomers, and the sequence entropy, measuring the complexity of the peptide amino acid composition. The analysis confirmed and extended previous results on a larger, more significant and consistent data set; we showed that the higher the distances from self, the higher the score of TAP Transport and binding to MHC class I; no significant association was instead found between distance from self and Proteasome Cleveage. Additionally, amino acid peptide composition entropy was significantly associated with the other features. In particular, higher entropies were linked with higher scores of Proteasome Cleveage, TAP Transport, Binding to MHC Class I, and higher distance from self. The relationship among the three selection steps provided evidence of a tight inter-correlation, clearly suggesting it could be the product of a co-evolutive process. We believe that these results give new insights on the complex processes that regulate peptide presentation through MHC class I, and unveil the mechanisms the allow the immune system to distinguish self and viral non-self peptides.


Asunto(s)
Complejo de la Endopetidasa Proteasomal , Virus , Transportadoras de Casetes de Unión a ATP/genética , Aminoácidos , Presentación de Antígeno , Entropía , Epítopos , Antígenos de Histocompatibilidad Clase I/metabolismo , Humanos , Péptidos , Complejo de la Endopetidasa Proteasomal/metabolismo , Virus/metabolismo
11.
J Comput Graph Stat ; 30(3): 566-577, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-36406776

RESUMEN

Recent advances in mathematical programming have made Mixed Integer Optimization a competitive alternative to popular regularization methods for selecting features in regression problems. The approach exhibits unquestionable foundational appeal and versatility, but also poses important challenges. Here we propose MIP-BOOST, a revision of standard Mixed Integer Programming feature selection that reduces the computational burden of tuning the critical sparsity bound parameter and improves performance in the presence of feature collinearity and of signals that vary in nature and strength. The final outcome is a more efficient and effective L 0 Feature Selection method for applications of realistic size and complexity, grounded on rigorous cross-validation tuning and exact optimization of the associated Mixed Integer Program. Computational viability and improved performance in realistic scenarios is achieved through three independent but synergistic proposals. Supplementary materials including additional results, pseudocode, and computer code are available online.

12.
Math Med Biol ; 37(2): 183-211, 2020 05 29.
Artículo en Inglés | MEDLINE | ID: mdl-31162541

RESUMEN

The present study aims to clarify the role of the fraction of patients under antiretroviral therapy (ART) achieving viral suppression (VS) (i.e. having plasma viral load below the detectability threshold) on the human immunodeficiency virus (HIV) epidemic in Italy. Based on the hypothesis that VS makes the virus untransmittable, we extend a previous model and we develop a time-varying ordinary differential equation model with immigration and treatment, where the naive and non-naive populations of infected are distinguished, and different compartments account for treated subjects virally suppressed and not suppressed. Moreover, naive and non-naive individuals with acquired immune deficiency syndrome (AIDS) are considered separately. Clinical data stored in the nationwide database Antiviral Response Cohort Analysis are used to reconstruct the history of the fraction of virally suppressed patients since highly active ART introduction, as well as to assess some model parameters. Other parameters are set according to the literature and the final model calibration is obtained by fitting epidemic data over the years 2003-2015. Predictions on the evolution of the HIV epidemic up to the end of 2035 are made assuming different future trends of the fraction of virally suppressed patients and different eligibility criteria for treatment. Increasing the VS fraction is found to reduce the incidence, the new cases of AIDS and the deaths from AIDS per year, especially in combination with early ART initiation. The asymptotic properties of a time-invariant formulation of the model are studied, and the existence and global asymptotic stability of a unique positive equilibrium are proved.


Asunto(s)
Fármacos Anti-VIH/uso terapéutico , Infecciones por VIH/tratamiento farmacológico , Modelos Biológicos , Terapia Antirretroviral Altamente Activa , Biología Computacional , Simulación por Computador , Bases de Datos Factuales , Epidemias/estadística & datos numéricos , VIH/efectos de los fármacos , VIH/fisiología , Infecciones por VIH/epidemiología , Infecciones por VIH/virología , Humanos , Incidencia , Italia/epidemiología , Conceptos Matemáticos , ARN Viral/sangre , Factores de Tiempo , Carga Viral/efectos de los fármacos , Viremia/tratamiento farmacológico , Viremia/virología , Replicación Viral/efectos de los fármacos
13.
BMC Bioinformatics ; 10 Suppl 14: S7, 2009 Nov 10.
Artículo en Inglés | MEDLINE | ID: mdl-19900303

RESUMEN

BACKGROUND: According to many field experts, specimens classification based on morphological keys needs to be supported with automated techniques based on the analysis of DNA fragments. The most successful results in this area are those obtained from a particular fragment of mitochondrial DNA, the gene cytochrome c oxidase I (COI) (the "barcode"). Since 2004 the Consortium for the Barcode of Life (CBOL) promotes the collection of barcode specimens and the development of methods to analyze the barcode for several tasks, among which the identification of rules to correctly classify an individual into its species by reading its barcode. RESULTS: We adopt a Logic Mining method based on two optimization models and present the results obtained on two datasets where a number of COI fragments are used to describe the individuals that belong to different species. The method proposed exhibits high correct recognition rates on a training-testing split of the available data using a small proportion of the information available (e.g., correct recognition approx. 97% when only 20 sites of the 648 available are used). The method is able to provide compact formulas on the values (A, C, G, T) at the selected sites that synthesize the characteristic of each species, a relevant information for taxonomists. CONCLUSION: We have presented a Logic Mining technique designed to analyze barcode data and to provide detailed output of interest to the taxonomists and the barcode community represented in the CBOL Consortium. The method has proven to be effective, efficient and precise.


Asunto(s)
Clasificación/métodos , Biología Computacional/métodos , Procesamiento Automatizado de Datos , Análisis de Secuencia de ADN/métodos , Animales , Humanos
14.
P R Health Sci J ; 38(1): 46-53, 2019 03.
Artículo en Inglés | MEDLINE | ID: mdl-30924915

RESUMEN

OBJECTIVE: The objectives of this research were to develop an epidemiological profile of tobacco use in the Puerto Rico lesbian, gay, bisexual, transgender, and transsexual (LGBTT) populations and identify whether there are any statistically significant differences (in terms of health conditions and risk factors) between LGBTT smokers (LGBTT-S), LGBTT non-smokers (LGBTT-NS), general-populationnon-smokers (GP-NS), and general-population smokers (GP-S). METHODS: Using the Puerto Rico Behavioral Risk Factor Surveillance System database (2013-2015), we conducted a univariate analysis to obtain an epidemiological profile, and a bivariate analysis was performed to compare LGBTT-S, LGBTT-NS, GP-NS, and GP-S. Finally, to determine the odds ratios (ORs), an age-adjusted logistic regression model with a 95% level of reliability was used. RESULTS: During the period of 2013 through 2015, the Puerto Rico LGBTT population was reported to have a higher tobacco use prevalence than the general population had (21.6% vs. 10.8%). The LGBTT-S were more likely to have depression (OR: 2.63, p = 0.030) than the LGBTT-NS were. Likewise, LGBTT-S were more likely to suffer from COPD (OR: 4.81, p = 0.014), depression (OR: 3.27, p = 0.002), and heart attack (OR: 0.12, p = 0.038) than were GP-NS. Finally, LGBTT-S were more likely to suffer from COPD (OR: 5.07, p = 0.013) and heart attack (OR: 0.13, p = 0.046) than GP-S were. CONCLUSION: The results of this research demonstrate that tobacco use is one of the most critical public health issues affecting the LGBTT populations in Puerto Rico. For that reason, specific interventions and treatments directed to LGBTT populations are needed to help to reduce the impact of this addiction on the health of their members.


Asunto(s)
Salud Pública , Minorías Sexuales y de Género/estadística & datos numéricos , Uso de Tabaco/epidemiología , Adolescente , Adulto , Anciano , Sistema de Vigilancia de Factor de Riesgo Conductual , Estudios Transversales , Femenino , Humanos , Masculino , Persona de Mediana Edad , Prevalencia , Puerto Rico/epidemiología , Factores de Riesgo , Adulto Joven
15.
BioData Min ; 11: 22, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30386434

RESUMEN

BACKGROUND: In the Next Generation Sequencing (NGS) era a large amount of biological data is being sequenced, analyzed, and stored in many public databases, whose interoperability is often required to allow an enhanced accessibility. The combination of heterogeneous NGS genomic data is an open challenge: the analysis of data from different experiments is a fundamental practice for the study of diseases. In this work, we propose to combine DNA methylation and RNA sequencing NGS experiments at gene level for supervised knowledge extraction in cancer. METHODS: We retrieve DNA methylation and RNA sequencing datasets from The Cancer Genome Atlas (TCGA), focusing on the Breast Invasive Carcinoma (BRCA), the Thyroid Carcinoma (THCA), and the Kidney Renal Papillary Cell Carcinoma (KIRP). We combine the RNA sequencing gene expression values with the gene methylation quantity, as a new measure that we define for representing the methylation quantity associated to a gene. Additionally, we propose to analyze the combined data through tree- and rule-based classification algorithms (C4.5, Random Forest, RIPPER, and CAMUR). RESULTS: We extract more than 15,000 classification models (composed of gene sets), which allow to distinguish the tumoral samples from the normal ones with an average accuracy of 95%. From the integrated experiments we obtain about 5000 classification models that consider both the gene measures related to the RNA sequencing and the DNA methylation experiments. CONCLUSIONS: We compare the sets of genes obtained from the classifications on RNA sequencing and DNA methylation data with the genes obtained from the integration of the two experiments. The comparison results in several genes that are in common among the single experiments and the integrated ones (733 for BRCA, 35 for KIRP, and 861 for THCA) and 509 genes that are in common among the different experiments. Finally, we investigate the possible relationships among the different analyzed tumors by extracting a core set of 13 genes that appear in all tumors. A preliminary functional analysis confirms the relation of part of those genes (5 out of 13 and 279 out of 509) with cancer, suggesting to focus further studies on the new individuated ones.

16.
J Integr Bioinform ; 15(4)2018 Oct 26.
Artículo en Inglés | MEDLINE | ID: mdl-30367805

RESUMEN

Finding similarities and differences between metagenomic samples within large repositories has been rather a significant issue for researchers. Over the recent years, content-based retrieval has been suggested by various studies from different perspectives. In this study, a content-based retrieval framework for identifying relevant metagenomic samples is developed. The framework consists of feature extraction, selection methods and similarity measures for whole metagenome sequencing samples. Performance of the developed framework was evaluated on given samples. A ground truth was used to evaluate the system performance such that if the system retrieves patients with the same disease, -called positive samples-, they are labeled as relevant samples otherwise irrelevant. The experimental results show that relevant experiments can be detected by using different fingerprinting approaches. We observed that Latent Semantic Analysis (LSA) Method is a promising fingerprinting approach for representing metagenomic samples and finding relevance among them. Source codes and executable files are available at www.baskent.edu.tr/∼hogul/WMS_retrieval.rar.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Metagenoma , Microbiota , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Humanos
17.
Math Biosci Eng ; 15(1): 181-207, 2018 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-29161832

RESUMEN

In the present paper we propose a simple time-varying ODE model to describe the evolution of HIV epidemic in Italy. The model considers a single population of susceptibles, without distinction of high-risk groups within the general population, and accounts for the presence of immigration and emigration, modelling their effects on both the general demography and the dynamics of the infected subpopulations. To represent the intra-host disease progression, the untreated infected population is distributed over four compartments in cascade according to the CD4 counts. A further compartment is added to represent infected people under antiretroviral therapy. The per capita exit rate from treatment, due to voluntary interruption or failure of therapy, is assumed variable with time. The values of the model parameters not reported in the literature are assessed by fitting available epidemiological data over the decade 2003÷2012. Predictions until year 2025 are computed, enlightening the impact on the public health of the early initiation of the antiretroviral therapy. The benefits of this change in the treatment eligibility consist in reducing the HIV incidence rate, the rate of new AIDS cases, and the rate of death from AIDS. Analytical results about properties of the model in its time-invariant form are provided, in particular the global stability of the equilibrium points is established either in the absence and in the presence of infected among immigrants.


Asunto(s)
Síndrome de Inmunodeficiencia Adquirida/epidemiología , Síndrome de Inmunodeficiencia Adquirida/transmisión , Antirretrovirales/farmacología , Epidemias , Infecciones por VIH/epidemiología , Infecciones por VIH/transmisión , Adulto , Anciano , Algoritmos , Terapia Antirretroviral Altamente Activa , Linfocitos T CD4-Positivos/citología , Progresión de la Enfermedad , Emigrantes e Inmigrantes , Femenino , Humanos , Incidencia , Italia/epidemiología , Masculino , Persona de Mediana Edad , Modelos Teóricos , Salud Pública , Reproducibilidad de los Resultados , Factores de Tiempo
18.
Genes Nutr ; 13: 12, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29736190

RESUMEN

BACKGROUND: The multidisciplinary nature of nutrition research is one of its main strengths. At the same time, however, it presents a major obstacle to integrate data analysis, especially for the terminological and semantic interpretations that specific research fields or communities are used to. To date, a proper ontology to structure and formalize the concepts used for the description of nutritional studies is still lacking. RESULTS: We have developed the Ontology for Nutritional Studies (ONS) by harmonizing selected pre-existing de facto ontologies with novel health and nutritional terminology classifications. The ONS is the result of a scholarly consensus of 51 research centers in nine European countries. The ontology classes and relations are commonly encountered while conducting, storing, harmonizing, integrating, describing, and searching nutritional studies. The ONS facilitates the description and specification of complex nutritional studies as demonstrated with two application scenarios. CONCLUSIONS: The ONS is the first systematic effort to provide a solid and extensible formal ontology framework for nutritional studies. Integration of new information can be easily achieved by the addition of extra modules (i.e., nutrigenomics, metabolomics, nutrikinetics, and quality appraisal). The ONS provides a unified and standardized terminology for nutritional studies as a resource for nutrition researchers who might not necessarily be familiar with ontologies and standardization concepts.

19.
Oncotarget ; 8(61): 103340-103363, 2017 Nov 28.
Artículo en Inglés | MEDLINE | ID: mdl-29262566

RESUMEN

Increasing evidence points to a key role played by epithelial-mesenchymal transition (EMT) in cancer progression and drug resistance. In this study, we used wet and in silico approaches to investigate whether EMT phenotypes are associated to resistance to target therapy in a non-small cell lung cancer model system harboring activating mutations of the epidermal growth factor receptor. The combination of different analysis techniques allowed us to describe intermediate/hybrid and complete EMT phenotypes respectively in HCC827- and HCC4006-derived drug-resistant human cancer cell lines. Interestingly, intermediate/hybrid EMT phenotypes, a collective cell migration and increased stem-like ability associate to resistance to the epidermal growth factor receptor inhibitor, erlotinib, in HCC827 derived cell lines. Moreover, the use of three complementary approaches for gene expression analysis supported the identification of a small EMT-related gene list, which may have otherwise been overlooked by standard stand-alone methods for gene expression analysis.

20.
BioData Min ; 9: 38, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27980679

RESUMEN

BACKGROUND: Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods. RESULTS: We propose a supervised method based on a genetic algorithm to identify small genomic subsequences that discriminate among different species. The method identifies multiple subsequences of bounded length with the same information power in a given genomic region. The algorithm has been successfully evaluated through its integration into a rule-based classification framework and applied to three different biological data sets: Influenza, Polyoma, and Rhino virus sequences. CONCLUSIONS: We discover a large number of small subsequences that can be used to identify each virus type with high accuracy and low computational time, and moreover help to characterize different genomic regions. Bounding their length to 20, our method found 1164 characterizing subsequences for all the Influenza virus subtypes, 194 for all the Polyoma viruses, and 11 for Rhino viruses. The abundance of small separating subsequences extracted for each genomic region may be an important support for quick and robust virus identification. Finally, useful biological information can be derived by the relative location and abundance of such subsequences along the different regions.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA