Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 83
Filtrar
1.
BMC Bioinformatics ; 25(1): 180, 2024 May 08.
Artículo en Inglés | MEDLINE | ID: mdl-38720249

RESUMEN

BACKGROUND: High-throughput sequencing (HTS) has become the gold standard approach for variant analysis in cancer research. However, somatic variants may occur at low fractions due to contamination from normal cells or tumor heterogeneity; this poses a significant challenge for standard HTS analysis pipelines. The problem is exacerbated in scenarios with minimal tumor DNA, such as circulating tumor DNA in plasma. Assessing sensitivity and detection of HTS approaches in such cases is paramount, but time-consuming and expensive: specialized experimental protocols and a sufficient quantity of samples are required for processing and analysis. To overcome these limitations, we propose a new computational approach specifically designed for the generation of artificial datasets suitable for this task, simulating ultra-deep targeted sequencing data with low-fraction variants and demonstrating their effectiveness in benchmarking low-fraction variant calling. RESULTS: Our approach enables the generation of artificial raw reads that mimic real data without relying on pre-existing data by using NEAT, a fine-grained read simulator that generates artificial datasets using models learned from multiple different datasets. Then, it incorporates low-fraction variants to simulate somatic mutations in samples with minimal tumor DNA content. To prove the suitability of the created artificial datasets for low-fraction variant calling benchmarking, we used them as ground truth to evaluate the performance of widely-used variant calling algorithms: they allowed us to define tuned parameter values of major variant callers, considerably improving their detection of very low-fraction variants. CONCLUSIONS: Our findings highlight both the pivotal role of our approach in creating adequate artificial datasets with low tumor fraction, facilitating rapid prototyping and benchmarking of algorithms for such dataset type, as well as the important need of advancing low-fraction variant calling techniques.


Asunto(s)
Benchmarking , Secuenciación de Nucleótidos de Alto Rendimiento , Neoplasias , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Neoplasias/genética , Mutación , Algoritmos , ADN de Neoplasias/genética , Análisis de Secuencia de ADN/métodos , Biología Computacional/métodos
2.
Int J Mol Sci ; 25(4)2024 Feb 08.
Artículo en Inglés | MEDLINE | ID: mdl-38396735

RESUMEN

The in-silico strategy of identifying novel uses for already existing drugs, known as drug repositioning, has enhanced drug discovery. Previous studies have shown a positive correlation between expression changes induced by the anticancer agent trabectedin and those caused by irinotecan, a topoisomerase I inhibitor. Leveraging the availability of transcriptional datasets, we developed a general in-silico drug-repositioning approach that we applied to investigate novel trabectedin synergisms. We set a workflow allowing the identification of genes selectively modulated by a drug and possible novel drug interactions. To show its effectiveness, we selected trabectedin as a case-study drug. We retrieved eight transcriptional cancer datasets including controls and samples treated with trabectedin or its analog lurbinectedin. We compared gene signature associated with each dataset to the 476,251 signatures from the Connectivity Map database. The most significant connections referred to mitomycin-c, topoisomerase II inhibitors, a PKC inhibitor, a Chk1 inhibitor, an antifungal agent, and an antagonist of the glutamate receptor. Genes coherently modulated by the drugs were involved in cell cycle, PPARalpha, and Rho GTPases pathways. Our in-silico approach for drug synergism identification showed that trabectedin modulates specific pathways that are shared with other drugs, suggesting possible synergisms.


Asunto(s)
Antineoplásicos , Tetrahidroisoquinolinas , Trabectedina/farmacología , Trabectedina/uso terapéutico , Tetrahidroisoquinolinas/farmacología , Dioxoles/farmacología , Sinergismo Farmacológico
3.
BMC Bioinformatics ; 24(1): 395, 2023 Oct 20.
Artículo en Inglés | MEDLINE | ID: mdl-37864168

RESUMEN

BACKGROUND: Transcription factors (TF) play a crucial role in the regulation of gene transcription; alterations of their activity and binding to DNA areas are strongly involved in cancer and other disease onset and development. For proper biomedical investigation, it is hence essential to correctly trace TF dense DNA areas, having multiple bindings of distinct factors, and select DNA high occupancy target (HOT) zones, showing the highest accumulation of such bindings. Indeed, systematic and replicable analysis of HOT zones in a large variety of cells and tissues would allow further understanding of their characteristics and could clarify their functional role. RESULTS: Here, we propose, thoroughly explain and discuss a full computational procedure to study in-depth DNA dense areas of transcription factor accumulation and identify HOT zones. This methodology, developed as a computationally efficient parametric algorithm implemented in an R/Bioconductor package, uses a systematic approach with two alternative methods to examine transcription factor bindings and provide comparative and fully-reproducible assessments. It offers different resolutions by introducing three distinct types of accumulation, which can analyze DNA from single-base to region-oriented levels, and a moving window, which can estimate the influence of the neighborhood for each DNA base under exam. CONCLUSIONS: We quantitatively assessed the full procedure by using our implemented software package, named TFHAZ, in two example applications of biological interest, proving its full reliability and relevance.


Asunto(s)
Regulación de la Expresión Génica , Factores de Transcripción , Factores de Transcripción/metabolismo , Reproducibilidad de los Resultados , ADN/genética , Unión Proteica , Sitios de Unión/genética
4.
J Biomed Inform ; 144: 104457, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37488024

RESUMEN

BACKGROUND AND OBJECTIVE: Many classification tasks in translational bioinformatics and genomics are characterized by the high dimensionality of potential features and unbalanced sample distribution among classes. This can affect classifier robustness and increase the risk of overfitting, curse of dimensionality and generalization leaks; furthermore and most importantly, this can prevent obtaining adequate patient stratification required for precision medicine in facing complex diseases, like cancer. Setting up a feature selection strategy able to extract only proper predictive features by removing irrelevant, redundant, and noisy ones is crucial to achieving valuable results on the desired task. METHODS: We propose a new feature selection approach, called ReRa, based on supervised Relevance-Redundancy assessments. ReRa consists of a customized step of relevance-based filtering, to identify a reduced subset of meaningful features, followed by a supervised similarity-based procedure to minimize redundancy. This latter step innovatively uses a combination of global and class-specific similarity assessments to remove redundant features while preserving those differentiated across classes, even when these classes are strongly unbalanced. RESULTS: We compared ReRa with several existing feature selection methods to obtain feature spaces on which performing breast cancer patient subtyping using several classifiers: we considered two use cases based on gene or transcript isoform expression. In the vast majority of the assessed scenarios, when using ReRa-selected feature spaces, the performances were significantly increased compared to simple feature filtering, LASSO regularization, or even MRmr - another Relevance-Redundancy method. The two use cases represent an insightful example of translational application, taking advantage of ReRa capabilities to investigate and enhance a clinically-relevant patient stratification task, which could be easily applied also to other cancer types and diseases. CONCLUSIONS: ReRa approach has the potential to improve the performance of machine learning models used in an unbalanced classification scenario. Compared to another Relevance-Redundancy approach like MRmr, ReRa does not require tuning the number of preserved features, ensures efficiency and scalability over huge initial dimensionalities and allows re-evaluation of all previously selected features at each iteration of the redundancy assessment, to ultimately preserve only the most relevant and class-differentiated features.


Asunto(s)
Algoritmos , Neoplasias de la Mama , Humanos , Femenino , Biología Computacional/métodos , Genómica , Proteómica , Neoplasias de la Mama/diagnóstico , Neoplasias de la Mama/genética
5.
Genome Med ; 15(1): 37, 2023 05 15.
Artículo en Inglés | MEDLINE | ID: mdl-37189167

RESUMEN

BACKGROUND: Transcriptional classification has been used to stratify colorectal cancer (CRC) into molecular subtypes with distinct biological and clinical features. However, it is not clear whether such subtypes represent discrete, mutually exclusive entities or molecular/phenotypic states with potential overlap. Therefore, we focused on the CRC Intrinsic Subtype (CRIS) classifier and evaluated whether assigning multiple CRIS subtypes to the same sample provides additional clinically and biologically relevant information. METHODS: A multi-label version of the CRIS classifier (multiCRIS) was applied to newly generated RNA-seq profiles from 606 CRC patient-derived xenografts (PDXs), together with human CRC bulk and single-cell RNA-seq datasets. Biological and clinical associations of single- and multi-label CRIS were compared. Finally, a machine learning-based multi-label CRIS predictor (ML2CRIS) was developed for single-sample classification. RESULTS: Surprisingly, about half of the CRC cases could be significantly assigned to more than one CRIS subtype. Single-cell RNA-seq analysis revealed that multiple CRIS membership can be a consequence of the concomitant presence of cells of different CRIS class or, less frequently, of cells with hybrid phenotype. Multi-label assignments were found to improve prediction of CRC prognosis and response to treatment. Finally, the ML2CRIS classifier was validated for retaining the same biological and clinical associations also in the context of single-sample classification. CONCLUSIONS: These results show that CRIS subtypes retain their biological and clinical features even when concomitantly assigned to the same CRC sample. This approach could be potentially extended to other cancer types and classification systems.


Asunto(s)
Neoplasias Colorrectales , Animales , Humanos , Neoplasias Colorrectales/patología , Pronóstico , Modelos Animales de Enfermedad , Biomarcadores de Tumor/genética
6.
Genome Biol ; 24(1): 79, 2023 04 18.
Artículo en Inglés | MEDLINE | ID: mdl-37072822

RESUMEN

A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.


Asunto(s)
Algoritmos , Epigenómica , Genómica/métodos
7.
BMC Bioinformatics ; 23(1): 401, 2022 Sep 29.
Artículo en Inglés | MEDLINE | ID: mdl-36175857

RESUMEN

BACKGROUND: Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics. RESULTS: Here, we target general germline or somatic mutation data sources for their seamless inclusion within an interoperable-format repository, supporting integration among them and with other genomic data, as well as their integrated use within bioinformatic workflows. In addition, we provide VarSum, a data summarization service working on sub-populations of interest selected using filters on population metadata and/or variant characteristics. The service is developed as an optimized computational framework with an Application Programming Interface (API) that can be called from within any existing computing pipeline or programming script. Provided example use cases of biological interest show the relevance, power and ease of use of the API functionalities. CONCLUSIONS: The proposed data integration pipeline and data set extraction and summarization API pave the way for solid computational infrastructures that quickly process cumbersome variation data, and allow biologists and bioinformaticians to easily perform scalable analysis on user-defined partitions of large cohorts from increasingly available genetic variation studies. With the current tendency to large (cross)nation-wide sequencing and variation initiatives, we expect an ever growing need for the kind of computational support hereby proposed.


Asunto(s)
Genómica , Metadatos , Biología Computacional , Genotipo , Humanos , Programas Informáticos
8.
BMC Bioinformatics ; 23(1): 123, 2022 Apr 07.
Artículo en Inglés | MEDLINE | ID: mdl-35392801

RESUMEN

BACKGROUND: Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. RESULTS: We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. CONCLUSIONS: RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.


Asunto(s)
Metadatos , Programas Informáticos , Macrodatos , Nube Computacional , Genómica
9.
BMC Bioinformatics ; 23(1): 151, 2022 Apr 26.
Artículo en Inglés | MEDLINE | ID: mdl-35473556

RESUMEN

BACKGROUND: Histone Mark Modifications (HMs) are crucial actors in gene regulation, as they actively remodel chromatin to modulate transcriptional activity: aberrant combinatorial patterns of HMs have been connected with several diseases, including cancer. HMs are, however, reversible modifications: understanding their role in disease would allow the design of 'epigenetic drugs' for specific, non-invasive treatments. Standard statistical techniques were not entirely successful in extracting representative features from raw HM signals over gene locations. On the other hand, deep learning approaches allow for effective automatic feature extraction, but at the expense of model interpretation. RESULTS: Here, we propose ShallowChrome, a novel computational pipeline to model transcriptional regulation via HMs in both an accurate and interpretable way. We attain state-of-the-art results on the binary classification of gene transcriptional states over 56 cell-types from the REMC database, largely outperforming recent deep learning approaches. We interpret our models by extracting insightful gene-specific regulative patterns, and we analyse them for the specific case of the PAX5 gene over three differentiated blood cell lines. Finally, we compare the patterns we obtained with the characteristic emission patterns of ChromHMM, and show that ShallowChrome is able to coherently rank groups of chromatin states w.r.t. their transcriptional activity. CONCLUSIONS: In this work we demonstrate that it is possible to model HM-modulated gene expression regulation in a highly accurate, yet interpretable way. Our feature extraction algorithm leverages on data downstream the identification of enriched regions to retrieve gene-wise, statistically significant and dynamically located features for each HM. These features are highly predictive of gene transcriptional state, and allow for accurate modeling by computationally efficient logistic regression models. These models allow a direct inspection and a rigorous interpretation, helping to formulate quantifiable hypotheses.


Asunto(s)
Código de Histonas , Histonas , Cromatina , Expresión Génica , Histonas/metabolismo , Procesamiento Proteico-Postraduccional
10.
Artículo en Inglés | MEDLINE | ID: mdl-32750853

RESUMEN

The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository-already integrating several important sources-which is exposed by means of practical user interfaces to respond biological researchers' needs.


Asunto(s)
Genómica , Metadatos , Biología Computacional , Almacenamiento y Recuperación de la Información
11.
IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 1956-1967, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-34166199

RESUMEN

Traditional drug experiments to find synergistic drug pairs are time-consuming and expensive due to the numerous possible combinations of drugs that have to be examined. Thus, computational methods that can give suggestions for synergistic drug investigations are of great interest. Here, we propose a Non-negative Matrix Tri-Factorization (NMTF) based approach that leverages the integration of different data types for predicting synergistic drug pairs in multiple specific cell lines. Our computational framework relies on a network-based representation of available data about drug synergism, which also allows integrating genomic information about cell lines. We computationally evaluate the performances of our method in finding missing relationships between synergistic drug pairs and cell lines, and in computing synergy scores between drug pairs in a specific cell line, as well as we estimate the benefit of adding cell line genomic data to the network. Our approach obtains very good performance (Average Precision Score equal to 0.937, Pearson's correlation coefficient equal to 0.760) when cell line genomic data and rich data about synergistic drugs in a cell line are considered. Finally, we systematically searched our top-scored predictions in the available literature and in the NCI ALMANAC, a well-known database of drug combination experiments, proving the goodness of our findings.


Asunto(s)
Algoritmos , Biología Computacional , Biología Computacional/métodos , Bases de Datos Factuales , Sinergismo Farmacológico , Genómica
12.
Artículo en Inglés | MEDLINE | ID: mdl-33270566

RESUMEN

Breast Cancer comprises multiple subtypes implicated in prognosis. Existing stratification methods rely on the expression quantification of small gene sets. Next Generation Sequencing promises large amounts of omic data in the next years. In this scenario, we explore the potential of machine learning and, particularly, deep learning for breast cancer subtyping. Due to the paucity of publicly available data, we leverage on pan-cancer and non-cancer data to design semi-supervised settings. We make use of multi-omic data, including microRNA expressions and copy number alterations, and we provide an in-depth investigation of several supervised and semi-supervised architectures. Obtained accuracy results show simpler models to perform at least as well as the deep semi-supervised approaches on our task over gene expression data. When multi-omic data types are combined together, performance of deep models shows little (if any) improvement in accuracy, indicating the need for further analysis on larger datasets of multi-omic data as and when they become available. From a biological perspective, our linear model mostly confirms known gene-subtype annotations. Conversely, deep approaches model non-linear relationships, which is reflected in a more varied and still unexplored set of representative omic features that may prove useful for breast cancer subtyping.


Asunto(s)
Neoplasias de la Mama , Aprendizaje Profundo , Neoplasias de la Mama/genética , Variaciones en el Número de Copia de ADN , Femenino , Humanos , Aprendizaje Automático , Aprendizaje Automático Supervisado
13.
Bioinformatics ; 38(5): 1183-1190, 2022 02 07.
Artículo en Inglés | MEDLINE | ID: mdl-34864898

RESUMEN

MOTIVATION: Approaches such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) represent the standard for the identification of binding sites of DNA-associated proteins, including transcription factors and histone marks. Public repositories of omics data contain a huge number of experimental ChIP-seq data, but their reuse and integrative analysis across multiple conditions remain a daunting task. RESULTS: We present the Combinatorial and Semantic Analysis of Functional Elements (CombSAFE), an efficient computational method able to integrate and take advantage of the valuable and numerous, but heterogeneous, ChIP-seq data publicly available in big data repositories. Leveraging natural language processing techniques, it integrates omics data samples with semantic annotations from selected biomedical ontologies; then, using hidden Markov models, it identifies combinations of static and dynamic functional elements throughout the genome for the corresponding samples. CombSAFE allows analyzing the whole genome, by clustering patterns of regions with similar functional elements and through enrichment analyses to discover ontological terms significantly associated with them. Moreover, it allows comparing functional states of a specific genomic region to analyze their different behavior throughout the various semantic annotations. Such findings can provide novel insights by identifying unexpected combinations of functional elements in different biological conditions. AVAILABILITY AND IMPLEMENTATION: The Python implementation of the CombSAFE pipeline is freely available for non-commercial use at: https://github.com/DEIB-GECO/CombSAFE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica , Semántica , Análisis de Secuencia de ADN/métodos , Genómica/métodos , Secuenciación de Inmunoprecipitación de Cromatina , Genoma
14.
BMC Bioinformatics ; 22(1): 571, 2021 Nov 27.
Artículo en Inglés | MEDLINE | ID: mdl-34837938

RESUMEN

BACKGROUND: In-depth analysis of regulation networks of genes aberrantly expressed in cancer is essential for better understanding tumors and identifying key genes that could be therapeutically targeted. RESULTS: We developed a quantitative analysis approach to investigate the main biological relationships among different regulatory elements and target genes; we applied it to Ovarian Serous Cystadenocarcinoma and 177 target genes belonging to three main pathways (DNA REPAIR, STEM CELLS and GLUCOSE METABOLISM) relevant for this tumor. Combining data from ENCODE and TCGA datasets, we built a predictive linear model for the regulation of each target gene, assessing the relationships between its expression, promoter methylation, expression of genes in the same or in the other pathways and of putative transcription factors. We proved the reliability and significance of our approach in a similar tumor type (basal-like Breast cancer) and using a different existing algorithm (ARACNe), and we obtained experimental confirmations on potentially interesting results. CONCLUSIONS: The analysis of the proposed models allowed disclosing the relations between a gene and its related biological processes, the interconnections between the different gene sets, and the evaluation of the relevant regulatory elements at single gene level. This led to the identification of already known regulators and/or gene correlations and to unveil a set of still unknown and potentially interesting biological relationships for their pharmacological and clinical use.


Asunto(s)
Regulación Neoplásica de la Expresión Génica , Redes Reguladoras de Genes , Algoritmos , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Reproducibilidad de los Resultados , Factores de Transcripción/metabolismo
15.
Comput Methods Programs Biomed ; 209: 106329, 2021 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-34418814

RESUMEN

BACKGROUND AND OBJECTIVE: Chronic Kidney Disease (CKD) is a condition characterized by a progressive loss of kidney function over time caused by many diseases. The most effective weapons against CKD are early diagnosis and treatment, which in most of the cases can only postpone the onset of complete kidney failure. The CKD grading system is classified based on the estimated Glomerular Filtration Rate (eGFR), and it helps to stratify patients for risk, follow up and management planning. This study aims to effectively predict how soon a CKD patient will need to be dialyzed, thus allowing personalized care and strategic planning of treatment. METHODS: To accurately predict the time frame within which a CKD patient will necessarily have to be dialyzed, a computational model based on a supervised machine learning approach is developed. Many techniques, regarding both information extraction and model training phases, are compared in order to understand which approaches are most effective. The different models compared are trained on the data extracted from the Electronic Medical Records of the Vimercate Hospital. RESULTS: As final model, we propose a set of Extremely Randomized Trees classifiers considering 27 features, including creatinine level, urea, red blood cells count, eGFR trend (which is not even the most important), age and associated comorbidities. In predicting the occurrence of complete renal failure within the next year rather than later, it obtains a test accuracy of 94%, specificity of 91% and sensitivity of 96%. More and shorter time-frame intervals, up to 6 months of granularity, can be specified without relevantly worsening the model performance. CONCLUSIONS: The developed computational model provides nephrologists with a great support in predicting the patient's clinical pathway. The model promising results, coupled with the knowledge and experience of the clinicians, can effectively lead to better personalized care and strategic planning of both patient's needs and hospital resources.


Asunto(s)
Insuficiencia Renal Crónica , Tasa de Filtración Glomerular , Humanos , Insuficiencia Renal Crónica/diagnóstico , Aprendizaje Automático Supervisado
16.
Brief Bioinform ; 22(3)2021 05 20.
Artículo en Inglés | MEDLINE | ID: mdl-34020536

RESUMEN

MOTIVATION: With the spreading of biological and clinical uses of next-generation sequencing (NGS) data, many laboratories and health organizations are facing the need of sharing NGS data resources and easily accessing and processing comprehensively shared genomic data; in most cases, primary and secondary data management of NGS data is done at sequencing stations, and sharing applies to processed data. Based on the previous single-instance GMQL system architecture, here we review the model, language and architectural extensions that make the GMQL centralized system innovatively open to federated computing. RESULTS: A well-designed extension of a centralized system architecture to support federated data sharing and query processing. Data is federated thanks to simple data sharing instructions. Queries are assigned to execution nodes; they are translated into an intermediate representation, whose computation drives data and processing distributions. The approach allows writing federated applications according to classical styles: centralized, distributed or externalized. AVAILABILITY: The federated genomic data management system is freely available for non-commercial use as an open source project at http://www.bioinformatics.deib.polimi.it/FederatedGMQLsystem/. CONTACT: {arif.canakoglu, pietro.pinoli}@polimi.it.


Asunto(s)
Conjuntos de Datos como Asunto , Genómica , Difusión de la Información , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Lenguajes de Programación
17.
NPJ Syst Biol Appl ; 7(1): 17, 2021 03 12.
Artículo en Inglés | MEDLINE | ID: mdl-33712625

RESUMEN

The complexity of cancer has always been a huge issue in understanding the source of this disease. However, by appreciating its complexity, we can shed some light on crucial gene associations across and in specific cancer types. In this study, we develop a general framework to infer relevant gene biomarkers and their gene-to-gene associations using multiple gene co-expression networks for each cancer type. Specifically, we infer computationally and biologically interesting communities of genes from kidney renal clear cell carcinoma, liver hepatocellular carcinoma, and prostate adenocarcinoma data sets of The Cancer Genome Atlas (TCGA) database. The gene communities are extracted through a data-driven pipeline and then evaluated through both functional analyses and literature findings. Furthermore, we provide a computational validation of their relevance for each cancer type by comparing the performance of normal/cancer classification for our identified gene sets and other gene signatures, including the typically-used differentially expressed genes. The hallmark of this study is its approach based on gene co-expression networks from different similarity measures: using a combination of multiple gene networks and then fusing normal and cancer networks for each cancer type, we can have better insights on the overall structure of the cancer-type-specific network.


Asunto(s)
Biomarcadores de Tumor , Neoplasias Hepáticas , Biomarcadores de Tumor/genética , Biología Computacional , Perfilación de la Expresión Génica , Genes Relacionados con las Neoplasias , Humanos , Neoplasias Hepáticas/genética , Masculino
18.
BioData Min ; 14(1): 14, 2021 Feb 12.
Artículo en Inglés | MEDLINE | ID: mdl-33579334

RESUMEN

BACKGROUND: Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. RESULTS: Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/ . CONCLUSIONS: Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available.

19.
Brief Bioinform ; 22(1): 30-44, 2021 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-32496509

RESUMEN

Thousands of new experimental datasets are becoming available every day; in many cases, they are produced within the scope of large cooperative efforts, involving a variety of laboratories spread all over the world, and typically open for public use. Although the potential collective amount of available information is huge, the effective combination of such public sources is hindered by data heterogeneity, as the datasets exhibit a wide variety of notations and formats, concerning both experimental values and metadata. Thus, data integration is becoming a fundamental activity, to be performed prior to data analysis and biological knowledge discovery, consisting of subsequent steps of data extraction, normalization, matching and enrichment; once applied to heterogeneous data sources, it builds multiple perspectives over the genome, leading to the identification of meaningful relationships that could not be perceived by using incompatible data formats. In this paper, we first describe a technological pipeline from data production to data integration; we then propose a taxonomy of genomic data players (based on the distinction between contributors, repository hosts, consortia, integrators and consumers) and apply the taxonomy to describe about 30 important players in genomic data management. We specifically focus on the integrator players and analyse the issues in solving the genomic data integration challenges, as well as evaluate the computational environments that they provide to follow up data integration by means of visualization and analysis tools.


Asunto(s)
Manejo de Datos/métodos , Genoma Humano , Genómica/métodos , Humanos , Metadatos
20.
Brief Bioinform ; 22(2): 664-675, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-33348368

RESUMEN

With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature.


Asunto(s)
COVID-19/terapia , COVID-19/epidemiología , COVID-19/virología , Genes Virales , Humanos , Almacenamiento y Recuperación de la Información , Pandemias , SARS-CoV-2/genética , SARS-CoV-2/aislamiento & purificación
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...