Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 83
Filtrar
1.
BMC Bioinformatics ; 25(1): 180, 2024 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-38720249

RESUMO

BACKGROUND: High-throughput sequencing (HTS) has become the gold standard approach for variant analysis in cancer research. However, somatic variants may occur at low fractions due to contamination from normal cells or tumor heterogeneity; this poses a significant challenge for standard HTS analysis pipelines. The problem is exacerbated in scenarios with minimal tumor DNA, such as circulating tumor DNA in plasma. Assessing sensitivity and detection of HTS approaches in such cases is paramount, but time-consuming and expensive: specialized experimental protocols and a sufficient quantity of samples are required for processing and analysis. To overcome these limitations, we propose a new computational approach specifically designed for the generation of artificial datasets suitable for this task, simulating ultra-deep targeted sequencing data with low-fraction variants and demonstrating their effectiveness in benchmarking low-fraction variant calling. RESULTS: Our approach enables the generation of artificial raw reads that mimic real data without relying on pre-existing data by using NEAT, a fine-grained read simulator that generates artificial datasets using models learned from multiple different datasets. Then, it incorporates low-fraction variants to simulate somatic mutations in samples with minimal tumor DNA content. To prove the suitability of the created artificial datasets for low-fraction variant calling benchmarking, we used them as ground truth to evaluate the performance of widely-used variant calling algorithms: they allowed us to define tuned parameter values of major variant callers, considerably improving their detection of very low-fraction variants. CONCLUSIONS: Our findings highlight both the pivotal role of our approach in creating adequate artificial datasets with low tumor fraction, facilitating rapid prototyping and benchmarking of algorithms for such dataset type, as well as the important need of advancing low-fraction variant calling techniques.


Assuntos
Benchmarking , Sequenciamento de Nucleotídeos em Larga Escala , Neoplasias , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Neoplasias/genética , Mutação , Algoritmos , DNA de Neoplasias/genética , Análise de Sequência de DNA/métodos , Biologia Computacional/métodos
2.
Int J Mol Sci ; 25(4)2024 Feb 08.
Artigo em Inglês | MEDLINE | ID: mdl-38396735

RESUMO

The in-silico strategy of identifying novel uses for already existing drugs, known as drug repositioning, has enhanced drug discovery. Previous studies have shown a positive correlation between expression changes induced by the anticancer agent trabectedin and those caused by irinotecan, a topoisomerase I inhibitor. Leveraging the availability of transcriptional datasets, we developed a general in-silico drug-repositioning approach that we applied to investigate novel trabectedin synergisms. We set a workflow allowing the identification of genes selectively modulated by a drug and possible novel drug interactions. To show its effectiveness, we selected trabectedin as a case-study drug. We retrieved eight transcriptional cancer datasets including controls and samples treated with trabectedin or its analog lurbinectedin. We compared gene signature associated with each dataset to the 476,251 signatures from the Connectivity Map database. The most significant connections referred to mitomycin-c, topoisomerase II inhibitors, a PKC inhibitor, a Chk1 inhibitor, an antifungal agent, and an antagonist of the glutamate receptor. Genes coherently modulated by the drugs were involved in cell cycle, PPARalpha, and Rho GTPases pathways. Our in-silico approach for drug synergism identification showed that trabectedin modulates specific pathways that are shared with other drugs, suggesting possible synergisms.


Assuntos
Antineoplásicos , Tetra-Hidroisoquinolinas , Trabectedina/farmacologia , Trabectedina/uso terapêutico , Tetra-Hidroisoquinolinas/farmacologia , Dioxóis/farmacologia , Sinergismo Farmacológico
3.
BMC Bioinformatics ; 24(1): 395, 2023 Oct 20.
Artigo em Inglês | MEDLINE | ID: mdl-37864168

RESUMO

BACKGROUND: Transcription factors (TF) play a crucial role in the regulation of gene transcription; alterations of their activity and binding to DNA areas are strongly involved in cancer and other disease onset and development. For proper biomedical investigation, it is hence essential to correctly trace TF dense DNA areas, having multiple bindings of distinct factors, and select DNA high occupancy target (HOT) zones, showing the highest accumulation of such bindings. Indeed, systematic and replicable analysis of HOT zones in a large variety of cells and tissues would allow further understanding of their characteristics and could clarify their functional role. RESULTS: Here, we propose, thoroughly explain and discuss a full computational procedure to study in-depth DNA dense areas of transcription factor accumulation and identify HOT zones. This methodology, developed as a computationally efficient parametric algorithm implemented in an R/Bioconductor package, uses a systematic approach with two alternative methods to examine transcription factor bindings and provide comparative and fully-reproducible assessments. It offers different resolutions by introducing three distinct types of accumulation, which can analyze DNA from single-base to region-oriented levels, and a moving window, which can estimate the influence of the neighborhood for each DNA base under exam. CONCLUSIONS: We quantitatively assessed the full procedure by using our implemented software package, named TFHAZ, in two example applications of biological interest, proving its full reliability and relevance.


Assuntos
Regulação da Expressão Gênica , Fatores de Transcrição , Fatores de Transcrição/metabolismo , Reprodutibilidade dos Testes , DNA/genética , Ligação Proteica , Sítios de Ligação/genética
4.
Brief Bioinform ; 22(1): 30-44, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-32496509

RESUMO

Thousands of new experimental datasets are becoming available every day; in many cases, they are produced within the scope of large cooperative efforts, involving a variety of laboratories spread all over the world, and typically open for public use. Although the potential collective amount of available information is huge, the effective combination of such public sources is hindered by data heterogeneity, as the datasets exhibit a wide variety of notations and formats, concerning both experimental values and metadata. Thus, data integration is becoming a fundamental activity, to be performed prior to data analysis and biological knowledge discovery, consisting of subsequent steps of data extraction, normalization, matching and enrichment; once applied to heterogeneous data sources, it builds multiple perspectives over the genome, leading to the identification of meaningful relationships that could not be perceived by using incompatible data formats. In this paper, we first describe a technological pipeline from data production to data integration; we then propose a taxonomy of genomic data players (based on the distinction between contributors, repository hosts, consortia, integrators and consumers) and apply the taxonomy to describe about 30 important players in genomic data management. We specifically focus on the integrator players and analyse the issues in solving the genomic data integration challenges, as well as evaluate the computational environments that they provide to follow up data integration by means of visualization and analysis tools.


Assuntos
Gerenciamento de Dados/métodos , Genoma Humano , Genômica/métodos , Humanos , Metadados
5.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-34020536

RESUMO

MOTIVATION: With the spreading of biological and clinical uses of next-generation sequencing (NGS) data, many laboratories and health organizations are facing the need of sharing NGS data resources and easily accessing and processing comprehensively shared genomic data; in most cases, primary and secondary data management of NGS data is done at sequencing stations, and sharing applies to processed data. Based on the previous single-instance GMQL system architecture, here we review the model, language and architectural extensions that make the GMQL centralized system innovatively open to federated computing. RESULTS: A well-designed extension of a centralized system architecture to support federated data sharing and query processing. Data is federated thanks to simple data sharing instructions. Queries are assigned to execution nodes; they are translated into an intermediate representation, whose computation drives data and processing distributions. The approach allows writing federated applications according to classical styles: centralized, distributed or externalized. AVAILABILITY: The federated genomic data management system is freely available for non-commercial use as an open source project at http://www.bioinformatics.deib.polimi.it/FederatedGMQLsystem/. CONTACT: {arif.canakoglu, pietro.pinoli}@polimi.it.


Assuntos
Conjuntos de Dados como Assunto , Genômica , Disseminação de Informação , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Linguagens de Programação
6.
Brief Bioinform ; 22(2): 664-675, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-33348368

RESUMO

With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature.


Assuntos
COVID-19/terapia , COVID-19/epidemiologia , COVID-19/virologia , Genes Virais , Humanos , Armazenamento e Recuperação da Informação , Pandemias , SARS-CoV-2/genética , SARS-CoV-2/isolamento & purificação
7.
Bioinformatics ; 38(5): 1183-1190, 2022 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-34864898

RESUMO

MOTIVATION: Approaches such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) represent the standard for the identification of binding sites of DNA-associated proteins, including transcription factors and histone marks. Public repositories of omics data contain a huge number of experimental ChIP-seq data, but their reuse and integrative analysis across multiple conditions remain a daunting task. RESULTS: We present the Combinatorial and Semantic Analysis of Functional Elements (CombSAFE), an efficient computational method able to integrate and take advantage of the valuable and numerous, but heterogeneous, ChIP-seq data publicly available in big data repositories. Leveraging natural language processing techniques, it integrates omics data samples with semantic annotations from selected biomedical ontologies; then, using hidden Markov models, it identifies combinations of static and dynamic functional elements throughout the genome for the corresponding samples. CombSAFE allows analyzing the whole genome, by clustering patterns of regions with similar functional elements and through enrichment analyses to discover ontological terms significantly associated with them. Moreover, it allows comparing functional states of a specific genomic region to analyze their different behavior throughout the various semantic annotations. Such findings can provide novel insights by identifying unexpected combinations of functional elements in different biological conditions. AVAILABILITY AND IMPLEMENTATION: The Python implementation of the CombSAFE pipeline is freely available for non-commercial use at: https://github.com/DEIB-GECO/CombSAFE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Semântica , Análise de Sequência de DNA/métodos , Genômica/métodos , Sequenciamento de Cromatina por Imunoprecipitação , Genoma
8.
J Biomed Inform ; 144: 104457, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37488024

RESUMO

BACKGROUND AND OBJECTIVE: Many classification tasks in translational bioinformatics and genomics are characterized by the high dimensionality of potential features and unbalanced sample distribution among classes. This can affect classifier robustness and increase the risk of overfitting, curse of dimensionality and generalization leaks; furthermore and most importantly, this can prevent obtaining adequate patient stratification required for precision medicine in facing complex diseases, like cancer. Setting up a feature selection strategy able to extract only proper predictive features by removing irrelevant, redundant, and noisy ones is crucial to achieving valuable results on the desired task. METHODS: We propose a new feature selection approach, called ReRa, based on supervised Relevance-Redundancy assessments. ReRa consists of a customized step of relevance-based filtering, to identify a reduced subset of meaningful features, followed by a supervised similarity-based procedure to minimize redundancy. This latter step innovatively uses a combination of global and class-specific similarity assessments to remove redundant features while preserving those differentiated across classes, even when these classes are strongly unbalanced. RESULTS: We compared ReRa with several existing feature selection methods to obtain feature spaces on which performing breast cancer patient subtyping using several classifiers: we considered two use cases based on gene or transcript isoform expression. In the vast majority of the assessed scenarios, when using ReRa-selected feature spaces, the performances were significantly increased compared to simple feature filtering, LASSO regularization, or even MRmr - another Relevance-Redundancy method. The two use cases represent an insightful example of translational application, taking advantage of ReRa capabilities to investigate and enhance a clinically-relevant patient stratification task, which could be easily applied also to other cancer types and diseases. CONCLUSIONS: ReRa approach has the potential to improve the performance of machine learning models used in an unbalanced classification scenario. Compared to another Relevance-Redundancy approach like MRmr, ReRa does not require tuning the number of preserved features, ensures efficiency and scalability over huge initial dimensionalities and allows re-evaluation of all previously selected features at each iteration of the redundancy assessment, to ultimately preserve only the most relevant and class-differentiated features.


Assuntos
Algoritmos , Neoplasias da Mama , Humanos , Feminino , Biologia Computacional/métodos , Genômica , Proteômica , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética
9.
BMC Bioinformatics ; 23(1): 123, 2022 Apr 07.
Artigo em Inglês | MEDLINE | ID: mdl-35392801

RESUMO

BACKGROUND: Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. RESULTS: We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. CONCLUSIONS: RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.


Assuntos
Metadados , Software , Big Data , Computação em Nuvem , Genômica
10.
BMC Bioinformatics ; 23(1): 401, 2022 Sep 29.
Artigo em Inglês | MEDLINE | ID: mdl-36175857

RESUMO

BACKGROUND: Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics. RESULTS: Here, we target general germline or somatic mutation data sources for their seamless inclusion within an interoperable-format repository, supporting integration among them and with other genomic data, as well as their integrated use within bioinformatic workflows. In addition, we provide VarSum, a data summarization service working on sub-populations of interest selected using filters on population metadata and/or variant characteristics. The service is developed as an optimized computational framework with an Application Programming Interface (API) that can be called from within any existing computing pipeline or programming script. Provided example use cases of biological interest show the relevance, power and ease of use of the API functionalities. CONCLUSIONS: The proposed data integration pipeline and data set extraction and summarization API pave the way for solid computational infrastructures that quickly process cumbersome variation data, and allow biologists and bioinformaticians to easily perform scalable analysis on user-defined partitions of large cohorts from increasingly available genetic variation studies. With the current tendency to large (cross)nation-wide sequencing and variation initiatives, we expect an ever growing need for the kind of computational support hereby proposed.


Assuntos
Genômica , Metadados , Biologia Computacional , Genótipo , Humanos , Software
11.
BMC Bioinformatics ; 23(1): 151, 2022 Apr 26.
Artigo em Inglês | MEDLINE | ID: mdl-35473556

RESUMO

BACKGROUND: Histone Mark Modifications (HMs) are crucial actors in gene regulation, as they actively remodel chromatin to modulate transcriptional activity: aberrant combinatorial patterns of HMs have been connected with several diseases, including cancer. HMs are, however, reversible modifications: understanding their role in disease would allow the design of 'epigenetic drugs' for specific, non-invasive treatments. Standard statistical techniques were not entirely successful in extracting representative features from raw HM signals over gene locations. On the other hand, deep learning approaches allow for effective automatic feature extraction, but at the expense of model interpretation. RESULTS: Here, we propose ShallowChrome, a novel computational pipeline to model transcriptional regulation via HMs in both an accurate and interpretable way. We attain state-of-the-art results on the binary classification of gene transcriptional states over 56 cell-types from the REMC database, largely outperforming recent deep learning approaches. We interpret our models by extracting insightful gene-specific regulative patterns, and we analyse them for the specific case of the PAX5 gene over three differentiated blood cell lines. Finally, we compare the patterns we obtained with the characteristic emission patterns of ChromHMM, and show that ShallowChrome is able to coherently rank groups of chromatin states w.r.t. their transcriptional activity. CONCLUSIONS: In this work we demonstrate that it is possible to model HM-modulated gene expression regulation in a highly accurate, yet interpretable way. Our feature extraction algorithm leverages on data downstream the identification of enriched regions to retrieve gene-wise, statistically significant and dynamically located features for each HM. These features are highly predictive of gene transcriptional state, and allow for accurate modeling by computationally efficient logistic regression models. These models allow a direct inspection and a rigorous interpretation, helping to formulate quantifiable hypotheses.


Assuntos
Código das Histonas , Histonas , Cromatina , Expressão Gênica , Histonas/metabolismo , Processamento de Proteína Pós-Traducional
12.
BMC Bioinformatics ; 22(1): 571, 2021 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-34837938

RESUMO

BACKGROUND: In-depth analysis of regulation networks of genes aberrantly expressed in cancer is essential for better understanding tumors and identifying key genes that could be therapeutically targeted. RESULTS: We developed a quantitative analysis approach to investigate the main biological relationships among different regulatory elements and target genes; we applied it to Ovarian Serous Cystadenocarcinoma and 177 target genes belonging to three main pathways (DNA REPAIR, STEM CELLS and GLUCOSE METABOLISM) relevant for this tumor. Combining data from ENCODE and TCGA datasets, we built a predictive linear model for the regulation of each target gene, assessing the relationships between its expression, promoter methylation, expression of genes in the same or in the other pathways and of putative transcription factors. We proved the reliability and significance of our approach in a similar tumor type (basal-like Breast cancer) and using a different existing algorithm (ARACNe), and we obtained experimental confirmations on potentially interesting results. CONCLUSIONS: The analysis of the proposed models allowed disclosing the relations between a gene and its related biological processes, the interconnections between the different gene sets, and the evaluation of the relevant regulatory elements at single gene level. This led to the identification of already known regulators and/or gene correlations and to unveil a set of still unknown and potentially interesting biological relationships for their pharmacological and clinical use.


Assuntos
Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes , Algoritmos , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Reprodutibilidade dos Testes , Fatores de Transcrição/metabolismo
13.
Bioinformatics ; 36(4): 1007-1013, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31504203

RESUMO

MOTIVATION: Genome regulatory networks have different layers and ways to modulate cellular processes, such as cell differentiation, proliferation, and adaptation to external stimuli. Transcription factors and other chromatin-associated proteins act as combinatorial protein complexes that control gene transcription. Thus, identifying functional interaction networks among these proteins is a fundamental task to understand the genome regulation framework. RESULTS: We developed a novel approach to infer interactions among transcription factors in user-selected genomic regions, by combining the computation of association rules and of a novel Importance Index on ChIP-seq datasets. The hallmark of our method is the definition of the Importance Index, which provides a relevance measure of the interaction among transcription factors found associated in the computed rules. Examples on synthetic data explain the index use and potential. A straightforward pre-processing pipeline enables the easy extraction of input data for our approach from any set of ChIP-seq experiments. Applications on ENCODE ChIP-seq data prove that our approach can reliably detect interactions between transcription factors, including known interactions that validate our approach. AVAILABILITY AND IMPLEMENTATION: A R/Bioconductor package implementing our association rules and Importance Index-based method is available at http://bioconductor.org/packages/release/bioc/html/TFARM.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Mineração de Dados , Genômica , Software , Fatores de Transcrição
14.
BMC Bioinformatics ; 21(1): 464, 2020 Oct 19.
Artigo em Inglês | MEDLINE | ID: mdl-33076821

RESUMO

BACKGROUND: Genome browsers are widely used for locating interesting genomic regions, but their interactive use is obviously limited to inspecting short genomic portions. An ideal interaction is to provide patterns of regions on the browser, and then extract other genomic regions over the whole genome where such patterns occur, ranked by similarity. RESULTS: We developed SimSearch, an optimized pattern-search method and an open source plugin for the Integrated Genome Browser (IGB), to find genomic region sets that are similar to a given region pattern. It provides efficient visual genome-wide analytics computation in large datasets; the plugin supports intuitive user interactions for selecting an interesting pattern on IGB tracks and visualizing the computed occurrences of similar patterns along the entire genome. SimSearch also includes functions for the annotation and enrichment of results, and is enhanced with a Quickload repository including numerous epigenomic feature datasets from ENCODE and Roadmap Epigenomics. The paper also includes some use cases to show multiple genome-wide analyses of biological interest, which can be easily performed by taking advantage of the presented approach. CONCLUSIONS: The novel SimSearch method provides innovative support for effective genome-wide pattern search and visualization; its relevance and practical usefulness is demonstrated through a number of significant use cases of biological interest. The SimSearch IGB plugin, documentation, and code are freely available at https://deib-geco.github.io/simsearch-app/ and https://github.com/DEIB-GECO/simsearch-app/ .


Assuntos
Algoritmos , Epigenômica , Genoma , Reconhecimento Automatizado de Padrão , Navegador , Humanos , Anotação de Sequência Molecular , Ligação Proteica , Sequências Reguladoras de Ácido Nucleico/genética , Fatores de Transcrição/metabolismo
15.
Anal Chem ; 92(13): 8874-8882, 2020 07 07.
Artigo em Inglês | MEDLINE | ID: mdl-32501676

RESUMO

Metabolomics and lipidomics studies are becoming increasingly popular but available tools for automated data analysis are still limited. The major issue in untargeted metabolomics is linked to the lack of efficient ranking methods allowing accurate identification of metabolites. Herein, we provide a user-friendly open-source software, named SMfinder, for the robust identification and quantification of small molecules. The software introduces an MS2 false discovery rate approach, which is based on single spectral permutation and increases identification accuracy. SMfinder can be efficiently applied to shotgun and targeted analysis in metabolomics and lipidomics without requiring extensive in-house acquisition of standards as it provides accurate identification by using available MS2 libraries in instrument independent manner. The software, downloadable at www.ifom.eu/SMfinder, is suitable for untargeted, targeted, and flux analysis.


Assuntos
Lipidômica/métodos , Metabolômica/métodos , Interface Usuário-Computador , Isótopos de Carbono/química , Isótopos de Carbono/metabolismo , Linhagem Celular Tumoral , Humanos , Lipídeos/análise , Metaboloma
16.
Bioinformatics ; 35(5): 729-736, 2019 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-30101316

RESUMO

MOTIVATION: We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. RESULTS: The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. AVAILABILITY AND IMPLEMENTATION: The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Epigenômica , Genoma , Genômica
17.
Brief Bioinform ; 18(3): 498-510, 2017 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-27075479

RESUMO

Metabolomics is a rapidly growing field consisting of the analysis of a large number of metabolites at a system scale. The two major goals of metabolomics are the identification of the metabolites characterizing each organism state and the measurement of their dynamics under different situations (e.g. pathological conditions, environmental factors). Knowledge about metabolites is crucial for the understanding of most cellular phenomena, but this information alone is not sufficient to gain a comprehensive view of all the biological processes involved. Integrated approaches combining metabolomics with transcriptomics and proteomics are thus required to obtain much deeper insights than any of these techniques alone. Although this information is available, multilevel integration of different 'omics' data is still a challenge. The handling, processing, analysis and integration of these data require specialized mathematical, statistical and bioinformatics tools, and several technical problems hampering a rapid progress in the field exist. Here, we review four main tools for number of users or provided features (MetaCoreTM, MetaboAnalyst, InCroMAP and 3Omics) out of the several available for metabolomic data analysis and integration with other 'omics' data, highlighting their strong and weak aspects; a number of related issues affecting data analysis and integration are also identified and discussed. Overall, we provide an objective description of how some of the main currently available software packages work, which may help the experimental practitioner in the choice of a robust pipeline for metabolomic data analysis and integration.


Assuntos
Metabolômica , Proteômica
18.
Brief Bioinform ; 18(3): 367-381, 2017 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-27013647

RESUMO

Enriched region (ER) identification is a fundamental step in several next-generation sequencing (NGS) experiment types. Yet, although NGS experimental protocols recommend producing replicate samples for each evaluated condition and their consistency is usually assessed, typically pipelines for ER identification do not consider available NGS replicates. This may alter genome-wide descriptions of ERs, hinder significance of subsequent analyses on detected ERs and eventually preclude biological discoveries that evidence in replicate could support. MuSERA is a broadly useful stand-alone tool for both interactive and batch analysis of combined evidence from ERs in multiple ChIP-seq or DNase-seq replicates. Besides rigorously combining sample replicates to increase statistical significance of detected ERs, it also provides quantitative evaluations and graphical features to assess the biological relevance of each determined ER set within its genomic context; they include genomic annotation of determined ERs, nearest ER distance distribution, global correlation assessment of ERs and an integrated genome browser. We review MuSERA rationale and implementation, and illustrate how sets of significant ERs are expanded by applying MuSERA on replicates for several types of NGS data, including ChIP-seq of transcription factors or histone marks and DNase-seq hypersensitive sites. We show that MuSERA can determine a new, enhanced set of ERs for each sample by locally combining evidence on replicates, and prove how the easy-to-use interactive graphical displays and quantitative evaluations that MuSERA provides effectively support thorough inspection of obtained results and evaluation of their biological content, facilitating their understanding and biological interpretations. MuSERA is freely available at http://www.bioinformatics.deib.polimi.it/MuSERA/.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Imunoprecipitação da Cromatina , Genoma , Genômica , Software
19.
BMC Bioinformatics ; 18(1): 6, 2017 Jan 03.
Artigo em Inglês | MEDLINE | ID: mdl-28049410

RESUMO

BACKGROUND: Data extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, mainly Next Generation Sequencing, for more than 30 cancer types. RESULTS: We propose TCGA2BED a software tool to search and retrieve TCGA data, and convert them in the structured BED format for their seamless use and integration. Additionally, it supports the conversion in CSV, GTF, JSON, and XML standard formats. Furthermore, TCGA2BED extends TCGA data with information extracted from other genomic databases (i.e., NCBI Entrez Gene, HGNC, UCSC, and miRBase). We also provide and maintain an automatically updated data repository with publicly available Copy Number Variation, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq (V1,V2) experimental data of TCGA converted into the BED format, and their associated clinical and biospecimen meta data in attribute-value text format. CONCLUSIONS: The availability of the valuable TCGA data in BED format reduces the time spent in taking advantage of them: it is possible to efficiently and effectively deal with huge amounts of cancer genomic data integratively, and to search, retrieve and extend them with additional information. The BED format facilitates the investigators allowing several knowledge discovery analyses on all tumor types in TCGA with the final aim of understanding pathological mechanisms and aiding cancer treatments.


Assuntos
Neoplasias/genética , Interface Usuário-Computador , Variações do Número de Cópias de DNA , Metilação de DNA , Bases de Dados Genéticas , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Internet , MicroRNAs/química , MicroRNAs/metabolismo , Neoplasias/patologia , Análise de Sequência de DNA
20.
BMC Bioinformatics ; 18(1): 536, 2017 Dec 04.
Artigo em Inglês | MEDLINE | ID: mdl-29202689

RESUMO

BACKGROUND: With the wide-spreading of public repositories of NGS processed data, the availability of user-friendly and effective tools for data exploration, analysis and visualization is becoming very relevant. These tools enable interactive analytics, an exploratory approach for the seamless "sense-making" of data through on-the-fly integration of analysis and visualization phases, suggested not only for evaluating processing results, but also for designing and adapting NGS data analysis pipelines. RESULTS: This paper presents abstractions for supporting the early analysis of NGS processed data and their implementation in an associated tool, named GenoMetric Space Explorer (GeMSE). This tool serves the needs of the GenoMetric Query Language, an innovative cloud-based system for computing complex queries over heterogeneous processed data. It can also be used starting from any text files in standard BED, BroadPeak, NarrowPeak, GTF, or general tab-delimited format, containing numerical features of genomic regions; metadata can be provided as text files in tab-delimited attribute-value format. GeMSE allows interactive analytics, consisting of on-the-fly cycling among steps of data exploration, analysis and visualization that help biologists and bioinformaticians in making sense of heterogeneous genomic datasets. By means of an explorative interaction support, users can trace past activities and quickly recover their results, seamlessly going backward and forward in the analysis steps and comparative visualizations of heatmaps. CONCLUSIONS: GeMSE effective application and practical usefulness is demonstrated through significant use cases of biological interest. GeMSE is available at http://www.bioinformatics.deib.polimi.it/GeMSE/ , and its source code is available at https://github.com/Genometric/GeMSE under GPLv3 open-source license.


Assuntos
Bases de Dados Genéticas , Genômica/métodos , Metadados , Células A549 , Dexametasona/farmacologia , Etanol/farmacologia , Humanos , Modelos Teóricos , Reconhecimento Automatizado de Padrão , Mapeamento de Interação de Proteínas , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA