Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 53
Filtrar
Más filtros

Bases de datos
Tipo del documento
Intervalo de año de publicación
1.
BMC Genomics ; 25(Suppl 3): 817, 2024 Aug 30.
Artículo en Inglés | MEDLINE | ID: mdl-39210256

RESUMEN

BACKGROUND: With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the data. RESULTS: We provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between ENCODE and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome. CONCLUSIONS: The signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation.


Asunto(s)
Factores de Transcripción , Humanos , Factores de Transcripción/metabolismo , Factores de Transcripción/genética , Sitios de Unión , Unión Proteica , Biología Computacional/métodos , Aprendizaje Automático , Bases de Datos Genéticas , Algoritmos , Genómica/métodos
2.
J Med Internet Res ; 26: e52655, 2024 May 30.
Artículo en Inglés | MEDLINE | ID: mdl-38814687

RESUMEN

BACKGROUND: Since the beginning of the COVID-19 pandemic, >1 million studies have been collected within the COVID-19 Open Research Dataset, a corpus of manuscripts created to accelerate research against the disease. Their related abstracts hold a wealth of information that remains largely unexplored and difficult to search due to its unstructured nature. Keyword-based search is the standard approach, which allows users to retrieve the documents of a corpus that contain (all or some of) the words in a target list. This type of search, however, does not provide visual support to the task and is not suited to expressing complex queries or compensating for missing specifications. OBJECTIVE: This study aims to consider small graphs of concepts and exploit them for expressing graph searches over existing COVID-19-related literature, leveraging the increasing use of graphs to represent and query scientific knowledge and providing a user-friendly search and exploration experience. METHODS: We considered the COVID-19 Open Research Dataset corpus and summarized its content by annotating the publications' abstracts using terms selected from the Unified Medical Language System and the Ontology of Coronavirus Infectious Disease. Then, we built a co-occurrence network that includes all relevant concepts mentioned in the corpus, establishing connections when their mutual information is relevant. A sophisticated graph query engine was built to allow the identification of the best matches of graph queries on the network. It also supports partial matches and suggests potential query completions using shortest paths. RESULTS: We built a large co-occurrence network, consisting of 128,249 entities and 47,198,965 relationships; the GRAPH-SEARCH interface allows users to explore the network by formulating or adapting graph queries; it produces a bibliography of publications, which are globally ranked; and each publication is further associated with the specific parts of the query that it explains, thereby allowing the user to understand each aspect of the matching. CONCLUSIONS: Our approach supports the process of query formulation and evidence search upon a large text corpus; it can be reapplied to any scientific domain where documents corpora and curated ontologies are made available.


Asunto(s)
Algoritmos , COVID-19 , SARS-CoV-2 , COVID-19/epidemiología , Humanos , Pandemias , Almacenamiento y Recuperación de la Información/métodos , Investigación Biomédica/métodos , Unified Medical Language System , Motor de Búsqueda
3.
Int J Mol Sci ; 25(17)2024 Sep 04.
Artículo en Inglés | MEDLINE | ID: mdl-39273521

RESUMEN

The vast corpus of heterogeneous biomedical data stored in databases, ontologies, and terminologies presents a unique opportunity for drug design. Integrating and fusing these sources is essential to develop data representations that can be analyzed using artificial intelligence methods to generate novel drug candidates or hypotheses. Here, we propose Non-Negative Matrix Tri-Factorization as an invaluable tool for integrating and fusing data, as well as for representation learning. Additionally, we demonstrate how representations learned by Non-Negative Matrix Tri-Factorization can effectively be utilized by traditional artificial intelligence methods. While this approach is domain-agnostic and applicable to any field with vast amounts of structured and semi-structured data, we apply it specifically to computational pharmacology and drug repurposing. This field is poised to benefit significantly from artificial intelligence, particularly in personalized medicine. We conducted extensive experiments to evaluate the performance of the proposed method, yielding exciting results, particularly compared to traditional methods. Novel drug-target predictions have also been validated in the literature, further confirming their validity. Additionally, we tested our method to predict drug synergism, where constructing a classical matrix dataset is challenging. The method demonstrated great flexibility, suggesting its applicability to a wide range of tasks in drug design and discovery.


Asunto(s)
Reposicionamiento de Medicamentos , Reposicionamiento de Medicamentos/métodos , Humanos , Inteligencia Artificial , Biología Computacional/métodos , Aprendizaje Automático , Algoritmos , Descubrimiento de Drogas/métodos , Multiómica
4.
Brief Bioinform ; 22(1): 30-44, 2021 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-32496509

RESUMEN

Thousands of new experimental datasets are becoming available every day; in many cases, they are produced within the scope of large cooperative efforts, involving a variety of laboratories spread all over the world, and typically open for public use. Although the potential collective amount of available information is huge, the effective combination of such public sources is hindered by data heterogeneity, as the datasets exhibit a wide variety of notations and formats, concerning both experimental values and metadata. Thus, data integration is becoming a fundamental activity, to be performed prior to data analysis and biological knowledge discovery, consisting of subsequent steps of data extraction, normalization, matching and enrichment; once applied to heterogeneous data sources, it builds multiple perspectives over the genome, leading to the identification of meaningful relationships that could not be perceived by using incompatible data formats. In this paper, we first describe a technological pipeline from data production to data integration; we then propose a taxonomy of genomic data players (based on the distinction between contributors, repository hosts, consortia, integrators and consumers) and apply the taxonomy to describe about 30 important players in genomic data management. We specifically focus on the integrator players and analyse the issues in solving the genomic data integration challenges, as well as evaluate the computational environments that they provide to follow up data integration by means of visualization and analysis tools.


Asunto(s)
Manejo de Datos/métodos , Genoma Humano , Genómica/métodos , Humanos , Metadatos
5.
Brief Bioinform ; 22(2): 664-675, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-33348368

RESUMEN

With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature.


Asunto(s)
COVID-19/terapia , COVID-19/epidemiología , COVID-19/virología , Genes Virales , Humanos , Almacenamiento y Recuperación de la Información , Pandemias , SARS-CoV-2/genética , SARS-CoV-2/aislamiento & purificación
6.
Brief Bioinform ; 22(3)2021 05 20.
Artículo en Inglés | MEDLINE | ID: mdl-34020536

RESUMEN

MOTIVATION: With the spreading of biological and clinical uses of next-generation sequencing (NGS) data, many laboratories and health organizations are facing the need of sharing NGS data resources and easily accessing and processing comprehensively shared genomic data; in most cases, primary and secondary data management of NGS data is done at sequencing stations, and sharing applies to processed data. Based on the previous single-instance GMQL system architecture, here we review the model, language and architectural extensions that make the GMQL centralized system innovatively open to federated computing. RESULTS: A well-designed extension of a centralized system architecture to support federated data sharing and query processing. Data is federated thanks to simple data sharing instructions. Queries are assigned to execution nodes; they are translated into an intermediate representation, whose computation drives data and processing distributions. The approach allows writing federated applications according to classical styles: centralized, distributed or externalized. AVAILABILITY: The federated genomic data management system is freely available for non-commercial use as an open source project at http://www.bioinformatics.deib.polimi.it/FederatedGMQLsystem/. CONTACT: {arif.canakoglu, pietro.pinoli}@polimi.it.


Asunto(s)
Conjuntos de Datos como Asunto , Genómica , Difusión de la Información , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Lenguajes de Programación
7.
Bioinformatics ; 38(7): 1988-1994, 2022 03 28.
Artículo en Inglés | MEDLINE | ID: mdl-35040923

RESUMEN

MOTIVATION: The ongoing evolution of SARS-CoV-2 and the rapid emergence of variants of concern at distinct geographic locations have relevant implications for the implementation of strategies for controlling the COVID-19 pandemic. Combining the growing body of data and the evidence on potential functional implications of SARS-CoV-2 mutations can suggest highly effective methods for the prioritization of novel variants of potential concern, e.g. increasing in frequency locally and/or globally. However, these analyses may be complex, requiring the integration of different data and resources. We claim the need for a streamlined access to up-to-date and high-quality genome sequencing data from different geographic regions/countries, and the current lack of a robust and consistent framework for the evaluation/comparison of the results. RESULTS: To overcome these limitations, we developed ViruClust, a novel tool for the comparison of SARS-CoV-2 genomic sequences and lineages in space and time. ViruClust is made available through a powerful and intuitive web-based user interface. Sophisticated large-scale analyses can be executed with a few clicks, even by users without any computational background. To demonstrate potential applications of our method, we applied ViruClust to conduct a thorough study of the evolution of the most prevalent lineage of the Delta SARS-CoV-2 variant, and derived relevant observations. By allowing the seamless integration of different types of functional annotations and the direct comparison of viral genomes and genetic variants in space and time, ViruClust represents a highly valuable resource for monitoring the evolution of SARS-CoV-2, facilitating the identification of variants and/or mutations of potential concern. AVAILABILITY AND IMPLEMENTATION: ViruClust is openly available at http://gmql.eu/viruclust/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Pandemias , Mapeo Cromosómico
8.
Nucleic Acids Res ; 49(D1): D817-D824, 2021 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-33045721

RESUMEN

ViruSurf, available at http://gmql.eu/virusurf/, is a large public database of viral sequences and integrated and curated metadata from heterogeneous sources (RefSeq, GenBank, COG-UK and NMDC); it also exposes computed nucleotide and amino acid variants, called from original sequences. A GISAID-specific ViruSurf database, available at http://gmql.eu/virusurf_gisaid/, offers a subset of these functionalities. Given the current pandemic outbreak, SARS-CoV-2 data are collected from the four sources; but ViruSurf contains other virus species harmful to humans, including SARS-CoV, MERS-CoV, Ebola and Dengue. The database is centered on sequences, described from their biological, technological and organizational dimensions. In addition, the analytical dimension characterizes the sequence in terms of its annotations and variants. The web interface enables expressing complex search queries in a simple way; arbitrary search queries can freely combine conditions on attributes from the four dimensions, extracting the resulting sequences. Several example queries on the database confirm and possibly improve results from recent research papers; results can be recomputed over time and upon selected populations. Effective search over large and curated sequence data may enable faster responses to future threats that could arise from new viruses.


Asunto(s)
COVID-19/prevención & control , Biología Computacional/métodos , Curaduría de Datos/métodos , Bases de Datos Genéticas , Genoma Viral/genética , SARS-CoV-2/genética , COVID-19/epidemiología , COVID-19/virología , Variación Genética , Humanos , Almacenamiento y Recuperación de la Información/métodos , Internet , Pandemias , SARS-CoV-2/fisiología , Interfaz Usuario-Computador
9.
Nucleic Acids Res ; 49(15): e90, 2021 09 07.
Artículo en Inglés | MEDLINE | ID: mdl-34107016

RESUMEN

Variant visualization plays an important role in supporting the viral evolution analysis, extremely valuable during the COVID-19 pandemic. VirusViz is a web-based application for comparing variants of selected viral populations and their sub-populations; it is primarily focused on SARS-CoV-2 variants, although the tool also supports other viral species (SARS-CoV, MERS-CoV, Dengue, Ebola). As input, VirusViz imports results of queries extracting variants and metadata from the large database ViruSurf, which integrates information about most SARS-CoV-2 sequences publicly deposited worldwide. Moreover, VirusViz accepts sequences of new viral populations as multi-FASTA files plus corresponding metadata in CSV format; a bioinformatic pipeline builds a suitable input for VirusViz by extracting the nucleotide and amino acid variants. Pages of VirusViz provide metadata summarization, variant descriptions, and variant visualization with rich options for zooming, highlighting variants or regions of interest, and switching from nucleotides to amino acids; sequences can be grouped, groups can be comparatively analyzed. For SARS-CoV-2, we manually collect mutations with known or predicted levels of severity/virulence, as indicated in linked research articles; such critical mutations are reported when observed in sequences. The system includes light-weight project management for downloading, resuming, and merging data analysis sessions. VirusViz is freely available at http://gmql.eu/virusviz/.


Asunto(s)
COVID-19/virología , Visualización de Datos , SARS-CoV-2/química , SARS-CoV-2/genética , Secuencia de Aminoácidos , Secuencia de Bases , Bases de Datos Factuales , Humanos , Bases del Conocimiento , SARS-CoV-2/clasificación , Sudáfrica/epidemiología , Estados Unidos/epidemiología
10.
BMC Bioinformatics ; 22(1): 250, 2021 May 15.
Artículo en Inglés | MEDLINE | ID: mdl-33992077

RESUMEN

BACKGROUND: A pair of genes is defined as synthetically lethal if defects on both cause the death of the cell but a defect in only one of the two is compatible with cell viability. Ideally, if A and B are two synthetic lethal genes, inhibiting B should kill cancer cells with a defect on A, and should have no effects on normal cells. Thus, synthetic lethality can be exploited for highly selective cancer therapies, which need to exploit differences between normal and cancer cells. RESULTS: In this paper, we present a new method for predicting synthetic lethal (SL) gene pairs. As neighbouring genes in the genome have highly correlated profiles of copy number variations (CNAs), our method clusters proximal genes with a similar CNA profile, then predicts mutually exclusive group pairs, and finally identifies the SL gene pairs within each group pairs. For mutual-exclusion testing we use a graph-based method which takes into account the mutation frequencies of different subjects and genes. We use two different methods for selecting the pair of SL genes; the first is based on the gene essentiality measured in various conditions by means of the "Gene Activity Ranking Profile" GARP score; the second leverages the annotations of gene to biological pathways. CONCLUSIONS: This method is unique among current SL prediction approaches, it reduces false-positive SL predictions compared to previous methods, and it allows establishing explicit collateral lethality relationship of gene pairs within mutually exclusive group pairs.


Asunto(s)
Variaciones en el Número de Copia de ADN , Genes Letales , ADN
11.
Bioinformatics ; 36(Suppl_2): i700-i708, 2020 12 30.
Artículo en Inglés | MEDLINE | ID: mdl-33381846

RESUMEN

MOTIVATION: The relationship between gene co-expression and chromatin conformation is of great biological interest. Thanks to high-throughput chromosome conformation capture technologies (Hi-C), researchers are gaining insights on the tri-dimensional organization of the genome. Given the high complexity of Hi-C data and the difficult definition of gene co-expression networks, the development of proper computational tools to investigate such relationship is rapidly gaining the interest of researchers. One of the most fascinating questions in this context is how chromatin topology correlates with gene co-expression and which physical interaction patterns are most predictive of co-expression relationships. RESULTS: To address these questions, we developed a computational framework for the prediction of co-expression networks from chromatin conformation data. We first define a gene chromatin interaction network where each gene is associated to its physical interaction profile; then, we apply two graph embedding techniques to extract a low-dimensional vector representation of each gene from the interaction network; finally, we train a classifier on gene embedding pairs to predict if they are co-expressed. Both graph embedding techniques outperform previous methods based on manually designed topological features, highlighting the need for more advanced strategies to encode chromatin information. We also establish that the most recent technique, based on random walks, is superior. Overall, our results demonstrate that chromatin conformation and gene regulation share a non-linear relationship and that gene topological embeddings encode relevant information, which could be used also for downstream analysis. AVAILABILITY AND IMPLEMENTATION: The source code for the analysis is available at: https://github.com/marcovarrone/gene-expression-chromatin. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Cromatina , Cromosomas , Cromatina/genética , Genoma , Conformación Molecular , Programas Informáticos
12.
BMC Bioinformatics ; 21(1): 464, 2020 Oct 19.
Artículo en Inglés | MEDLINE | ID: mdl-33076821

RESUMEN

BACKGROUND: Genome browsers are widely used for locating interesting genomic regions, but their interactive use is obviously limited to inspecting short genomic portions. An ideal interaction is to provide patterns of regions on the browser, and then extract other genomic regions over the whole genome where such patterns occur, ranked by similarity. RESULTS: We developed SimSearch, an optimized pattern-search method and an open source plugin for the Integrated Genome Browser (IGB), to find genomic region sets that are similar to a given region pattern. It provides efficient visual genome-wide analytics computation in large datasets; the plugin supports intuitive user interactions for selecting an interesting pattern on IGB tracks and visualizing the computed occurrences of similar patterns along the entire genome. SimSearch also includes functions for the annotation and enrichment of results, and is enhanced with a Quickload repository including numerous epigenomic feature datasets from ENCODE and Roadmap Epigenomics. The paper also includes some use cases to show multiple genome-wide analyses of biological interest, which can be easily performed by taking advantage of the presented approach. CONCLUSIONS: The novel SimSearch method provides innovative support for effective genome-wide pattern search and visualization; its relevance and practical usefulness is demonstrated through a number of significant use cases of biological interest. The SimSearch IGB plugin, documentation, and code are freely available at https://deib-geco.github.io/simsearch-app/ and https://github.com/DEIB-GECO/simsearch-app/ .


Asunto(s)
Algoritmos , Epigenómica , Genoma , Reconocimiento de Normas Patrones Automatizadas , Navegador Web , Humanos , Anotación de Secuencia Molecular , Unión Proteica , Secuencias Reguladoras de Ácidos Nucleicos/genética , Factores de Transcripción/metabolismo
13.
Bioinformatics ; 35(5): 729-736, 2019 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-30101316

RESUMEN

MOTIVATION: We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. RESULTS: The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. AVAILABILITY AND IMPLEMENTATION: The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Epigenómica , Genoma , Genómica
14.
BMC Bioinformatics ; 20(1): 560, 2019 Nov 08.
Artículo en Inglés | MEDLINE | ID: mdl-31703553

RESUMEN

BACKGROUND: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. RESULTS: We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. CONCLUSIONS: PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.


Asunto(s)
Análisis de Datos , Bases de Datos Genéticas , Genómica , Programas Informáticos , Elementos de Facilitación Genéticos/genética , Genoma , Estudio de Asociación del Genoma Completo , Humanos , Reproducibilidad de los Resultados , Factores de Transcripción/metabolismo
15.
BMC Bioinformatics ; 18(1): 536, 2017 Dec 04.
Artículo en Inglés | MEDLINE | ID: mdl-29202689

RESUMEN

BACKGROUND: With the wide-spreading of public repositories of NGS processed data, the availability of user-friendly and effective tools for data exploration, analysis and visualization is becoming very relevant. These tools enable interactive analytics, an exploratory approach for the seamless "sense-making" of data through on-the-fly integration of analysis and visualization phases, suggested not only for evaluating processing results, but also for designing and adapting NGS data analysis pipelines. RESULTS: This paper presents abstractions for supporting the early analysis of NGS processed data and their implementation in an associated tool, named GenoMetric Space Explorer (GeMSE). This tool serves the needs of the GenoMetric Query Language, an innovative cloud-based system for computing complex queries over heterogeneous processed data. It can also be used starting from any text files in standard BED, BroadPeak, NarrowPeak, GTF, or general tab-delimited format, containing numerical features of genomic regions; metadata can be provided as text files in tab-delimited attribute-value format. GeMSE allows interactive analytics, consisting of on-the-fly cycling among steps of data exploration, analysis and visualization that help biologists and bioinformaticians in making sense of heterogeneous genomic datasets. By means of an explorative interaction support, users can trace past activities and quickly recover their results, seamlessly going backward and forward in the analysis steps and comparative visualizations of heatmaps. CONCLUSIONS: GeMSE effective application and practical usefulness is demonstrated through significant use cases of biological interest. GeMSE is available at http://www.bioinformatics.deib.polimi.it/GeMSE/ , and its source code is available at https://github.com/Genometric/GeMSE under GPLv3 open-source license.


Asunto(s)
Bases de Datos Genéticas , Genómica/métodos , Metadatos , Células A549 , Dexametasona/farmacología , Etanol/farmacología , Humanos , Modelos Teóricos , Reconocimiento de Normas Patrones Automatizadas , Mapeo de Interacción de Proteínas , Programas Informáticos
16.
BMC Bioinformatics ; 18(1): 6, 2017 Jan 03.
Artículo en Inglés | MEDLINE | ID: mdl-28049410

RESUMEN

BACKGROUND: Data extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, mainly Next Generation Sequencing, for more than 30 cancer types. RESULTS: We propose TCGA2BED a software tool to search and retrieve TCGA data, and convert them in the structured BED format for their seamless use and integration. Additionally, it supports the conversion in CSV, GTF, JSON, and XML standard formats. Furthermore, TCGA2BED extends TCGA data with information extracted from other genomic databases (i.e., NCBI Entrez Gene, HGNC, UCSC, and miRBase). We also provide and maintain an automatically updated data repository with publicly available Copy Number Variation, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq (V1,V2) experimental data of TCGA converted into the BED format, and their associated clinical and biospecimen meta data in attribute-value text format. CONCLUSIONS: The availability of the valuable TCGA data in BED format reduces the time spent in taking advantage of them: it is possible to efficiently and effectively deal with huge amounts of cancer genomic data integratively, and to search, retrieve and extend them with additional information. The BED format facilitates the investigators allowing several knowledge discovery analyses on all tumor types in TCGA with the final aim of understanding pathological mechanisms and aiding cancer treatments.


Asunto(s)
Neoplasias/genética , Interfaz Usuario-Computador , Variaciones en el Número de Copia de ADN , Metilación de ADN , Bases de Datos Genéticas , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Internet , MicroARNs/química , MicroARNs/metabolismo , Neoplasias/patología , Análisis de Secuencia de ADN
17.
Methods ; 111: 3-11, 2016 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-27637471

RESUMEN

While a huge amount of (epi)genomic data of multiple types is becoming available by using Next Generation Sequencing (NGS) technologies, the most important emerging problem is the so-called tertiary analysis, concerned with sense making, e.g., discovering how different (epi)genomic regions and their products interact and cooperate with each other. We propose a paradigm shift in tertiary analysis, based on the use of the Genomic Data Model (GDM), a simple data model which links genomic feature data to their associated experimental, biological and clinical metadata. GDM encompasses all the data formats which have been produced for feature extraction from (epi)genomic datasets. We specifically describe the mapping to GDM of SAM (Sequence Alignment/Map), VCF (Variant Call Format), NARROWPEAK (for called peaks produced by NGS ChIP-seq or DNase-seq methods), and BED (Browser Extensible Data) formats, but GDM supports as well all the formats describing experimental datasets (e.g., including copy number variations, DNA somatic mutations, or gene expressions) and annotations (e.g., regarding transcription start sites, genes, enhancers or CpG islands). We downloaded and integrated samples of all the above-mentioned data types and formats from multiple sources. The GDM is able to homogeneously describe semantically heterogeneous data and makes the ground for providing data interoperability, e.g., achieved through the GenoMetric Query Language (GMQL), a high-level, declarative query language for genomic big data. The combined use of the data model and the query language allows comprehensive processing of multiple heterogeneous data, and supports the development of domain-specific data-driven computations and bio-molecular knowledge discovery.


Asunto(s)
Minería de Datos/métodos , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Variaciones en el Número de Copia de ADN/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Secuencias Reguladoras de Ácidos Nucleicos/genética , Alineación de Secuencia/métodos , Sitio de Iniciación de la Transcripción
18.
Bioinformatics ; 31(12): 1881-8, 2015 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-25649616

RESUMEN

MOTIVATION: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art 'big data' computing strategies, with abstraction levels beyond available tool capabilities. RESULTS: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic 'big data' analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets. AVAILABILITY AND IMPLEMENTATION: The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/.


Asunto(s)
Indización y Redacción de Resúmenes , Biología Computacional/métodos , Bases de Datos Factuales , Genómica/métodos , Ensayos Analíticos de Alto Rendimiento/métodos , Programas Informáticos , Inmunoprecipitación de Cromatina , Epigenómica , Histonas/metabolismo , Humanos , Análisis de Secuencia de ADN/métodos , Factores de Transcripción/metabolismo
19.
BMC Bioinformatics ; 15 Suppl 1: S3, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24564278

RESUMEN

BACKGROUND: The huge amount of biomedical-molecular data increasingly produced is providing scientists with potentially valuable information. Yet, such data quantity makes difficult to find and extract those data that are most reliable and most related to the biomedical questions to be answered, which are increasingly complex and often involve many different biomedical-molecular aspects. Such questions can be addressed only by comprehensively searching and exploring different types of data, which frequently are ordered and provided by different data sources. Search Computing has been proposed for the management and integration of ranked results from heterogeneous search services. Here, we present its novel application to the explorative search of distributed biomedical-molecular data and the integration of the search results to answer complex biomedical questions. RESULTS: A set of available bioinformatics search services has been modelled and registered in the Search Computing framework, and a Bioinformatics Search Computing application (Bio-SeCo) using such services has been created and made publicly available at http://www.bioinformatics.deib.polimi.it/bio-seco/seco/. It offers an integrated environment which eases search, exploration and ranking-aware combination of heterogeneous data provided by the available registered services, and supplies global results that can support answering complex multi-topic biomedical questions. CONCLUSIONS: By using Bio-SeCo, scientists can explore the very large and very heterogeneous biomedical-molecular data available. They can easily make different explorative search attempts, inspect obtained results, select the most appropriate, expand or refine them and move forward and backward in the construction of a global complex biomedical query on multiple distributed sources that could eventually find the most relevant results. Thus, it provides an extremely useful automated support for exploratory integrated bio search, which is fundamental for Life Science data driven knowledge discovery.


Asunto(s)
Tecnología Biomédica/métodos , Algoritmos , Biología Computacional , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN
20.
BMC Bioinformatics ; 15 Suppl 1: S2, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24564249

RESUMEN

Many efforts exist to design and implement approaches and tools for data capture, integration and analysis in the life sciences. Challenges are not only the heterogeneity, size and distribution of information sources, but also the danger of producing too many solutions for the same problem. Methodological, technological, infrastructural and social aspects appear to be essential for the development of a new generation of best practices and tools. In this paper, we analyse and discuss these aspects from different perspectives, by extending some of the ideas that arose during the NETTAB 2012 Workshop, making reference especially to the European context. First, relevance of using data and software models for the management and analysis of biological data is stressed. Second, some of the most relevant community achievements of the recent years, which should be taken as a starting point for future efforts in this research domain, are presented. Third, some of the main outstanding issues, challenges and trends are analysed. The challenges related to the tendency to fund and create large scale international research infrastructures and public-private partnerships in order to address the complex challenges of data intensive science are especially discussed. The needs and opportunities of Genomic Computing (the integration, search and display of genomic information at a very specific level, e.g. at the level of a single DNA region) are then considered. In the current data and network-driven era, social aspects can become crucial bottlenecks. How these may best be tackled to unleash the technical abilities for effective data integration and validation efforts is then discussed. Especially the apparent lack of incentives for already overwhelmed researchers appears to be a limitation for sharing information and knowledge with other scientists. We point out as well how the bioinformatics market is growing at an unprecedented speed due to the impact that new powerful in silico analysis promises to have on better diagnosis, prognosis, drug discovery and treatment, towards personalized medicine. An open business model for bioinformatics, which appears to be able to reduce undue duplication of efforts and support the increased reuse of valuable data sets, tools and platforms, is finally discussed.


Asunto(s)
Biología Computacional/métodos , Programas Informáticos , Algoritmos , Animales , Biología Computacional/tendencias , Conducta Cooperativa , Genoma , Genómica , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA