Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
Más filtros












Base de datos
Intervalo de año de publicación
1.
J Intensive Care Med ; 39(5): 420-428, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-37926984

RESUMEN

Purpose: This study aimed to investigate the effects of inspired oxygen fraction (FiO2) and positive end-expiratory pressure (PEEP) on gas exchange in mechanically ventilated patients with COVID-19. Methods: Two FiO2 (100%, 40%) were tested at 3 decreasing levels of PEEP (15, 10, and 5 cmH2O). At each FiO2 and PEEP, gas exchange, respiratory mechanics, hemodynamics, and the distribution of ventilation and perfusion were assessed with electrical impedance tomography. The impact of FiO2 on the intrapulmonary shunt (delta shunt) was analyzed as the difference between the calculated shunt at FiO2 100% (shunt) and venous admixture at FiO2 40% (venous admixture). Results: Fourteen patients were studied. Decreasing PEEP from 15 to 10 cmH2O did not change shunt (24 [21-28] vs 27 [24-29]%) or venous admixture (18 [15-26] vs 23 [18-34]%) while partial pressure of arterial oxygen (FiO2 100%) was higher at PEEP 15 (262 [198-338] vs 256 [147-315] mmHg; P < .05). Instead when PEEP was decreased from 10 to 5 cmH2O, shunt increased to 36 [30-39]% (P < .05) and venous admixture increased to 33 [30-43]% (P < .05) and partial pressure of arterial oxygen (100%) decreased to 109 [76-177] mmHg (P < .05). At PEEP 15, administration of 100% FiO2 resulted in a shunt greater than venous admixture at 40% FiO2, ((24 [21-28] vs 18 [15-26]%, P = .005), delta shunt 5.5% (2.3-8.8)). Compared to PEEP 10, PEEP of 5 and 15 cmH2O resulted in decreased global and pixel-level compliance. Cardiac output at FiO2 100% resulted higher at PEEP 5 (5.4 [4.4-6.5]) compared to PEEP 10 (4.8 [4.1-5.5], P < .05) and PEEP 15 cmH2O (4.7 [4.5-5.4], P < .05). Conclusion: In this study, PEEP of 15 cmH2O, despite resulting in the highest oxygenation, was associated with overdistension. PEEP of 5 cmH2O was associated with increased shunt and alveolar collapse. Administration of 100% FiO2 was associated with an increase in intrapulmonary shunt in the setting of high PEEP. Trial registration: NCT05132933.


Asunto(s)
COVID-19 , Enfermedades Pulmonares , Síndrome de Dificultad Respiratoria , Humanos , Respiración Artificial , Síndrome de Dificultad Respiratoria/terapia , COVID-19/complicaciones , COVID-19/terapia , Pulmón/diagnóstico por imagen , Respiración con Presión Positiva/métodos , Mecánica Respiratoria , Oxígeno
2.
Database (Oxford) ; 20232023 07 06.
Artículo en Inglés | MEDLINE | ID: mdl-37410916

RESUMEN

With the progression of the COVID-19 pandemic, large datasets of SARS-CoV-2 genome sequences were collected to closely monitor the evolution of the virus and identify the novel variants/strains. By analyzing genome sequencing data, health authorities can 'hunt' novel emerging variants of SARS-CoV-2 as early as possible, and then monitor their evolution and spread. We designed VariantHunter, a highly flexible and user-friendly tool for systematically monitoring the evolution of SARS-CoV-2 at global and regional levels. In VariantHunter, amino acid changes are analyzed over an interval of 4 weeks in an arbitrary geographical area (continent, country, or region); for every week in the interval, the prevalence is computed and changes are ranked based on their increase or decrease in prevalence. VariantHunter supports two main types of analysis: lineage-independent and lineage-specific. The former considers all the available data and aims to discover new viral variants. The latter evaluates specific lineages/viral variants to identify novel candidate designations (sub-lineages and sub-variants). Both analyses use simple statistics and visual representations (diffusion charts and heatmaps) to track viral evolution. A dataset explorer allows users to visualize available data and refine their selection. VariantHunter is a web application free to all users. The two types of supported analysis (lineage-independent and lineage-specific) allow user-friendly monitoring of the viral evolution, empowering genomic surveillance without requiring any computational background. Database URL http://gmql.eu/variant_hunter/.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/epidemiología , Pandemias , Mapeo Cromosómico
4.
Genome Biol ; 24(1): 79, 2023 04 18.
Artículo en Inglés | MEDLINE | ID: mdl-37072822

RESUMEN

A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.


Asunto(s)
Algoritmos , Epigenómica , Genómica/métodos
5.
BMC Genom Data ; 24(1): 13, 2023 03 03.
Artículo en Inglés | MEDLINE | ID: mdl-36869294

RESUMEN

BACKGROUND: Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants - typically single-nucleotide polymorphisms (SNPs) - in different individuals that are associated with phenotypic traits. Research efforts have so far been directed to improving GWAS techniques rather than on making the results of GWAS interoperable with other genomic signals; this is currently hindered by the use of heterogeneous formats and uncoordinated experiment descriptions. RESULTS: To practically facilitate integrative use, we propose to include GWAS datasets within the META-BASE repository, exploiting an integration pipeline previously studied for other genomic datasets that includes several heterogeneous data types in the same format, queryable from the same systems. We represent GWAS SNPs and metadata by means of the Genomic Data Model and include metadata within a relational representation by extending the Genomic Conceptual Model with a dedicated view. To further reduce the gap with the descriptions of other signals in the repository of genomic datasets, we perform a semantic annotation of phenotypic traits. Our pipeline is demonstrated using two important data sources, initially organized according to different data models: the NHGRI-EBI GWAS Catalog and FinnGen (University of Helsinki). The integration effort finally allows us to use these datasets within multi-sample processing queries that respond to important biological questions. These are then made usable for multi-omic studies together with, e.g., somatic and reference mutation data, genomic annotations, epigenetic signals. CONCLUSIONS: As a result of the our work on GWAS datasets, we enable 1) their interoperable use with several other homogenized and processed genomic datasets in the context of the META-BASE repository; 2) their big data processing by means of the GenoMetric Query Language and associated system. Future large-scale tertiary data analysis may extensively benefit from the addition of GWAS results to inform several different downstream analysis workflows.


Asunto(s)
Estudio de Asociación del Genoma Completo , Genómica , Humanos , Epigenómica , Macrodatos , Análisis de Datos
6.
JAMA Netw Open ; 5(10): e2238871, 2022 10 03.
Artículo en Inglés | MEDLINE | ID: mdl-36301541

RESUMEN

Importance: Data on the association of COVID-19 vaccination with intensive care unit (ICU) admission and outcomes of patients with SARS-CoV-2-related pneumonia are scarce. Objective: To evaluate whether COVID-19 vaccination is associated with preventing ICU admission for COVID-19 pneumonia and to compare baseline characteristics and outcomes of vaccinated and unvaccinated patients admitted to an ICU. Design, Setting, and Participants: This retrospective cohort study on regional data sets reports: (1) daily number of administered vaccines and (2) data of all consecutive patients admitted to an ICU in Lombardy, Italy, from August 1 to December 15, 2021 (Delta variant predominant). Vaccinated patients received either mRNA vaccines (BNT162b2 or mRNA-1273) or adenoviral vector vaccines (ChAdOx1-S or Ad26.COV2). Incident rate ratios (IRRs) were computed from August 1, 2021, to January 31, 2022; ICU and baseline characteristics and outcomes of vaccinated and unvaccinated patients admitted to an ICU were analyzed from August 1 to December 15, 2021. Exposures: COVID-19 vaccination status (no vaccination, mRNA vaccine, adenoviral vector vaccine). Main Outcomes and Measures: The incidence IRR of ICU admission was evaluated, comparing vaccinated people with unvaccinated, adjusted for age and sex. The baseline characteristics at ICU admission of vaccinated and unvaccinated patients were investigated. The association between vaccination status at ICU admission and mortality at ICU and hospital discharge were also studied, adjusting for possible confounders. Results: Among the 10 107 674 inhabitants of Lombardy, Italy, at the time of this study, the median [IQR] age was 48 [28-64] years and 5 154 914 (51.0%) were female. Of the 7 863 417 individuals who were vaccinated (median [IQR] age: 53 [33-68] years; 4 010 343 [51.4%] female), 6 251 417 (79.5%) received an mRNA vaccine, 550 439 (7.0%) received an adenoviral vector vaccine, and 1 061 561 (13.5%) received a mix of vaccines and 4 497 875 (57.2%) were boosted. Compared with unvaccinated people, IRR of individuals who received an mRNA vaccine within 120 days from the last dose was 0.03 (95% CI, 0.03-0.04; P < .001), whereas IRR of individuals who received an adenoviral vector vaccine after 120 days was 0.21 (95% CI, 0.19-0.24; P < .001). There were 553 patients admitted to an ICU for COVID-19 pneumonia during the study period: 139 patients (25.1%) were vaccinated and 414 (74.9%) were unvaccinated. Compared with unvaccinated patients, vaccinated patients were older (median [IQR]: 72 [66-76] vs 60 [51-69] years; P < .001), primarily male individuals (110 patients [79.1%] vs 252 patients [60.9%]; P < .001), with more comorbidities (median [IQR]: 2 [1-3] vs 0 [0-1] comorbidities; P < .001) and had higher ratio of arterial partial pressure of oxygen (Pao2) and fraction of inspiratory oxygen (FiO2) at ICU admission (median [IQR]: 138 [100-180] vs 120 [90-158] mm Hg; P = .007). Factors associated with ICU and hospital mortality were higher age, premorbid heart disease, lower Pao2/FiO2 at ICU admission, and female sex (this factor only for ICU mortality). ICU and hospital mortality were similar between vaccinated and unvaccinated patients. Conclusions and Relevance: In this cohort study, mRNA and adenoviral vector vaccines were associated with significantly lower risk of ICU admission for COVID-19 pneumonia. ICU and hospital mortality were not associated with vaccinated status. These findings suggest a substantial reduction of the risk of developing COVID-19-related severe acute respiratory failure requiring ICU admission among vaccinated people.


Asunto(s)
COVID-19 , Neumonía , Humanos , Masculino , Femenino , Persona de Mediana Edad , Adulto , COVID-19/epidemiología , COVID-19/prevención & control , SARS-CoV-2 , Enfermedad Crítica/terapia , Vacunas contra la COVID-19 , Estudios Retrospectivos , Estudios de Cohortes , Vacuna BNT162 , Unidades de Cuidados Intensivos , Neumonía/epidemiología , Oxígeno , Vacunas de ARNm
7.
BMC Bioinformatics ; 23(1): 401, 2022 Sep 29.
Artículo en Inglés | MEDLINE | ID: mdl-36175857

RESUMEN

BACKGROUND: Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics. RESULTS: Here, we target general germline or somatic mutation data sources for their seamless inclusion within an interoperable-format repository, supporting integration among them and with other genomic data, as well as their integrated use within bioinformatic workflows. In addition, we provide VarSum, a data summarization service working on sub-populations of interest selected using filters on population metadata and/or variant characteristics. The service is developed as an optimized computational framework with an Application Programming Interface (API) that can be called from within any existing computing pipeline or programming script. Provided example use cases of biological interest show the relevance, power and ease of use of the API functionalities. CONCLUSIONS: The proposed data integration pipeline and data set extraction and summarization API pave the way for solid computational infrastructures that quickly process cumbersome variation data, and allow biologists and bioinformaticians to easily perform scalable analysis on user-defined partitions of large cohorts from increasingly available genetic variation studies. With the current tendency to large (cross)nation-wide sequencing and variation initiatives, we expect an ever growing need for the kind of computational support hereby proposed.


Asunto(s)
Genómica , Metadatos , Biología Computacional , Genotipo , Humanos , Programas Informáticos
8.
BioTech (Basel) ; 11(1)2022 Mar 21.
Artículo en Inglés | MEDLINE | ID: mdl-35822815

RESUMEN

With the spread of COVID-19, sequencing laboratories started to share hundreds of sequences daily. However, the lack of a commonly agreed standard across deposition databases hindered the exploration and study of all the viral sequences collected worldwide in a practical and homogeneous way. During the first months of the pandemic, we developed an automatic procedure to collect, transform, and integrate viral sequences of SARS-CoV-2, MERS, SARS-CoV, Ebola, and Dengue from four major database institutions (NCBI, COG-UK, GISAID, and NMDC). This data pipeline allowed the creation of the data exploration interfaces VirusViz and EpiSurf, as well as ViruSurf, one of the largest databases of integrated viral sequences. Almost two years after the first release of the repository, the original pipeline underwent a thorough refinement process and became more efficient, scalable, and general (currently, it also includes epitopes from the IEDB). Thanks to these improvements, we constantly update and expand our integrated repository, encompassing about 9.1 million SARS-CoV-2 sequences at present (March 2022). This pipeline made it possible to design and develop fundamental resources for any researcher interested in understanding the biological mechanisms behind the viral infection. In addition, it plays a crucial role in many analytic and visualization tools, such as ViruSurf, EpiSurf, VirusViz, and VirusLab.

9.
Bioinformatics ; 38(7): 1988-1994, 2022 03 28.
Artículo en Inglés | MEDLINE | ID: mdl-35040923

RESUMEN

MOTIVATION: The ongoing evolution of SARS-CoV-2 and the rapid emergence of variants of concern at distinct geographic locations have relevant implications for the implementation of strategies for controlling the COVID-19 pandemic. Combining the growing body of data and the evidence on potential functional implications of SARS-CoV-2 mutations can suggest highly effective methods for the prioritization of novel variants of potential concern, e.g. increasing in frequency locally and/or globally. However, these analyses may be complex, requiring the integration of different data and resources. We claim the need for a streamlined access to up-to-date and high-quality genome sequencing data from different geographic regions/countries, and the current lack of a robust and consistent framework for the evaluation/comparison of the results. RESULTS: To overcome these limitations, we developed ViruClust, a novel tool for the comparison of SARS-CoV-2 genomic sequences and lineages in space and time. ViruClust is made available through a powerful and intuitive web-based user interface. Sophisticated large-scale analyses can be executed with a few clicks, even by users without any computational background. To demonstrate potential applications of our method, we applied ViruClust to conduct a thorough study of the evolution of the most prevalent lineage of the Delta SARS-CoV-2 variant, and derived relevant observations. By allowing the seamless integration of different types of functional annotations and the direct comparison of viral genomes and genetic variants in space and time, ViruClust represents a highly valuable resource for monitoring the evolution of SARS-CoV-2, facilitating the identification of variants and/or mutations of potential concern. AVAILABILITY AND IMPLEMENTATION: ViruClust is openly available at http://gmql.eu/viruclust/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Pandemias , Mapeo Cromosómico
10.
Artículo en Inglés | MEDLINE | ID: mdl-33270566

RESUMEN

Breast Cancer comprises multiple subtypes implicated in prognosis. Existing stratification methods rely on the expression quantification of small gene sets. Next Generation Sequencing promises large amounts of omic data in the next years. In this scenario, we explore the potential of machine learning and, particularly, deep learning for breast cancer subtyping. Due to the paucity of publicly available data, we leverage on pan-cancer and non-cancer data to design semi-supervised settings. We make use of multi-omic data, including microRNA expressions and copy number alterations, and we provide an in-depth investigation of several supervised and semi-supervised architectures. Obtained accuracy results show simpler models to perform at least as well as the deep semi-supervised approaches on our task over gene expression data. When multi-omic data types are combined together, performance of deep models shows little (if any) improvement in accuracy, indicating the need for further analysis on larger datasets of multi-omic data as and when they become available. From a biological perspective, our linear model mostly confirms known gene-subtype annotations. Conversely, deep approaches model non-linear relationships, which is reflected in a more varied and still unexplored set of representative omic features that may prove useful for breast cancer subtyping.


Asunto(s)
Neoplasias de la Mama , Aprendizaje Profundo , Neoplasias de la Mama/genética , Variaciones en el Número de Copia de ADN , Femenino , Humanos , Aprendizaje Automático , Aprendizaje Automático Supervisado
11.
Artículo en Inglés | MEDLINE | ID: mdl-32750853

RESUMEN

The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository-already integrating several important sources-which is exposed by means of practical user interfaces to respond biological researchers' needs.


Asunto(s)
Genómica , Metadatos , Biología Computacional , Almacenamiento y Recuperación de la Información
12.
Database (Oxford) ; 20212021 09 29.
Artículo en Inglés | MEDLINE | ID: mdl-34585726

RESUMEN

EpiSurf is a Web application for selecting viral populations of interest and then analyzing how their amino acid changes are distributed along epitopes. Viral sequences are searched within ViruSurf, which stores curated metadata and amino acid changes imported from the most widely used deposition sources for viral databases (GenBank, COVID-19 Genomics UK (COG-UK) and Global initiative on sharing all influenza data (GISAID)). Epitopes are searched within the open source Immune Epitope Database or directly proposed by users by indicating their start and stop positions in the context of a given viral protein. Amino acid changes of selected populations are joined with epitopes of interest; a result table summarizes, for each epitope, statistics about the overlapping amino acid changes and about the sequences carrying such alterations. The results may also be inspected by the VirusViz Web application; epitope regions are highlighted within the given viral protein, and changes can be comparatively inspected. For sequences mutated within the epitope, we also offer a complete view of the distribution of amino acid changes, optionally grouped by the location, collection date or lineage. Thanks to these functionalities, EpiSurf supports the user-friendly testing of epitope conservancy within selected populations of interest, which can be of utmost relevance for designing vaccines, drugs or serological assays. EpiSurf is available at two endpoints. Database URL: http://gmql.eu/episurf/ (for searching GenBank and COG-UK sequences) and http://gmql.eu/episurf_gisaid/ (for GISAID sequences).


Asunto(s)
Sustitución de Aminoácidos , Antígenos Virales/química , Epítopos/química , Internet , Metadatos , SARS-CoV-2/química , Motor de Búsqueda , Programas Informáticos , Aminoácidos/química , Aminoácidos/inmunología , Antígenos Virales/inmunología , COVID-19/virología , Epítopos/inmunología , Humanos , SARS-CoV-2/inmunología
13.
Nucleic Acids Res ; 49(15): e90, 2021 09 07.
Artículo en Inglés | MEDLINE | ID: mdl-34107016

RESUMEN

Variant visualization plays an important role in supporting the viral evolution analysis, extremely valuable during the COVID-19 pandemic. VirusViz is a web-based application for comparing variants of selected viral populations and their sub-populations; it is primarily focused on SARS-CoV-2 variants, although the tool also supports other viral species (SARS-CoV, MERS-CoV, Dengue, Ebola). As input, VirusViz imports results of queries extracting variants and metadata from the large database ViruSurf, which integrates information about most SARS-CoV-2 sequences publicly deposited worldwide. Moreover, VirusViz accepts sequences of new viral populations as multi-FASTA files plus corresponding metadata in CSV format; a bioinformatic pipeline builds a suitable input for VirusViz by extracting the nucleotide and amino acid variants. Pages of VirusViz provide metadata summarization, variant descriptions, and variant visualization with rich options for zooming, highlighting variants or regions of interest, and switching from nucleotides to amino acids; sequences can be grouped, groups can be comparatively analyzed. For SARS-CoV-2, we manually collect mutations with known or predicted levels of severity/virulence, as indicated in linked research articles; such critical mutations are reported when observed in sequences. The system includes light-weight project management for downloading, resuming, and merging data analysis sessions. VirusViz is freely available at http://gmql.eu/virusviz/.


Asunto(s)
COVID-19/virología , Visualización de Datos , SARS-CoV-2/química , SARS-CoV-2/genética , Secuencia de Aminoácidos , Secuencia de Bases , Bases de Datos Factuales , Humanos , Bases del Conocimiento , SARS-CoV-2/clasificación , Sudáfrica/epidemiología , Estados Unidos/epidemiología
14.
Brief Bioinform ; 22(3)2021 05 20.
Artículo en Inglés | MEDLINE | ID: mdl-34020536

RESUMEN

MOTIVATION: With the spreading of biological and clinical uses of next-generation sequencing (NGS) data, many laboratories and health organizations are facing the need of sharing NGS data resources and easily accessing and processing comprehensively shared genomic data; in most cases, primary and secondary data management of NGS data is done at sequencing stations, and sharing applies to processed data. Based on the previous single-instance GMQL system architecture, here we review the model, language and architectural extensions that make the GMQL centralized system innovatively open to federated computing. RESULTS: A well-designed extension of a centralized system architecture to support federated data sharing and query processing. Data is federated thanks to simple data sharing instructions. Queries are assigned to execution nodes; they are translated into an intermediate representation, whose computation drives data and processing distributions. The approach allows writing federated applications according to classical styles: centralized, distributed or externalized. AVAILABILITY: The federated genomic data management system is freely available for non-commercial use as an open source project at http://www.bioinformatics.deib.polimi.it/FederatedGMQLsystem/. CONTACT: {arif.canakoglu, pietro.pinoli}@polimi.it.


Asunto(s)
Conjuntos de Datos como Asunto , Genómica , Difusión de la Información , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Lenguajes de Programación
15.
Brief Bioinform ; 22(2): 664-675, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-33348368

RESUMEN

With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature.


Asunto(s)
COVID-19/terapia , COVID-19/epidemiología , COVID-19/virología , Genes Virales , Humanos , Almacenamiento y Recuperación de la Información , Pandemias , SARS-CoV-2/genética , SARS-CoV-2/aislamiento & purificación
16.
Brief Bioinform ; 22(1): 30-44, 2021 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-32496509

RESUMEN

Thousands of new experimental datasets are becoming available every day; in many cases, they are produced within the scope of large cooperative efforts, involving a variety of laboratories spread all over the world, and typically open for public use. Although the potential collective amount of available information is huge, the effective combination of such public sources is hindered by data heterogeneity, as the datasets exhibit a wide variety of notations and formats, concerning both experimental values and metadata. Thus, data integration is becoming a fundamental activity, to be performed prior to data analysis and biological knowledge discovery, consisting of subsequent steps of data extraction, normalization, matching and enrichment; once applied to heterogeneous data sources, it builds multiple perspectives over the genome, leading to the identification of meaningful relationships that could not be perceived by using incompatible data formats. In this paper, we first describe a technological pipeline from data production to data integration; we then propose a taxonomy of genomic data players (based on the distinction between contributors, repository hosts, consortia, integrators and consumers) and apply the taxonomy to describe about 30 important players in genomic data management. We specifically focus on the integrator players and analyse the issues in solving the genomic data integration challenges, as well as evaluate the computational environments that they provide to follow up data integration by means of visualization and analysis tools.


Asunto(s)
Manejo de Datos/métodos , Genoma Humano , Genómica/métodos , Humanos , Metadatos
17.
Nucleic Acids Res ; 49(D1): D817-D824, 2021 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-33045721

RESUMEN

ViruSurf, available at http://gmql.eu/virusurf/, is a large public database of viral sequences and integrated and curated metadata from heterogeneous sources (RefSeq, GenBank, COG-UK and NMDC); it also exposes computed nucleotide and amino acid variants, called from original sequences. A GISAID-specific ViruSurf database, available at http://gmql.eu/virusurf_gisaid/, offers a subset of these functionalities. Given the current pandemic outbreak, SARS-CoV-2 data are collected from the four sources; but ViruSurf contains other virus species harmful to humans, including SARS-CoV, MERS-CoV, Ebola and Dengue. The database is centered on sequences, described from their biological, technological and organizational dimensions. In addition, the analytical dimension characterizes the sequence in terms of its annotations and variants. The web interface enables expressing complex search queries in a simple way; arbitrary search queries can freely combine conditions on attributes from the four dimensions, extracting the resulting sequences. Several example queries on the database confirm and possibly improve results from recent research papers; results can be recomputed over time and upon selected populations. Effective search over large and curated sequence data may enable faster responses to future threats that could arise from new viruses.


Asunto(s)
COVID-19/prevención & control , Biología Computacional/métodos , Curaduría de Datos/métodos , Bases de Datos Genéticas , Genoma Viral/genética , SARS-CoV-2/genética , COVID-19/epidemiología , COVID-19/virología , Variación Genética , Humanos , Almacenamiento y Recuperación de la Información/métodos , Internet , Pandemias , SARS-CoV-2/fisiología , Interfaz Usuario-Computador
18.
Database (Oxford) ; 20192019 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-31820804

RESUMEN

Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata.


Asunto(s)
Bases de Datos Genéticas , Metadatos , Semántica , Programas Informáticos , Femenino , Humanos , Bases del Conocimiento , Interfaz Usuario-Computador
19.
BMC Bioinformatics ; 20(1): 560, 2019 Nov 08.
Artículo en Inglés | MEDLINE | ID: mdl-31703553

RESUMEN

BACKGROUND: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. RESULTS: We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. CONCLUSIONS: PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.


Asunto(s)
Análisis de Datos , Bases de Datos Genéticas , Genómica , Programas Informáticos , Elementos de Facilitación Genéticos/genética , Genoma , Estudio de Asociación del Genoma Completo , Humanos , Reproducibilidad de los Resultados , Factores de Transcripción/metabolismo
20.
Bioinformatics ; 35(5): 729-736, 2019 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-30101316

RESUMEN

MOTIVATION: We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. RESULTS: The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. AVAILABILITY AND IMPLEMENTATION: The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Epigenómica , Genoma , Genómica
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...