RESUMO
BACKGROUND: With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the data. RESULTS: We provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between ENCODE and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome. CONCLUSIONS: The signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation.
Assuntos
Fatores de Transcrição , Humanos , Fatores de Transcrição/metabolismo , Fatores de Transcrição/genética , Sítios de Ligação , Ligação Proteica , Biologia Computacional/métodos , Aprendizado de Máquina , Bases de Dados Genéticas , Algoritmos , Genômica/métodosRESUMO
The vast corpus of heterogeneous biomedical data stored in databases, ontologies, and terminologies presents a unique opportunity for drug design. Integrating and fusing these sources is essential to develop data representations that can be analyzed using artificial intelligence methods to generate novel drug candidates or hypotheses. Here, we propose Non-Negative Matrix Tri-Factorization as an invaluable tool for integrating and fusing data, as well as for representation learning. Additionally, we demonstrate how representations learned by Non-Negative Matrix Tri-Factorization can effectively be utilized by traditional artificial intelligence methods. While this approach is domain-agnostic and applicable to any field with vast amounts of structured and semi-structured data, we apply it specifically to computational pharmacology and drug repurposing. This field is poised to benefit significantly from artificial intelligence, particularly in personalized medicine. We conducted extensive experiments to evaluate the performance of the proposed method, yielding exciting results, particularly compared to traditional methods. Novel drug-target predictions have also been validated in the literature, further confirming their validity. Additionally, we tested our method to predict drug synergism, where constructing a classical matrix dataset is challenging. The method demonstrated great flexibility, suggesting its applicability to a wide range of tasks in drug design and discovery.
Assuntos
Reposicionamento de Medicamentos , Reposicionamento de Medicamentos/métodos , Humanos , Inteligência Artificial , Biologia Computacional/métodos , Aprendizado de Máquina , Algoritmos , Descoberta de Drogas/métodos , MultiômicaRESUMO
MOTIVATION: With the spreading of biological and clinical uses of next-generation sequencing (NGS) data, many laboratories and health organizations are facing the need of sharing NGS data resources and easily accessing and processing comprehensively shared genomic data; in most cases, primary and secondary data management of NGS data is done at sequencing stations, and sharing applies to processed data. Based on the previous single-instance GMQL system architecture, here we review the model, language and architectural extensions that make the GMQL centralized system innovatively open to federated computing. RESULTS: A well-designed extension of a centralized system architecture to support federated data sharing and query processing. Data is federated thanks to simple data sharing instructions. Queries are assigned to execution nodes; they are translated into an intermediate representation, whose computation drives data and processing distributions. The approach allows writing federated applications according to classical styles: centralized, distributed or externalized. AVAILABILITY: The federated genomic data management system is freely available for non-commercial use as an open source project at http://www.bioinformatics.deib.polimi.it/FederatedGMQLsystem/. CONTACT: {arif.canakoglu, pietro.pinoli}@polimi.it.
Assuntos
Conjuntos de Dados como Assunto , Genômica , Disseminação de Informação , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Linguagens de ProgramaçãoRESUMO
With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature.
Assuntos
COVID-19/terapia , COVID-19/epidemiologia , COVID-19/virologia , Genes Virais , Humanos , Armazenamento e Recuperação da Informação , Pandemias , SARS-CoV-2/genética , SARS-CoV-2/isolamento & purificaçãoRESUMO
MOTIVATION: The ongoing evolution of SARS-CoV-2 and the rapid emergence of variants of concern at distinct geographic locations have relevant implications for the implementation of strategies for controlling the COVID-19 pandemic. Combining the growing body of data and the evidence on potential functional implications of SARS-CoV-2 mutations can suggest highly effective methods for the prioritization of novel variants of potential concern, e.g. increasing in frequency locally and/or globally. However, these analyses may be complex, requiring the integration of different data and resources. We claim the need for a streamlined access to up-to-date and high-quality genome sequencing data from different geographic regions/countries, and the current lack of a robust and consistent framework for the evaluation/comparison of the results. RESULTS: To overcome these limitations, we developed ViruClust, a novel tool for the comparison of SARS-CoV-2 genomic sequences and lineages in space and time. ViruClust is made available through a powerful and intuitive web-based user interface. Sophisticated large-scale analyses can be executed with a few clicks, even by users without any computational background. To demonstrate potential applications of our method, we applied ViruClust to conduct a thorough study of the evolution of the most prevalent lineage of the Delta SARS-CoV-2 variant, and derived relevant observations. By allowing the seamless integration of different types of functional annotations and the direct comparison of viral genomes and genetic variants in space and time, ViruClust represents a highly valuable resource for monitoring the evolution of SARS-CoV-2, facilitating the identification of variants and/or mutations of potential concern. AVAILABILITY AND IMPLEMENTATION: ViruClust is openly available at http://gmql.eu/viruclust/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Pandemias , Mapeamento CromossômicoRESUMO
ViruSurf, available at http://gmql.eu/virusurf/, is a large public database of viral sequences and integrated and curated metadata from heterogeneous sources (RefSeq, GenBank, COG-UK and NMDC); it also exposes computed nucleotide and amino acid variants, called from original sequences. A GISAID-specific ViruSurf database, available at http://gmql.eu/virusurf_gisaid/, offers a subset of these functionalities. Given the current pandemic outbreak, SARS-CoV-2 data are collected from the four sources; but ViruSurf contains other virus species harmful to humans, including SARS-CoV, MERS-CoV, Ebola and Dengue. The database is centered on sequences, described from their biological, technological and organizational dimensions. In addition, the analytical dimension characterizes the sequence in terms of its annotations and variants. The web interface enables expressing complex search queries in a simple way; arbitrary search queries can freely combine conditions on attributes from the four dimensions, extracting the resulting sequences. Several example queries on the database confirm and possibly improve results from recent research papers; results can be recomputed over time and upon selected populations. Effective search over large and curated sequence data may enable faster responses to future threats that could arise from new viruses.
Assuntos
COVID-19/prevenção & controle , Biologia Computacional/métodos , Curadoria de Dados/métodos , Bases de Dados Genéticas , Genoma Viral/genética , SARS-CoV-2/genética , COVID-19/epidemiologia , COVID-19/virologia , Variação Genética , Humanos , Armazenamento e Recuperação da Informação/métodos , Internet , Pandemias , SARS-CoV-2/fisiologia , Interface Usuário-ComputadorRESUMO
Variant visualization plays an important role in supporting the viral evolution analysis, extremely valuable during the COVID-19 pandemic. VirusViz is a web-based application for comparing variants of selected viral populations and their sub-populations; it is primarily focused on SARS-CoV-2 variants, although the tool also supports other viral species (SARS-CoV, MERS-CoV, Dengue, Ebola). As input, VirusViz imports results of queries extracting variants and metadata from the large database ViruSurf, which integrates information about most SARS-CoV-2 sequences publicly deposited worldwide. Moreover, VirusViz accepts sequences of new viral populations as multi-FASTA files plus corresponding metadata in CSV format; a bioinformatic pipeline builds a suitable input for VirusViz by extracting the nucleotide and amino acid variants. Pages of VirusViz provide metadata summarization, variant descriptions, and variant visualization with rich options for zooming, highlighting variants or regions of interest, and switching from nucleotides to amino acids; sequences can be grouped, groups can be comparatively analyzed. For SARS-CoV-2, we manually collect mutations with known or predicted levels of severity/virulence, as indicated in linked research articles; such critical mutations are reported when observed in sequences. The system includes light-weight project management for downloading, resuming, and merging data analysis sessions. VirusViz is freely available at http://gmql.eu/virusviz/.
Assuntos
COVID-19/virologia , Visualização de Dados , SARS-CoV-2/química , SARS-CoV-2/genética , Sequência de Aminoácidos , Sequência de Bases , Bases de Dados Factuais , Humanos , Bases de Conhecimento , SARS-CoV-2/classificação , África do Sul/epidemiologia , Estados Unidos/epidemiologiaRESUMO
BACKGROUND: A pair of genes is defined as synthetically lethal if defects on both cause the death of the cell but a defect in only one of the two is compatible with cell viability. Ideally, if A and B are two synthetic lethal genes, inhibiting B should kill cancer cells with a defect on A, and should have no effects on normal cells. Thus, synthetic lethality can be exploited for highly selective cancer therapies, which need to exploit differences between normal and cancer cells. RESULTS: In this paper, we present a new method for predicting synthetic lethal (SL) gene pairs. As neighbouring genes in the genome have highly correlated profiles of copy number variations (CNAs), our method clusters proximal genes with a similar CNA profile, then predicts mutually exclusive group pairs, and finally identifies the SL gene pairs within each group pairs. For mutual-exclusion testing we use a graph-based method which takes into account the mutation frequencies of different subjects and genes. We use two different methods for selecting the pair of SL genes; the first is based on the gene essentiality measured in various conditions by means of the "Gene Activity Ranking Profile" GARP score; the second leverages the annotations of gene to biological pathways. CONCLUSIONS: This method is unique among current SL prediction approaches, it reduces false-positive SL predictions compared to previous methods, and it allows establishing explicit collateral lethality relationship of gene pairs within mutually exclusive group pairs.
Assuntos
Variações do Número de Cópias de DNA , Genes Letais , DNARESUMO
MOTIVATION: We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. RESULTS: The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. AVAILABILITY AND IMPLEMENTATION: The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Epigenômica , Genoma , GenômicaRESUMO
BACKGROUND: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. RESULTS: We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. CONCLUSIONS: PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.
Assuntos
Análise de Dados , Bases de Dados Genéticas , Genômica , Software , Elementos Facilitadores Genéticos/genética , Genoma , Estudo de Associação Genômica Ampla , Humanos , Reprodutibilidade dos Testes , Fatores de Transcrição/metabolismoRESUMO
While a huge amount of (epi)genomic data of multiple types is becoming available by using Next Generation Sequencing (NGS) technologies, the most important emerging problem is the so-called tertiary analysis, concerned with sense making, e.g., discovering how different (epi)genomic regions and their products interact and cooperate with each other. We propose a paradigm shift in tertiary analysis, based on the use of the Genomic Data Model (GDM), a simple data model which links genomic feature data to their associated experimental, biological and clinical metadata. GDM encompasses all the data formats which have been produced for feature extraction from (epi)genomic datasets. We specifically describe the mapping to GDM of SAM (Sequence Alignment/Map), VCF (Variant Call Format), NARROWPEAK (for called peaks produced by NGS ChIP-seq or DNase-seq methods), and BED (Browser Extensible Data) formats, but GDM supports as well all the formats describing experimental datasets (e.g., including copy number variations, DNA somatic mutations, or gene expressions) and annotations (e.g., regarding transcription start sites, genes, enhancers or CpG islands). We downloaded and integrated samples of all the above-mentioned data types and formats from multiple sources. The GDM is able to homogeneously describe semantically heterogeneous data and makes the ground for providing data interoperability, e.g., achieved through the GenoMetric Query Language (GMQL), a high-level, declarative query language for genomic big data. The combined use of the data model and the query language allows comprehensive processing of multiple heterogeneous data, and supports the development of domain-specific data-driven computations and bio-molecular knowledge discovery.
Assuntos
Mineração de Dados/métodos , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Variações do Número de Cópias de DNA/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Sequências Reguladoras de Ácido Nucleico/genética , Alinhamento de Sequência/métodos , Sítio de Iniciação de TranscriçãoRESUMO
MOTIVATION: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art 'big data' computing strategies, with abstraction levels beyond available tool capabilities. RESULTS: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic 'big data' analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets. AVAILABILITY AND IMPLEMENTATION: The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/.
Assuntos
Indexação e Redação de Resumos , Biologia Computacional/métodos , Bases de Dados Factuais , Genômica/métodos , Ensaios de Triagem em Larga Escala/métodos , Software , Imunoprecipitação da Cromatina , Epigenômica , Histonas/metabolismo , Humanos , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismoRESUMO
BACKGROUND: Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. METHODS: We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. RESULTS: We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. CONCLUSIONS: Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper weighting policy, it is able to predict a significant number of novel annotations, demonstrating to actually be a helpful tool in supporting scientists in the curation process of gene functional annotations.
Assuntos
Algoritmos , Biologia Computacional/métodos , Proteínas de Drosophila/genética , Drosophila melanogaster/genética , Ontologia Genética , Anotação de Sequência Molecular , Animais , Análise por ConglomeradosRESUMO
Introduction: Data-driven medicine is essential for enhancing the accessibility and quality of the healthcare system. The availability of data plays a crucial role in achieving this goal. Methods: We propose implementing a robust data infrastructure of FAIRification and data fusion for clinical, genomic, and imaging data. This will be embedded within the framework of a distributed analytics platform for healthcare data analysis, utilizing the Personal Health Train paradigm. Results: This infrastructure will ensure the findability, accessibility, interoperability, and reusability of data, metadata, and results among multiple medical centers participating in the BETTER Horizon Europe project. The project focuses on studying rare diseases, such as intellectual disability and inherited retinal dystrophies. Conclusion: The anticipated impacts will benefit a wide range of healthcare practitioners and potentially influence health policymakers.
RESUMO
INTRODUCTION: Epitopes are specific structures in antigens that are recognized by the immune system. They are widely used in the context of immunology-related applications, such as vaccine development, drug design, and diagnosis / treatment / prevention of disease. The SARS-CoV-2 virus has represented the main point of interest within the viral and genomic surveillance community in the last four years. Its ability to mutate and acquire new characteristics while it reorganizes into new variants has been analyzed from many perspectives. Understanding how epitopes are impacted by mutations that accumulate on the protein level cannot be underrated. METHODS: With a focus on Omicron-named SARS-CoV-2 lineages, including the last WHO-designated Variants of Interest, we propose a workflow for data retrieval, integration, and analysis pipeline for conducting a database-wide study on the impact of lineages' characterizing mutations on all T cell and B cell linear epitopes collected in the Immune Epitope Database (IEDB) for SARS-CoV-2. RESULTS: Our workflow allows us to showcase novel qualitative and quantitative results on 1) coverage of viral proteins by deposited epitopes; 2) distribution of epitopes that are mutated across Omicron variants; 3) distribution of Omicron characterizing mutations across epitopes. Results are discussed based on the type of epitope, the response frequency of the assays, and the sample size. Our proposed workflow can be reproduced at any point in time, given updated variant characterizations and epitopes from IEDB, thereby guaranteeing to observe a quantitative landscape of mutations' impact on demand. CONCLUSION: A big data-driven analysis such as the one provided here can inform the next genomic surveillance policies in combatting SARS-CoV-2 and future epidemic viruses.
Assuntos
COVID-19 , Epitopos de Linfócito B , Epitopos de Linfócito T , Mutação , SARS-CoV-2 , SARS-CoV-2/imunologia , SARS-CoV-2/genética , Epitopos de Linfócito T/imunologia , Epitopos de Linfócito T/genética , Humanos , Epitopos de Linfócito B/imunologia , COVID-19/imunologia , COVID-19/virologiaRESUMO
Statins, widely used cardiovascular drugs that lower cholesterol by inhibiting HMG-CoA reductase, have been increasingly recognized for their potential anticancer properties. This study elucidates the underlying mechanism, revealing that statins exploit Synthetic Lethality, a principle where the co-occurrence of two non-lethal events leads to cell death. Our computational analysis of approximately 37,000 SL pairs identified statins as potential drugs targeting genes involved in SL pairs with metastatic genes. In vitro validation on various cancer cell lines confirmed the anticancer efficacy of statins. This data-driven drug repurposing strategy provides a molecular basis for the anticancer effects of statins, offering translational opportunities in oncology.
Assuntos
Antineoplásicos , Inibidores de Hidroximetilglutaril-CoA Redutases , Humanos , Inibidores de Hidroximetilglutaril-CoA Redutases/farmacologia , Antineoplásicos/farmacologia , Linhagem Celular Tumoral , Neoplasias/tratamento farmacológico , Neoplasias/genética , Neoplasias/patologia , Reposicionamento de Medicamentos/métodosRESUMO
With the progression of the COVID-19 pandemic, large datasets of SARS-CoV-2 genome sequences were collected to closely monitor the evolution of the virus and identify the novel variants/strains. By analyzing genome sequencing data, health authorities can 'hunt' novel emerging variants of SARS-CoV-2 as early as possible, and then monitor their evolution and spread. We designed VariantHunter, a highly flexible and user-friendly tool for systematically monitoring the evolution of SARS-CoV-2 at global and regional levels. In VariantHunter, amino acid changes are analyzed over an interval of 4 weeks in an arbitrary geographical area (continent, country, or region); for every week in the interval, the prevalence is computed and changes are ranked based on their increase or decrease in prevalence. VariantHunter supports two main types of analysis: lineage-independent and lineage-specific. The former considers all the available data and aims to discover new viral variants. The latter evaluates specific lineages/viral variants to identify novel candidate designations (sub-lineages and sub-variants). Both analyses use simple statistics and visual representations (diffusion charts and heatmaps) to track viral evolution. A dataset explorer allows users to visualize available data and refine their selection. VariantHunter is a web application free to all users. The two types of supported analysis (lineage-independent and lineage-specific) allow user-friendly monitoring of the viral evolution, empowering genomic surveillance without requiring any computational background. Database URL http://gmql.eu/variant_hunter/.
Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/epidemiologia , Pandemias , Mapeamento CromossômicoRESUMO
BACKGROUND: SARS-CoV-2 viremia has been found to be a potential prognostic factor in patients hospitalized for COVID-19. OBJECTIVE: We aimed to assess the association between SARS-CoV-2 viremia and mortality in COVID-19 hospitalized patients during different epidemic periods. METHODS: A prospective COVID-19 registry was queried to extract all COVID-19 patients with an available SARS-CoV-2 viremia performed at hospital admission between March 2020 and January 2022. SARS-CoV-2 viremia was assessed by means of GeneFinderTM COVID-19 Plus RealAmp Kit assay and SARS-CoV-2 ELITe MGB® Kit using <45 cycle threshold to define positivity. Uni and multivariable logistic regression model were built to assess the association between SARS-CoV-2 positive viremia and death. RESULTS: Four hundred and forty-five out of 2,822 COVID-19 patients had an available SARS-CoV-2 viremia, prevalently males (64.9%) with a median age of 65 years (IQR 55-75). Patients with a positive SARS-CoV-2 viremia (86/445; 19.3%) more frequently presented with a severe or critical disease (67.4% vs 57.1%) when compared to those with a negative SARS-CoV-2 viremia. Deceased subjects (88/445; 19.8%) were older [75 (IQR 68-82) vs 63 (IQR 54-72)] and showed more frequently a detectable SARS-CoV-2 viremia at admission (60.2% vs 22.7%) when compared to survivors. In univariable analysis a positive SARS-CoV-2 viremia was associated with a higher odd of death [OR 5.16 (95% CI 3.15-8.45)] which was confirmed in the multivariable analysis adjusted for age, biological sex and, disease severity [AOR 6.48 (95% CI 4.05-10.45)]. The association between positive SARS-CoV-2 viremia and death was consistent in the period 1 February 2021-31 January 2022 [AOR 5.86 (95% CI 3.43-10.16)] and in subgroup analysis according to disease severity: mild/moderate [AOR 6.45 (95% CI 2.84-15.17)] and severe/critical COVID-19 patients [AOR 6.98 (95% CI 3.68-13.66)]. CONCLUSIONS: SARS-CoV-2 viremia resulted associated to COVID-19 mortality and should be considered in the initial assessment of COVID-19 hospitalized patients.
Assuntos
COVID-19 , Masculino , Humanos , Pessoa de Meia-Idade , Idoso , SARS-CoV-2 , Viremia , Hospitalização , Estudos ProspectivosRESUMO
A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.
Assuntos
Algoritmos , Epigenômica , Genômica/métodosRESUMO
The inflation of SARS-CoV-2 lineages with a high number of accumulated mutations (such as the recent case of Omicron) has risen concerns about the evolutionary capacity of this virus. Here, we propose a computational study to examine non-synonymous mutations gathered within genomes of SARS-CoV-2 from the beginning of the pandemic until February 2022. We provide both qualitative and quantitative descriptions of such corpus, focusing on statistically significant co-occurring and mutually exclusive mutations within single genomes. Then, we examine in depth the distributions of mutations over defined lineages and compare those of frequently co-occurring mutation pairs. Based on this comparison, we study mutations' convergence/divergence on the phylogenetic tree. As a result, we identify 1,818 co-occurring pairs of non-synonymous mutations showing at least one event of convergent evolution and 6,625 co-occurring pairs with at least one event of divergent evolution. Notable examples of both types are shown by means of a tree-based representation of lineages, visually capturing mutations' behaviors. Our method confirms several well-known cases; moreover, the provided evidence suggests that our workflow can explain aspects of the future mutational evolution of SARS-CoV-2.