Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
Mais filtros

Bases de dados
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 22(2): 664-675, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-33348368

RESUMO

With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature.


Assuntos
COVID-19/terapia , COVID-19/epidemiologia , COVID-19/virologia , Genes Virais , Humanos , Armazenamento e Recuperação da Informação , Pandemias , SARS-CoV-2/genética , SARS-CoV-2/isolamento & purificação
2.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-34020536

RESUMO

MOTIVATION: With the spreading of biological and clinical uses of next-generation sequencing (NGS) data, many laboratories and health organizations are facing the need of sharing NGS data resources and easily accessing and processing comprehensively shared genomic data; in most cases, primary and secondary data management of NGS data is done at sequencing stations, and sharing applies to processed data. Based on the previous single-instance GMQL system architecture, here we review the model, language and architectural extensions that make the GMQL centralized system innovatively open to federated computing. RESULTS: A well-designed extension of a centralized system architecture to support federated data sharing and query processing. Data is federated thanks to simple data sharing instructions. Queries are assigned to execution nodes; they are translated into an intermediate representation, whose computation drives data and processing distributions. The approach allows writing federated applications according to classical styles: centralized, distributed or externalized. AVAILABILITY: The federated genomic data management system is freely available for non-commercial use as an open source project at http://www.bioinformatics.deib.polimi.it/FederatedGMQLsystem/. CONTACT: {arif.canakoglu, pietro.pinoli}@polimi.it.


Assuntos
Conjuntos de Dados como Assunto , Genômica , Disseminação de Informação , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Linguagens de Programação
3.
Bioinformatics ; 38(7): 1988-1994, 2022 03 28.
Artigo em Inglês | MEDLINE | ID: mdl-35040923

RESUMO

MOTIVATION: The ongoing evolution of SARS-CoV-2 and the rapid emergence of variants of concern at distinct geographic locations have relevant implications for the implementation of strategies for controlling the COVID-19 pandemic. Combining the growing body of data and the evidence on potential functional implications of SARS-CoV-2 mutations can suggest highly effective methods for the prioritization of novel variants of potential concern, e.g. increasing in frequency locally and/or globally. However, these analyses may be complex, requiring the integration of different data and resources. We claim the need for a streamlined access to up-to-date and high-quality genome sequencing data from different geographic regions/countries, and the current lack of a robust and consistent framework for the evaluation/comparison of the results. RESULTS: To overcome these limitations, we developed ViruClust, a novel tool for the comparison of SARS-CoV-2 genomic sequences and lineages in space and time. ViruClust is made available through a powerful and intuitive web-based user interface. Sophisticated large-scale analyses can be executed with a few clicks, even by users without any computational background. To demonstrate potential applications of our method, we applied ViruClust to conduct a thorough study of the evolution of the most prevalent lineage of the Delta SARS-CoV-2 variant, and derived relevant observations. By allowing the seamless integration of different types of functional annotations and the direct comparison of viral genomes and genetic variants in space and time, ViruClust represents a highly valuable resource for monitoring the evolution of SARS-CoV-2, facilitating the identification of variants and/or mutations of potential concern. AVAILABILITY AND IMPLEMENTATION: ViruClust is openly available at http://gmql.eu/viruclust/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Pandemias , Mapeamento Cromossômico
4.
Nucleic Acids Res ; 49(D1): D817-D824, 2021 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-33045721

RESUMO

ViruSurf, available at http://gmql.eu/virusurf/, is a large public database of viral sequences and integrated and curated metadata from heterogeneous sources (RefSeq, GenBank, COG-UK and NMDC); it also exposes computed nucleotide and amino acid variants, called from original sequences. A GISAID-specific ViruSurf database, available at http://gmql.eu/virusurf_gisaid/, offers a subset of these functionalities. Given the current pandemic outbreak, SARS-CoV-2 data are collected from the four sources; but ViruSurf contains other virus species harmful to humans, including SARS-CoV, MERS-CoV, Ebola and Dengue. The database is centered on sequences, described from their biological, technological and organizational dimensions. In addition, the analytical dimension characterizes the sequence in terms of its annotations and variants. The web interface enables expressing complex search queries in a simple way; arbitrary search queries can freely combine conditions on attributes from the four dimensions, extracting the resulting sequences. Several example queries on the database confirm and possibly improve results from recent research papers; results can be recomputed over time and upon selected populations. Effective search over large and curated sequence data may enable faster responses to future threats that could arise from new viruses.


Assuntos
COVID-19/prevenção & controle , Biologia Computacional/métodos , Curadoria de Dados/métodos , Bases de Dados Genéticas , Genoma Viral/genética , SARS-CoV-2/genética , COVID-19/epidemiologia , COVID-19/virologia , Variação Genética , Humanos , Armazenamento e Recuperação da Informação/métodos , Internet , Pandemias , SARS-CoV-2/fisiologia , Interface Usuário-Computador
5.
Nucleic Acids Res ; 49(15): e90, 2021 09 07.
Artigo em Inglês | MEDLINE | ID: mdl-34107016

RESUMO

Variant visualization plays an important role in supporting the viral evolution analysis, extremely valuable during the COVID-19 pandemic. VirusViz is a web-based application for comparing variants of selected viral populations and their sub-populations; it is primarily focused on SARS-CoV-2 variants, although the tool also supports other viral species (SARS-CoV, MERS-CoV, Dengue, Ebola). As input, VirusViz imports results of queries extracting variants and metadata from the large database ViruSurf, which integrates information about most SARS-CoV-2 sequences publicly deposited worldwide. Moreover, VirusViz accepts sequences of new viral populations as multi-FASTA files plus corresponding metadata in CSV format; a bioinformatic pipeline builds a suitable input for VirusViz by extracting the nucleotide and amino acid variants. Pages of VirusViz provide metadata summarization, variant descriptions, and variant visualization with rich options for zooming, highlighting variants or regions of interest, and switching from nucleotides to amino acids; sequences can be grouped, groups can be comparatively analyzed. For SARS-CoV-2, we manually collect mutations with known or predicted levels of severity/virulence, as indicated in linked research articles; such critical mutations are reported when observed in sequences. The system includes light-weight project management for downloading, resuming, and merging data analysis sessions. VirusViz is freely available at http://gmql.eu/virusviz/.


Assuntos
COVID-19/virologia , Visualização de Dados , SARS-CoV-2/química , SARS-CoV-2/genética , Sequência de Aminoácidos , Sequência de Bases , Bases de Dados Factuais , Humanos , Bases de Conhecimento , SARS-CoV-2/classificação , África do Sul/epidemiologia , Estados Unidos/epidemiologia
6.
BMC Bioinformatics ; 22(1): 250, 2021 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-33992077

RESUMO

BACKGROUND: A pair of genes is defined as synthetically lethal if defects on both cause the death of the cell but a defect in only one of the two is compatible with cell viability. Ideally, if A and B are two synthetic lethal genes, inhibiting B should kill cancer cells with a defect on A, and should have no effects on normal cells. Thus, synthetic lethality can be exploited for highly selective cancer therapies, which need to exploit differences between normal and cancer cells. RESULTS: In this paper, we present a new method for predicting synthetic lethal (SL) gene pairs. As neighbouring genes in the genome have highly correlated profiles of copy number variations (CNAs), our method clusters proximal genes with a similar CNA profile, then predicts mutually exclusive group pairs, and finally identifies the SL gene pairs within each group pairs. For mutual-exclusion testing we use a graph-based method which takes into account the mutation frequencies of different subjects and genes. We use two different methods for selecting the pair of SL genes; the first is based on the gene essentiality measured in various conditions by means of the "Gene Activity Ranking Profile" GARP score; the second leverages the annotations of gene to biological pathways. CONCLUSIONS: This method is unique among current SL prediction approaches, it reduces false-positive SL predictions compared to previous methods, and it allows establishing explicit collateral lethality relationship of gene pairs within mutually exclusive group pairs.


Assuntos
Variações do Número de Cópias de DNA , Genes Letais , DNA
7.
Bioinformatics ; 35(5): 729-736, 2019 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-30101316

RESUMO

MOTIVATION: We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. RESULTS: The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. AVAILABILITY AND IMPLEMENTATION: The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Epigenômica , Genoma , Genômica
8.
BMC Bioinformatics ; 20(1): 560, 2019 Nov 08.
Artigo em Inglês | MEDLINE | ID: mdl-31703553

RESUMO

BACKGROUND: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. RESULTS: We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. CONCLUSIONS: PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.


Assuntos
Análise de Dados , Bases de Dados Genéticas , Genômica , Software , Elementos Facilitadores Genéticos/genética , Genoma , Estudo de Associação Genômica Ampla , Humanos , Reprodutibilidade dos Testes , Fatores de Transcrição/metabolismo
9.
Methods ; 111: 3-11, 2016 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-27637471

RESUMO

While a huge amount of (epi)genomic data of multiple types is becoming available by using Next Generation Sequencing (NGS) technologies, the most important emerging problem is the so-called tertiary analysis, concerned with sense making, e.g., discovering how different (epi)genomic regions and their products interact and cooperate with each other. We propose a paradigm shift in tertiary analysis, based on the use of the Genomic Data Model (GDM), a simple data model which links genomic feature data to their associated experimental, biological and clinical metadata. GDM encompasses all the data formats which have been produced for feature extraction from (epi)genomic datasets. We specifically describe the mapping to GDM of SAM (Sequence Alignment/Map), VCF (Variant Call Format), NARROWPEAK (for called peaks produced by NGS ChIP-seq or DNase-seq methods), and BED (Browser Extensible Data) formats, but GDM supports as well all the formats describing experimental datasets (e.g., including copy number variations, DNA somatic mutations, or gene expressions) and annotations (e.g., regarding transcription start sites, genes, enhancers or CpG islands). We downloaded and integrated samples of all the above-mentioned data types and formats from multiple sources. The GDM is able to homogeneously describe semantically heterogeneous data and makes the ground for providing data interoperability, e.g., achieved through the GenoMetric Query Language (GMQL), a high-level, declarative query language for genomic big data. The combined use of the data model and the query language allows comprehensive processing of multiple heterogeneous data, and supports the development of domain-specific data-driven computations and bio-molecular knowledge discovery.


Assuntos
Mineração de Dados/métodos , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Variações do Número de Cópias de DNA/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Sequências Reguladoras de Ácido Nucleico/genética , Alinhamento de Sequência/métodos , Sítio de Iniciação de Transcrição
10.
Bioinformatics ; 31(12): 1881-8, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-25649616

RESUMO

MOTIVATION: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art 'big data' computing strategies, with abstraction levels beyond available tool capabilities. RESULTS: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic 'big data' analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets. AVAILABILITY AND IMPLEMENTATION: The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/.


Assuntos
Indexação e Redação de Resumos , Biologia Computacional/métodos , Bases de Dados Factuais , Genômica/métodos , Ensaios de Triagem em Larga Escala/métodos , Software , Imunoprecipitação da Cromatina , Epigenômica , Histonas/metabolismo , Humanos , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo
11.
BMC Bioinformatics ; 16 Suppl 6: S4, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25916950

RESUMO

BACKGROUND: Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. METHODS: We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. RESULTS: We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. CONCLUSIONS: Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper weighting policy, it is able to predict a significant number of novel annotations, demonstrating to actually be a helpful tool in supporting scientists in the curation process of gene functional annotations.


Assuntos
Algoritmos , Biologia Computacional/métodos , Proteínas de Drosophila/genética , Drosophila melanogaster/genética , Ontologia Genética , Anotação de Sequência Molecular , Animais , Análise por Conglomerados
12.
Database (Oxford) ; 20232023 07 06.
Artigo em Inglês | MEDLINE | ID: mdl-37410916

RESUMO

With the progression of the COVID-19 pandemic, large datasets of SARS-CoV-2 genome sequences were collected to closely monitor the evolution of the virus and identify the novel variants/strains. By analyzing genome sequencing data, health authorities can 'hunt' novel emerging variants of SARS-CoV-2 as early as possible, and then monitor their evolution and spread. We designed VariantHunter, a highly flexible and user-friendly tool for systematically monitoring the evolution of SARS-CoV-2 at global and regional levels. In VariantHunter, amino acid changes are analyzed over an interval of 4 weeks in an arbitrary geographical area (continent, country, or region); for every week in the interval, the prevalence is computed and changes are ranked based on their increase or decrease in prevalence. VariantHunter supports two main types of analysis: lineage-independent and lineage-specific. The former considers all the available data and aims to discover new viral variants. The latter evaluates specific lineages/viral variants to identify novel candidate designations (sub-lineages and sub-variants). Both analyses use simple statistics and visual representations (diffusion charts and heatmaps) to track viral evolution. A dataset explorer allows users to visualize available data and refine their selection. VariantHunter is a web application free to all users. The two types of supported analysis (lineage-independent and lineage-specific) allow user-friendly monitoring of the viral evolution, empowering genomic surveillance without requiring any computational background. Database URL http://gmql.eu/variant_hunter/.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/epidemiologia , Pandemias , Mapeamento Cromossômico
13.
PLoS One ; 18(4): e0281052, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37115764

RESUMO

BACKGROUND: SARS-CoV-2 viremia has been found to be a potential prognostic factor in patients hospitalized for COVID-19. OBJECTIVE: We aimed to assess the association between SARS-CoV-2 viremia and mortality in COVID-19 hospitalized patients during different epidemic periods. METHODS: A prospective COVID-19 registry was queried to extract all COVID-19 patients with an available SARS-CoV-2 viremia performed at hospital admission between March 2020 and January 2022. SARS-CoV-2 viremia was assessed by means of GeneFinderTM COVID-19 Plus RealAmp Kit assay and SARS-CoV-2 ELITe MGB® Kit using <45 cycle threshold to define positivity. Uni and multivariable logistic regression model were built to assess the association between SARS-CoV-2 positive viremia and death. RESULTS: Four hundred and forty-five out of 2,822 COVID-19 patients had an available SARS-CoV-2 viremia, prevalently males (64.9%) with a median age of 65 years (IQR 55-75). Patients with a positive SARS-CoV-2 viremia (86/445; 19.3%) more frequently presented with a severe or critical disease (67.4% vs 57.1%) when compared to those with a negative SARS-CoV-2 viremia. Deceased subjects (88/445; 19.8%) were older [75 (IQR 68-82) vs 63 (IQR 54-72)] and showed more frequently a detectable SARS-CoV-2 viremia at admission (60.2% vs 22.7%) when compared to survivors. In univariable analysis a positive SARS-CoV-2 viremia was associated with a higher odd of death [OR 5.16 (95% CI 3.15-8.45)] which was confirmed in the multivariable analysis adjusted for age, biological sex and, disease severity [AOR 6.48 (95% CI 4.05-10.45)]. The association between positive SARS-CoV-2 viremia and death was consistent in the period 1 February 2021-31 January 2022 [AOR 5.86 (95% CI 3.43-10.16)] and in subgroup analysis according to disease severity: mild/moderate [AOR 6.45 (95% CI 2.84-15.17)] and severe/critical COVID-19 patients [AOR 6.98 (95% CI 3.68-13.66)]. CONCLUSIONS: SARS-CoV-2 viremia resulted associated to COVID-19 mortality and should be considered in the initial assessment of COVID-19 hospitalized patients.


Assuntos
COVID-19 , Masculino , Humanos , Pessoa de Meia-Idade , Idoso , SARS-CoV-2 , Viremia , Hospitalização , Estudos Prospectivos
14.
Genome Biol ; 24(1): 79, 2023 04 18.
Artigo em Inglês | MEDLINE | ID: mdl-37072822

RESUMO

A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.


Assuntos
Algoritmos , Epigenômica , Genômica/métodos
15.
IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 1956-1967, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34166199

RESUMO

Traditional drug experiments to find synergistic drug pairs are time-consuming and expensive due to the numerous possible combinations of drugs that have to be examined. Thus, computational methods that can give suggestions for synergistic drug investigations are of great interest. Here, we propose a Non-negative Matrix Tri-Factorization (NMTF) based approach that leverages the integration of different data types for predicting synergistic drug pairs in multiple specific cell lines. Our computational framework relies on a network-based representation of available data about drug synergism, which also allows integrating genomic information about cell lines. We computationally evaluate the performances of our method in finding missing relationships between synergistic drug pairs and cell lines, and in computing synergy scores between drug pairs in a specific cell line, as well as we estimate the benefit of adding cell line genomic data to the network. Our approach obtains very good performance (Average Precision Score equal to 0.937, Pearson's correlation coefficient equal to 0.760) when cell line genomic data and rich data about synergistic drugs in a cell line are considered. Finally, we systematically searched our top-scored predictions in the available literature and in the NCI ALMANAC, a well-known database of drug combination experiments, proving the goodness of our findings.


Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Bases de Dados Factuais , Sinergismo Farmacológico , Genômica
16.
BioTech (Basel) ; 11(1)2022 Mar 21.
Artigo em Inglês | MEDLINE | ID: mdl-35822815

RESUMO

With the spread of COVID-19, sequencing laboratories started to share hundreds of sequences daily. However, the lack of a commonly agreed standard across deposition databases hindered the exploration and study of all the viral sequences collected worldwide in a practical and homogeneous way. During the first months of the pandemic, we developed an automatic procedure to collect, transform, and integrate viral sequences of SARS-CoV-2, MERS, SARS-CoV, Ebola, and Dengue from four major database institutions (NCBI, COG-UK, GISAID, and NMDC). This data pipeline allowed the creation of the data exploration interfaces VirusViz and EpiSurf, as well as ViruSurf, one of the largest databases of integrated viral sequences. Almost two years after the first release of the repository, the original pipeline underwent a thorough refinement process and became more efficient, scalable, and general (currently, it also includes epitopes from the IEDB). Thanks to these improvements, we constantly update and expand our integrated repository, encompassing about 9.1 million SARS-CoV-2 sequences at present (March 2022). This pipeline made it possible to design and develop fundamental resources for any researcher interested in understanding the biological mechanisms behind the viral infection. In addition, it plays a crucial role in many analytic and visualization tools, such as ViruSurf, EpiSurf, VirusViz, and VirusLab.

17.
Comput Struct Biotechnol J ; 20: 4238-4250, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35945925

RESUMO

The inflation of SARS-CoV-2 lineages with a high number of accumulated mutations (such as the recent case of Omicron) has risen concerns about the evolutionary capacity of this virus. Here, we propose a computational study to examine non-synonymous mutations gathered within genomes of SARS-CoV-2 from the beginning of the pandemic until February 2022. We provide both qualitative and quantitative descriptions of such corpus, focusing on statistically significant co-occurring and mutually exclusive mutations within single genomes. Then, we examine in depth the distributions of mutations over defined lineages and compare those of frequently co-occurring mutation pairs. Based on this comparison, we study mutations' convergence/divergence on the phylogenetic tree. As a result, we identify 1,818 co-occurring pairs of non-synonymous mutations showing at least one event of convergent evolution and 6,625 co-occurring pairs with at least one event of divergent evolution. Notable examples of both types are shown by means of a tree-based representation of lineages, visually capturing mutations' behaviors. Our method confirms several well-known cases; moreover, the provided evidence suggests that our workflow can explain aspects of the future mutational evolution of SARS-CoV-2.

18.
Artigo em Inglês | MEDLINE | ID: mdl-33270566

RESUMO

Breast Cancer comprises multiple subtypes implicated in prognosis. Existing stratification methods rely on the expression quantification of small gene sets. Next Generation Sequencing promises large amounts of omic data in the next years. In this scenario, we explore the potential of machine learning and, particularly, deep learning for breast cancer subtyping. Due to the paucity of publicly available data, we leverage on pan-cancer and non-cancer data to design semi-supervised settings. We make use of multi-omic data, including microRNA expressions and copy number alterations, and we provide an in-depth investigation of several supervised and semi-supervised architectures. Obtained accuracy results show simpler models to perform at least as well as the deep semi-supervised approaches on our task over gene expression data. When multi-omic data types are combined together, performance of deep models shows little (if any) improvement in accuracy, indicating the need for further analysis on larger datasets of multi-omic data as and when they become available. From a biological perspective, our linear model mostly confirms known gene-subtype annotations. Conversely, deep approaches model non-linear relationships, which is reflected in a more varied and still unexplored set of representative omic features that may prove useful for breast cancer subtyping.


Assuntos
Neoplasias da Mama , Aprendizado Profundo , Neoplasias da Mama/genética , Variações do Número de Cópias de DNA , Feminino , Humanos , Aprendizado de Máquina , Aprendizado de Máquina Supervisionado
19.
Front Bioeng Biotechnol ; 10: 945474, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36686258

RESUMO

Mesenchymal stem cells (MSCs) are known to be ideal candidates for clinical applications where not only regenerative potential but also immunomodulation ability is fundamental. Over the last years, increasing efforts have been put into the design and fabrication of 3D synthetic niches, conceived to emulate the native tissue microenvironment and aiming at efficiently controlling the MSC phenotype in vitro. In this panorama, our group patented an engineered microstructured scaffold, called Nichoid. It is fabricated through two-photon polymerization, a technique enabling the creation of 3D structures with control of scaffold geometry at the cell level and spatial resolution beyond the diffraction limit, down to 100 nm. The Nichoid's capacity to maintain higher levels of stemness as compared to 2D substrates, with no need for adding exogenous soluble factors, has already been demonstrated in MSCs, neural precursors, and murine embryonic stem cells. In this work, we evaluated how three-dimensionality can influence the whole gene expression profile in rat MSCs. Our results show that at only 4 days from cell seeding, gene activation is affected in a significant way, since 654 genes appear to be differentially expressed (392 upregulated and 262 downregulated) between cells cultured in 3D Nichoids and in 2D controls. The functional enrichment analysis shows that differentially expressed genes are mainly enriched in pathways related to the actin cytoskeleton, extracellular matrix (ECM), and, in particular, cell adhesion molecules (CAMs), thus confirming the important role of cell morphology and adhesions in determining the MSC phenotype. In conclusion, our results suggest that the Nichoid, thanks to its exclusive architecture and 3D cell adhesion properties, is not only a useful tool for governing cell stemness but could also be a means for controlling immune-related MSC features specifically involved in cell migration.

20.
BioTech (Basel) ; 10(4)2021 Nov 06.
Artigo em Inglês | MEDLINE | ID: mdl-35822801

RESUMO

Since the beginning of 2020, the COVID-19 pandemic has posed unprecedented challenges to viral data analysis and connected host disease diagnostic methods. We propose VirusLab, a flexible system for analysing SARS-CoV-2 viral sequences and relating them to metadata or clinical information about the host. VirusLab capitalizes on two existing resources: ViruSurf, a database of public SARS-CoV-2 sequences supporting metadata-driven search, and VirusViz, a tool for visual analysis of search results. VirusLab is designed for taking advantage of these resources within a server-side architecture that: (i) covers pipelines based on approaches already in use (ARTIC, Galaxy) but entirely cutomizable upon user request; (ii) predigests analysis of raw sequencing data from different platforms (Oxford Nanopore and Illumina); (iii) gives access to public archives datasets; (iv) supplies user-friendly reporting - making it a tool that can also be integrated into a business environment. VirusLab can be installed and hosted within the premises of any organization where information about SARS-CoV-2 sequences can be safely integrated with information about hosts (e.g., clinical metadata). A system such as VirusLab is not currently available in the landscape of similar providers: our results show that VirusLab is a powerful tool to generate tabular/graphical and machine readable reports that can be integrated in more complex pipelines. We foresee that the proposed system can support many research-oriented and therapeutic scenarios within hospitals or the tracing of viral sequences and their mutational processes within organizations for viral surveillance.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA