Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
Add more filters










Publication year range
1.
Nucleic Acids Res ; 52(D1): D891-D899, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37953337

ABSTRACT

Ensembl (https://www.ensembl.org) is a freely available genomic resource that has produced high-quality annotations, tools, and services for vertebrates and model organisms for more than two decades. In recent years, there has been a dramatic shift in the genomic landscape, with a large increase in the number and phylogenetic breadth of high-quality reference genomes, alongside major advances in the pan-genome representations of higher species. In order to support these efforts and accelerate downstream research, Ensembl continues to focus on scaling for the rapid annotation of new genome assemblies, developing new methods for comparative analysis, and expanding the depth and quality of our genome annotations. This year we have continued our expansion to support global biodiversity research, doubling the number of annotated genomes we support on our Rapid Release site to over 1700, driven by our close collaboration with biodiversity projects such as Darwin Tree of Life. We have also strengthened support for key agricultural species, including the first regulatory builds for farmed animals, and have updated key tools and resources that support the global scientific community, notably the Ensembl Variant Effect Predictor. Ensembl data, software, and tools are freely available.


Subject(s)
Databases, Genetic , Genomics , Animals , Genome , Molecular Sequence Annotation , Phylogeny , Software , Humans
2.
Nucleic Acids Res ; 51(D1): D933-D941, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36318249

ABSTRACT

Ensembl (https://www.ensembl.org) has produced high-quality genomic resources for vertebrates and model organisms for more than twenty years. During that time, our resources, services and tools have continually evolved in line with both the publicly available genome data and the downstream research and applications that utilise the Ensembl platform. In recent years we have witnessed a dramatic shift in the genomic landscape. There has been a large increase in the number of high-quality reference genomes through global biodiversity initiatives. In parallel, there have been major advances towards pangenome representations of higher species, where many alternative genome assemblies representing different breeds, cultivars, strains and haplotypes are now available. In order to support these efforts and accelerate downstream research, it is our goal at Ensembl to create high-quality annotations, tools and services for species across the tree of life. Here, we report our resources for popular reference genomes, the dramatic growth of our annotations (including haplotypes from the first human pangenome graphs), updates to the Ensembl Variant Effect Predictor (VEP), interactive protein structure predictions from AlphaFold DB, and the beta release of our new website.


Subject(s)
Databases, Genetic , Software , Animals , Humans , Molecular Sequence Annotation , Genomics , Genome
3.
Bioinformatics ; 38(19): 4488-4496, 2022 09 30.
Article in English | MEDLINE | ID: mdl-35929781

ABSTRACT

MOTIVATION: Experimental testing and manual curation are the most precise ways for assigning Gene Ontology (GO) terms describing protein functions. However, they are expensive, time-consuming and cannot cope with the exponential growth of data generated by high-throughput sequencing methods. Hence, researchers need reliable computational systems to help fill the gap with automatic function prediction. The results of the last Critical Assessment of Function Annotation challenge revealed that GO-terms prediction remains a very challenging task. Recent developments on deep learning are significantly breaking out the frontiers leading to new knowledge in protein research thanks to the integration of data from multiple sources. However, deep models hitherto developed for functional prediction are mainly focused on sequence data and have not achieved breakthrough performances yet. RESULTS: We propose DeeProtGO, a novel deep-learning model for predicting GO annotations by integrating protein knowledge. DeeProtGO was trained for solving 18 different prediction problems, defined by the three GO sub-ontologies, the type of proteins, and the taxonomic kingdom. Our experiments reported higher prediction quality when more protein knowledge is integrated. We also benchmarked DeeProtGO against state-of-the-art methods on public datasets, and showed it can effectively improve the prediction of GO annotations. AVAILABILITY AND IMPLEMENTATION: DeeProtGO and a case of use are available at https://github.com/gamerino/DeeProtGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Deep Learning , Gene Ontology , Computational Biology/methods , Molecular Sequence Annotation , Proteins/metabolism
4.
Elife ; 112022 01 12.
Article in English | MEDLINE | ID: mdl-35018885

ABSTRACT

Estrogen (E2) and Progesterone (Pg), via their specific receptors (ERalpha and PR), are major determinants in the development and progression of endometrial carcinomas, However, their precise mechanism of action and the role of other transcription factors involved are not entirely clear. Using Ishikawa endometrial cancer cells, we report that E2 treatment exposes a set of progestin-dependent PR binding sites which include both E2 and progestin target genes. ChIP-seq results from hormone-treated cells revealed a non-random distribution of PAX2 binding in the vicinity of these estrogen-promoted PR sites. Altered expression of hormone regulated genes in PAX2 knockdown cells suggests a role for PAX2 in fine-tuning ERalpha and PR interplay in transcriptional regulation. Analysis of long-range interactions by Hi-C coupled with ATAC-seq data showed that these regions, that we call 'progestin control regions' (PgCRs), exhibited an open chromatin state even before hormone exposure and were non-randomly associated with regulated genes. Nearly 20% of genes potentially influenced by PgCRs were found to be altered during progression of endometrial cancer. Our findings suggest that endometrial response to progestins in differentiated endometrial tumor cells results in part from binding of PR together with PAX2 to accessible chromatin regions. What maintains these regions open remains to be studied.


Subject(s)
Endometrial Neoplasms , Receptors, Progesterone , Cell Line, Tumor , Chromatin , Endometrial Neoplasms/genetics , Endometrial Neoplasms/metabolism , Endometrial Neoplasms/pathology , Estradiol/pharmacology , Estrogen Receptor alpha/genetics , Female , Humans , PAX2 Transcription Factor/genetics , Progesterone , Receptors, Progesterone/genetics , Receptors, Progesterone/metabolism
5.
Bioinformatics ; 36(24): 5571-5581, 2021 04 05.
Article in English | MEDLINE | ID: mdl-33244583

ABSTRACT

MOTIVATION: The Severe Acute Respiratory Syndrome-Coronavirus 2 (SARS-CoV-2) has recently emerged as the responsible for the pandemic outbreak of the coronavirus disease 2019. This virus is closely related to coronaviruses infecting bats and Malayan pangolins, species suspected to be an intermediate host in the passage to humans. Several genomic mutations affecting viral proteins have been identified, contributing to the understanding of the recent animal-to-human transmission. However, the capacity of SARS-CoV-2 to encode functional putative microRNAs (miRNAs) remains largely unexplored. RESULTS: We have used deep learning to discover 12 candidate stem-loop structures hidden in the viral protein-coding genome. Among the precursors, the expression of eight mature miRNAs-like sequences was confirmed in small RNA-seq data from SARS-CoV-2 infected human cells. Predicted miRNAs are likely to target a subset of human genes of which 109 are transcriptionally deregulated upon infection. Remarkably, 28 of those genes potentially targeted by SARS-CoV-2 miRNAs are down-regulated in infected human cells. Interestingly, most of them have been related to respiratory diseases and viral infection, including several afflictions previously associated with SARS-CoV-1 and SARS-CoV-2. The comparison of SARS-CoV-2 pre-miRNA sequences with those from bat and pangolin coronaviruses suggests that single nucleotide mutations could have helped its progenitors jumping inter-species boundaries, allowing the gain of novel mature miRNAs targeting human mRNAs. Our results suggest that the recent acquisition of novel miRNAs-like sequences in the SARS-CoV-2 genome may have contributed to modulate the transcriptional reprograming of the new host upon infection. AVAILABILITY AND IMPLEMENTATION: https://github.com/sinc-lab/sarscov2-mirna-discovery. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
COVID-19 , Coronavirus , Animals , Betacoronavirus , Coronavirus/genetics , Genome, Viral , Humans , Pandemics , SARS-CoV-2
6.
Genes (Basel) ; 11(3)2020 03 06.
Article in English | MEDLINE | ID: mdl-32155892

ABSTRACT

Sunflower germplasm collections are valuable resources for broadening the genetic base of commercial hybrids and ameliorate the risk of climate events. Nowadays, the most studied worldwide sunflower pre-breeding collections belong to INTA (Argentina), INRA (France), and USDA-UBC (United States of America-Canada). In this work, we assess the amount and distribution of genetic diversity (GD) available within and between these collections to estimate the distribution pattern of global diversity. A mixed genotyping strategy was implemented, by combining proprietary genotyping-by-sequencing data with public whole-genome-sequencing data, to generate an integrative 11,834-common single nucleotide polymorphism matrix including the three breeding collections. In general, the GD estimates obtained were moderate. An analysis of molecular variance provided evidence of population structure between breeding collections. However, the optimal number of subpopulations, studied via discriminant analysis of principal components (K = 12), the bayesian STRUCTURE algorithm (K = 6) and distance-based methods (K = 9) remains unclear, since no single unifying characteristic is apparent for any of the inferred groups. Different overall patterns of linkage disequilibrium (LD) were observed across chromosomes, with Chr10, Chr17, Chr5, and Chr2 showing the highest LD. This work represents the largest and most comprehensive inter-breeding collection analysis of genomic diversity for cultivated sunflower conducted to date.


Subject(s)
Helianthus/genetics , Linkage Disequilibrium , Polymorphism, Genetic , Seed Bank , Chromosomes, Plant/genetics , Plant Breeding/methods
7.
J Biomed Inform ; 103: 103378, 2020 03.
Article in English | MEDLINE | ID: mdl-31972288

ABSTRACT

Alternative splicing alterations have been widely related to several human diseases revealing the importance of their study for the success of translational medicine. Differential splicing (DS) occurrence has been mainly analyzed through exon-based approaches over RNA-seq data. Although these strategies allow identifying differentially spliced genes, they ignore the identity of the affected gene isoforms which is crucial to understand the underlying pathological processes behind alternative splicing changes. Moreover, despite several isoform quantification tools for RNA-seq data have been recently developed, DS tools have not taken advantage of them. Here, the NBSplice R package for differential splicing analysis by means of isoform expression data is presented. It estimates differences on relative expressions of gene transcripts between experimental conditions to infer changes in gene alternative splicing patterns. The developed tool was evaluated using a synthetic RNA-seq dataset with controlled differential splicing. NBSplice accurately predicted DS occurrence, outperforming current methods in terms of accuracy, sensitivity, F-score, and false discovery rate control. The usefulness of our development was demonstrated by the analysis of a real cancer dataset, revealing new differentially spliced genes that could be studied pursuing new colorectal cancer biomarkers discovery.


Subject(s)
Alternative Splicing , RNA Splicing , Exons , Gene Expression Profiling , Humans , Protein Isoforms/genetics , RNA-Seq , Sequence Analysis, RNA
8.
J Biomed Inform ; 93: 103157, 2019 05.
Article in English | MEDLINE | ID: mdl-30928514

ABSTRACT

The availability of large-scale repositories and integrated cancer genome efforts have created unprecedented opportunities to study and describe cancer biology. In this sense, the aim of translational researchers is the integration of multiple omics data to achieve a better identification of homogeneous subgroups of patients in order to develop adequate diagnostic and treatment strategies from the personalized medicine perspective. So far, existing integrative methods have grouped together omics data information, leaving out individual omics data phenotypic interpretation. Here, we present the Massive and Integrative Gene Set Analysis (MIGSA) R package. This tool can analyze several high throughput experiments in a comprehensive way through a functional analysis strategy, relating a phenotype to its biological function counterpart defined by means of gene sets. By simultaneously querying different multiple omics data from the same or different groups of patients, common and specific functional patterns for each studied phenotype can be obtained. The usefulness of MIGSA was demonstrated by applying the package to functionally characterize the intrinsic breast cancer PAM50 subtypes. For each subtype, specific functional transcriptomic profiles and gene sets enriched by transcriptomic and proteomic data were identified. To achieve this, transcriptomic and proteomic data from 28 datasets were analyzed using MIGSA. As a result, enriched gene sets and important genes were consistently found as related to a specific subtype across experiments or data types and thus can be used as molecular signature biomarkers.


Subject(s)
Breast Neoplasms/genetics , Biomarkers, Tumor/metabolism , Breast Neoplasms/classification , Breast Neoplasms/metabolism , Breast Neoplasms/pathology , Datasets as Topic , Female , Humans
9.
Brief Bioinform ; 20(2): 471-481, 2019 03 22.
Article in English | MEDLINE | ID: mdl-29040385

ABSTRACT

Over the last few years, RNA-seq has been used to study alterations in alternative splicing related to several diseases. Bioinformatics workflows used to perform these studies can be divided into two groups, those finding changes in the absolute isoform expression and those studying differential splicing. Many computational methods for transcriptomics analysis have been developed, evaluated and compared; however, there are not enough reports of systematic and objective assessment of processing pipelines as a whole. Moreover, comparative studies have been performed considering separately the changes in absolute or relative isoform expression levels. Consequently, no consensus exists about the best practices and appropriate workflows to analyse alternative and differential splicing. To assist the adequate pipeline choice, we present here a benchmarking of nine commonly used workflows to detect differential isoform expression and splicing. We evaluated the workflows performance over different experimental scenarios where changes in absolute and relative isoform expression occurred simultaneously. In addition, the effect of the number of isoforms per gene, and the magnitude of the expression change over pipeline performances were also evaluated. Our results suggest that workflow performance is influenced by the number of replicates per condition and the conditions heterogeneity. In general, workflows based on DESeq2, DEXSeq, Limma and NOISeq performed well over a wide range of transcriptomics experiments. In particular, we suggest the use of workflows based on Limma when high precision is required, and DESeq2 and DEXseq pipelines to prioritize sensitivity. When several replicates per condition are available, NOISeq and Limma pipelines are indicated.


Subject(s)
Alternative Splicing , Benchmarking/methods , Computational Biology/methods , High-Throughput Nucleotide Sequencing/methods , Neoplasm Proteins/genetics , Prostatic Neoplasms/genetics , Sequence Analysis, RNA/methods , Case-Control Studies , Gene Expression Profiling , Humans , Male , Neoplasm Proteins/metabolism , Prostate/metabolism , Prostatic Neoplasms/metabolism , Protein Isoforms , Workflow
10.
Hum Mutat ; 38(5): 494-502, 2017 05.
Article in English | MEDLINE | ID: mdl-28236343

ABSTRACT

Targeted sequencing (TS) is growing as a screening methodology used in research and medical genetics to identify genomic alterations causing human diseases. In general, a list of possible genomic variants is derived from mapped reads through a variant calling step. This processing step is usually based on variant coverage, although it may be affected by several factors. Therefore, undercovered relevant clinical variants may not be reported, affecting pathology diagnosis or treatment. Thus, a prior quality control of the experiment is critical to determine variant detection accuracy and to avoid erroneous medical conclusions. There are several quality control tools, but they are focused on issues related to whole-genome sequencing. However, in TS, quality control should assess experiment, gene, and genomic region performances based on achieved coverages. Here, we propose TarSeqQC R package for quality control in TS experiments. The tool is freely available at Bioconductor repository. TarSeqQC was used to analyze two datasets; low-performance primer pools and features were detected, enhancing the quality of experiment results. Read count profiles were also explored, showing TarSeqQC's effectiveness as an exploration tool. Our proposal may be a valuable bioinformatic tool for routinely TS experiments in both research and medical genetics.


Subject(s)
Computational Biology/methods , Genomics/methods , High-Throughput Nucleotide Sequencing , Software , Computational Biology/standards , Datasets as Topic , Genomics/standards , Humans , Neoplasms/genetics , Quality Control , Reproducibility of Results , Software/standards , User-Computer Interface
11.
Bioinformatics ; 33(5): 693-700, 2017 03 01.
Article in English | MEDLINE | ID: mdl-28062443

ABSTRACT

Motivation: The PAM50 classifier is used to assign patients to the highest correlated breast cancer subtype irrespectively of the obtained value. Nonetheless, all subtype correlations are required to build the risk of recurrence (ROR) score, currently used in therapeutic decisions. Present subtype uncertainty estimations are not accurate, seldom considered or require a population-based approach for this context. Results: Here we present a novel single-subject non-parametric uncertainty estimation based on PAM50's gene label permutations. Simulations results ( n = 5228) showed that only 61% subjects can be reliably 'Assigned' to the PAM50 subtype, whereas 33% should be 'Not Assigned' (NA), leaving the rest to tight 'Ambiguous' correlations between subtypes. The NA subjects exclusion from the analysis improved survival subtype curves discrimination yielding a higher proportion of low and high ROR values. Conversely, all NA subjects showed similar survival behaviour regardless of the original PAM50 assignment. We propose to incorporate our PAM50 uncertainty estimation to support therapeutic decisions. Availability and Implementation: Source code can be found in 'pbcmc' R package at Bioconductor. Contacts: cristobalfresno@gmail.com or efernandez@bdmg.com.ar. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Breast Neoplasms/diagnosis , Computational Biology/methods , Neoplasm Recurrence, Local , Uncertainty , Female , Humans , Prognosis , Risk
SELECTION OF CITATIONS
SEARCH DETAIL