RESUMO
The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.
Assuntos
COVID-19/prevenção & controle , Biologia Computacional/métodos , Bases de Dados Genéticas , Genômica/métodos , Anotação de Sequência Molecular/métodos , SARS-CoV-2/genética , Animais , COVID-19/epidemiologia , COVID-19/virologia , Epidemias , Humanos , Internet , Camundongos , Pseudogenes/genética , RNA Longo não Codificante/genética , SARS-CoV-2/metabolismo , SARS-CoV-2/fisiologia , Transcrição Gênica/genéticaRESUMO
Targeted protein degradation (TPD) strategies have revolutionized how scientists tackle challenging protein targets deemed undruggable with traditional small molecule inhibitors. Many promising campaigns to inhibit proteins have failed due to factors surrounding inhibition selectivity and targeting of compounds to specific tissues and cell types. One of the major improvements that PROTAC (proteolysis targeting chimera) and molecular glue technology can exert is highly selective control of target inhibition. Multiple studies have shown that PROTACs can gain selectivity for their protein targets beyond that of their parent ligands via optimization of linker length and stabilization of ternary complexes. Due to the bifunctional nature of PROTACs, the tissue selective nature of E3 ligases can be exploited to uncover novel targeting mechanisms. In this review, we provide critical analysis of the recent progress towards making selective PROTAC molecules and new PROTAC technologies that will continue to push the boundaries of achieving selectivity. These efforts have wide implications in the future of treating disease as they will broaden the possible targets that can be addressed by small molecules, like undruggable proteins or broadly active targets that would benefit from degradation in specific tissue types.
Assuntos
Proteólise , Ubiquitina-Proteína Ligases , Ligantes , Ubiquitina-Proteína Ligases/metabolismoRESUMO
The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.
Assuntos
Bases de Dados Genéticas , Genoma Humano/genética , Genômica , Pseudogenes/genética , Animais , Biologia Computacional , Humanos , Internet , Camundongos , Anotação de Sequência Molecular , SoftwareRESUMO
Tumor-specific neoantigens are mutated self-peptides presented by tumor cell major histocompatibility complex (MHC) molecules and are necessary to elicit host's anti-cancer cytotoxic T cell responses. It could be specifically recognized by neoantigen-specific T cell receptors (TCRs). However, current wet-lab assays for identifying peptide MHC binding are too expensive and time-consuming to meet the clinical needs. In this study, we developed an in silico method with a deep convolutional neural network (CNN) model, iConMHC, to predict peptide MHC binding affinity. Unlike other in silico methods that only learn from properties of amino acid in neoantigen peptides alone and/or MHCs alone, iConMHC learns from physical and chemical interaction properties between pairwise amino acids from the two molecules. These properties, such as contact potentials and distances in folded proteins, directly affect neoantigen-MHC binding affinity. In addition, IConMHC is a pan-allele model that is capable of making predictions for all the MHC alleles. Even for those rare MHC alleles without training data, iConMHC can make predictions with reasonable accuracy. We benchmarked iConMHC with other commonly used MHC-I binding predictors and found our model performs better than most of the pan-allele models.
Assuntos
Aprendizado Profundo , Antígenos de Histocompatibilidade Classe I/metabolismo , Peptídeos/metabolismo , Alelos , Sequência de Aminoácidos , Antígenos de Neoplasias/química , Antígenos de Neoplasias/metabolismo , Simulação por Computador , Bases de Dados de Proteínas , Antígenos de Histocompatibilidade Classe I/química , Antígenos de Histocompatibilidade Classe I/genética , Humanos , Redes Neurais de Computação , Peptídeos/química , Ligação Proteica , Reprodutibilidade dos TestesRESUMO
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
Assuntos
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Transcriptoma/genética , Animais , Caenorhabditis elegans/embriologia , Caenorhabditis elegans/crescimento & desenvolvimento , Cromatina/genética , Análise por Conglomerados , Drosophila melanogaster/crescimento & desenvolvimento , Regulação da Expressão Gênica no Desenvolvimento/genética , Histonas/metabolismo , Humanos , Larva/genética , Larva/crescimento & desenvolvimento , Modelos Genéticos , Anotação de Sequência Molecular , Regiões Promotoras Genéticas/genética , Pupa/genética , Pupa/crescimento & desenvolvimento , RNA não Traduzido/genética , Análise de Sequência de RNARESUMO
Pseudogenes are degraded fossil copies of genes. Here, we report a comparison of pseudogenes spanning three phyla, leveraging the completed annotations of the human, worm, and fly genomes, which we make available as an online resource. We find that pseudogenes are lineage specific, much more so than protein-coding genes, reflecting the different remodeling processes marking each organism's genome evolution. The majority of human pseudogenes are processed, resulting from a retrotranspositional burst at the dawn of the primate lineage. This burst can be seen in the largely uniform distribution of pseudogenes across the genome, their preservation in areas with low recombination rates, and their preponderance in highly expressed gene families. In contrast, worm and fly pseudogenes tell a story of numerous duplication events. In worm, these duplications have been preserved through selective sweeps, so we see a large number of pseudogenes associated with highly duplicated families such as chemoreceptors. However, in fly, the large effective population size and high deletion rate resulted in a depletion of the pseudogene complement. Despite large variations between these species, we also find notable similarities. Overall, we identify a broad spectrum of biochemical activity for pseudogenes, with the majority in each organism exhibiting varying degrees of partial activity. In particular, we identify a consistent amount of transcription (â¼15%) across all species, suggesting a uniform degradation process. Also, we see a uniform decay of pseudogene promoter activity relative to their coding counterparts and identify a number of pseudogenes with conserved upstream sequences and activity, hinting at potential regulatory roles.
Assuntos
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Filogenia , Pseudogenes/genética , Animais , Evolução Molecular , Estudos de Associação Genética , Humanos , Anotação de Sequência Molecular , Regiões Promotoras Genéticas/genética , Homologia de Sequência do Ácido NucleicoRESUMO
In primates and other animals, reverse transcription of mRNA followed by genomic integration creates retroduplications. Expressed retroduplications are either "retrogenes" coding for functioning proteins, or expressed "processed pseudogenes," which can function as noncoding RNAs. To date, little is known about the variation in retroduplications in terms of their presence or absence across individuals in the human population. We have developed new methodologies that allow us to identify "novel" retroduplications (i.e., those not present in the reference genome), to find their insertion points, and to genotype them. Using these methods, we catalogued and analyzed 174 retroduplication variants in almost one thousand humans, which were sequenced as part of Phase 1 of The 1000 Genomes Project Consortium. The accuracy of our data set was corroborated by (1) multiple lines of sequencing evidence for retroduplication (e.g., depth of coverage in exons vs. introns), (2) experimental validation, and (3) the fact that we can reconstruct a correct phylogenetic tree of human subpopulations based solely on retroduplications. We also show that parent genes of retroduplication variants tend to be expressed at the M-to-G1 transition in the cell cycle and that M-to-G1 expressed genes have more copies of fixed retroduplications than genes expressed at other times. These findings suggest that cell division is coupled to retrotransposition and, perhaps, is even a requirement for it.
Assuntos
Divisão Celular/genética , Duplicação Gênica , Retroelementos/genética , Biologia Computacional/métodos , Evolução Molecular , Genoma Humano , Genótipo , Humanos , Filogenia , Pseudogenes , Reprodutibilidade dos Testes , Análise de Sequência de DNARESUMO
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
Assuntos
Bases de Dados Genéticas , Genoma Humano , Genômica/métodos , Anotação de Sequência Molecular , Animais , Biologia Computacional/métodos , DNA Complementar/química , DNA Complementar/genética , Evolução Molecular , Éxons , Loci Gênicos , Humanos , Internet , Modelos Moleculares , Fases de Leitura Aberta , Pseudogenes , Controle de Qualidade , Sítios de Splice de RNA , RNA Longo não Codificante , Reprodutibilidade dos Testes , Regiões não TraduzidasRESUMO
Mice have been a long-standing model for human biology and disease. Here we characterize, by RNA sequencing, the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates. Comparison with transcriptome profiles in human cell lines reveals substantial conservation of transcriptional programmes, and uncovers a distinct class of genes with levels of expression that have been constrained early in vertebrate evolution. This core set of genes captures a substantial fraction of the transcriptional output of mammalian cells, and participates in basic functional and structural housekeeping processes common to all cell types. Perturbation of these constrained genes is associated with significant phenotypes including embryonic lethality and cancer. Evolutionary constraint in gene expression levels is not reflected in the conservation of the genomic sequences, but is associated with conserved epigenetic marking, as well as with characteristic post-transcriptional regulatory programme, in which sub-cellular localization and alternative splicing play comparatively large roles.
Assuntos
Evolução Molecular , Regulação da Expressão Gênica , Transcriptoma , Processamento Alternativo , Animais , Evolução Biológica , Linhagem Celular , Epigênese Genética , Perfilação da Expressão Gênica , Biblioteca Gênica , Genoma , Histonas/química , Humanos , Camundongos , Camundongos Endogâmicos C57BL , Modelos Genéticos , Oligonucleotídeos Antissenso , Fenótipo , Análise de Sequência de RNARESUMO
Bayesian network model is widely used for reverse engineering of biological network structures. An advantage of this model is its capability to integrate prior knowledge into the model learning process, which can lead to improving the quality of the network reconstruction outcome. Some previous works have explored this area with focus on using prior knowledge of the direct molecular links, except for a few recent ones proposing to examine the effects of molecular orderings. In this study, we propose a Bayesian network model that can integrate both direct links and orderings into the model. Random weights are assigned to these two types of prior knowledge to alleviate bias toward certain types of information. We evaluate our model performance using both synthetic data and biological data for the RAF signaling network, and illustrate the significant improvement on network structure reconstruction of the proposing models over the existing methods. We also examine the correlation between the improvement and the abundance of ordering prior knowledge. To address the issue of generating prior knowledge, we propose an approach to automatically extract potential molecular orderings from knowledge resources such as Kyoto Encyclopedia of Genes and Genomes (KEGG) database and Gene Ontology (GO) annotation.
Assuntos
Teorema de Bayes , Biologia Computacional/métodos , Redes Reguladoras de Genes , Genoma , Modelos Biológicos , Transdução de Sinais , Bases de Dados Genéticas , Sistema de Sinalização das MAP Quinases , Quinases raf/metabolismoRESUMO
Many statistical methods have been developed to screen for differentially expressed genes associated with specific phenotypes in the microarray data. However, it remains a major challenge to synthesize the observed expression patterns with abundant biological knowledge for more complete understanding of the biological functions among genes. Various methods including clustering analysis on genes, neural network, Bayesian network and pathway analysis have been developed toward this goal. In most of these procedures, the activation and inhibition relationships among genes have hardly been utilized in the modeling steps. We propose two novel Bayesian models to integrate the microarray data with the putative pathway structures obtained from the KEGG database and the directional gene-gene interactions in the medical literature. We define the symmetric Kullback-Leibler divergence of a pathway, and use it to identify the pathway(s) most supported by the microarray data. Monte Carlo Markov Chain sampling algorithm is given for posterior computation in the hierarchical model. The proposed method is shown to select the most supported pathway in an illustrative example. Finally, we apply the methodology to a real microarray data set to understand the gene expression profile of osteoblast lineage at defined stages of differentiation. We observe that our method correctly identifies the pathways that are reported to play essential roles in modulating bone mass.
RESUMO
BACKGROUND: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. RESULTS: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. CONCLUSIONS: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.
Assuntos
Genoma Humano , Pseudogenes , Transcrição Gênica , Animais , Sítios de Ligação , Cromatina/química , Cromatina/genética , Humanos , Modelos Genéticos , Modelos Estatísticos , Anotação de Sequência Molecular , Filogenia , Primatas , RNA Polimerase II/metabolismo , Sequências Reguladoras de Ácido Nucleico , Seleção Genética , Análise de Sequência de DNA , Fatores de Transcrição/metabolismoRESUMO
A Bayesian network model can be used to study the structures of gene regulatory networks. It has the ability to integrate information from both prior knowledge and experimental data. In this study, we propose an approach to efficiently integrate global ordering information into model learning, where the ordering information specifies the indirect relationships among genes. We demonstrate that, compared with a traditional Bayesian network model that uses only local prior knowledge, utilising additional global ordering knowledge can significantly improve the model's performance. The magnitude of this improvement depends on abundance of global ordering information and data quality.
Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes/genética , Algoritmos , Teorema de Bayes , Bases de Dados FactuaisRESUMO
Microarray experiments produce expression patterns for thousands of genes at once. On the other hand, biomedical literature contains large amounts of gene regulation relationship information accumulated over the years. One obvious requirement is an automated way of comparing microarray data with the collection of known gene regulation relationships. Such an automated comparison is imperative because it can help biologists rapidly understand the context of a given microarray experiment. In addition, the consistency measure can be used to either validate or refute the hypothesis being tested using the microarray experiment. In this paper we present a systematic way of examining the consistency between a given set of microarray data and known gene regulation relationships. We first introduce a simple gene regulation network model with two separate algorithms designed to isolate a maximally consistent network. Subsequently, we extend the model to take into account multiple regulating factors for a single gene while highlighting both consistencies and inconsistencies. We illustrate the effectiveness of our approach with two practical examples, one that picks the peroxisome proliferator-activated receptor (PPAR) pathway as highly consistent from multiple pathways of Kyoto encyclopedia of genes and genomes (KEGG), and another that isolates key regulatory relationships involving nfkb1 and others known for macrophage's counter response to inflammation.