RESUMO
The major transcript variants of human protein-coding genes are annotated to a certain degree of accuracy combining manual curation, transcript data, and proteomics evidence. However, there is considerable disagreement on the annotation of about 2000 genes-they can be protein-coding, noncoding, or pseudogenes-and on the annotation of most of the predicted alternative transcripts. Pure transcriptome mapping approaches seem to be limited in discriminating functional expression from noise. These limitations have partially been overcome by dedicated algorithms to detect alternative spliced micro-exons and wobble splice variants. Recently, knowledge about splice mechanism and protein structure are incorporated into an algorithm to predict neighboring homologous exons, often spliced in a mutually exclusive manner. Predicted exons are evaluated by transcript data, structural compatibility, and evolutionary conservation, revealing hundreds of novel coding exons and splice mechanism re-assignments. The emerging human pan-genome is necessitating distinctive annotations incorporating differences between individuals and between populations.
Assuntos
Genoma Humano/genética , Proteínas/genética , Algoritmos , Processamento Alternativo/genética , Animais , Éxons/genética , Genômica/métodos , Humanos , Splicing de RNA/genética , Transcriptoma/genéticaRESUMO
Mutually exclusive splicing of exons is a mechanism of functional gene and protein diversification with pivotal roles in organismal development and diseases such as Timothy syndrome, cardiomyopathy and cancer in humans. In order to obtain a first genomewide estimate of the extent and biological role of mutually exclusive splicing in humans, we predicted and subsequently validated mutually exclusive exons (MXEs) using 515 publically available RNA-Seq datasets. Here, we provide evidence for the expression of over 855 MXEs, 42% of which represent novel exons, increasing the annotated human mutually exclusive exome more than fivefold. The data provide strong evidence for the existence of large and multi-cluster MXEs in higher vertebrates and offer new insights into MXE evolution. More than 82% of the MXE clusters are conserved in mammals, and five clusters have homologous clusters in Drosophila Finally, MXEs are significantly enriched in pathogenic mutations and their spatio-temporal expression might predict human disease pathology.
Assuntos
Splicing de RNA/genética , Animais , Análise por Conglomerados , Doença/genética , Evolução Molecular , Éxons/genética , Loci Gênicos , Genoma Humano , Humanos , Mamíferos/genética , Mutação/genética , Dobramento de Proteína , RNA Mensageiro/genética , RNA Mensageiro/metabolismoRESUMO
Eukaryotic genomes are the basis for understanding the complexity of life from populations to the molecular level. Recent technological innovations have revolutionized the speed of data generation enabling the sequencing of eukaryotic genomes and transcriptomes within days. The database diArk (http://www.diark.org) has been developed with the aim to provide access to all available assembled genomes and transcriptomes. In September 2014, diArk contains about 2600 eukaryotes with 6000 genome and transcriptome assemblies, of which 22% are not available via NCBI/ENA/DDBJ. Several indicators for the quality of the assemblies are provided to facilitate their comparison for selecting the most appropriate dataset for further studies. diArk has a user-friendly web interface with extensive options for filtering and browsing the sequenced eukaryotes. In this new version of the database we have also integrated species, for which transcriptome assemblies are available, and we provide more analyses of assemblies.
Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica , Genômica , Eucariotos/genética , Internet , Análise de Sequência de RNARESUMO
UNLABELLED: Waggawagga is a web-based tool for the comparative visualization of coiled-coil predictions and the detection of stable single α-helices (SAH domains). Overview schemes show the predicted coiled-coil regions found in the query sequence and provide sliders, which can be used to select segments for detailed helical wheel and helical net views. A window-based score has been developed to predict SAH domains. Export to several bitmap and vector graphics formats is supported. AVAILABILITY AND IMPLEMENTATION: http://waggawagga.motorprotein.de
Assuntos
Gráficos por Computador , Miosinas/química , Estrutura Secundária de Proteína , Software , Sequência de Aminoácidos , Humanos , Dados de Sequência MolecularRESUMO
Heterologous protein expression is an important method for analysing cellular functions of proteins, in genetic circuit engineering and in overexpressing proteins for biopharmaceutical applications and structural biology research. The degeneracy of the genetic code, which enables a single protein to be encoded by a multitude of synonymous gene sequences, plays an important role in regulating protein expression, but substantial uncertainty exists concerning the details of this phenomenon. Here we analyse the influence of a profiled codon usage adaptation approach on protein expression levels in the eukaryotic model organism Saccharomyces cerevisiae. We selected green fluorescent protein (GFP) and human α-synuclein (αSyn) as representatives for stable and intrinsically disordered proteins and representing a benchmark and a challenging test case. A new approach was implemented to design typical genes resembling the codon usage of any subset of endogenous genes. Using this approach, synthetic genes for GFP and αSyn were generated, heterologously expressed and evaluated in yeast. We demonstrate that GFP is expressed at high levels, and that the toxic αSyn can be adapted to endogenous, low-level expression. The new software is publicly available as a web-application for performing host-specific protein adaptations to a set of the most commonly used model organisms ( https://odysseus.motorprotein.de ).
Assuntos
Saccharomyces cerevisiae , Software , Códon/genética , Expressão Gênica , Proteínas de Fluorescência Verde/genética , Humanos , Saccharomyces cerevisiae/genéticaRESUMO
Coiled-coil regions were among the first protein motifs described structurally and theoretically. The simplicity of the motif promises that coiled-coil regions can be detected with reasonable accuracy and precision in any protein sequence. Here, we re-evaluated the most commonly used coiled-coil prediction tools with respect to the most comprehensive reference data set available, the entire Protein Data Bank, down to each amino acid and its secondary structure. Apart from the 30-fold difference in minimum and maximum number of coiled coils predicted the tools strongly vary in where they predict coiled-coil regions. Accordingly, there is a high number of false predictions and missed, true coiled-coil regions. The evaluation of the binary classification metrics in comparison with naïve coin-flip models and the calculation of the Matthews correlation coefficient, the most reliable performance metric for imbalanced data sets, suggests that the tested tools' performance is close to random. This implicates that the tools' predictions have only limited informative value. Coiled-coil predictions are often used to interpret biochemical data and are part of in-silico functional genome annotation. Our results indicate that these predictions should be treated very cautiously and need to be supported and validated by experimental evidence.
Assuntos
Motivos de Aminoácidos , Modelos Moleculares , Estrutura Secundária de Proteína , Sequência de Aminoácidos , Bases de Dados de Proteínas/estatística & dados numéricos , SoftwareRESUMO
Stable single-alpha helices (SAH-domains) function as rigid connectors and constant force springs between structural domains, and can provide contact surfaces for protein-protein and protein-RNA interactions. SAH-domains mainly consist of charged amino acids and are monomeric and stable in polar solutions, characteristics which distinguish them from coiled-coil domains and intrinsically disordered regions. Although the number of reported SAH-domains is steadily increasing, genome-wide analyses of SAH-domains in eukaryotic genomes are still missing. Here, we present Waggawagga-CLI, a command-line tool for predicting and analysing SAH-domains in protein sequence datasets. Using Waggawagga-CLI we predicted SAH-domains in 24 datasets from eukaryotes across the tree of life. SAH-domains were predicted in 0.5 to 3.5% of the protein-coding content per species. SAH-domains are particularly present in longer proteins supporting their function as structural building block in multi-domain proteins. In human, SAH-domains are mainly used as alternative building blocks not being present in all transcripts of a gene. Gene ontology analysis showed that yeast proteins with SAH-domains are particular enriched in macromolecular complex subunit organization, cellular component biogenesis and RNA metabolic processes, and that they have a strong nuclear and ribonucleoprotein complex localization and function in ribosome and nucleic acid binding. Human proteins with SAH-domains have roles in all types of RNA processing and cytoskeleton organization, and are predicted to function in RNA binding, protein binding involved in cell and cell-cell adhesion, and cytoskeletal protein binding. Waggawagga-CLI allows the user to adjust the stabilizing and destabilizing contribution of amino acid interactions in i,i+3 and i,i+4 spacings, and provides extensive flexibility for user-designed analyses.
Assuntos
Domínios Proteicos , Conjuntos de Dados como Assunto , Eucariotos , Evolução Molecular , Humanos , RNA Mensageiro/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/metabolismoRESUMO
The diArk Eukaryotic Genome Database is a manually curated and updated repository of available eukaryotic genome and transcriptome assemblies. diArk is a key resource for researchers interested in comparative eukaryotic genomics, and the entry point to browsing sequenced eukaryotes in general and to find the most closely related species to the own organism of interest in particular. The exponentially increasing number of sequenced species demands sophisticated search and data presentation tools. In this chapter we describe how to navigate the diArk database keeping a first-time user in mind.
Assuntos
Bases de Dados Genéticas , Eucariotos/metabolismo , Genoma , Genômica , Transcriptoma , Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , NavegadorRESUMO
Stable single-alpha helices (SAHs) are versatile structural elements in many prokaryotic and eukaryotic proteins acting as semi-flexible linkers and constant force springs. This way SAH-domains function as part of the lever of many different myosins. Canonical myosin levers consist of one or several IQ-motifs to which light chains such as calmodulin bind. SAH-domains provide flexibility in length and stiffness to the myosin levers, and may be particularly suited for myosins working in crowded cellular environments. Although the function of the SAH-domains in human class-6 and class-10 myosins has well been characterised, the distribution of the SAH-domain in all myosin subfamilies and across the eukaryotic tree of life remained elusive. Here, we analysed the largest available myosin sequence dataset consisting of 7919 manually annotated myosin sequences from 938 species representing all major eukaryotic branches using the SAH-prediction algorithm of Waggawagga, a recently developed tool for the identification of SAH-domains. With this approach we identified SAH-domains in more than one third of the supposed 79 myosin subfamilies. Depending on the myosin class, the presence of SAH-domains can range from a few to almost all class members indicating complex patterns of independent and taxon-specific SAH-domain gain and loss.
Assuntos
Miosinas/metabolismo , Conformação Proteica em alfa-Hélice/fisiologia , Domínios Proteicos/fisiologia , Algoritmos , Sequência de Aminoácidos , Animais , Calmodulina/metabolismo , Drosophila , Humanos , Ligação ProteicaRESUMO
[This corrects the article DOI: 10.1371/journal.pone.0174639.].