Your browser doesn't support javascript.
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 686
Filtrar
Mais filtros










Intervalo de ano de publicação
1.
BMC Bioinformatics ; 20(1): 454, 2019 Sep 05.
Artigo em Inglês | MEDLINE | ID: mdl-31488049

RESUMO

BACKGROUND: As genome sequencing projects grow rapidly, the diversity of organisms with recently assembled genome sequences peaks at an unprecedented scale, thereby highlighting the need to make gene functional annotations fast and efficient. However, the (high) quality of such annotations must be guaranteed, as this is the first indicator of the genomic potential of every organism. Automatic procedures help accelerating the annotation process, though decreasing the confidence and reliability of the outcomes. Manually curating a genome-wide annotation of genes, enzymes and transporter proteins function is a highly time-consuming, tedious and impractical task, even for the most proficient curator. Hence, a semi-automated procedure, which balances the two approaches, will increase the reliability of the annotation, while speeding up the process. In fact, a prior analysis of the annotation algorithm may leverage its performance, by manipulating its parameters, hastening the downstream processing and the manual curation of assigning functions to genes encoding proteins. RESULTS: Here SamPler, a novel strategy to select parameters for gene functional annotation routines is presented. This semi-automated method is based on the manual curation of a randomly selected set of genes/proteins. Then, in a multi-dimensional array, this sample is used to assess the automatic annotations for all possible combinations of the algorithm's parameters. These assessments allow creating an array of confusion matrices, for which several metrics are calculated (accuracy, precision and negative predictive value) and used to reach optimal values for the parameters. CONCLUSIONS: The potential of this methodology is demonstrated with four genome functional annotations performed in merlin, an in-house user-friendly computational framework for genome-scale metabolic annotation and model reconstruction. For that, SamPler was implemented as a new plugin for the merlin tool.


Assuntos
Algoritmos , Anotação de Sequência Molecular/métodos , Bactérias/genética , Mapeamento Cromossômico , Bases de Dados de Proteínas , Reprodutibilidade dos Testes
2.
Nat Protoc ; 14(10): 3013-3031, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-31520072

RESUMO

Functionally linked genes in bacterial and archaeal genomes are often organized into operons. However, the composition and architecture of operons are highly variable and frequently differ even among closely related genomes. Therefore, to efficiently extract reliable functional predictions for uncharacterized genes from comparative analyses of the rapidly growing genomic databases, dedicated computational approaches are required. We developed a protocol to systematically and automatically identify genes that are likely to be functionally associated with a 'bait' gene or locus by using relevance metrics. Given a set of bait loci and a genomic database defined by the user, this protocol compares the genomic neighborhoods of the baits to identify genes that are likely to be functionally linked to the baits by calculating the abundance of a given gene within and outside the bait neighborhoods and the distance to the bait. We exemplify the performance of the protocol with three test cases, namely, genes linked to CRISPR-Cas systems using the 'CRISPRicity' metric, genes associated with archaeal proviruses and genes linked to Argonaute genes in halobacteria. The protocol can be run by users with basic computational skills. The computational cost depends on the sizes of the genomic dataset and the list of reference loci and can vary from one CPU-hour to hundreds of hours on a supercomputer.


Assuntos
Biologia Computacional/métodos , Genes Arqueais , Genes Bacterianos , Genômica/métodos , Sistemas CRISPR-Cas , Genoma Arqueal , Genoma Bacteriano , Anotação de Sequência Molecular/métodos , Fases de Leitura Aberta , Óperon
3.
BMC Bioinformatics ; 20(1): 473, 2019 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-31521110

RESUMO

BACKGROUND: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins. RESULTS: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. These accelerated the search methods HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ∼10× faster than PSI-BLAST and ∼20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over cluster servers using OpenMP and message passing interface (MPI). The free, open-source, GPLv3-licensed software is available at https://github.com/soedinglab/hh-suite . CONCLUSION: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.


Assuntos
Anotação de Sequência Molecular/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Cadeias de Markov
4.
Nat Commun ; 10(1): 3100, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31308405

RESUMO

Of the 473 genes in the genome of the bacterium with the smallest genome generated to date, 149 genes have unknown function, emphasising a universal problem; less than 1% of proteins have experimentally determined annotations. Here, we combine the results from state-of-the-art in silico methods for functional annotation and assign functions to 66 of the 149 proteins. Proteins that are still not annotated lack orthologues, lack protein domains, and/ or are membrane proteins. Twenty-four likely transporter proteins are identified indicating the importance of nutrient uptake into and waste disposal out of the minimal bacterial cell in a nutrient-rich environment after removal of metabolic enzymes. Hence, the environment shapes the nature of a minimal genome. Our findings also show that the combination of multiple different state-of-the-art in silico methods for annotating proteins is able to predict functions, even for difficult to characterise proteins and identify crucial gaps for further development.


Assuntos
Adaptação Biológica/genética , Bactérias/genética , Genoma Bacteriano/genética , Biologia Computacional/métodos , Genes Essenciais/genética , Anotação de Sequência Molecular/métodos , Software
5.
Methods Mol Biol ; 1962: 29-51, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31020553

RESUMO

The Genome Sequence Annotation Server (GenSAS, https://www.gensas.org ) is a secure, web-based genome annotation platform for structural and functional annotation, as well as manual curation. Requiring no installation by users, GenSAS integrates popular command line-based, annotation tools under a single, easy-to-use, online interface. GenSAS integrates JBrowse and Apollo, so users can view annotation data and manually curate gene models. Users are guided step by step through the annotation process by embedded instructions and a more in-depth GenSAS User's Guide. In addition to a genome assembly file, users can also upload organism-specific transcript, protein, and RNA-seq read evidence for use in the annotation process. The latest versions of the NCBI RefSeq transcript and protein databases and the SwissProt and TrEMBL protein databases are provided for all users. GenSAS projects can be shared with other GenSAS users enabling collaborative annotation. Once annotation is complete, GenSAS generates the final files of the annotated gene models in common file formats for use with other annotation tools, submission to a repository, and use in publications.


Assuntos
Bases de Dados Genéticas , Genoma , Anotação de Sequência Molecular/métodos , Software , Curadoria de Dados , Eucariotos , Internet , Análise de Sequência de RNA , Especificidade da Espécie , Interface Usuário-Computador
6.
Methods Mol Biol ; 1962: 53-64, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31020554

RESUMO

FunGAP is a Python-wrapped fungal genome annotation pipeline running under the Linux/Unix operating system. The annotation procedure used in FunGAP requires two inputs, genome assembly and RNA-seq reads. FunGAP aims to predict the most feasible gene from all plausible gene models obtained from various gene prediction programs using multiple strategies such as ab initio, EST-, and/or homology-based methods. This guide covers how to run the FunGAP from the command line and use various options for practical gene prediction. Users can choose options for quality control of the input sequences, selecting model database, filtration of predicted gene models, and post-process such as checking genome completeness and transposable elements. Using FunGAP, the user will acquire a high-quality fungal gene prediction for post-genome sequencing analysis.


Assuntos
Genes Fúngicos , Anotação de Sequência Molecular/métodos , Software , Bases de Dados Genéticas , Genoma Fúngico , Análise de Sequência de RNA , Interface Usuário-Computador
7.
Methods Mol Biol ; 1962: 65-95, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31020555

RESUMO

BRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and AUGUSTUS. GeneMark-ES/ET learns its parameters from a novel genomic sequence in a fully automated fashion; if available, it uses extrinsic evidence for model refinement. From the protein-coding genes predicted by GeneMark-ES/ET, we select a set for training AUGUSTUS, one of the most accurate gene finding tools that, in contrast to GeneMark-ES/ET, integrates extrinsic evidence already into the gene prediction step. The first published version, BRAKER1, integrated genomic footprints of unassembled RNA-Seq reads into the training as well as into the prediction steps. The pipeline has since been extended to the integration of data on mapped cross-species proteins, and to the usage of heterogeneous extrinsic evidence, both RNA-Seq and protein alignments. In this book chapter, we briefly summarize the pipeline methodology and describe how to apply BRAKER in environments characterized by various combinations of external evidence.


Assuntos
Genoma , Anotação de Sequência Molecular/métodos , Software , Sequência de Aminoácidos , Genômica/métodos , Internet , Interface Usuário-Computador
8.
Methods Mol Biol ; 1962: 139-160, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31020558

RESUMO

Comparing multiple related genomes can help to improve their structural annotation. The accuracy and consistency of the predicted exon-intron structures of the protein coding genes can be higher when considering all genomes at once rather than annotating one genome at a time.The comparative gene prediction algorithm of AUGUSTUS performs such a multi-genome annotation. A multiple alignment of genomes is used to exploit evolutionary clues to conservation and negative selection. Further, AUGUSTUS exploits the fact that orthologous genes typically have congruent exon-intron structures. Comparative AUGUSTUS simultaneously predicts the genes in all input genomes. In this chapter we walk the reader through a small example from eight vertebrate species, including the construction of an alignment of the input genomes and how to integrate RNA-Seq evidence from multiple species for gene finding.


Assuntos
Algoritmos , Genoma , Anotação de Sequência Molecular/métodos , Vertebrados/genética , Animais , Biologia Computacional/métodos , Bases de Dados Genéticas , Evolução Molecular , Análise de Sequência de RNA/métodos , Interface Usuário-Computador
9.
Methods Mol Biol ; 1962: 179-191, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31020560

RESUMO

Alignment-based gene identification methods utilize sequence conservation between orthologous protein-coding genes to annotate genes in newly sequenced genomes. CESAR is an approach that makes use of existing genome alignments to transfer genes from one genome to other aligned genomes, and thus generates comparative gene annotations. To accurately detect conserved exons that exhibit an intact reading frame and consensus splice sites, CESAR produces a new alignment between orthologous exons, taking information about the exon's reading frame and splice site positions into account. Furthermore, CESAR is able to detect most evolutionary splice site shifts, which helps to annotate exon boundaries at high precision. Here, we describe how to apply CESAR to generate comparative gene annotations for one or many species, and discuss the strengths and limitations of this approach. CESAR is available at https://github.com/hillerlab/CESAR2.0 .


Assuntos
Éxons , Anotação de Sequência Molecular/métodos , Sítios de Splice de RNA , Análise de Sequência de DNA/métodos , Software , Animais , Sequência de Bases , Sequência Conservada , Apresentação de Dados , Evolução Molecular , Genoma , Genômica/métodos , Humanos , Camundongos , Fases de Leitura
10.
Methods Mol Biol ; 1962: 193-206, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31020561

RESUMO

Scipio and WebScipio are homology-based gene prediction software designed for annotating multigenic families and for transferring annotations from one species to closely related species. The strengths include the power to cope with sequencing-related problems such as sequencing errors and assemblies with short contigs but also the ability to correctly predict genes with unusually long introns and/or rather short exons. WebScipio is connected to diArk, the largest collection of eukaryotic genome assemblies, and thereby offers a very convenient way to correct existing annotations and to extend protein family datasets. WebScipio is also a key resource for researchers interested in mutually exclusive splicing, allowing to search for alternative exons not only in introns but also in up- and downstream regions in case of incompleteness of the search sequence. In this chapter, I describe how to use Scipio and WebScipio keeping a first-time user in mind.


Assuntos
Éxons , Genômica/métodos , Anotação de Sequência Molecular/métodos , Software , Processamento Alternativo , Bases de Dados Genéticas , Família Multigênica , Especificidade da Espécie , Interface Usuário-Computador
11.
Methods Mol Biol ; 1962: 215-226, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31020563

RESUMO

DDBJ Fast Annotation and Submission Tool (DFAST) is a genome annotation pipeline for prokaryotes, which also assists data submission to the public sequence database. It is available both as a web service and as a stand-alone tool that runs on local machines. DFAST can annotate a typical-sized bacterial genome within 5 min. The default annotation workflow contains a gene prediction phase for protein coding sequence, rRNA, tRNA, and CRISPR, and a functional annotation phase to infer protein functions. DFAST generates result files in standard annotation formats and data files for submission to DNA Data Bank of Japan (DDBJ). In this chapter, the annotation workflow and applications of DFAST are introduced.


Assuntos
Bases de Dados de Ácidos Nucleicos , Anotação de Sequência Molecular/métodos , Células Procarióticas , Software , Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Apresentação de Dados , Genoma Bacteriano , Internet , Proteínas/genética , Pseudogenes , Editoração , RNA Ribossômico , RNA de Transferência , Fluxo de Trabalho
12.
Methods Mol Biol ; 1962: 227-245, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31020564

RESUMO

Genomics drives the current progress in molecular biology, generating unprecedented volumes of data. The scientific value of these sequences depends on the ability to evaluate their completeness using a biologically meaningful approach. Here, we describe the use of the BUSCO tool suite to assess the completeness of genomes, gene sets, and transcriptomes, using their gene content as a complementary method to common technical metrics. The chapter introduces the concept of universal single-copy genes, which underlies the BUSCO methodology, covers the basic requirements to set up the tool, and provides guidelines to properly design the analyses, run the assessments, and interpret and utilize the results.


Assuntos
Genômica/métodos , Anotação de Sequência Molecular/métodos , Software , Bases de Dados Genéticas , Dosagem de Genes , Genoma , Internet , Cadeias de Markov , Transcriptoma
13.
Methods Mol Biol ; 1962: 269-281, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31020567

RESUMO

Comprehensive structural characterization of protein-coding gene repertoires is a crucial step to identify differences and commonalities in comparative genomics contexts. This requires a descriptive set of standardized parameters as well as summary statistics of, e.g., gene lengths and exon counts. We developed the tool COGNATE to gather this data from a given structural annotation file in combination with the corresponding genome assembly with a single simple command line call. COGNATE relies on clearly stated parameter definitions and thus serves to enhance dataset comparability. Here, it is shown how the tool can be used; special attention is given to input formatting.


Assuntos
Genômica/métodos , Anotação de Sequência Molecular/métodos , Software , Éxons , Genoma , Internet
14.
Methods Mol Biol ; 1958: 47-71, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30945213

RESUMO

Secondary structure elements (SSEs) are inherent parts of protein structures, and their arrangement is characteristic for each protein family. Therefore, annotation of SSEs can facilitate orientation in the vast number of homologous structures which is now available for many protein families. It also provides a way to identify and annotate the key regions, like active sites and channels, and subsequently answer the key research questions, such as understanding of molecular function and its variability.This chapter introduces the concept of SSE annotation and describes the workflow for obtaining SSE annotation for the members of a selected protein family using program SecStrAnnotator.


Assuntos
Motivos de Aminoácidos , Biologia Computacional/métodos , Anotação de Sequência Molecular/métodos , Proteínas/química , Algoritmos , Domínio Catalítico/genética , Proteínas/genética , Software
15.
Methods Cell Biol ; 151: 115-126, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30948003

RESUMO

Echinoderms have some of the most complete reconstructed developmental gene regulatory networks (GRN) of any embryo, accounting for the formation of most embryo tissues and organs. Yet, many nodes (genes and regulators) and their regulatory interactions are still to be uncovered. Traditionally, knockdown/knockout experiments are performed to determine regulator-gene interactions, which are individually validated by cis-regulatory analysis. Differential RNA-seq, combined with perturbation analysis, allows for genome-wide reconstruction of a GRN around given regulators; however, this level of resolution cannot determine direct interactions. ChiP-chip or ChIP-seq is better equipped for determining, genome-wide, whether binding of a given transcription factor (TF) to cis-regulatory elements occurs. Antibodies for the TFs of interest must be available, and if not, this presents a limiting factor. ATAC-seq identifies regions of open chromatin, that are typically trimethylated at H3K4, H3K36 and H3K79 (Kouzarides, 2007), for a given time point, condition, or tissue. This technology combined with RNA-seq and perturbation analysis provides high resolution of the possible functional interactions occurring during development. Additionally, ATAC-seq is less expensive than ChIP-seq, requires less starting material, and provides a global view of regulatory regions. This chapter provides detailed steps to identify potential regulatory relationships between the nodes of a GRN, given a well assembled genome, annotated with gene models, and ATAC-seq data combined with RNA-seq and knockdown experiments.


Assuntos
Redes Reguladoras de Genes/genética , Análise de Sequência de DNA/métodos , Análise de Sequência de RNA/métodos , Animais , Cromatina/genética , Equinodermos/genética , Equinodermos/crescimento & desenvolvimento , Anotação de Sequência Molecular/métodos , RNA/genética
16.
Methods Cell Biol ; 151: 65-88, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30948032

RESUMO

Echinoderms are important research models for a wide range of biological questions. In particular, echinoderm embryos are exemplary models for dissecting the molecular and cellular processes that drive development and testing how these processes can be modified through evolution to produce the extensive morphological diversity observed in the phylum. Modern attempts to characterize these processes depend on some level of genomic analysis; from querying annotated gene sets to functional genomics experiments to identify candidate cis-regulatory sequences. Given how essential these data have become, it is important that researchers using available datasets or performing their own genome-scale experiments understand the nature and limitations of echinoderm genomic analyses. In this chapter we highlight the current state of echinoderm genomic data and provide methodological considerations for common approaches, including analysis of transcriptome and functional genomics datasets.


Assuntos
Equinodermos/genética , Desenvolvimento Embrionário/genética , Perfilação da Expressão Gênica/métodos , Genômica/métodos , Animais , Equinodermos/crescimento & desenvolvimento , Genoma/genética , Genômica/tendências , Anotação de Sequência Molecular/métodos
17.
Gene ; 699: 43-53, 2019 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-30858139

RESUMO

Ribes diacanthum Pall. (Grossulariaceae), a species with dioecious, unixsexual flowers, has great economic and medicinal value and is widespread in northeastern China. After the initiation of intact floral organs, male flowers develop an abnormal stigma, and female flowers develop fading stamens incapable of pollination. To explore the genes governing dioecious unisexual floral development in R. diacanthum, we used high-throughput sequencing to obtain transcriptome data for male and female inflorescences and analyzed expression patterns of candidate genes at various developmental stages of male and female flowers. The combined transcriptomic data were successfully assembled into 72,791 transcripts (N50 = 1467) and 48,600 unigenes (N50 = 1378); 62% of the unigenes were annotated by NR, Swissprot, KEGG, GO and COG database based on orthology. Analysis of the differentially expressed genes (DEGs) showed that 2785 annotated genes were differentially expressed, and significantly more genes were male-biased than were female-biased in expression in the inflorescences. Both male and female flowers were found to be complete hermaphroditic flowers during early floral development; sex determination was a late event. Several MADS-box genes such as comp53946_c0 (putative AGL11) might be directly correlated with the establishment of sexual dimorphism. The sex-specific transcripts and genes identified may regulate coordinated events during floral development and be involved in the molecular regulation of dioecious, unisexual floral development in R. diacanthum. The transcriptome from the male and the female inflorescences will provide a valuable reference for further functional research on the development of dioecious, unisexual flowers.


Assuntos
Flores/genética , Inflorescência/genética , Ribes/genética , Transcriptoma/genética , China , Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica de Plantas/genética , Genes de Plantas/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Anotação de Sequência Molecular/métodos , Proteínas de Plantas/genética , Polinização/genética
18.
Genes Genomics ; 41(5): 599-612, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-30840180

RESUMO

BACKGROUND: Sinonovacula constricta is an economically important bivalve species in China, Korea and Japan that widely resides in estuarine and coastal areas where salinity fluctuates rapidly. However, little is known about its adaptation mechanisms to acute salt stresses. OBJECTIVE: To reveal the underlying molecular mechanisms involved in acute salt stresses in juvenile S. constricta. METHODS: Nine cDNA libraries (triplicate each trial) were established from juvenile S. constricta, which were subjected to low salinity (5 psu), optimal salinity (15 psu) and high salinity (25 psu) for 6 h, respectively. RESULTS: Illumina sequencing generated 478,587,310 clean reads totally, which were assembled into 427,057 transcripts of 246,672 unigenes. Compared with the control, 1259 and 2163 differentially expressed genes (DEGs) were identified under acute low and high salt stresses, respectively. GO and KEGG enrichment analyses of DEGs revealed that several key metabolic modulations were mainly responsible for the acute salt stresses. According to the significantly highlighted KEGG pathways, some key DEGs were identified and discussed in details. Notably, based on which, some potential osmolytes were further speculated. CONCLUSION: Here, we carried out a unique report of comparative transcriptome analyses in juvenile S. constricta in response to acute salt stresses. The identified DEGs and their significantly enriched GO terms and KEGG pathways were critical for understanding and further investigating the underlying the physical and biochemical performances, and ultimately facilitated S. constricta breeding. Besides, the transcriptome data greatly enriched the genetic information of S. constricta, which were valuable for promoting its molecular biology researches.


Assuntos
Bivalves/genética , Perfilação da Expressão Gênica/métodos , Estresse Salino/genética , Adaptação Biológica/genética , Animais , Bivalves/fisiologia , China , Japão , Anotação de Sequência Molecular/métodos , República da Coreia , Tolerância ao Sal/genética , Tolerância ao Sal/fisiologia
19.
PLoS Comput Biol ; 15(2): e1006790, 2019 02.
Artigo em Inglês | MEDLINE | ID: mdl-30726205

RESUMO

Genome annotation is the process of identifying the location and function of a genome's encoded features. Improving the biological accuracy of annotation is a complex and iterative process requiring researchers to review and incorporate multiple sources of information such as transcriptome alignments, predictive models based on sequence profiles, and comparisons to features found in related organisms. Because rapidly decreasing costs are enabling an ever-growing number of scientists to incorporate sequencing as a routine laboratory technique, there is widespread demand for tools that can assist in the deliberative analytical review of genomic information. To this end, we present Apollo, an open source software package that enables researchers to efficiently inspect and refine the precise structure and role of genomic features in a graphical browser-based platform. Some of Apollo's newer user interface features include support for real-time collaboration, allowing distributed users to simultaneously edit the same encoded features while also instantly seeing the updates made by other researchers on the same region in a manner similar to Google Docs. Its technical architecture enables Apollo to be integrated into multiple existing genomic analysis pipelines and heterogeneous laboratory workflow platforms. Finally, we consider the implications that Apollo and related applications may have on how the results of genome research are published and made accessible.


Assuntos
Biologia Computacional/métodos , Anotação de Sequência Molecular/métodos , Mapeamento Cromossômico/métodos , Sistemas de Gerenciamento de Base de Dados , Genoma/genética , Genômica , Armazenamento e Recuperação da Informação , Internet , Software , Interface Usuário-Computador
20.
BMC Bioinformatics ; 19(Suppl 13): 551, 2019 Feb 04.
Artigo em Inglês | MEDLINE | ID: mdl-30717662

RESUMO

BACKGROUND: Small open reading frames (smORF/sORFs) that encode short protein sequences are often overlooked during the standard gene prediction process thus leading to many sORFs being left undiscovered and/or misannotated. For many genomes, a second round of sORF targeted gene prediction can complement the existing annotation. In this study, we specifically targeted the identification of ORFs encoding for 80 amino acid residues or less from 31 fungal genomes. We then compared the predicted sORFs and analysed those that are highly conserved among the genomes. RESULTS: A first set of sORFs was identified from existing annotations that fitted the maximum of 80 residues criterion. A second set was predicted using parameters that specifically searched for ORF candidates of 80 codons or less in the exonic, intronic and intergenic sequences of the subject genomes. A total of 1986 conserved sORFs were predicted and characterized. CONCLUSIONS: It is evident that numerous open reading frames that could potentially encode for polypeptides consisting of 80 amino acid residues or less are overlooked during standard gene prediction and annotation. From our results, additional targeted reannotation of genomes is clearly able to complement standard genome annotation to identify sORFs. Due to the lack of, and limitations with experimental validation, we propose that a simple conservation analysis can provide an acceptable means of ensuring that the predicted sORFs are sufficiently clear of gene prediction artefacts.


Assuntos
Biologia Computacional/métodos , Sequência Conservada , Genoma Fúngico , Anotação de Sequência Molecular/métodos , Fases de Leitura Aberta/genética , Sequência de Aminoácidos , Ontologia Genética , Filogenia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA