RESUMO
We present Conrad, the first comparative gene predictor based on semi-Markov conditional random fields (SMCRFs). Unlike the best standalone gene predictors, which are based on generalized hidden Markov models (GHMMs) and trained by maximum likelihood, Conrad is discriminatively trained to maximize annotation accuracy. In addition, unlike the best annotation pipelines, which rely on heuristic and ad hoc decision rules to combine standalone gene predictors with additional information such as ESTs and protein homology, Conrad encodes all sources of information as features and treats all features equally in the training and inference algorithms. Conrad outperforms the best standalone gene predictors in cross-validation and whole chromosome testing on two fungi with vastly different gene structures. The performance improvement arises from the SMCRF's discriminative training methods and their ability to easily incorporate diverse types of information by encoding them as feature functions. On Cryptococcus neoformans, configuring Conrad to reproduce the predictions of a two-species phylo-GHMM closely matches the performance of Twinscan. Enabling discriminative training increases performance, and adding new feature functions further increases performance, achieving a level of accuracy that is unprecedented for this organism. Similar results are obtained on Aspergillus nidulans comparing Conrad versus Fgenesh. SMCRFs are a promising framework for gene prediction because of their highly modular nature, simplifying the process of designing and testing potential indicators of gene structure. Conrad's implementation of SMCRFs advances the state of the art in gene prediction in fungi and provides a robust platform for both current application and future research.
Assuntos
Algoritmos , Aspergillus nidulans/genética , Cryptococcus neoformans/genética , Genes Fúngicos , Software , Inteligência Artificial , Cromossomos Fúngicos , Análise Discriminante , Funções Verossimilhança , Cadeias de Markov , Padrões de ReferênciaRESUMO
Whole-genome assembly is now used routinely to obtain high-quality draft sequence for the genomes of species with low levels of polymorphism. However, genome assembly remains extremely challenging for highly polymorphic species. The difficulty arises because two divergent haplotypes are sequenced together, making it difficult to distinguish alleles at the same locus from paralogs at different loci. We present here a method for assembling highly polymorphic diploid genomes that involves assembling the two haplotypes separately and then merging them to obtain a reference sequence. Our method was developed to assemble the genome of the sea squirt Ciona savignyi, which was sequenced to a depth of 12.7 x from a single wild individual. By comparing finished clones of the two haplotypes we determined that the sequenced individual had an extremely high heterozygosity rate, averaging 4.6% with significant regional variation and rearrangements at all physical scales. Applied to these data, our method produced a reference assembly covering 157 Mb, with N50 contig and scaffold sizes of 47 kb and 989 kb, respectively. Alignment of ESTs indicates that 88% of loci are present at least once and 81% exactly once in the reference assembly. Our method represented loci in a single copy more reliably and achieved greater contiguity than a conventional whole-genome assembly method.
Assuntos
Algoritmos , Genoma , Urocordados/genética , Animais , Sequência de Bases , Clonagem Molecular/métodos , Diploide , Etiquetas de Sequências Expressas , Haplótipos/genética , Heterozigoto , Dados de Sequência Molecular , Reação em Cadeia da Polimerase/métodos , Reprodutibilidade dos TestesRESUMO
With the recent completion of a high-quality sequence of the human genome, the challenge is now to understand the functional elements that it encodes. Comparative genomic analysis offers a powerful approach for finding such elements by identifying sequences that have been highly conserved during evolution. Here, we propose an initial strategy for detecting such regions by generating low-redundancy sequence from a collection of 16 eutherian mammals, beyond the 7 for which genome sequence data are already available. We show that such sequence can be accurately aligned to the human genome and used to identify most of the highly conserved regions. Although not a long-term substitute for generating high-quality genomic sequences from many mammalian species, this strategy represents a practical initial approach for rapidly annotating the most evolutionarily conserved sequences in the human genome, providing a key resource for the systematic study of human genome function.