RESUMO
Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.
Assuntos
Genoma , Genômica/métodos , Alinhamento de Sequência/métodos , Software , Animais , Biologia Computacional/métodos , Simulação por Computador , Conjuntos de Dados como Assunto , Estudo de Associação Genômica Ampla , Humanos , Mamíferos/genética , Filogenia , Reprodutibilidade dos TestesRESUMO
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.
Assuntos
Genoma/fisiologia , Genômica/métodos , Análise de Sequência de DNA/métodosRESUMO
Using the ability of poorly differentiated cells to natively internalize fragments of extracellular double-stranded DNA as a marker, we isolated a tumorigenic subpopulation present in Krebs-2 ascites that demonstrated the features of tumor-inducing cancer stem cells. Having combined TAMRA-labeled DNA probe and the power of RNA-seq technology, we identified a set of 168 genes specifically expressed in TAMRA-positive cells (tumor-initiating stem cells), these genes remaining silent in TAMRA-negative cancer cells. TAMRA+ cells displayed gene expression signatures characteristic of both stem cells and cancer cells. The observed expression differences between TAMRA+ and TAMRA- cells were validated by Real Time PCR. The results obtained corroborated the biological data that TAMRA+ murine Krebs-2 tumor cells are tumor-initiating stem cells. The approach developed can be applied to profile any poorly differentiated cell types that are capable of immanent internalization of double-stranded DNA.