RESUMO
Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased "Platinum" variant catalog of 4.7 million single-nucleotide variants (SNVs) plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and 11 children of this pedigree. Platinum genotypes are highly concordant with the current catalog of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission ("nonplatinum") revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.
Assuntos
Genoma Humano/genética , Genômica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Bases de Dados Genéticas , Exoma/genética , Genótipo , Humanos , Mutação INDEL/genética , Linhagem , Polimorfismo de Nucleotídeo Único , SoftwareRESUMO
Identifying large expansions of short tandem repeats (STRs), such as those that cause amyotrophic lateral sclerosis (ALS) and fragile X syndrome, is challenging for short-read whole-genome sequencing (WGS) data. A solution to this problem is an important step toward integrating WGS into precision medicine. We developed a software tool called ExpansionHunter that, using PCR-free WGS short-read data, can genotype repeats at the locus of interest, even if the expanded repeat is larger than the read length. We applied our algorithm to WGS data from 3001 ALS patients who have been tested for the presence of the C9orf72 repeat expansion with repeat-primed PCR (RP-PCR). Compared against this truth data, ExpansionHunter correctly classified all (212/212, 95% CI [0.98, 1.00]) of the expanded samples as either expansions (208) or potential expansions (4). Additionally, 99.9% (2786/2789, 95% CI [0.997, 1.00]) of the wild-type samples were correctly classified as wild type by this method with the remaining three samples identified as possible expansions. We further applied our algorithm to a set of 152 samples in which every sample had one of eight different pathogenic repeat expansions, including those associated with fragile X syndrome, Friedreich's ataxia, and Huntington's disease, and correctly flagged all but one of the known repeat expansions. Thus, ExpansionHunter can be used to accurately detect known pathogenic repeat expansions and provides researchers with a tool that can be used to identify new pathogenic repeat expansions.
Assuntos
Esclerose Lateral Amiotrófica/genética , Expansão das Repetições de DNA , Sequenciamento Completo do Genoma/métodos , Algoritmos , Proteína C9orf72/genética , Bases de Dados Genéticas , Humanos , Medicina de Precisão , Sensibilidade e Especificidade , SoftwareRESUMO
All cancers carry somatic mutations. A subset of these somatic alterations, termed driver mutations, confer selective growth advantage and are implicated in cancer development, whereas the remainder are passengers. Here we have sequenced the genomes of a malignant melanoma and a lymphoblastoid cell line from the same person, providing the first comprehensive catalogue of somatic mutations from an individual cancer. The catalogue provides remarkable insights into the forces that have shaped this cancer genome. The dominant mutational signature reflects DNA damage due to ultraviolet light exposure, a known risk factor for malignant melanoma, whereas the uneven distribution of mutations across the genome, with a lower prevalence in gene footprints, indicates that DNA repair has been preferentially deployed towards transcribed regions. The results illustrate the power of a cancer genome sequence to reveal traces of the DNA damage, repair, mutation and selection processes that were operative years before the cancer became symptomatic.
Assuntos
Genes Neoplásicos/genética , Genoma Humano/genética , Mutação/genética , Neoplasias/genética , Adulto , Linhagem Celular Tumoral , Dano ao DNA/genética , Análise Mutacional de DNA , Reparo do DNA/genética , Dosagem de Genes/genética , Humanos , Perda de Heterozigosidade/genética , Masculino , Melanoma/etiologia , Melanoma/genética , MicroRNAs/genética , Mutagênese Insercional/genética , Neoplasias/etiologia , Polimorfismo de Nucleotídeo Único/genética , Medicina de Precisão , Deleção de Sequência/genética , Raios UltravioletaRESUMO
DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.
Assuntos
Genoma Humano/genética , Genômica/métodos , Análise de Sequência de DNA/métodos , Cromossomos Humanos X/genética , Sequência Consenso/genética , Genômica/economia , Genótipo , Humanos , Masculino , Nigéria , Polimorfismo de Nucleotídeo Único/genética , Sensibilidade e Especificidade , Análise de Sequência de DNA/economiaRESUMO
To analyse the myogenic transcriptome and identify novel genes involved in muscle development in an in vivo context, we have constructed a muscle specific cDNA library from GFP-expressing myoblasts purified by fluorescent activated cell sorting of transgenic zebrafish embryos. We have generated 153,428 EST sequences from this library that have been clustered into consensi, mapped to the genome assembly Zv6 and analysed for protein homology. Expression analysis of a randomly picked sample of clones using whole mount in situ hybridisation, identified 30 genes that are expressed specifically within the myotome, one third of which represent novel sequences. These genes have been assigned to syn-expression groups. The sequencing of the myoblast enriched cDNA library has significantly increased the number of zebrafish ESTs, facilitating the prediction of new spliced transcripts in the genome assembly and providing a transcriptome of an in vivo myoblast cell.
Assuntos
Perfilação da Expressão Gênica/métodos , Biblioteca Gênica , Mioblastos/metabolismo , Análise de Sequência de DNA/métodos , Peixe-Zebra/genética , Animais , Animais Geneticamente Modificados , Embrião não Mamífero , Etiquetas de Sequências Expressas , Genômica/métodos , Proteínas de Fluorescência Verde/genética , Proteínas de Fluorescência Verde/metabolismo , Especificidade de Órgãos/genética , Peixe-Zebra/embriologia , Peixe-Zebra/metabolismoRESUMO
The neotropical butterflies Heliconius melpomene and H. erato are Müllerian mimics that display the same warningly colored wing patterns in local populations, yet pattern diversity between geographic regions. Linkage mapping has previously shown convergent red wing phenotypes in these species are controlled by loci on homologous chromosomes. Here, AFLP bulk segregant analysis using H. melpomene crosses identified genetic markers tightly linked to two red wing-patterning loci. These markers were used to screen a H. melpomene BAC library and a tile path was assembled spanning one locus completely and part of the second. Concurrently, a similar strategy was used to identify a BAC clone tightly linked to the locus controlling the mimetic red wing phenotypes in H. erato. A methionine rich storage protein (MRSP) gene was identified within this BAC clone, and comparative genetic mapping shows red wing color loci are in homologous regions of the genome of H. erato and H. melpomene. Subtle differences in these convergent phenotypes imply they evolved independently using somewhat different developmental routes, but are nonetheless regulated by the same switch locus. Genetic mapping of MRSP in a third related species, the "tiger" patterned H. numata, has no association with wing patterning and shows no evidence for genomic translocation of wing-patterning loci.
Assuntos
Adaptação Biológica/genética , Borboletas/fisiologia , Evolução Molecular , Genes de Insetos , Variação Genética , Asas de Animais/anatomia & histologia , Análise do Polimorfismo de Comprimento de Fragmentos Amplificados , Animais , Biomimética , Padronização Corporal , Mapeamento Cromossômico , Cromossomos Artificiais Bacterianos , Cruzamentos Genéticos , Deriva Genética , Ligação Genética , Marcadores Genéticos , Fenótipo , Comportamento Predatório , Seleção Genética , Asas de Animais/fisiologiaRESUMO
We studied whether similar developmental genetic mechanisms are involved in both convergent and divergent evolution. Mimetic insects are known for their diversity of patterns as well as their remarkable evolutionary convergence, and they have played an important role in controversies over the respective roles of selection and constraints in adaptive evolution. Here we contrast three butterfly species, all classic examples of Müllerian mimicry. We used a genetic linkage map to show that a locus, Yb, which controls the presence of a yellow band in geographic races of Heliconius melpomene, maps precisely to the same location as the locus Cr, which has very similar phenotypic effects in its co-mimic H. erato. Furthermore, the same genomic location acts as a "supergene", determining multiple sympatric morphs in a third species, H. numata. H. numata is a species with a very different phenotypic appearance, whose many forms mimic different unrelated ithomiine butterflies in the genus Melinaea. Other unlinked colour pattern loci map to a homologous linkage group in the co-mimics H. melpomene and H. erato, but they are not involved in mimetic polymorphism in H. numata. Hence, a single region from the multilocus colour pattern architecture of H. melpomene and H. erato appears to have gained control of the entire wing-pattern variability in H. numata, presumably as a result of selection for mimetic "supergene" polymorphism without intermediates. Although we cannot at this stage confirm the homology of the loci segregating in the three species, our results imply that a conserved yet relatively unconstrained mechanism underlying pattern switching can affect mimicry in radically different ways. We also show that adaptive evolution, both convergent and diversifying, can occur by the repeated involvement of the same genomic regions.
Assuntos
Biologia , Padronização Corporal , Borboletas/fisiologia , Modelos Biológicos , Animais , Biodiversidade , Cromossomos Artificiais Bacterianos , Sequência Conservada , Cruzamentos Genéticos , Feminino , Masculino , Repetições de Microssatélites , Modelos Genéticos , Dados de Sequência Molecular , FenótipoRESUMO
The human gene deleted in malignant brain tumors 1 (DMBT1) is considered to play a role in tumorigenesis and pathogen defense. It encodes a protein with multiple scavenger receptor cysteine-rich (SRCR) domains, which are involved in recognition and binding of a broad spectrum of bacterial pathogens. The SRCR domains are encoded by highly homologous repetitive exons, whose number in humans may vary from 8 to 13 due to genetic polymorphism. Here, we characterized the porcine DMBT1 gene on the mRNA and genomic level. We assembled a 4.5 kb porcine DMBT1 cDNA sequence from RT-PCR amplified seminal vesicle RNA. The porcine DMBT1 cDNA contains an open reading frame of 4050 nt. The transcript gives rise to a putative polypeptide of 1349 amino acids with a calculated mass of 147.9 kDa. Compared to human DMBT1, it contains only four N-terminal SRCR domains. Northern blotting revealed transcripts of approximately 4.7 kb in size in the tissues analyzed. Analysis of ESTs suggested the existence of secreted and transmembrane variants. The porcine DMBT1 gene spans about 54 kb on chromosome 14q28-q29. In contrast to the characterized cDNA, the genomic BAC clone only contained 3 exons coding for N-terminal SRCR domains. In different mammalian DMBT1 orthologs large interspecific differences in the number of SRCR exons and utilization of the transmembrane exon exist. Our data suggest that the porcine DMBT1 gene may share with the human DMBT1 gene additional intraspecific variations in the number of SRCR-coding exons.
Assuntos
Neoplasias Encefálicas/genética , Deleção de Genes , Suínos/genética , Proteínas Supressoras de Tumor/química , Proteínas Supressoras de Tumor/genética , Sequência de Aminoácidos , Animais , Sequência de Bases , Cromossomos de Mamíferos , Cisteína/química , DNA Complementar/genética , Éxons , Etiquetas de Sequências Expressas , Mutação da Fase de Leitura , Duplicação Gênica , Variação Genética , Genoma , Íntrons , Dados de Sequência Molecular , Peso Molecular , Fases de Leitura Aberta , Polimorfismo Genético , Ligação Proteica , Estrutura Terciária de Proteína , Splicing de RNA , RNA Mensageiro/genética , Proteínas Supressoras de Tumor/metabolismoRESUMO
The newly available pig genome sequence has provided new information to fine map quantitative trait loci (QTL) in order to eventually identify causal variants. With targeted genomic sequencing efforts, we were able to obtain high quality BAC sequences that cover a region on pig chromosome 17 where a number of meat quality QTL have been previously discovered. Sequences from 70 BAC clones were assembled to form an 8-Mbp contig. Subsequently, we successfully mapped five previously identified QTL, three for meat color and two for lactate related traits, to the contig. With an additional 25 genetic markers that were identified by sequence comparison, we were able to carry out further linkage disequilibrium analysis to narrow down the genomic locations of these QTL, which allowed identification of the chromosomal regions that likely contain the causative variants. This research has provided one practical approach to combine genetic and molecular information for QTL mining.
RESUMO
BACKGROUND: We describe here the sequencing, annotation and comparative analysis of an 8 Mb region of pig chromosome 17, which provides a useful test region to assess coverage and quality for the pig genome sequencing project. We report our findings comparing the annotation of draft sequence assembled at different depths of coverage. RESULTS: Within this region we annotated 71 loci, of which 53 are orthologous to human known coding genes. When compared to the syntenic regions in human (20q13.13-q13.33) and mouse (chromosome 2, 167.5 Mb-178.3 Mb), this region was found to be highly conserved with respect to gene order. The most notable difference between the three species is the presence of a large expansion of zinc finger coding genes and pseudogenes on mouse chromosome 2 between Edn3 and Phactr3 that is absent from pig and human. All of our annotation has been made publicly available in the Vertebrate Genome Annotation browser, VEGA. We assessed the impact of coverage on sequence assembly across this region and found, as expected, that increased sequence depth resulted in fewer, longer contigs. One-third of our annotated loci could not be fully re-aligned back to the low coverage version of the sequence, principally because the transcripts are fragmented over several contigs. CONCLUSION: We have demonstrated the considerable advantages of sequencing at increased read depths and discuss the implications that lower coverage sequence may have on subsequent comparative and functional studies, particularly those involving complex loci such as GNAS.
Assuntos
Genoma , Sus scrofa/genética , Animais , Sequência de Bases , Cromossomos Artificiais Bacterianos/genética , Cromossomos de Mamíferos/genética , Sequência Conservada , Sistema Enzimático do Citocromo P-450/genética , Ordem dos Genes , Genoma Humano/genética , Humanos , Camundongos , Chaperonas Moleculares/genética , Dados de Sequência Molecular , Proteína Tirosina Fosfatase não Receptora Tipo 1/genética , Análise de Sequência de DNA , Proteínas de Transporte Vesicular/genéticaRESUMO
BACKGROUND: The domestic pig is being increasingly exploited as a system for modeling human disease. It also has substantial economic importance for meat-based protein production. Physical clone maps have underpinned large-scale genomic sequencing and enabled focused cloning efforts for many genomes. Comparative genetic maps indicate that there is more structural similarity between pig and human than, for example, mouse and human, and we have used this close relationship between human and pig as a way of facilitating map construction. RESULTS: Here we report the construction of the most highly continuous bacterial artificial chromosome (BAC) map of any mammalian genome, for the pig (Sus scrofa domestica) genome. The map provides a template for the generation and assembly of high-quality anchored sequence across the genome. The physical map integrates previous landmark maps with restriction fingerprints and BAC end sequences from over 260,000 BACs derived from 4 BAC libraries and takes advantage of alignments to the human genome to improve the continuity and local ordering of the clone contigs. We estimate that over 98% of the euchromatin of the 18 pig autosomes and the X chromosome along with localized coverage on Y is represented in 172 contigs, with chromosome 13 (218 Mb) represented by a single contig. The map is accessible through pre-Ensembl, where links to marker and sequence data can be found. CONCLUSION: The map will enable immediate electronic positional cloning of genes, benefiting the pig research community and further facilitating use of the pig as an alternative animal model for human disease. The clone map and BAC end sequence data can also help to support the assembly of maps and genome sequences of other artiodactyls.
Assuntos
Genoma , Mapeamento Físico do Cromossomo , Sus scrofa/genética , Animais , Sequência de Bases , Cromossomos Artificiais Bacterianos/genética , Cromossomos de Mamíferos , Clonagem Molecular , Biblioteca Gênica , Dados de Sequência MolecularRESUMO
This study describes a new tool for accurate and reliable high-throughput detection of copy number variation in the human genome. We have constructed a large-insert clone DNA microarray covering the entire human genome in tiling path resolution that we have used to identify copy number variation in human populations. Crucial to this study has been the development of a robust array platform and analytic process for the automated identification of copy number variants (CNVs). The array consists of 26,574 clones covering 93.7% of euchromatic regions. Clones were selected primarily from the published "Golden Path," and mapping was confirmed by fingerprinting and BAC-end sequencing. Array performance was extensively tested by a series of validation assays. These included determining the hybridization characteristics of each individual clone on the array by chromosome-specific add-in experiments. Estimation of data reproducibility and false-positive/negative rates was carried out using self-self hybridizations, replicate experiments, and independent validations of CNVs. Based on these studies, we developed a variance-based automatic copy number detection analysis process (CNVfinder) and have demonstrated its robustness by comparison with the SW-ARRAY method.
Assuntos
Dosagem de Genes , Variação Genética , Genoma Humano , Algoritmos , Mapeamento Cromossômico , DNA/genética , Impressões Digitais de DNA , Eucromatina/química , Reações Falso-Negativas , Reações Falso-Positivas , Perfilação da Expressão Gênica , Humanos , Hibridização de Ácido Nucleico , Análise de Sequência com Séries de Oligonucleotídeos , Reprodutibilidade dos TestesRESUMO
Genome sequencing of the model legumes, Medicago truncatula and Lotus japonicus, provides an opportunity for large-scale sequence-based comparison of two genomes in the same plant family. Here we report synteny comparisons between these species, including details about chromosome relationships, large-scale synteny blocks, microsynteny within blocks, and genome regions lacking clear correspondence. The Lotus and Medicago genomes share a minimum of 10 large-scale synteny blocks, each with substantial collinearity and frequently extending the length of whole chromosome arms. The proportion of genes syntenic and collinear within each synteny block is relatively homogeneous. Medicago-Lotus comparisons also indicate similar and largely homogeneous gene densities, although gene-containing regions in Mt occupy 20-30% more space than Lj counterparts, primarily because of larger numbers of Mt retrotransposons. Because the interpretation of genome comparisons is complicated by large-scale genome duplications, we describe synteny, synonymous substitutions and phylogenetic analyses to identify and date a probable whole-genome duplication event. There is no direct evidence for any recent large-scale genome duplication in either Medicago or Lotus but instead a duplication predating speciation. Phylogenetic comparisons place this duplication within the Rosid I clade, clearly after the split between legumes and Salicaceae (poplar).
Assuntos
Genoma de Planta/genética , Lotus/genética , Medicago truncatula/genética , Filogenia , Cromossomos de Plantas/genética , Duplicação Gênica , Genes de Plantas/genética , Sintenia/genéticaRESUMO
We present an analysis of the chicken (Gallus gallus) transcriptome based on the full insert sequences for 19,626 cDNAs, combined with 485,337 EST sequences. The cDNA data set has been functionally annotated and describes a minimum of 11,929 chicken coding genes, including the sequence for 2260 full-length cDNAs together with a collection of noncoding (nc) cDNAs that have been stringently filtered to remove untranslated regions of coding mRNAs. The combined collection of cDNAs and ESTs describe 62,546 clustered transcripts and provide transcriptional evidence for a total of 18,989 chicken genes, including 88% of the annotated Ensembl gene set. Analysis of the ncRNAs reveals a set that is highly conserved in chickens and mammals, including sequences for 14 pri-miRNAs encoding 23 different miRNAs. The data sets described here provide a transcriptome toolkit linked to physical clones for bioinformaticians and experimental biologists who wish to use chicken systems as a low-cost, accessible alternative to mammals for the analysis of vertebrate development, immunology, and cell biology.
Assuntos
Galinhas/genética , DNA Complementar/genética , Etiquetas de Sequências Expressas , Biblioteca Gênica , Transcrição Gênica/genética , Animais , Clonagem Molecular/métodos , Biologia Computacional/métodos , DNA Complementar/fisiologia , Humanos , MicroRNAs/genética , RNA não Traduzido/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodosRESUMO
The future systematic mapping of variants that confer susceptibility to common diseases requires the construction of a fully informative polymorphism map. Ideally, every base pair of the genome would be sequenced in many individuals. Here, we report 4.75 Mb of contiguous sequence for each of two common haplotypes of the major histocompatibility complex (MHC), to which susceptibility to >100 diseases has been mapped. The autoimmune disease-associated-haplotypes HLA-A3-B7-Cw7-DR15 and HLA-A1-B8-Cw7-DR3 were sequenced in their entirety through a bacterial artificial chromosome (BAC) cloning strategy using the consanguineous cell lines PGF and COX, respectively. The two sequences were annotated to encompass all described splice variants of expressed genes. We defined the complete variation content of the two haplotypes, revealing >18,000 variations between them. Average SNP densities ranged from less than one SNP per kilobase to >60. Acquisition of complete and accurate sequence data over polymorphic regions such as the MHC from large-insert cloned DNA provides a definitive resource for the construction of informative genetic maps, and avoids the limitation of chromosome regions that are refractory to PCR amplification.