RESUMEN
There is a concerted effort by a number of public and private groups to identify a large set of human single-nucleotide polymorphisms (SNPs). As of March 2001, 2.84 million SNPs have been deposited in the public database, dbSNP, at the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/SNP/). The 2.84 million SNPs can be grouped into 1.65 million non-redundant SNPs. As part of the International SNP Map Working Group, we recently published a high-density SNP map of the human genome consisting of 1.42 million SNPs (ref. 3). In addition, numerous SNPs are maintained in proprietary databases. Our survey of more than 1,200 SNPs indicates that more than 80% of TSC and Washington University candidate SNPs are polymorphic and that approximately 50% of the candidate SNPs from these two sources are common SNPs (with minor allele frequency of > or =20%) in any given population.
Asunto(s)
Polimorfismo de Nucleótido Simple , ADN/genética , Humanos , Reacción en Cadena de la PolimerasaRESUMEN
Single-nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and a resource for mapping complex genetic traits. The large volume of data produced by high-throughput sequencing projects is a rich and largely untapped source of SNPs (refs 2, 3, 4, 5). We present here a unified approach to the discovery of variations in genetic sequence data of arbitrary DNA sources. We propose to use the rapidly emerging genomic sequence as a template on which to layer often unmapped, fragmentary sequence data and to use base quality values to discern true allelic variations from sequencing errors. By taking advantage of the genomic sequence we are able to use simpler yet more accurate methods for sequence organization: fragment clustering, paralogue identification and multiple alignment. We analyse these sequences with a novel, Bayesian inference engine, POLYBAYES, to calculate the probability that a given site is polymorphic. Rigorous treatment of base quality permits completely automated evaluation of the full length of all sequences, without limitations on alignment depth. We demonstrate this approach by accurate SNP predictions in human ESTs aligned to finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery.
Asunto(s)
Técnicas Genéticas , Polimorfismo de Nucleótido Simple , Algoritmos , Alelos , Teorema de Bayes , Interpretación Estadística de Datos , Etiquetas de Secuencia Expresada , Variación Genética , Genoma Humano , Humanos , Alineación de Secuencia , Programas InformáticosAsunto(s)
Antineoplásicos/uso terapéutico , Mesilato de Imatinib/uso terapéutico , Leucemia Mielógena Crónica BCR-ABL Positiva/tratamiento farmacológico , Leucemia Mielomonocítica Crónica/tratamiento farmacológico , Anciano , Humanos , Leucemia Mielógena Crónica BCR-ABL Positiva/genética , Leucemia Mielógena Crónica BCR-ABL Positiva/patología , Leucemia Mielomonocítica Crónica/genética , Leucemia Mielomonocítica Crónica/patología , Masculino , MutaciónRESUMEN
Large-scale genomic sequencing requires a software infrastructure to support and integrate applications that are not directly compatible. We describe a suite of software tools built around the Common Assembly Format (CAF), a comprehensive representation of a sequence assembly as a text file. These tools form the backbone of sequencing informatics at the Sanger Centre and the Genome Sequencing Center. The CAF format is intentionally flexible, and our Perl and C libraries, which parse and manipulate it, provide powerful tools for creating new applications as well as wrappers to incorporate other software. The tools are available free by anonymous FTP from ftp://ftp.sanger.ac.uk/pub/badger/.
Asunto(s)
Secuencia de Bases , Genoma , Análisis de Secuencia de ADN/métodos , Algoritmos , Biología Computacional/métodos , Bases de Datos Factuales , Biblioteca de Genes , Alineación de SecuenciaRESUMEN
We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.