Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Nature ; 625(7996): 735-742, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38030727

RESUMO

Noncoding DNA is central to our understanding of human gene regulation and complex diseases1,2, and measuring the evolutionary sequence constraint can establish the functional relevance of putative regulatory elements in the human genome3-9. Identifying the genomic elements that have become constrained specifically in primates has been hampered by the faster evolution of noncoding DNA compared to protein-coding DNA10, the relatively short timescales separating primate species11, and the previously limited availability of whole-genome sequences12. Here we construct a whole-genome alignment of 239 species, representing nearly half of all extant species in the primate order. Using this resource, we identified human regulatory elements that are under selective constraint across primates and other mammals at a 5% false discovery rate. We detected 111,318 DNase I hypersensitivity sites and 267,410 transcription factor binding sites that are constrained specifically in primates but not across other placental mammals and validate their cis-regulatory effects on gene expression. These regulatory elements are enriched for human genetic variants that affect gene expression and complex traits and diseases. Our results highlight the important role of recent evolution in regulatory sequence elements differentiating primates, including humans, from other placental mammals.


Assuntos
Sequência Conservada , Evolução Molecular , Genoma , Primatas , Animais , Feminino , Humanos , Gravidez , Sequência Conservada/genética , Desoxirribonuclease I/metabolismo , DNA/genética , DNA/metabolismo , Genoma/genética , Mamíferos/classificação , Mamíferos/genética , Placenta , Primatas/classificação , Primatas/genética , Sequências Reguladoras de Ácido Nucleico/genética , Reprodutibilidade dos Testes , Fatores de Transcrição/metabolismo , Proteínas/genética , Regulação da Expressão Gênica/genética
2.
Bioinformatics ; 34(3): 516-518, 2018 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-29028893

RESUMO

Motivation: Whole genome sequencing is becoming a diagnostics of choice for the identification of rare inherited and de novo copy number variants in families with various pediatric and late-onset genetic diseases. However, joint variant calling in pedigrees is hampered by the complexity of consensus breakpoint alignment across samples within an arbitrary pedigree structure. Results: We have developed a new tool, Canvas SPW, for the identification of inherited and de novo copy number variants from pedigree sequencing data. Canvas SPW supports a number of family structures and provides a wide range of scoring and filtering options to automate and streamline identification of de novo variants. Availability and implementation: Canvas SPW is available for download from https://github.com/Illumina/canvas. Contact: sivakhno@illumina.com. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Variações do Número de Cópias de DNA , Genômica/métodos , Linhagem , Análise de Sequência de DNA/métodos , Software , Humanos
3.
Bioinformatics ; 33(1): 142-144, 2017 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-27634946

RESUMO

MOTIVATION: Ancestry and Kinship Toolkit (AKT) is a statistical genetics tool for analysing large cohorts of whole-genome sequenced samples. It can rapidly detect related samples, characterize sample ancestry, calculate correlation between variants, check Mendel consistency and perform data clustering. AKT brings together the functionality of many state-of-the-art methods, with a focus on speed and a unified interface. We believe it will be an invaluable tool for the curation of large WGS datasets. AVAILABILITY AND IMPLEMENTATION: The source code is available at https://illumina.github.io/akt CONTACTS: joconnell@illumina.com or rudy.d.arthur@gmail.comSupplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma Humano , Linhagem , Análise de Sequência de DNA/métodos , Software , Análise por Conglomerados , Família , Feminino , Humanos , Masculino , Filogenia
4.
Bioinformatics ; 33(2): 280-282, 2017 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-27605106

RESUMO

MOTIVATION: Large-scale rearrangements and copy number changes combined with different modes of clonal evolution create extensive somatic genome diversity, making it difficult to develop versatile and scalable variant calling tools and create well-calibrated benchmarks. RESULTS: We developed a new simulation framework tHapMix that enables the creation of tumour samples with different ploidy, purity and polyclonality features. It easily scales to simulation of hundreds of somatic genomes, while re-use of real read data preserves noise and biases present in sequencing platforms. We further demonstrate tHapMix utility by creating a simulated set of 140 somatic genomes and showing how it can be used in training and testing of somatic copy number variant calling tools. AVAILABILITY AND IMPLEMENTATION: tHapMix is distributed under an open source license and can be downloaded from https://github.com/Illumina/tHapMix CONTACT: sivakhno@illumina.comSupplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Variações do Número de Cópias de DNA , Genômica/métodos , Haplótipos , Neoplasias/genética , Ploidias , Software , Simulação por Computador , DNA de Neoplasias , Genoma , Humanos
5.
Bioinformatics ; 32(15): 2306-12, 2016 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-27153730

RESUMO

MOTIVATION: Whole-genome low-coverage sequencing has been combined with linkage-disequilibrium (LD)-based genotype refinement to accurately and cost-effectively infer genotypes in large cohorts of individuals. Most genotype refinement methods are based on hidden Markov models, which are accurate but computationally expensive. We introduce an algorithm that models LD using a simple multivariate Gaussian distribution. The key feature of our algorithm is its speed. RESULTS: Our method is hundreds of times faster than other methods on the same data set and its scaling behaviour is linear in the number of samples. We demonstrate the performance of the method on both low- and high-coverage samples. AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/illumina/marvin CONTACT: rarthur@illumina.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genótipo , Desequilíbrio de Ligação , Software , Algoritmos , Humanos , Distribuição Normal
6.
Bioinformatics ; 32(8): 1220-2, 2016 04 15.
Artigo em Inglês | MEDLINE | ID: mdl-26647377

RESUMO

UNLABELLED: : We describe Manta, a method to discover structural variants and indels from next generation sequencing data. Manta is optimized for rapid germline and somatic analysis, calling structural variants, medium-sized indels and large insertions on standard compute hardware in less than a tenth of the time that comparable methods require to identify only subsets of these variant types: for example NA12878 at 50× genomic coverage is analyzed in less than 20 min. Manta can discover and score variants based on supporting paired and split-read evidence, with scoring models optimized for germline analysis of diploid individuals and somatic analysis of tumor-normal sample pairs. Call quality is similar to or better than comparable methods, as determined by pedigree consistency of germline calls and comparison of somatic calls to COSMIC database variants. Manta consistently assembles a higher fraction of its calls to base-pair resolution, allowing for improved downstream annotation and analysis of clinical significance. We provide Manta as a community resource to facilitate practical and routine structural variant analysis in clinical and research sequencing scenarios. AVAILABILITY AND IMPLEMENTATION: Manta is released under the open-source GPLv3 license. Source code, documentation and Linux binaries are available from https://github.com/Illumina/manta. CONTACT: csaunders@illumina.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Mutação INDEL , Neoplasias/genética , DNA de Neoplasias , Genoma , Genômica , Humanos , Software
7.
PeerJ ; 3: e996, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26056623

RESUMO

Scaffolding errors and incorrect repeat disambiguation during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub or PyPI, the Python Package Index; a tutorial and user documentation are also available.

8.
Bioinformatics ; 31(12): 2035-7, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-25661542

RESUMO

MOTIVATION: Mate pair protocols add to the utility of paired-end sequencing by boosting the genomic distance spanned by each pair of reads, potentially allowing larger repeats to be bridged and resolved. The Illumina Nextera Mate Pair (NMP) protocol uses a circularization-based strategy that leaves behind 38-bp adapter sequences, which must be computationally removed from the data. While 'adapter trimming' is a well-studied area of bioinformatics, existing tools do not fully exploit the particular properties of NMP data and discard more data than is necessary. RESULTS: We present NxTrim, a tool that strives to discard as little sequence as possible from NMP reads. NxTrim makes full use of the sequence on both sides of the adapter site to build 'virtual libraries' of mate pairs, paired-end reads and single-ended reads. For bacterial data, we show that aggregating these datasets allows a single NMP library to yield an assembly whose quality compares favourably to that obtained from regular paired-end reads. AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/sequencing/NxTrim


Assuntos
Bactérias/genética , Biologia Computacional/métodos , Genoma Bacteriano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Biblioteca Gênica
9.
Bioinformatics ; 30(19): 2796-801, 2014 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-24950811

RESUMO

MOTIVATION: FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be decompressed before they can be used. Here, we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input. RESULTS: We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000 and a million of 30-mers in 3, 8, 14, 45 and 567 s, respectively, plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and 'in silico pull-down' experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization. AVAILABILITY AND IMPLEMENTATION: BEETL-fastq is part of the BEETL library, available as a github repository at github.com/BEETL/BEETL.


Assuntos
Compressão de Dados/métodos , Neoplasias/genética , Análise de Sequência de DNA/métodos , Algoritmos , Simulação por Computador , DNA , Genoma , Genoma Humano , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Software
10.
Bioinformatics ; 30(1): 24-30, 2014 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-23661694

RESUMO

MOTIVATION: Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and, hence, are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full resolution. As our approach relies directly on redundancy present in the reads, it does not need a reference sequence and is, therefore, applicable to data from metagenomics and de novo experiments as well as to re-sequencing data. RESULTS: We show that a conservative smoothing strategy affecting 75% of the quality scores above Q2 leads to an overall quality score compression of 1 bit per value with a negligible effect on variant calling. A compression of 0.68 bit per quality value is achieved using a more aggressive smoothing strategy, again with a very small effect on variant calling. AVAILABILITY: Code to construct the BWT and LCP-array on large genomic data sets is part of the BEETL library, available as a github repository at git@github.com:BEETL/BEETL.git.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Animais , Caenorhabditis elegans , Compressão de Dados , Genoma , Genômica , Humanos , Metagenômica , Análise de Sequência de DNA
11.
Am J Hum Genet ; 93(4): 687-96, 2013 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-24094745

RESUMO

High-throughput sequencing technologies produce short sequence reads that can contain phase information if they span two or more heterozygote genotypes. This information is not routinely used by current methods that infer haplotypes from genotype data. We have extended the SHAPEIT2 method to use phase-informative sequencing reads to improve phasing accuracy. Our model incorporates the read information in a probabilistic model through base quality scores within each read. The method is primarily designed for high-coverage sequence data or data sets that already have genotypes called. One important application is phasing of single samples sequenced at high coverage for use in medical sequencing and studies of rare diseases. Our method can also use existing panels of reference haplotypes. We tested the method by using a mother-father-child trio sequenced at high-coverage by Illumina together with the low-coverage sequence data from the 1000 Genomes Project (1000GP). We found that use of phase-informative reads increases the mean distance between switch errors by 22% from 274.4 kb to 328.6 kb. We also used male chromosome X haplotypes from the 1000GP samples to simulate sequencing reads with varying insert size, read length, and base error rate. When using short 100 bp paired-end reads, we found that using mixtures of insert sizes produced the best results. When using longer reads with high error rates (5-20 kb read with 4%-15% error per base), phasing performance was substantially improved.


Assuntos
Genoma Humano , Haplótipos/genética , Análise de Sequência de DNA/métodos , Criança , Pai , Feminino , Genótipo , Humanos , Masculino , Modelos Genéticos , Mães , Polimorfismo de Nucleotídeo Único
12.
BMC Bioinformatics ; 14 Suppl 5: S2, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23734710

RESUMO

Environmental shotgun sequencing (ESS) has potential to give greater insight into microbial communities than targeted sequencing of 16S regions, but requires much higher sequence coverage. The advent of next-generation sequencing has made it feasible for the Human Microbiome Project and other initiatives to generate ESS data on a large scale, but computationally efficient methods for analysing such data sets are needed.Here we present metaBEETL, a fast taxonomic classifier for environmental shotgun sequences. It uses a Burrows-Wheeler Transform (BWT) index of the sequencing reads and an indexed database of microbial reference sequences. Unlike other BWT-based tools, our method has no upper limit on the number or the total size of the reference sequences in its database. By capturing sequence relationships between strains, our reference index also allows us to classify reads which are not unique to an individual strain but are nevertheless specific to some higher phylogenetic order.Tested on datasets with known taxonomic composition, metaBEETL gave results that are competitive with existing similarity-based tools: due to normalization steps which other classifiers lack, the taxonomic profile computed by metaBEETL closely matched the true environmental profile. At the same time, its moderate running time and low memory footprint allow metaBEETL to scale well to large data sets.Code to construct the BWT indexed database and for the taxonomic classification is part of the BEETL library, available as a github repository at git@github.com:BEETL/BEETL.git.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Microbiota , Análise de Sequência de DNA/métodos , Algoritmos , Microbiologia Ambiental , Humanos , Filogenia
13.
Bioinformatics ; 28(11): 1415-9, 2012 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-22556365

RESUMO

MOTIVATION: The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets. RESULTS: We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm. We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel 'implicit sorting' strategy that enables these benefits to be realized without the overhead of sorting the reads. With these techniques, a 45× coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3 Gb of sequence to fit into only 8.2 GB of space (trimming a small proportion of low-quality bases from the reads improves the compression still further). This is >4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections. AVAILABILITY: Code to construct the BWT and SAP-array on large genomic datasets is part of the BEETL library, available as a github repository at https://github.com/BEETL/BEETL.


Assuntos
Algoritmos , Compressão de Dados/métodos , Bases de Dados de Ácidos Nucleicos , Genoma Humano , Genômica/métodos , Simulação por Computador , Escherichia coli/genética , Humanos , Análise de Sequência de DNA
14.
Cell ; 148(4): 780-91, 2012 Feb 17.
Artigo em Inglês | MEDLINE | ID: mdl-22341448

RESUMO

The Tasmanian devil (Sarcophilus harrisii), the largest marsupial carnivore, is endangered due to a transmissible facial cancer spread by direct transfer of living cancer cells through biting. Here we describe the sequencing, assembly, and annotation of the Tasmanian devil genome and whole-genome sequences for two geographically distant subclones of the cancer. Genomic analysis suggests that the cancer first arose from a female Tasmanian devil and that the clone has subsequently genetically diverged during its spread across Tasmania. The devil cancer genome contains more than 17,000 somatic base substitution mutations and bears the imprint of a distinct mutational process. Genotyping of somatic mutations in 104 geographically and temporally distributed Tasmanian devil tumors reveals the pattern of evolution and spread of this parasitic clonal lineage, with evidence of a selective sweep in one geographical area and persistence of parallel lineages in other populations.


Assuntos
Neoplasias Faciais/veterinária , Instabilidade Genômica , Marsupiais/genética , Mutação , Animais , Evolução Clonal , Espécies em Perigo de Extinção , Neoplasias Faciais/epidemiologia , Neoplasias Faciais/genética , Neoplasias Faciais/patologia , Feminino , Estudo de Associação Genômica Ampla , Masculino , Dados de Sequência Molecular , Tasmânia/epidemiologia
15.
Bioinformatics ; 26(24): 3051-8, 2010 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-20966003

RESUMO

MOTIVATION: Copy number abnormalities (CNAs) represent an important type of genetic mutation that can lead to abnormal cell growth and proliferation. New high-throughput sequencing technologies promise comprehensive characterization of CNAs. In contrast to microarrays, where probe design follows a carefully developed protocol, reads represent a random sample from a library and may be prone to representation biases due to GC content and other factors. The discrimination between true and false positive CNAs becomes an important issue. RESULTS: We present a novel approach, called CNAseg, to identify CNAs from second-generation sequencing data. It uses depth of coverage to estimate copy number states and flowcell-to-flowcell variability in cancer and normal samples to control the false positive rate. We tested the method using the COLO-829 melanoma cell line sequenced to 40-fold coverage. An extensive simulation scheme was developed to recreate different scenarios of copy number changes and depth of coverage by altering a real dataset with spiked-in CNAs. Comparison to alternative approaches using both real and simulated datasets showed that CNAseg achieves superior precision and improved sensitivity estimates. AVAILABILITY: The CNAseg package and test data are available at http://www.compbio.group.cam.ac.uk/software.html.


Assuntos
Algoritmos , Variações do Número de Cópias de DNA , Neoplasias/genética , Composição de Bases , Linhagem Celular Tumoral , Genoma Humano , Humanos , Mutação , Análise de Sequência de DNA
16.
Nature ; 463(7278): 191-6, 2010 Jan 14.
Artigo em Inglês | MEDLINE | ID: mdl-20016485

RESUMO

All cancers carry somatic mutations. A subset of these somatic alterations, termed driver mutations, confer selective growth advantage and are implicated in cancer development, whereas the remainder are passengers. Here we have sequenced the genomes of a malignant melanoma and a lymphoblastoid cell line from the same person, providing the first comprehensive catalogue of somatic mutations from an individual cancer. The catalogue provides remarkable insights into the forces that have shaped this cancer genome. The dominant mutational signature reflects DNA damage due to ultraviolet light exposure, a known risk factor for malignant melanoma, whereas the uneven distribution of mutations across the genome, with a lower prevalence in gene footprints, indicates that DNA repair has been preferentially deployed towards transcribed regions. The results illustrate the power of a cancer genome sequence to reveal traces of the DNA damage, repair, mutation and selection processes that were operative years before the cancer became symptomatic.


Assuntos
Genes Neoplásicos/genética , Genoma Humano/genética , Mutação/genética , Neoplasias/genética , Adulto , Linhagem Celular Tumoral , Dano ao DNA/genética , Análise Mutacional de DNA , Reparo do DNA/genética , Dosagem de Genes/genética , Humanos , Perda de Heterozigosidade/genética , Masculino , Melanoma/etiologia , Melanoma/genética , MicroRNAs/genética , Mutagênese Insercional/genética , Neoplasias/etiologia , Polimorfismo de Nucleotídeo Único/genética , Medicina de Precisão , Deleção de Sequência/genética , Raios Ultravioleta
17.
Nature ; 456(7218): 53-9, 2008 Nov 06.
Artigo em Inglês | MEDLINE | ID: mdl-18987734

RESUMO

DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.


Assuntos
Genoma Humano/genética , Genômica/métodos , Análise de Sequência de DNA/métodos , Cromossomos Humanos X/genética , Sequência Consenso/genética , Genômica/economia , Genótipo , Humanos , Masculino , Nigéria , Polimorfismo de Nucleotídeo Único/genética , Sensibilidade e Especificidade , Análise de Sequência de DNA/economia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA