RESUMO
A mechanistic understanding of the biological and technical factors that impact transcript measurements is essential to designing and analyzing single-cell and single-nucleus RNA sequencing experiments. Nuclei contain the same pre-mRNA population as cells, but they contain a small subset of the mRNAs. Nonetheless, early studies argued that single-nucleus analysis yielded results comparable to cellular samples if pre-mRNA measurements were included. However, typical workflows do not distinguish between pre-mRNA and mRNA when estimating gene expression, and variation in their relative abundances across cell types has received limited attention. These gaps are especially important given that incorporating pre-mRNA has become commonplace for both assays, despite known gene length bias in pre-mRNA capture. Here, we reanalyze public data sets from mouse and human to describe the mechanisms and contrasting effects of mRNA and pre-mRNA sampling on gene expression and marker gene selection in single-cell and single-nucleus RNA-seq. We show that pre-mRNA levels vary considerably among cell types, which mediates the degree of gene length bias and limits the generalizability of a recently published normalization method intended to correct for this bias. As an alternative, we repurpose an existing post hoc gene length-based correction method from conventional RNA-seq gene set enrichment analysis. Finally, we show that inclusion of pre-mRNA in bioinformatic processing can impart a larger effect than assay choice itself, which is pivotal to the effective reuse of existing data. These analyses advance our understanding of the sources of variation in single-cell and single-nucleus RNA-seq experiments and provide useful guidance for future studies.
Assuntos
Núcleo Celular , Precursores de RNA , Humanos , Animais , Camundongos , RNA-Seq , RNA Mensageiro/genética , Análise de Sequência de RNA/métodos , Núcleo Celular/genética , Perfilação da Expressão Gênica/métodos , Análise de Célula ÚnicaRESUMO
Genetic and gene expression heterogeneity is an essential hallmark of many tumors, allowing the cancer to evolve and to develop resistance to treatment. Currently, the most commonly used data types for studying such heterogeneity are bulk tumor/normal whole-genome or whole-exome sequencing (WGS, WES); and single-cell RNA sequencing (scRNA-seq), respectively. However, tools are currently lacking to link genomic tumor subclonality with transcriptomic heterogeneity by integrating genomic and single-cell transcriptomic data collected from the same tumor. To address this gap, we developed scBayes, a Bayesian probabilistic framework that uses tumor subclonal structure inferred from bulk DNA sequencing data to determine the subclonal identity of cells from single-cell gene expression (scRNA-seq) measurements. Grouping together cells representing the same genetically defined tumor subclones allows comparison of gene expression across different subclones, or investigation of gene expression changes within the same subclone across time (i.e., progression, treatment response, or relapse) or space (i.e., at multiple metastatic sites and organs). We used simulated data sets, in silico synthetic data sets, as well as biological data sets generated from cancer samples to extensively characterize and validate the performance of our method, as well as to show improvements over existing methods. We show the validity and utility of our approach by applying it to published data sets and recapitulating the findings, as well as arriving at novel insights into cancer subclonal expression behavior in our own data sets. We further show that our method is applicable to a wide range of single-cell sequencing technologies including single-cell DNA sequencing as well as Smart-seq and 10x Genomics scRNA-seq protocols.
Assuntos
Neoplasias , Humanos , Sequenciamento do Exoma , Teorema de Bayes , Neoplasias/genética , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodosRESUMO
MOTIVATION: In time-critical clinical settings, such as precision medicine, genomic data needs to be processed as fast as possible to arrive at data-informed treatment decisions in a timely fashion. While sequencing throughput has dramatically increased over the past decade, bioinformatics analysis throughput has not been able to keep up with the pace of computer hardware improvement, and consequently has now turned into the primary bottleneck. Modern computer hardware today is capable of much higher performance than current genomic informatics algorithms can typically utilize, therefore presenting opportunities for significant improvement of performance. Accessing the raw sequencing data from BAM files, e.g. is a necessary and time-consuming step in nearly all sequence analysis tools, however existing programming libraries for BAM access do not take full advantage of the parallel input/output capabilities of storage devices. RESULTS: In an effort to stimulate the development of a new generation of faster sequence analysis tools, we developed quickBAM, a software library to accelerate sequencing data access by exploiting the parallelism in commodity storage hardware currently widely available. We demonstrate that analysis software ported to quickBAM consistently outperforms their current versions, in some cases finishing an analysis in under 3 min while the original version took 1.5 h, using the same storage solution. AVAILABILITY AND IMPLEMENTATION: Open source and freely available at https://gitlab.com/yiq/quickbam/, we envision that quickBAM will enable a new generation of high-performance informatics tools, either directly boosting their performance if they are currently data-access bottlenecked, or allow data-access to keep up with further optimizations in algorithms and compute techniques.
Assuntos
Algoritmos , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica , Informática , Análise de Sequência de DNA/métodosRESUMO
GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.
Assuntos
Neoplasias da Mama/genética , Genoma Humano , Genômica/métodos , Ferramenta de Busca/métodos , Análise de Sequência de DNA/métodos , Software , Bases de Dados Genéticas , Feminino , Humanos , InternetRESUMO
Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.
Assuntos
Variação Genética/genética , Genoma Humano/genética , Mapeamento Físico do Cromossomo , Sequência de Aminoácidos , Predisposição Genética para Doença , Genética Médica , Genética Populacional , Estudo de Associação Genômica Ampla , Genômica , Genótipo , Haplótipos/genética , Homozigoto , Humanos , Dados de Sequência Molecular , Taxa de Mutação , Polimorfismo de Nucleotídeo Único/genética , Locos de Características Quantitativas/genética , Análise de Sequência de DNA , Deleção de Sequência/genéticaRESUMO
BACKGROUND: Pedigree files are ubiquitously used within bioinformatics and genetics studies to convey critical information about relatedness, sex and affected status of study samples. While the text based format of ped files is efficient for computational methods, it is not immediately intuitive to a bioinformatician or geneticist trying to understand family structures, many of which encode the affected status of individuals across multiple generations. The visualization of pedigrees into connected nodes with descriptive shapes and shading provides a far more interpretable format to recognize visual patterns and intuit family structures. Despite these advantages of a visual pedigree, it remains difficult to quickly and accurately visualize a pedigree given a pedigree text file. RESULTS: Here we describe ped_draw a command line and web tool as a simple and easy solution to pedigree visualization. Ped_draw is capable of drawing complex multi-generational pedigrees and conforms to the accepted standards for depicting pedigrees visually. The command line tool can be used as a simple one liner command, utilizing graphviz to generate an image file. The web tool, https://peddraw.github.io , allows the user to either: paste a pedigree file, type to construct a pedigree file in the text box or upload a pedigree file. Users can save the generated image file in various formats. CONCLUSIONS: We believe ped_draw is a useful pedigree drawing tool that improves on current methods due to its ease of use and approachability. Ped_draw allows users with various levels of expertise to quickly and easily visualize pedigrees.
Assuntos
Biologia Computacional/métodos , Linhagem , Software , HumanosRESUMO
Intracranial germ cell tumours (IGCTs) are a group of rare heterogeneous brain tumours that are clinically and histologically similar to the more common gonadal GCTs. IGCTs show great variation in their geographical and gender distribution, histological composition and treatment outcomes. The incidence of IGCTs is historically five- to eightfold greater in Japan and other East Asian countries than in Western countries, with peak incidence near the time of puberty. About half of the tumours are located in the pineal region. The male-to-female incidence ratio is approximately 3-4:1 overall, but is even higher for tumours located in the pineal region. Owing to the scarcity of tumour specimens available for research, little is currently known about this rare disease. Here we report the analysis of 62 cases by next-generation sequencing, single nucleotide polymorphism array and expression array. We find the KIT/RAS signalling pathway frequently mutated in more than 50% of IGCTs, including novel recurrent somatic mutations in KIT, its downstream mediators KRAS and NRAS, and its negative regulator CBL. Novel somatic alterations in the AKT/mTOR pathway included copy number gains of the AKT1 locus at 14q32.33 in 19% of patients, with corresponding upregulation of AKT1 expression. We identified loss-of-function mutations in BCORL1, a transcriptional co-repressor and tumour suppressor. We report significant enrichment of novel and rare germline variants in JMJD1C, which codes for a histone demethylase and is a coactivator of the androgen receptor, among Japanese IGCT patients. This study establishes a molecular foundation for understanding the biology of IGCTs and suggests potentially promising therapeutic strategies focusing on the inhibition of KIT/RAS activation and the AKT1/mTOR pathway.
Assuntos
Neoplasias Encefálicas/genética , Mutação em Linhagem Germinativa/genética , Mutação/genética , Neoplasias Embrionárias de Células Germinativas/genética , Adulto , Neoplasias Encefálicas/patologia , Criança , Feminino , Humanos , Japão , Masculino , Neoplasias Embrionárias de Células Germinativas/patologia , Proteína Oncogênica v-akt/genética , Proteínas Proto-Oncogênicas c-kit/genética , Reprodutibilidade dos Testes , Transdução de Sinais/genética , Serina-Treonina Quinases TOR/genética , Adulto Jovem , Proteínas ras/genéticaRESUMO
PURPOSE: EPHB4 variants were recently reported to cause capillary malformation-arteriovenous malformation 2 (CM-AVM2). CM-AVM2 mimics RASA1-related CM-AVM1 and hereditary hemorrhagic telangiectasia (HHT), as clinical features include capillary malformations (CMs), telangiectasia, and arteriovenous malformations (AVMs). Epistaxis, another clinical feature that overlaps with HHT, was reported in several cases. Based on the clinical overlap of CM-AVM2 and HHT, we hypothesized that patients considered clinically suspicious for HHT with no variant detected in an HHT gene (ENG, ACVRL1, or SMAD4) may have an EPHB4 variant. METHODS: Exome sequencing or a next-generation sequencing panel including EPHB4 was performed on individuals with previously negative molecular genetic testing for the HHT genes and/or RASA1. RESULTS: An EPHB4 variant was identified in ten unrelated cases. Seven cases had a pathogenic EPHB4 variant, including one with mosaicism. Three cases had an EPHB4 variant of uncertain significance. The majority had epistaxis (6/10 cases) and telangiectasia (8/10 cases), as well as CMs. Two of ten cases had a central nervous system AVM. CONCLUSIONS: Our results emphasize the importance of considering CM-AVM2 as part of the clinical differential for HHT and other vascular malformation syndromes. Yet, these cases highlight significant differences in the cutaneous presentations of CM-AVM2 versus HHT.
Assuntos
Capilares/anormalidades , Testes Genéticos , Receptor EphB4/genética , Telangiectasia Hemorrágica Hereditária/genética , Malformações Vasculares/genética , Receptores de Activinas Tipo II/genética , Adolescente , Capilares/patologia , Criança , Endoglina/genética , Feminino , Humanos , Masculino , Mutação , Proteína Smad4/genética , Telangiectasia Hemorrágica Hereditária/diagnóstico , Telangiectasia Hemorrágica Hereditária/patologia , Malformações Vasculares/patologia , Sequenciamento do ExomaAssuntos
Leucemia Linfocítica Crônica de Células B , Adenina/análogos & derivados , Benzamidas , Humanos , Leucemia Linfocítica Crônica de Células B/tratamento farmacológico , Leucemia Linfocítica Crônica de Células B/genética , Mutação , Recidiva Local de Neoplasia , Piperidinas , Inibidores de Proteínas Quinases/uso terapêutico , PirazinasRESUMO
INTRODUCTION: Hereditary haemorrhagic telangiectasia (HHT) is a genetically heterogeneous disorder caused by mutations in the genes ENG, ACVRL1, and SMAD4. Yet the genetic cause remains unknown for some families even after exhaustive exome analysis. We hypothesised that non-coding regions of the known HHT genes may harbour variants that disrupt splicing in these cases. METHODS: DNA from 35 individuals with clinical findings of HHT and 2 healthy controls from 13 families underwent whole genome sequencing. Additionally, 87 unrelated cases suspected to have HHT were evaluated using a custom designed next-generation sequencing panel to capture the coding and non-coding regions of ENG, ACVRL1 and SMAD4. Individuals from both groups had tested negative previously for a mutation in the coding region of known HHT genes. Samples were sequenced on a HiSeq2500 instrument and data were analysed to identify novel and rare variants. RESULTS: Eight cases had a novel non-coding ACVRL1 variant that disrupted splicing. One family had an ACVRL1intron 9:chromosome 3 translocation, the first reported case of a translocation causing HHT. The other seven cases had a variant located within a ~300 bp CT-rich 'hotspot' region of ACVRL1intron 9 that disrupted splicing. CONCLUSIONS: Despite the difficulty of interpreting deep intronic variants, our study highlights the importance of non-coding regions in the disease mechanism of HHT, particularly the CT-rich hotspot region of ACVRL1intron 9. The addition of this region to HHT molecular diagnostic testing algorithms will improve clinical sensitivity.
Assuntos
Receptores de Activinas Tipo II/genética , Genômica , Íntrons , Mutação , Splicing de RNA , Telangiectasia Hemorrágica Hereditária/diagnóstico , Telangiectasia Hemorrágica Hereditária/genética , Sequência de Bases , Estudos de Casos e Controles , Mapeamento Cromossômico , Biologia Computacional/métodos , Feminino , Estudos de Associação Genética/métodos , Predisposição Genética para Doença , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Masculino , Família Multigênica , Linhagem , RNA não Traduzido , Análise de Sequência de DNA , Translocação GenéticaRESUMO
SpeedSeq is an open-source genome analysis platform that accomplishes alignment, variant detection and functional annotation of a 50× human genome in 13 h on a low-cost server and alleviates a bioinformatics bottleneck that typically demands weeks of computation with extensive hands-on expert involvement. SpeedSeq offers performance competitive with or superior to current methods for detecting germline and somatic single-nucleotide variants, structural variants, insertions and deletions, and it includes novel functionality for streamlined interpretation.
Assuntos
Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Anotação de Sequência Molecular/métodos , Software , Variação Genética , Humanos , Neoplasias/genética , Polimorfismo de Nucleotídeo Único , Medicina de Precisão/métodos , Fluxo de TrabalhoRESUMO
By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.
Assuntos
Variação Genética/genética , Genética Populacional , Genoma Humano/genética , Genômica , Alelos , Sítios de Ligação/genética , Sequência Conservada/genética , Evolução Molecular , Genética Médica , Estudo de Associação Genômica Ampla , Haplótipos/genética , Humanos , Motivos de Nucleotídeos , Polimorfismo de Nucleotídeo Único/genética , Grupos Raciais/genética , Deleção de Sequência/genética , Fatores de Transcrição/metabolismoRESUMO
Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
Assuntos
Variações do Número de Cópias de DNA/genética , Genética Populacional , Genoma Humano/genética , Genômica , Duplicação Gênica/genética , Predisposição Genética para Doença/genética , Genótipo , Humanos , Mutagênese Insercional/genética , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Deleção de Sequência/genéticaRESUMO
The simultaneous targeting of host and pathogen processes represents an untapped approach for the treatment of intracellular infections. Hypoxia-inducible factor-1 (HIF-1) is a host cell transcription factor that is activated by and required for the growth of the intracellular protozoan parasite Toxoplasma gondii at physiological oxygen levels. Parasite activation of HIF-1 is blocked by inhibiting the family of closely related Activin-Like Kinase (ALK) host cell receptors ALK4, ALK5, and ALK7, which was determined in part by use of an ALK4,5,7 inhibitor named SB505124. Besides inhibiting HIF-1 activation, SB505124 also potently blocks parasite replication under normoxic conditions. To determine whether SB505124 inhibition of parasite growth was exclusively due to inhibition of ALK4,5,7 or because the drug inhibited a second kinase, SB505124-resistant parasites were isolated by chemical mutagenesis. Whole-genome sequencing of these mutants revealed mutations in the Toxoplasma MAP kinase, TgMAPK1. Allelic replacement of mutant TgMAPK1 alleles into wild-type parasites was sufficient to confer SB505124 resistance. SB505124 independently impacts TgMAPK1 and ALK4,5,7 signaling since drug resistant parasites could not activate HIF-1 in the presence of SB505124 or grow in HIF-1 deficient cells. In addition, TgMAPK1 kinase activity is inhibited by SB505124. Finally, mice treated with SB505124 had significantly lower tissue burdens following Toxoplasma infection. These data therefore identify SB505124 as a novel small molecule inhibitor that acts by inhibiting two distinct targets, host HIF-1 and TgMAPK1.
Assuntos
Receptores de Ativinas Tipo I/antagonistas & inibidores , Fator 1 Induzível por Hipóxia/antagonistas & inibidores , Proteína Quinase 1 Ativada por Mitógeno/antagonistas & inibidores , Toxoplasma/crescimento & desenvolvimento , Animais , Sequência de Bases , Benzodioxóis/farmacologia , Domínio Catalítico/efeitos dos fármacos , Domínio Catalítico/genética , DNA de Protozoário/genética , Resistência a Medicamentos/genética , Genoma de Protozoário/genética , Interações Hospedeiro-Parasita/genética , Fator 1 Induzível por Hipóxia/genética , Imidazóis/farmacologia , Camundongos , Camundongos Endogâmicos C57BL , Proteína Quinase 1 Ativada por Mitógeno/genética , Proteínas de Protozoários/antagonistas & inibidores , Proteínas de Protozoários/genética , Piridinas/farmacologia , Análise de Sequência de DNA , Transdução de Sinais/efeitos dos fármacos , Transdução de Sinais/genética , Toxoplasma/genéticaRESUMO
BACKGROUND: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls. RESULTS: This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%. CONCLUSIONS: In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.
Assuntos
Exoma/genética , Mutação INDEL/genética , Mutagênese , Biologia Computacional , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Projeto Genoma Humano , Humanos , Aprendizado de MáquinaRESUMO
BACKGROUND: Mobile elements (MEs) constitute greater than 50% of the human genome as a result of repeated insertion events during human genome evolution. Although most of these elements are now fixed in the population, some MEs, including ALU, L1, SVA and HERV-K elements, are still actively duplicating. Mobile element insertions (MEIs) have been associated with human genetic disorders, including Crohn's disease, hemophilia, and various types of cancer, motivating the need for accurate MEI detection methods. To comprehensively identify and accurately characterize these variants in whole genome next-generation sequencing (NGS) data, a computationally efficient detection and genotyping method is required. Current computational tools are unable to call MEI polymorphisms with sufficiently high sensitivity and specificity, or call individual genotypes with sufficiently high accuracy. RESULTS: Here we report Tangram, a computationally efficient MEI detection program that integrates read-pair (RP) and split-read (SR) mapping signals to detect MEI events. By utilizing SR mapping in its primary detection module, a feature unique to this software, Tangram is able to pinpoint MEI breakpoints with single-nucleotide precision. To understand the role of MEI events in disease, it is essential to produce accurate individual genotypes in clinical samples. Tangram is able to determine sample genotypes with very high accuracy. Using simulations and experimental datasets, we demonstrate that Tangram has superior sensitivity, specificity, breakpoint resolution and genotyping accuracy, when compared to other, recently developed MEI detection methods. CONCLUSIONS: Tangram serves as the primary MEI detection tool in the 1000 Genomes Project, and is implemented as a highly portable, memory-efficient, easy-to-use C++ computer program, built under an open-source development model.
Assuntos
Algoritmos , Elementos Alu , Cromossomos Humanos Par 22/genética , Biologia Computacional/métodos , Genoma Humano , Genótipo , Humanos , Modelos Genéticos , Sensibilidade e EspecificidadeRESUMO
BACKGROUND: Next generation sequencing is helping to overcome limitations in organisms less accessible to classical or reverse genetic methods by facilitating whole genome mutational analysis studies. One traditionally intractable group, the Apicomplexa, contains several important pathogenic protozoan parasites, including the Plasmodium species that cause malaria.Here we apply whole genome analysis methods to the relatively accessible model apicomplexan, Toxoplasma gondii, to optimize forward genetic methods for chemical mutagenesis using N-ethyl-N-nitrosourea (ENU) and ethylmethane sulfonate (EMS) at varying dosages. RESULTS: By comparing three different lab-strains we show that spontaneously generated mutations reflect genome composition, without nucleotide bias. However, the single nucleotide variations (SNVs) are not distributed randomly over the genome; most of these mutations reside either in non-coding sequence or are silent with respect to protein coding. This is in contrast to the random genomic distribution of mutations induced by chemical mutagenesis. Additionally, we report a genome wide transition vs transversion ratio (ti/tv) of 0.91 for spontaneous mutations in Toxoplasma, with a slightly higher rate of 1.20 and 1.06 for variants induced by ENU and EMS respectively. We also show that in the Toxoplasma system, surprisingly, both ENU and EMS have a proclivity for inducing mutations at A/T base pairs (78.6% and 69.6%, respectively). CONCLUSIONS: The number of SNVs between related laboratory strains is relatively low and managed by purifying selection away from changes to amino acid sequence. From an experimental mutagenesis point of view, both ENU (24.7%) and EMS (29.1%) are more likely to generate variation within exons than would naturally accumulate over time in culture (19.1%), demonstrating the utility of these approaches for yielding proportionally greater changes to the amino acid sequence. These results will not only direct the methods of future chemical mutagenesis in Toxoplasma, but also aid in designing forward genetic approaches in less accessible pathogenic protozoa as well.
Assuntos
Genoma , Toxoplasma/genética , Adenosina/genética , Adenosina/metabolismo , Sequência de Aminoácidos , Pareamento de Bases , Linhagem Celular , Metanossulfonato de Etila/toxicidade , Etilnitrosoureia/toxicidade , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Dados de Sequência Molecular , Pentosiltransferases/genética , Pentosiltransferases/metabolismo , Fenótipo , Mutação Puntual , Proteínas de Protozoários/genética , Proteínas de Protozoários/metabolismo , Timidina/genética , Timidina/metabolismo , Toxoplasma/efeitos dos fármacosRESUMO
MOTIVATION: High-throughput biological research requires simultaneous visualization as well as analysis of genomic data, e.g. read alignments, variant calls and genomic annotations. Traditionally, such integrative analysis required desktop applications operating on locally stored data. Many current terabyte-size datasets generated by large public consortia projects, however, are already only feasibly stored at specialist genome analysis centers. As even small laboratories can afford very large datasets, local storage and analysis are becoming increasingly limiting, and it is likely that most such datasets will soon be stored remotely, e.g. in the cloud. These developments will require web-based tools that enable users to access, analyze and view vast remotely stored data with a level of sophistication and interactivity that approximates desktop applications. As rapidly dropping cost enables researchers to collect data intended to answer questions in very specialized contexts, developers must also provide software libraries that empower users to implement customized data analyses and data views for their particular application. Such specialized, yet lightweight, applications would empower scientists to better answer specific biological questions than possible with general-purpose genome browsers currently available. RESULTS: Using recent advances in core web technologies (HTML5), we developed Scribl, a flexible genomic visualization library specifically targeting coordinate-based data such as genomic features, DNA sequence and genetic variants. Scribl simplifies the development of sophisticated web-based graphical tools that approach the dynamism and interactivity of desktop applications. AVAILABILITY AND IMPLEMENTATION: Software is freely available online at http://chmille4.github.com/Scribl/ and is implemented in JavaScript with all modern browsers supported.
Assuntos
Gráficos por Computador , Genômica/métodos , Software , Cromossomos Humanos , Humanos , InternetRESUMO
MOTIVATION: A common question arises at the beginning of every experiment where RNA-Seq is used to detect differential gene expression between two conditions: How many reads should we sequence? RESULTS: Scotty is an interactive web-based application that assists biologists to design an experiment with an appropriate sample size and read depth to satisfy the user-defined experimental objectives. This design can be based on data available from either pilot samples or publicly available datasets. AVAILABILITY: Scotty can be freely accessed on the web at http://euler.bc.edu/marthlab/scotty/scotty.php
Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Software , Expressão Gênica , Humanos , InternetRESUMO
High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2-4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after â¼1,000 sequenced chromosomes per population, whereas â¼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.