Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
Bioinform Adv ; 3(1): vbac101, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36726731

RESUMEN

Summary: Nanopore reads encode information on the methylation status of cytosines in CpG dinucleotides. The length of the reads makes it comparatively easy to look at patterns consisting of multiple loci; here, we exploit this property to search for regions where one can define subpopulations of molecules based on methylation patterns. As an example, we run our clustering algorithm on known imprinted genes; we also scan chromosome 15 looking for windows corresponding to heterogeneous methylation. Our software can also compute the covariance of methylation across these regions while keeping into account the mixture of different types of reads. Availability and implementation: https://github.com/EmanueleRaineri/cvlr. Contact: simon.heath@cnag.crg.eu. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

2.
Nucleic Acids Res ; 48(8): 4066-4080, 2020 05 07.
Artículo en Inglés | MEDLINE | ID: mdl-32182345

RESUMEN

We introduce an R package and a web-based visualization tool for the representation, analysis and integration of epigenomic data in the context of 3D chromatin interaction networks. GARDEN-NET allows for the projection of user-submitted genomic features on pre-loaded chromatin interaction networks, exploiting the functionalities of the ChAseR package to explore the features in combination with chromatin network topology properties. We demonstrate the approach using published epigenomic and chromatin structure datasets in haematopoietic cells, including a collection of gene expression, DNA methylation and histone modifications data in primary healthy myeloid cells from hundreds of individuals. These datasets allow us to test the robustness of chromatin assortativity, which highlights which epigenomic features, alone or in combination, are more strongly associated with 3D genome architecture. We find evidence for genomic regions with specific histone modifications, DNA methylation, and gene expression levels to be forming preferential contacts in 3D nuclear space, to a different extent depending on the cell type and lineage. Finally, we examine replication timing data and find it to be the genomic feature most strongly associated with overall 3D chromatin organization at multiple scales, consistent with previous results from the literature.


Asunto(s)
Cromatina/metabolismo , Epigénesis Genética , Células Madre Hematopoyéticas/metabolismo , Programas Informáticos , Linfocitos B/metabolismo , Metilación de ADN , Momento de Replicación del ADN , Expresión Génica , Código de Histonas , Humanos , Neutrófilos/metabolismo , Regiones Promotoras Genéticas
3.
PLoS Comput Biol ; 15(11): e1007496, 2019 11.
Artículo en Inglés | MEDLINE | ID: mdl-31765368

RESUMEN

The sheer size of the human genome makes it improbable that identical somatic mutations at the exact same position are observed in multiple tumours solely by chance. The scarcity of cancer driver mutations also precludes positive selection as the sole explanation. Therefore, recurrent mutations may be highly informative of characteristics of mutational processes. To explore the potential, we use recurrence as a starting point to cluster >2,500 whole genomes of a pan-cancer cohort. We describe each genome with 13 recurrence-based and 29 general mutational features. Using principal component analysis we reduce the dimensionality and create independent features. We apply hierarchical clustering to the first 18 principal components followed by k-means clustering. We show that the resulting 16 clusters capture clinically relevant cancer phenotypes. High levels of recurrent substitutions separate the clusters that we link to UV-light exposure and deregulated activity of POLE from the one representing defective mismatch repair, which shows high levels of recurrent insertions/deletions. Recurrence of both mutation types characterizes cancer genomes with somatic hypermutation of immunoglobulin genes and the cluster of genomes exposed to gastric acid. Low levels of recurrence are observed for the cluster where tobacco-smoke exposure induces mutagenesis and the one linked to increased activity of cytidine deaminases. Notably, the majority of substitutions are recurrent in a single tumour type, while recurrent insertions/deletions point to shared processes between tumour types. Recurrence also reveals susceptible sequence motifs, including TT[C>A]TTT and AAC[T>G]T for the POLE and 'gastric-acid exposure' clusters, respectively. Moreover, we refine knowledge of mutagenesis, including increased C/G deletion levels in general for lung tumours and specifically in midsize homopolymer sequence contexts for microsatellite instable tumours. Our findings are an important step towards the development of a generic cancer diagnostic test for clinical practice based on whole-genome sequencing that could replace multiple diagnostics currently in use.


Asunto(s)
Biología Computacional/métodos , Neoplasias/clasificación , Neoplasias/genética , Estudios de Cohortes , Bases de Datos de Ácidos Nucleicos , Predisposición Genética a la Enfermedad/genética , Genoma Humano/genética , Humanos , Mutación INDEL/genética , Mutagénesis/genética , Mutación/genética , Polimorfismo de Nucleótido Simple/genética , Análisis de Secuencia de ADN/métodos , Eliminación de Secuencia/genética
4.
Nucleic Acids Res ; 47(6): 2778-2792, 2019 04 08.
Artículo en Inglés | MEDLINE | ID: mdl-30799488

RESUMEN

The concept of tissue-specific gene expression posits that lineage-determining transcription factors (LDTFs) determine the open chromatin profile of a cell via collaborative binding, providing molecular beacons to signal-dependent transcription factors (SDTFs). However, the guiding principles of LDTF binding, chromatin accessibility and enhancer activity have not yet been systematically evaluated. We sought to study these features of the macrophage genome by the combination of experimental (ChIP-seq, ATAC-seq and GRO-seq) and computational approaches. We show that Random Forest and Support Vector Regression machine learning methods can accurately predict chromatin accessibility using the binding patterns of the LDTF PU.1 and four other key TFs of macrophages (IRF8, JUNB, CEBPA and RUNX1). Any of these TFs alone were not sufficient to predict open chromatin, indicating that TF binding is widespread at closed or weakly opened chromatin regions. Analysis of the PU.1 cistrome revealed that two-thirds of PU.1 binding occurs at low accessible chromatin. We termed these sites labelled regulatory elements (LREs), which may represent a dormant state of a future enhancer and contribute to macrophage cellular plasticity. Collectively, our work demonstrates the existence of LREs occupied by various key TFs, regulating specific gene expression programs triggered by divergent macrophage polarizing stimuli.


Asunto(s)
Ensamble y Desensamble de Cromatina/fisiología , Macrófagos/metabolismo , Secuencias Reguladoras de Ácidos Nucleicos , Factores de Transcripción/metabolismo , Animales , Células Cultivadas , Biología Computacional , Regulación de la Expresión Génica/fisiología , Genoma , Aprendizaje Automático , Ratones , Ratones Endogámicos C57BL , Unión Proteica/fisiología , Coloración y Etiquetado/métodos , Activación Transcripcional/fisiología
5.
Biotechnol Bioeng ; 116(3): 677-692, 2019 03.
Artículo en Inglés | MEDLINE | ID: mdl-30512195

RESUMEN

The existence of dynamic cellular phenotypes in changing environmental conditions is of major interest for cell biologists who aim to understand the mechanism and sequence of regulation of gene expression. In the context of therapeutic protein production by Chinese Hamster Ovary (CHO) cells, a detailed temporal understanding of cell-line behavior and control is necessary to achieve a more predictable and reliable process performance. Of particular interest are data on dynamic, temporally resolved transcriptional regulation of genes in response to altered substrate availability and culture conditions. In this study, the gene transcription dynamics throughout a 9-day batch culture of CHO cells was examined by analyzing histone modifications and gene expression profiles in regular 12- and 24-hr intervals, respectively. Three levels of regulation were observed: (a) the presence or absence of DNA methylation in the promoter region provides an ON/OFF switch; (b) a temporally resolved correlation is observed between the presence of active transcription- and promoter-specific histone marks and the expression level of the respective genes; and (c) a major mechanism of gene regulation is identified by interaction of coding genes with long non-coding RNA (lncRNA), as observed in the regulation of the expression level of both neighboring coding/lnc gene pairs and of gene pairs where the lncRNA is able to form RNA-DNA-DNA triplexes. Such triplex-forming regions were predominantly found in the promoter or enhancer region of the targeted coding gene. Significantly, the coding genes with the highest degree of variation in expression during the batch culture are characterized by a larger number of possible triplex-forming interactions with differentially expressed lncRNAs. This indicates a specific role of lncRNA-triplexes in enabling rapid and large changes in transcription. A more comprehensive understanding of these regulatory mechanisms will provide an opportunity for new tools to control cellular behavior and to engineer enhanced phenotypes.


Asunto(s)
Técnicas de Cultivo Celular por Lotes/métodos , Epigénesis Genética/genética , Regulación de la Expresión Génica/genética , Adaptación Fisiológica , Animales , Células CHO , Cricetinae , Cricetulus , Perfilación de la Expresión Génica , ARN Largo no Codificante/genética , Transcriptoma
6.
Theor Popul Biol ; 123: 70-79, 2018 09.
Artículo en Inglés | MEDLINE | ID: mdl-29964061

RESUMEN

We introduce the conditional Site Frequency Spectrum (SFS) for a genomic region linked to a focal mutation of known frequency. An exact expression for its expected value is provided for the neutral model without recombination. Its relation with the expected SFS for two sites, 2-SFS, is discussed. These spectra derive from the coalescent approach of Fu (1995) for finite samples, which is reviewed. Remarkably simple expressions are obtained for the linked SFS of a large population, which are also solutions of the multi-allelic Kolmogorov equations. These formulae are the immediate extensions of the well known single site θ∕f neutral SFS. Besides the general interest in these spectra, they relate to relevant biological cases, such as structural variants and introgressions. As an application, a recipe to adapt Tajima's D and other SFS-based neutrality tests to a non-recombining region containing a neutral marker is presented.


Asunto(s)
Genética de Población/métodos , Modelos Genéticos , Tasa de Mutación , Evolución Molecular , Desequilibrio de Ligamiento , Selección Genética
8.
Genome Res ; 27(1): 95-106, 2017 01.
Artículo en Inglés | MEDLINE | ID: mdl-27821408

RESUMEN

The impact of RNA structures in coding sequences (CDS) within mRNAs is poorly understood. Here, we identify a novel and highly conserved mechanism of translational control involving RNA structures within coding sequences and the DEAD-box helicase Dhh1. Using yeast genetics and genome-wide ribosome profiling analyses, we show that this mechanism, initially derived from studies of the Brome Mosaic virus RNA genome, extends to yeast and human mRNAs highly enriched in membrane and secreted proteins. All Dhh1-dependent mRNAs, viral and cellular, share key common features. First, they contain long and highly structured CDSs, including a region located around nucleotide 70 after the translation initiation site; second, they are directly bound by Dhh1 with a specific binding distribution; and third, complementary experimental approaches suggest that they are activated by Dhh1 at the translation initiation step. Our results show that ribosome translocation is not the only unwinding force of CDS and uncover a novel layer of translational control that involves RNA helicases and RNA folding within CDS providing novel opportunities for regulation of membrane and secretome proteins.


Asunto(s)
ARN Helicasas DEAD-box/genética , Iniciación de la Cadena Peptídica Traduccional , Biosíntesis de Proteínas , ARN/genética , Proteínas de Saccharomyces cerevisiae/genética , Bromovirus/genética , Exones/genética , Regulación de la Expresión Génica/genética , Humanos , Conformación de Ácido Nucleico , Sistemas de Lectura Abierta/genética , ARN Mensajero/genética , Ribosomas/genética , Saccharomyces cerevisiae/genética
9.
Cancer Cell ; 30(5): 806-821, 2016 Nov 14.
Artículo en Inglés | MEDLINE | ID: mdl-27846393

RESUMEN

We analyzed the in silico purified DNA methylation signatures of 82 mantle cell lymphomas (MCL) in comparison with cell subpopulations spanning the entire B cell lineage. We identified two MCL subgroups, respectively carrying epigenetic imprints of germinal-center-inexperienced and germinal-center-experienced B cells, and we found that DNA methylation profiles during lymphomagenesis are largely influenced by the methylation dynamics in normal B cells. An integrative epigenomic approach revealed 10,504 differentially methylated regions in regulatory elements marked by H3K27ac in MCL primary cases, including a distant enhancer showing de novo looping to the MCL oncogene SOX11. Finally, we observed that the magnitude of DNA methylation changes per case is highly variable and serves as an independent prognostic factor for MCL outcome.


Asunto(s)
Metilación de ADN , Elementos de Facilitación Genéticos , Epigenómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Linfoma de Células del Manto/genética , Linfocitos B/metabolismo , Línea Celular Tumoral , Linaje de la Célula , Simulación por Computador , Epigénesis Genética , Regulación Neoplásica de la Expresión Génica , Humanos , Factores de Transcripción SOXC/genética
10.
Cell Rep ; 17(8): 2101-2111, 2016 11 15.
Artículo en Inglés | MEDLINE | ID: mdl-27851971

RESUMEN

DNA methylation and the localization and post-translational modification of nucleosomes are interdependent factors that contribute to the generation of distinct phenotypes from genetically identical cells. With 112 whole-genome bisulfite sequencing datasets from the BLUEPRINT Epigenome Project, we analyzed the global development of DNA methylation patterns during lineage commitment and maturation of a range of immune system effector cells and the cancers that arise from them. We show clear trends in methylation patterns that are distinct in the innate and adaptive arms of the human immune system, both globally and in relation to consistently positioned nucleosomes. Most notable are a progressive loss of methylation in developing lymphocytes and the consistent occurrence of non-CG methylation in specific cell types. Cancer samples from the two lineages are further polarized, suggesting the involvement of distinct lineage-specific epigenetic mechanisms. We anticipate broad utility for this resource as a basis for further comparative epigenetic analyses.


Asunto(s)
Inmunidad Adaptativa/genética , Metilación de ADN/genética , Inmunidad Innata/genética , Linfocitos B/metabolismo , Secuencia de Bases , Sitios de Unión , Factor de Unión a CCCTC , Fosfatos de Dinucleósidos/genética , Exones/genética , Humanos , Linfocitos/metabolismo , Células Mieloides/metabolismo , Nucleosomas
11.
Nat Commun ; 6: 10001, 2015 Dec 09.
Artículo en Inglés | MEDLINE | ID: mdl-26647970

RESUMEN

As whole-genome sequencing for cancer genome analysis becomes a clinical tool, a full understanding of the variables affecting sequencing analysis output is required. Here using tumour-normal sample pairs from two different types of cancer, chronic lymphocytic leukaemia and medulloblastoma, we conduct a benchmarking exercise within the context of the International Cancer Genome Consortium. We compare sequencing methods, analysis pipelines and validation methods. We show that using PCR-free methods and increasing sequencing depth to ∼ 100 × shows benefits, as long as the tumour:control coverage ratio remains balanced. We observe widely varying mutation call rates and low concordance among analysis pipelines, reflecting the artefact-prone nature of the raw data and lack of standards for dealing with the artefacts. However, we show that, using the benchmark mutation set we have created, many issues are in fact easy to remedy and have an immediate positive impact on mutation detection accuracy.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Leucemia Linfoide/genética , Meduloblastoma/genética , Mutación , Genoma Humano , Humanos
12.
Nat Genet ; 47(7): 746-56, 2015 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-26053498

RESUMEN

We analyzed the DNA methylome of ten subpopulations spanning the entire B cell differentiation program by whole-genome bisulfite sequencing and high-density microarrays. We observed that non-CpG methylation disappeared upon B cell commitment, whereas CpG methylation changed extensively during B cell maturation, showing an accumulative pattern and affecting around 30% of all measured CpG sites. Early differentiation stages mainly displayed enhancer demethylation, which was associated with upregulation of key B cell transcription factors and affected multiple genes involved in B cell biology. Late differentiation stages, in contrast, showed extensive demethylation of heterochromatin and methylation gain at Polycomb-repressed areas, and genes with apparent functional impact in B cells were not affected. This signature, which has previously been linked to aging and cancer, was particularly widespread in mature cells with an extended lifespan. Comparing B cell neoplasms with their normal counterparts, we determined that they frequently acquire methylation changes in regions already undergoing dynamic methylation during normal B cell differentiation.


Asunto(s)
Linfocitos B/fisiología , Metilación de ADN , Epigénesis Genética/inmunología , Secuencia de Bases , Diferenciación Celular , Células Cultivadas , Islas de CpG , Regulación Leucémica de la Expresión Génica , Genoma Humano , Humanos , Leucemia de Células B/genética , Análisis de Secuencia de ADN
13.
Genome Res ; 25(4): 478-87, 2015 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-25644835

RESUMEN

While analyzing the DNA methylome of multiple myeloma (MM), a plasma cell neoplasm, by whole-genome bisulfite sequencing and high-density arrays, we observed a highly heterogeneous pattern globally characterized by regional DNA hypermethylation embedded in extensive hypomethylation. In contrast to the widely reported DNA hypermethylation of promoter-associated CpG islands (CGIs) in cancer, hypermethylated sites in MM, as opposed to normal plasma cells, were located outside CpG islands and were unexpectedly associated with intronic enhancer regions defined in normal B cells and plasma cells. Both RNA-seq and in vitro reporter assays indicated that enhancer hypermethylation is globally associated with down-regulation of its host genes. ChIP-seq and DNase-seq further revealed that DNA hypermethylation in these regions is related to enhancer decommissioning. Hypermethylated enhancer regions overlapped with binding sites of B cell-specific transcription factors (TFs) and the degree of enhancer methylation inversely correlated with expression levels of these TFs in MM. Furthermore, hypermethylated regions in MM were methylated in stem cells and gradually became demethylated during normal B-cell differentiation, suggesting that MM cells either reacquire epigenetic features of undifferentiated cells or maintain an epigenetic signature of a putative myeloma stem cell progenitor. Overall, we have identified DNA hypermethylation of developmentally regulated enhancers as a new type of epigenetic modification associated with the pathogenesis of MM.


Asunto(s)
Metilación de ADN/genética , Elementos de Facilitación Genéticos/genética , Mieloma Múltiple/genética , Células Madre Neoplásicas/citología , Células Plasmáticas/citología , Diferenciación Celular/genética , Línea Celular Tumoral , Islas de CpG/genética , ADN de Neoplasias/genética , Regulación hacia Abajo/genética , Epigénesis Genética/genética , Regulación Neoplásica de la Expresión Génica , Genoma Humano/genética , Humanos , Regiones Promotoras Genéticas , Factores de Transcripción/biosíntesis , Factores de Transcripción/genética
14.
PLoS One ; 9(5): e97349, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24824426

RESUMEN

We apply a known algorithm for computing exactly inequalities between Beta distributions to assess whether a given position in a genome is differentially methylated across samples. We discuss the advantages brought by the adoption of this solution with respect to two approximations (Fisher's test and Z score). The same formalism presented here can be applied in a similar way to variant calling.


Asunto(s)
Metilación de ADN/genética , Genoma/genética , Modelos Genéticos , Teorema de Bayes , Genómica/métodos , Probabilidad
15.
BMC Genomics ; 14: 363, 2013 May 31.
Artículo en Inglés | MEDLINE | ID: mdl-23721540

RESUMEN

BACKGROUND: The only known albino gorilla, named Snowflake, was a male wild born individual from Equatorial Guinea who lived at the Barcelona Zoo for almost 40 years. He was diagnosed with non-syndromic oculocutaneous albinism, i.e. white hair, light eyes, pink skin, photophobia and reduced visual acuity. Despite previous efforts to explain the genetic cause, this is still unknown. Here, we study the genetic cause of his albinism and making use of whole genome sequencing data we find a higher inbreeding coefficient compared to other gorillas. RESULTS: We successfully identified the causal genetic variant for Snowflake's albinism, a non-synonymous single nucleotide variant located in a transmembrane region of SLC45A2. This transporter is known to be involved in oculocutaneous albinism type 4 (OCA4) in humans. We provide experimental evidence that shows that this amino acid replacement alters the membrane spanning capability of this transmembrane region. Finally, we provide a comprehensive study of genome-wide patterns of autozygogosity revealing that Snowflake's parents were related, being this the first report of inbreeding in a wild born Western lowland gorilla. CONCLUSIONS: In this study we demonstrate how the use of whole genome sequencing can be extended to link genotype and phenotype in non-model organisms and it can be a powerful tool in conservation genetics (e.g., inbreeding and genetic diversity) with the expected decrease in sequencing cost.


Asunto(s)
Genómica , Gorilla gorilla/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Endogamia , Secuencia de Aminoácidos , Animales , Femenino , Heterocigoto , Masculino , Proteínas de Transporte de Membrana/química , Proteínas de Transporte de Membrana/genética , Repeticiones de Microsatélite/genética , Datos de Secuencia Molecular , Mutación , Análisis de Secuencia de ADN
16.
BMC Genomics ; 14: 148, 2013 Mar 05.
Artículo en Inglés | MEDLINE | ID: mdl-23497037

RESUMEN

BACKGROUND: In contrast to international pig breeds, the Iberian breed has not been admixed with Asian germplasm. This makes it an important model to study both domestication and relevance of Asian genes in the pig. Besides, Iberian pigs exhibit high meat quality as well as appetite and propensity to obesity. Here we provide a genome wide analysis of nucleotide and structural diversity in a reduced representation library from a pool (n=9 sows) and shotgun genomic sequence from a single sow of the highly inbred Guadyerbas strain. In the pool, we applied newly developed tools to account for the peculiarities of these data. RESULTS: A total of 254,106 SNPs in the pool (79.6 Mb covered) and 643,783 in the Guadyerbas sow (1.47 Gb covered) were called. The nucleotide diversity (1.31x10-3 per bp in autosomes) is very similar to that reported in wild boar. A much lower than expected diversity in the X chromosome was confirmed (1.79x10-4 per bp in the individual and 5.83x10-4 per bp in the pool). A strong (0.70) correlation between recombination and variability was observed, but not with gene density or GC content. Multicopy regions affected about 4% of annotated pig genes in their entirety, and 2% of the genes partially. Genes within the lowest variability windows comprised interferon genes and, in chromosome X, genes involved in behavior like HTR2C or MCEP2. A modified Hudson-Kreitman-Aguadé test for pools also indicated an accelerated evolution in genes involved in behavior, as well as in spermatogenesis and in lipid metabolism. CONCLUSIONS: This work illustrates the strength of current sequencing technologies to picture a comprehensive landscape of variability in livestock species, and to pinpoint regions containing genes potentially under selection. Among those genes, we report genes involved in behavior, including feeding behavior, and lipid metabolism. The pig X chromosome is an outlier in terms of nucleotide diversity, which suggests selective constraints. Our data further confirm the importance of structural variation in the species, including Iberian pigs, and allowed us to identify new paralogs for known gene families.


Asunto(s)
Animales Endogámicos/genética , Mapeo Cromosómico , Polimorfismo de Nucleótido Simple/genética , Porcinos/genética , Animales , Cruzamiento , Variación Genética , Nucleótidos/genética
17.
Nucleic Acids Res ; 40(20): 10073-83, 2012 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-22962361

RESUMEN

High-throughput sequencing of cDNA libraries constructed from cellular RNA complements (RNA-Seq) naturally provides a digital quantitative measurement for every expressed RNA molecule. Nature, impact and mutual interference of biases in different experimental setups are, however, still poorly understood-mostly due to the lack of data from intermediate protocol steps. We analysed multiple RNA-Seq experiments, involving different sample preparation protocols and sequencing platforms: we broke them down into their common--and currently indispensable--technical components (reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing), investigating how such different steps influence abundance and distribution of the sequenced reads. For each of those steps, we developed universally applicable models, which can be parameterised by empirical attributes of any experimental protocol. Our models are implemented in a computer simulation pipeline called the Flux Simulator, and we show that read distributions generated by different combinations of these models reproduce well corresponding evidence obtained from the corresponding experimental setups. We further demonstrate that our in silico RNA-Seq provides insights about hidden precursors that determine the final configuration of reads along gene bodies; enhancing or compensatory effects that explain apparently controversial observations can be observed. Moreover, our simulations identify hitherto unreported sources of systematic bias from RNA hydrolysis, a fragmentation technique currently employed by most RNA-Seq protocols.


Asunto(s)
Simulación por Computador , Perfilación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ARN , Hidrólisis , ARN/metabolismo
18.
BMC Bioinformatics ; 13: 239, 2012 Sep 20.
Artículo en Inglés | MEDLINE | ID: mdl-22992255

RESUMEN

BACKGROUND: Performing high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read - or, more likely, none - from a true singleton. RESULTS: To improve upon existing theory and software packages, we have developed a Bayesian approach for minor allele frequency (MAF) computation and SNP calling in pools (and implemented it in a program called snape): the approach takes into account sequencing errors and allows users to choose different priors. We also set up a pipeline which can simulate the coalescence process giving rise to the SNPs, the pooling procedure and the sequencing. We used it to compare the performance of snape to that of other packages. CONCLUSIONS: We present a software which helps in calling SNPs in pooled samples: it has good power while retaining a low false discovery rate (FDR). The method also provides the posterior probability that a SNP is segregating and the full posterior distribution of f for every SNP. In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling. In this setting, snape has better power and False Discovery Rate (FDR) than the comparable packages samtools, PoPoolation, Varscan : for N = 50 chromosomes, snape has power ≈ 35%and FDR ≈ 2.5%. snape is available at http://code.google.com/p/snape-pooled/ (source code and precompiled binaries).


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN/métodos , Alelos , Teorema de Bayes , Frecuencia de los Genes , Genoma , Humanos , Programas Informáticos
19.
Genetics ; 191(4): 1397-401, 2012 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-22661328

RESUMEN

Missing data are common in DNA sequences obtained through high-throughput sequencing. Furthermore, samples of low quality or problems in the experimental protocol often cause a loss of data even with traditional sequencing technologies. Here we propose modified estimators of variability and neutrality tests that can be naturally applied to sequences with missing data, without the need to remove bases or individuals from the analysis. Modified statistics include the Watterson estimator θW, Tajima's D, Fay and Wu's H, and HKA. We develop a general framework to take missing data into account in frequency spectrum-based neutrality tests and we derive the exact expression for the variance of these statistics under the neutral model. The neutrality tests proposed here can also be used as summary statistics to describe the information contained in other classes of data like DNA microarrays.


Asunto(s)
Frecuencia de los Genes , Variación Genética , Modelos Genéticos , Algoritmos , Simulación por Computador , Interpretación Estadística de Datos , Genética de Población
20.
PLoS One ; 7(1): e30377, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22276185

RESUMEN

We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).


Asunto(s)
Algoritmos , Biología Computacional/métodos , Genoma Humano/genética , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...