Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 54
Filtrar
Más filtros

Bases de datos
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38877886

RESUMEN

Single-cell sequencing has revolutionized our ability to dissect the heterogeneity within tumor populations. In this study, we present LoRA-TV (Low Rank Approximation with Total Variation), a novel method for clustering tumor cells based on the read depth profiles derived from single-cell sequencing data. Traditional analysis pipelines process read depth profiles of each cell individually. By aggregating shared genomic signatures distributed among individual cells using low-rank optimization and robust smoothing, the proposed method enhances clustering performance. Results from analyses of both simulated and real data demonstrate its effectiveness compared with state-of-the-art alternatives, as supported by improvements in the adjusted Rand index and computational efficiency.


Asunto(s)
Neoplasias , Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Humanos , Neoplasias/genética , Neoplasias/patología , Análisis por Conglomerados , Algoritmos , Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Genómica/métodos
2.
Am J Hum Genet ; 109(6): 1065-1076, 2022 06 02.
Artículo en Inglés | MEDLINE | ID: mdl-35609568

RESUMEN

The human genome contains tens of thousands of large tandem repeats and hundreds of genes that show common and highly variable copy-number changes. Due to their large size and repetitive nature, these variable number tandem repeats (VNTRs) and multicopy genes are generally recalcitrant to standard genotyping approaches and, as a result, this class of variation is poorly characterized. However, several recent studies have demonstrated that copy-number variation of VNTRs can modify local gene expression, epigenetics, and human traits, indicating that many have a functional role. Here, using read depth from whole-genome sequencing to profile copy number, we report results of a phenome-wide association study (PheWAS) of VNTRs and multicopy genes in a discovery cohort of ∼35,000 samples, identifying 32 traits associated with copy number of 38 VNTRs and multicopy genes at 1% FDR. We replicated many of these signals in an independent cohort and observed that VNTRs showing trait associations were significantly enriched for expression QTLs with nearby genes, providing strong support for our results. Fine-mapping studies indicated that in the majority (∼90%) of cases, the VNTRs and multicopy genes we identified represent the causal variants underlying the observed associations. Furthermore, several lie in regions where prior SNV-based GWASs have failed to identify any significant associations with these traits. Our study indicates that copy number of VNTRs and multicopy genes contributes to diverse human traits and suggests that complex structural variants potentially explain some of the so-called "missing heritability" of SNV-based GWASs.


Asunto(s)
Variaciones en el Número de Copia de ADN , Repeticiones de Minisatélite , Variaciones en el Número de Copia de ADN/genética , Genoma Humano , Estudio de Asociación del Genoma Completo , Humanos , Repeticiones de Minisatélite/genética , Fenotipo
3.
BMC Genomics ; 24(1): 616, 2023 Oct 16.
Artículo en Inglés | MEDLINE | ID: mdl-37845620

RESUMEN

BACKGROUND: Elucidating genome-wide structural variants including copy number variations (CNVs) have gained increased significance in recent times owing to their contribution to genetic diversity and association with important pathophysiological states. The present study aimed to elucidate the high-resolution CNV map of six different global buffalo breeds using whole genome resequencing data at two coverages (10X and 30X). Post-quality control, the sequence reads were aligned to the latest draft release of the Bubaline genome. The genome-wide CNVs were elucidated using a read-depth approach in CNVnator with different bin sizes. Adjacent CNVs were concatenated into copy number variation regions (CNVRs) in different breeds and their genomic coverage was elucidated. RESULTS: Overall, the average size of CNVR was lower at 30X coverage, providing finer details. Most of the CNVRs were either deletion or duplication type while the occurrence of mixed events was lesser in number on a comparative basis in all breeds. The average CNVR size was lower at 30X coverage (0.201 Mb) as compared to 10X (0.013 Mb) with the finest variants in Banni buffaloes. The maximum number of CNVs was observed in Murrah (2627) and Pandharpuri (25,688) at 10X and 30X coverages, respectively. Whereas the minimum number of CNVs were scored in Surti at both coverages (2092 and 17,373). On the other hand, the highest and lowest number of CNVRs were scored in Jaffarabadi (833 and 10,179 events) and Surti (783 and 7553 events) at both coverages. Deletion events overnumbered duplications in all breeds at both coverages. Gene profiling of common overlapped genes and longest CNVRs provided important insights into the evolutionary history of these breeds and indicate the genomic regions under selection in respective breeds. CONCLUSION: The present study is the first of its kind to elucidate the high-resolution CNV map in major buffalo populations using a read-depth approach on whole genome resequencing data. The results revealed important insights into the divergence of major global buffalo breeds along the evolutionary timescale.


Asunto(s)
Búfalos , Variaciones en el Número de Copia de ADN , Animales , Búfalos/genética , Genoma , Análisis de Secuencia de ADN , Genómica/métodos
4.
BMC Genomics ; 24(1): 43, 2023 Jan 25.
Artículo en Inglés | MEDLINE | ID: mdl-36698077

RESUMEN

BACKGROUND: Epigenomic profiling assays such as ChIP-seq have been widely used to map the genome-wide enrichment profiles of chromatin-associated proteins and posttranslational histone modifications. Sequencing depth is a key parameter in experimental design and quality control. However, due to variable sequencing depth requirements across experimental conditions, it can be challenging to determine optimal sequencing depth, particularly for projects involving multiple targets or cell types. RESULTS: We developed the peaksat R package to provide target read depth estimates for epigenomic experiments based on the analysis of peak saturation curves. We applied peaksat to establish the distinctive read depth requirements for ChIP-seq studies of histone modifications in different cell lines. Using peaksat, we were able to estimate the target read depth required per library to obtain high-quality peak calls for downstream analysis. In addition, peaksat was applied to other sequence-enrichment methods including CUT&RUN and ATAC-seq. CONCLUSION: peaksat addresses a need for researchers to make informed decisions about whether their sequencing data has been generated to an adequate depth and subsequently sufficient meaningful peaks, and failing that, how many more reads would be required per library. peaksat is applicable to other sequence-based methods that include calling peaks in their analysis.


Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina , Secuenciación de Nucleótidos de Alto Rendimiento , Secuenciación de Inmunoprecipitación de Cromatina/métodos , Análisis de Secuencia de ADN/métodos , Biblioteca de Genes
5.
Brief Bioinform ; 22(6)2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34151932

RESUMEN

Whole-genome sequencing (WGS) of parent-offspring trios has become widely used to identify causal copy number variations (CNVs) in rare and complex diseases. Existing CNV detection approaches usually do not make effective use of Mendelian inheritance in parent-offspring trios and yield low accuracy. In this study, we propose a novel integrated approach, TrioCNV2, for jointly detecting CNVs from WGS data of the parent-offspring trio. TrioCNV2 first makes use of the read depth and discordant read pairs to infer approximate locations of CNVs and then employs the split read and local de novo assembly approaches to refine the breakpoints. We use the real WGS data of two parent-offspring trios to demonstrate TrioCNV2's performance and compare it with other CNV detection approaches. The software TrioCNV2 is implemented using a combination of Java and R and is freely available from the website at https://github.com/yongzhuang/TrioCNV2.


Asunto(s)
Biología Computacional/métodos , Variaciones en el Número de Copia de ADN , Estudios de Asociación Genética/métodos , Predisposición Genética a la Enfermedad , Programas Informáticos , Secuenciación Completa del Genoma , Algoritmos , Puntos de Rotura del Cromosoma , Familia , Humanos , Reproducibilidad de los Resultados , Navegador Web , Secuenciación Completa del Genoma/métodos , Flujo de Trabajo
6.
BMC Bioinformatics ; 23(1): 85, 2022 Mar 05.
Artículo en Inglés | MEDLINE | ID: mdl-35247967

RESUMEN

BACKGROUND: A typical Copy Number Variations (CNVs) detection process based on the depth of coverage in the Whole Exome Sequencing (WES) data consists of several steps: (I) calculating the depth of coverage in sequencing regions, (II) quality control, (III) normalizing the depth of coverage, (IV) calling CNVs. Previous tools performed one normalization process for each chromosome-all the coverage depths in the sequencing regions from a given chromosome were normalized in a single run. METHODS: Herein, we present the new CNVind tool for calling CNVs, where the normalization process is conducted separately for each of the sequencing regions. The total number of normalizations is equal to the number of sequencing regions in the investigated dataset. For example, when analyzing a dataset composed of n sequencing regions, CNVind performs n independent depth of coverage normalizations. Before each normalization, the application selects the k most correlated sequencing regions with the depth of coverage Pearson's Correlation as distance metric. Then, the resulting subgroup of [Formula: see text] sequencing regions is normalized, the results of all n independent normalizations are combined; finally, the segmentation and CNV calling process is performed on the resultant dataset. RESULTS AND CONCLUSIONS: We used WES data from the 1000 Genomes project to evaluate the impact of independent normalization on CNV calling performance and compared the results with state-of-the-art tools: CODEX and exomeCopy. The results proved that independent normalization allows to improve the rare CNVs detection specificity significantly. For example, for the investigated dataset, we reduced the number of FP calls from over 15,000 to around 5000 while maintaining a constant number of TP calls equal to about 150 CNVs. However, independent normalization of each sequencing region is a computationally expensive process, therefore our pipeline is customized and can be easily run in the cloud computing environment, on the computer cluster, or the single CPU server. To our knowledge, the presented application is the first attempt to implement an innovative approach to independent normalization of the depth of WES data coverage.


Asunto(s)
Variaciones en el Número de Copia de ADN , Exoma , Algoritmos , Nube Computacional , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación del Exoma
7.
Hum Mutat ; 42(5): 530-536, 2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-33600021

RESUMEN

Aggregate population genomics data from large cohorts are vital for assessing germline variant pathogenicity. However, there are no specifications on how sequencing quality metrics should be considered, and whether exome-derived and genome-derived allele frequencies should be considered in isolation. Germline genome sequence data were simulated for nine read-depths to identify a minimum acceptable read-depth for detecting variants. gnomAD exome-derived and genome-derived datasets were assessed for read-depth, for six key cancer genes selected for variant curation by ClinGen expert panels. Non-Finnish European allele frequency (AF) or filter AF of coding variants in these genes, assigned into frequency bins using modified ACMG-AMP criteria, was compared between exome-derived and genome-derived datasets. A 30X read-depth achieved acceptable precision and recall for detection of substitutions, but poor recall for small insertions/deletions. Exome-derived and genome-derived datasets exhibited low read-depth for different gene exons. Individual variants were mostly assigned to non-divergent AF bins (>95%) or filter AF bins (>97%). Two major bin divergences were resolved by applying the minimal acceptable read-depth threshold. These findings show the importance of assessing read-depth separately for population datasets sourced from different short-read sequencing technologies before assigning a frequency-based ACMG-AMP classification code for variant interpretation.


Asunto(s)
Genoma Humano , Neoplasias , Frecuencia de los Genes , Pruebas Genéticas , Variación Genética , Genómica , Células Germinativas , Humanos , Neoplasias/genética
8.
BMC Genomics ; 22(1): 446, 2021 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-34126923

RESUMEN

BACKGROUND: The combination of sodium bisulfite treatment with highly-parallel sequencing is a common method for quantifying DNA methylation across the genome. The power to detect between-group differences in DNA methylation using bisulfite-sequencing approaches is influenced by both experimental (e.g. read depth, missing data and sample size) and biological (e.g. mean level of DNA methylation and difference between groups) parameters. There is, however, no consensus about the optimal thresholds for filtering bisulfite sequencing data with implications for the reproducibility of findings in epigenetic epidemiology. RESULTS: We used a large reduced representation bisulfite sequencing (RRBS) dataset to assess the distribution of read depth across DNA methylation sites and the extent of missing data. To investigate how various study variables influence power to identify DNA methylation differences between groups, we developed a framework for simulating bisulfite sequencing data. As expected, sequencing read depth, group size, and the magnitude of DNA methylation difference between groups all impacted upon statistical power. The influence on power was not dependent on one specific parameter, but reflected the combination of study-specific variables. As a resource to the community, we have developed a tool, POWEREDBiSeq, which utilizes our simulation framework to predict study-specific power for the identification of DNAm differences between groups, taking into account user-defined read depth filtering parameters and the minimum sample size per group. CONCLUSIONS: Our data-driven approach highlights the importance of filtering bisulfite-sequencing data by minimum read depth and illustrates how the choice of threshold is influenced by the specific study design and the expected differences between groups being compared. The POWEREDBiSeq tool, which can be applied to different types of bisulfite sequencing data (e.g. RRBS, whole genome bisulfite sequencing (WGBS), targeted bisulfite sequencing and amplicon-based bisulfite sequencing), can help users identify the level of data filtering needed to optimize power and aims to improve the reproducibility of bisulfite sequencing studies.


Asunto(s)
Metilación de ADN , Sulfitos , Epigenómica , Secuenciación de Nucleótidos de Alto Rendimiento , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN
9.
Am J Hum Genet ; 102(1): 142-155, 2018 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-29304372

RESUMEN

A remaining hurdle to whole-genome sequencing (WGS) becoming a first-tier genetic test has been accurate detection of copy-number variations (CNVs). Here, we used several datasets to empirically develop a detailed workflow for identifying germline CNVs >1 kb from short-read WGS data using read depth-based algorithms. Our workflow is comprehensive in that it addresses all stages of the CNV-detection process, including DNA library preparation, sequencing, quality control, reference mapping, and computational CNV identification. We used our workflow to detect rare, genic CNVs in individuals with autism spectrum disorder (ASD), and 120/120 such CNVs tested using orthogonal methods were successfully confirmed. We also identified 71 putative genic de novo CNVs in this cohort, which had a confirmation rate of 70%; the remainder were incorrectly identified as de novo due to false positives in the proband (7%) or parental false negatives (23%). In individuals with an ASD diagnosis in which both microarray and WGS experiments were performed, our workflow detected all clinically relevant CNVs identified by microarrays, as well as additional potentially pathogenic CNVs < 20 kb. Thus, CNVs of clinical relevance can be discovered from WGS with a detection rate exceeding microarrays, positioning WGS as a single assay for genetic variation detection.


Asunto(s)
Variaciones en el Número de Copia de ADN/genética , Secuenciación Completa del Genoma , Flujo de Trabajo , Algoritmos , Niño , Femenino , Haplotipos/genética , Humanos , Masculino , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN
10.
BMC Bioinformatics ; 21(1): 506, 2020 Nov 07.
Artículo en Inglés | MEDLINE | ID: mdl-33160308

RESUMEN

BACKGROUND: Hi-C and its variant techniques have been developed to capture the spatial organization of chromatin. Normalization of Hi-C contact map is essential for accurate modeling and interpretation of high-throughput chromatin conformation capture (3C) experiments. Hi-C correction tools were originally developed to normalize systematic biases of karyotypically normal cell lines. However, a vast majority of available Hi-C datasets are derived from cancer cell lines that carry multi-level DNA copy number variations (CNVs). CNV regions display over- or under-representation of interaction frequencies compared to CN-neutral regions. Therefore, it is necessary to remove CNV-driven bias from chromatin interaction data of cancer cell lines to generate a euploid-equivalent contact map. RESULTS: We developed the HiCNAtra framework to compute high-resolution CNV profiles from Hi-C or 3C-seq data of cancer cell lines and to correct chromatin contact maps from systematic biases including CNV-associated bias. First, we introduce a novel 'entire-fragment' counting method for better estimation of the read depth (RD) signal from Hi-C reads that recapitulates the whole-genome sequencing (WGS)-derived coverage signal. Second, HiCNAtra employs a multimodal-based hierarchical CNV calling approach, which outperformed OneD and HiNT tools, to accurately identify CNVs of cancer cell lines. Third, incorporating CNV information with other systematic biases, HiCNAtra simultaneously estimates the contribution of each bias and explicitly corrects the interaction matrix using Poisson regression. HiCNAtra normalization abolishes CNV-induced artifacts from the contact map generating a heatmap with homogeneous signal. When benchmarked against OneD, CAIC, and ICE methods using MCF7 cancer cell line, HiCNAtra-corrected heatmap achieves the least 1D signal variation without deforming the inherent chromatin interaction signal. Additionally, HiCNAtra-corrected contact frequencies have minimum correlations with each of the systematic bias sources compared to OneD's explicit method. Visual inspection of CNV profiles and contact maps of cancer cell lines reveals that HiCNAtra is the most robust Hi-C correction tool for ameliorating CNV-induced bias. CONCLUSIONS: HiCNAtra is a Hi-C-based computational tool that provides an analytical and visualization framework for DNA copy number profiling and chromatin contact map correction of karyotypically abnormal cell lines. HiCNAtra is an open-source software implemented in MATLAB and is available at https://github.com/AISKhalil/HiCNAtra .


Asunto(s)
Biología Computacional/métodos , Variaciones en el Número de Copia de ADN , Neoplasias/patología , Cromatina/metabolismo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Células MCF-7 , Neoplasias/genética , Interfaz Usuario-Computador
11.
BMC Bioinformatics ; 21(1): 147, 2020 Apr 16.
Artículo en Inglés | MEDLINE | ID: mdl-32299346

RESUMEN

BACKGROUND: Detection of DNA copy number alterations (CNAs) is critical to understand genetic diversity, genome evolution and pathological conditions such as cancer. Cancer genomes are plagued with widespread multi-level structural aberrations of chromosomes that pose challenges to discover CNAs of different length scales, and distinct biological origins and functions. Although several computational tools are available to identify CNAs using read depth (RD) signal, they fail to distinguish between large-scale and focal alterations due to inaccurate modeling of the RD signal of cancer genomes. Additionally, RD signal is affected by overdispersion-driven biases at low coverage, which significantly inflate false detection of CNA regions. RESULTS: We have developed CNAtra framework to hierarchically discover and classify 'large-scale' and 'focal' copy number gain/loss from a single whole-genome sequencing (WGS) sample. CNAtra first utilizes a multimodal-based distribution to estimate the copy number (CN) reference from the complex RD profile of the cancer genome. We implemented Savitzky-Golay smoothing filter and Modified Varri segmentation to capture the change points of the RD signal. We then developed a CN state-driven merging algorithm to identify the large segments with distinct copy numbers. Next, we identified focal alterations in each large segment using coverage-based thresholding to mitigate the adverse effects of signal variations. Using cancer cell lines and patient datasets, we confirmed CNAtra's ability to detect and distinguish the segmental aneuploidies and focal alterations. We used realistic simulated data for benchmarking the performance of CNAtra against other single-sample detection tools, where we artificially introduced CNAs in the original cancer profiles. We found that CNAtra is superior in terms of precision, recall and f-measure. CNAtra shows the highest sensitivity of 93 and 97% for detecting large-scale and focal alterations respectively. Visual inspection of CNAs revealed that CNAtra is the most robust detection tool for low-coverage cancer data. CONCLUSIONS: CNAtra is a single-sample CNA detection tool that provides an analytical and visualization framework for CNA profiling without relying on any reference control. It can detect chromosome-level segmental aneuploidies and high-confidence focal alterations, even from low-coverage data. CNAtra is an open-source software implemented in MATLAB®. It is freely available at https://github.com/AISKhalil/CNAtra.


Asunto(s)
Algoritmos , Variaciones en el Número de Copia de ADN/genética , Neoplasias/genética , Secuenciación Completa del Genoma/métodos , Humanos
12.
BMC Bioinformatics ; 20(1): 266, 2019 May 28.
Artículo en Inglés | MEDLINE | ID: mdl-31138108

RESUMEN

BACKGROUND: There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample set is used to increase the signal-to-noise ratio. Although some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature. To our knowledge, this work is the first attempt to assess the impact of reference sample set selection on CNV detection performance. METHODS: We used WES data from the 1000 Genomes project to evaluate the impact of various methods of reference sample set selection on CNV calling performance of three chosen state-of-the-art tools: CODEX, CNVkit and exomeCopy. Two naive solutions (all samples as reference set and random selection) as well as two clustering methods (k-means and k nearest neighbours (kNN) with a variable number of clusters or group sizes) have been evaluated to discover the best performing sample selection method. RESULTS AND CONCLUSIONS: The performed experiments have shown that the appropriate selection of the reference sample set may greatly improve the CNV detection rate. In particular, we found that smart reduction of reference sample size may significantly increase the algorithms' precision while having negligible negative effect on sensitivity. We observed that a complete CNV calling process with the k-means algorithm as the selection method has significantly better time complexity than kNN-based solution.


Asunto(s)
Algoritmos , Variaciones en el Número de Copia de ADN/genética , Benchmarking , Bases de Datos Genéticas , Femenino , Humanos , Masculino , Estándares de Referencia , Tamaño de la Muestra
13.
BMC Biotechnol ; 19(1): 31, 2019 06 04.
Artículo en Inglés | MEDLINE | ID: mdl-31164119

RESUMEN

BACKGROUND: Copy number variation (CNV) plays an important role in human genetic diversity and has been associated with multiple complex disorders. Here we investigate a CNV on chromosome 10q11.22 that spans NPY4R, the gene for the appetite-regulating pancreatic polypeptide receptor Y4. This genomic region has been challenging to map due to multiple repeated elements and its precise organization has not yet been resolved. Previous studies using microarrays were interpreted to show that the most common copy number was 2 per genome. RESULTS: We have investigated 18 individuals from the 1000 Genomes project using the well-established method of read depth analysis and the new droplet digital PCR (ddPCR) method. We find that the most common copy number for NPY4R is 4. The estimated number of copies ranged from three to seven based on read depth analyses with Control-FREEC and CNVnator, and from four to seven based on ddPCR. We suggest that the difference between our results and those published previously can be explained by methodological differences such as reference gene choice, data normalization and method reliability. Three high-quality archaic human genomes (two Neanderthal and one Denisova) display four copies of the NPY4R gene indicating that a duplication occurred prior to the human-Neanderthal/Denisova split. CONCLUSIONS: We conclude that ddPCR is a sensitive and reliable method for CNV determination, that it can be used for read depth calibration in CNV studies based on already available whole-genome sequencing data, and that further investigation of NPY4R copy number variation and its consequences are necessary due to the role of Y4 receptor in food intake regulation.


Asunto(s)
Variaciones en el Número de Copia de ADN/genética , Dosificación de Gen , Reacción en Cadena de la Polimerasa/métodos , Receptores de Neuropéptido Y/genética , Análisis de Secuencia de ADN/métodos , Genoma Humano/genética , Genómica/métodos , Humanos , Reproducibilidad de los Resultados
14.
Mol Ecol ; 28(4): 721-730, 2019 02.
Artículo en Inglés | MEDLINE | ID: mdl-30582650

RESUMEN

Ribosomal DNA (rDNA) copy number variation (CNV) has major physiological implications for all organisms, but how it varies for fungi, an ecologically ubiquitous and important group of microorganisms, has yet to be systemically investigated. Here, we examine rDNA CNV using an in silico read depth approach for 91 fungal taxa with sequenced genomes and assess copy number conservation across phylogenetic scales and ecological lifestyles. rDNA copy number varied considerably across fungi, ranging from an estimated 14 to 1,442 copies (mean = 113, median = 82), and copy number similarity was inversely correlated with phylogenetic distance. No correlations were found between rDNA CNV and fungal trophic mode, ecological guild or genome size. Taken together, these results show that like other microorganisms, fungi exhibit substantial variation in rDNA copy number, which is linked to their phylogeny in a scale-dependent manner.


Asunto(s)
Variaciones en el Número de Copia de ADN/genética , Filogenia , ADN Ribosómico/genética , Ecología , Hongos/clasificación , Hongos/genética , Genoma Fúngico/genética , Estilo de Vida
15.
Biometrics ; 75(1): 210-221, 2019 03.
Artículo en Inglés | MEDLINE | ID: mdl-30168593

RESUMEN

DNA methylation studies have enabled researchers to understand methylation patterns and their regulatory roles in biological processes and disease. However, only a limited number of statistical approaches have been developed to provide formal quantitative analysis. Specifically, a few available methods do identify differentially methylated CpG (DMC) sites or regions (DMR), but they suffer from limitations that arise mostly due to challenges inherent in bisulfite sequencing data. These challenges include: (1) that read-depths vary considerably among genomic positions and are often low; (2) both methylation and autocorrelation patterns change as regions change; and (3) CpG sites are distributed unevenly. Furthermore, there are several methodological limitations: almost none of these tools is capable of comparing multiple groups and/or working with missing values, and only a few allow continuous or multiple covariates. The last of these is of great interest among researchers, as the goal is often to find which regions of the genome are associated with several exposures and traits. To tackle these issues, we have developed an efficient DMC identification method based on Hidden Markov Models (HMMs) called "DMCHMM" which is a three-step approach (model selection, prediction, testing) aiming to address the aforementioned drawbacks. Our proposed method is different from other HMM methods since it profiles methylation of each sample separately, hence exploiting inter-CpG autocorrelation within samples, and it is more flexible than previous approaches by allowing multiple hidden states. Using simulations, we show that DMCHMM has the best performance among several competing methods. An analysis of cell-separated blood methylation profiles is also provided.


Asunto(s)
Islas de CpG/genética , Metilación de ADN , Cadenas de Markov , Sulfitos , Algoritmos , Animales , Sitios de Unión , Células Sanguíneas/metabolismo , Simulación por Computador/economía , Simulación por Computador/estadística & datos numéricos , Humanos , Análisis de Secuencia de ADN/métodos
16.
J Med Genet ; 55(11): 735-743, 2018 11.
Artículo en Inglés | MEDLINE | ID: mdl-30061371

RESUMEN

BACKGROUND: Copy number variation (CNV) analysis is an integral component of the study of human genomes in both research and clinical settings. Array-based CNV analysis is the current first-tier approach in clinical cytogenetics. Decreasing costs in high-throughput sequencing and cloud computing have opened doors for the development of sequencing-based CNV analysis pipelines with fast turnaround times. We carry out a systematic and quantitative comparative analysis for several low-coverage whole-genome sequencing (WGS) strategies to detect CNV in the human genome. METHODS: We compared the CNV detection capabilities of WGS strategies (short insert, 3 kb insert mate pair and 5 kb insert mate pair) each at 1×, 3× and 5× coverages relative to each other and to 17 currently used high-density oligonucleotide arrays. For benchmarking, we used a set of gold standard (GS) CNVs generated for the 1000 Genomes Project CEU subject NA12878. RESULTS: Overall, low-coverage WGS strategies detect drastically more GS CNVs compared with arrays and are accompanied with smaller percentages of CNV calls without validation. Furthermore, we show that WGS (at ≥1× coverage) is able to detect all seven GS deletion CNVs >100 kb in NA12878, whereas only one is detected by most arrays. Lastly, we show that the much larger 15 Mbp Cri du chat deletion can be readily detected with short-insert paired-end WGS at even just 1× coverage. CONCLUSIONS: CNV analysis using low-coverage WGS is efficient and outperforms the array-based analysis that is currently used for clinical cytogenetics.


Asunto(s)
Hibridación Genómica Comparativa , Variaciones en el Número de Copia de ADN , Genoma Humano , Genómica , Secuenciación Completa del Genoma , Hibridación Genómica Comparativa/métodos , Hibridación Genómica Comparativa/normas , Estudios de Asociación Genética/métodos , Estudios de Asociación Genética/normas , Predisposición Genética a la Enfermedad , Pruebas Genéticas , Genómica/métodos , Genómica/normas , Humanos , Estándares de Referencia , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
17.
BMC Bioinformatics ; 19(1): 423, 2018 Nov 14.
Artículo en Inglés | MEDLINE | ID: mdl-30428853

RESUMEN

BACKGROUND: RNA-Sequencing analysis methods are rapidly evolving, and the tool choice for each step of one common workflow, differential expression analysis, which includes read alignment, expression modeling, and differentially expressed gene identification, has a dramatic impact on performance characteristics. Although a number of workflows are emerging as high performers that are robust to diverse input types, the relative performance characteristics of these workflows when either read depth or sample number is limited-a common occurrence in real-world practice-remain unexplored. RESULTS: Here, we evaluate the impact of varying read depth and sample number on the performance of differential gene expression identification workflows, as measured by precision, or the fraction of genes correctly identified as differentially expressed, and by recall, or the fraction of differentially expressed genes identified. We focus our analysis on 30 high-performing workflows, systematically varying the read depth and number of biological replicates of patient monocyte samples provided as input. We find that, in general for most workflows, read depth has little effect on workflow performance when held above two million reads per sample, with reduced workflow performance below this threshold. The greatest impact of decreased sample number is seen below seven samples per group, when more heterogeneity in workflow performance is observed. The choice of differential expression identification tool, in particular, has a large impact on the response to limited inputs. CONCLUSIONS: Among the tested workflows, the recall/precision balance remains relatively stable at a range of read depths and sample numbers, although some workflows are more sensitive to input restriction. At ranges typically recommended for biological studies, performance is more greatly impacted by the number of biological replicates than by read depth. Caution should be used when selecting analysis workflows and interpreting results from low sample number experiments, as all workflows exhibit poorer performance at lower sample numbers near typically reported values, with variable impact on recall versus precision. These analyses highlight the performance characteristics of common differential gene expression workflows at varying read depths and sample numbers, and provide empirical guidance in experimental and analytical design.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , ARN/genética , Análisis de Secuencia de ARN/métodos , Flujo de Trabajo , Humanos
18.
BMC Bioinformatics ; 17: 384, 2016 Sep 17.
Artículo en Inglés | MEDLINE | ID: mdl-27639558

RESUMEN

BACKGROUND: Variations in DNA copy number have an important contribution to the development of several diseases, including autism, schizophrenia and cancer. Single-cell sequencing technology allows the dissection of genomic heterogeneity at the single-cell level, thereby providing important evolutionary information about cancer cells. In contrast to traditional bulk sequencing, single-cell sequencing requires the amplification of the whole genome of a single cell to accumulate enough samples for sequencing. However, the amplification process inevitably introduces amplification bias, resulting in an over-dispersing portion of the sequencing data. Recent study has manifested that the over-dispersed portion of the single-cell sequencing data could be well modelled by negative binomial distributions. RESULTS: We developed a read-depth based method, nbCNV to detect the copy number variants (CNVs). The nbCNV method uses two constraints-sparsity and smoothness to fit the CNV patterns under the assumption that the read signals are negatively binomially distributed. The problem of CNV detection was formulated as a quadratic optimization problem, and was solved by an efficient numerical solution based on the classical alternating direction minimization method. CONCLUSIONS: Extensive experiments to compare nbCNV with existing benchmark models were conducted on both simulated data and empirical single-cell sequencing data. The results of those experiments demonstrate that nbCNV achieves superior performance and high robustness for the detection of CNVs in single-cell sequencing data.


Asunto(s)
Variaciones en el Número de Copia de ADN/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de la Célula Individual/métodos , Programas Informáticos , Distribución Binomial , Análisis por Conglomerados , Simulación por Computador , Humanos , Análisis de Secuencia de ADN
19.
Biostatistics ; 14(3): 600-11, 2013 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-23428932

RESUMEN

Copy number variations (CNVs) are a significant source of genetic variation and have been found frequently associated with diseases such as cancers and autism. High-throughput sequencing data are increasingly being used to detect and quantify CNVs; however, the distributional properties of the data are not fully understood. A hidden Markov model (HMM) is proposed using inhomogeneous emission distributions based on negative binomial regression to account for the sequencing biases. The model is tested on the whole genome sequencing data and simulated data sets. An algorithm for CNV detection is implemented in the R package CNVfinder. The model based on negative binomial regression is shown to provide a good fit to the data and provides competitive performance compared with methods based on normalization of read counts.


Asunto(s)
Variaciones en el Número de Copia de ADN , Modelos Genéticos , Modelos Estadísticos , Algoritmos , Distribución Binomial , Bioestadística , Simulación por Computador , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Cadenas de Markov , Programas Informáticos
20.
Genome Biol Evol ; 16(7)2024 Jul 03.
Artículo en Inglés | MEDLINE | ID: mdl-38946312

RESUMEN

Recent years have seen a dramatic increase in the number of canine genome assemblies available. Duplications are an important source of evolutionary novelty and are also prone to misassembly. We explored the duplication content of nine canine genome assemblies using both genome self-alignment and read-depth approaches. We find that 8.58% of the genome is duplicated in the canFam4 assembly, derived from the German Shepherd Dog Mischka, including 90.15% of unplaced contigs. Highlighting the continued difficulty in properly assembling duplications, less than half of read-depth and assembly alignment duplications overlap, but the mCanLor1.2 Greenland wolf assembly shows greater concordance. Further study shows the presence of multiple segments that have alignments to four or more duplicate copies. These high-recurrence duplications correspond to gene retrocopies. We identified 3,892 candidate retrocopies from 1,316 parental genes in the canFam4 assembly and find that ∼8.82% of duplicated base pairs involve a retrocopy, confirming this mechanism as a major driver of gene duplication in canines. Similar patterns are found across eight other recent canine genome assemblies, with metrics supporting a greater quality of the PacBio HiFi mCanLor1.2 assembly. Comparison between the wolf and other canine assemblies found that 92% of retrocopy insertions are shared between assemblies. By calculating the number of generations since genome divergence, we estimate that new retrocopy insertions appear, on average, in 1 out of 3,514 births. Our analyses illustrate the impact of retrogene formation on canine genomes and highlight the variable representation of duplicated sequences among recently completed canine assemblies.


Asunto(s)
Duplicación de Gen , Genoma , Perros/genética , Animales , Genómica , Evolución Molecular , Retroelementos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA