Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
1.
Nucleic Acids Res ; 47(13): e77, 2019 07 26.
Artículo en Inglés | MEDLINE | ID: mdl-31045217

RESUMEN

The availability of genome-wide epigenomic datasets enables in-depth studies of epigenetic modifications and their relationships with chromatin structures and gene expression. Various alignment tools have been developed to align nucleotide or protein sequences in order to identify structurally similar regions. However, there are currently no alignment methods specifically designed for comparing multi-track epigenomic signals and detecting common patterns that may explain functional or evolutionary similarities. We propose a new local alignment algorithm, EpiAlign, designed to compare chromatin state sequences learned from multi-track epigenomic signals and to identify locally aligned chromatin regions. EpiAlign is a dynamic programming algorithm that novelly incorporates varying lengths and frequencies of chromatin states. We demonstrate the efficacy of EpiAlign through extensive simulations and studies on the real data from the NIH Roadmap Epigenomics project. EpiAlign is able to extract recurrent chromatin state patterns along a single epigenome, and many of these patterns carry cell-type-specific characteristics. EpiAlign can also detect common chromatin state patterns across multiple epigenomes, and it will serve as a useful tool to group and distinguish epigenomic samples based on genome-wide or local chromatin state patterns.


Asunto(s)
Cromatina/ultraestructura , Biología Computacional/métodos , Epigenómica/métodos , Alineación de Secuencia , Algoritmos , Secuencia de Bases , Química Encefálica , Cromatina/genética , Metilación de ADN , Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Ontología de Genes , Humanos , Proteínas del Tejido Nervioso/biosíntesis , Proteínas del Tejido Nervioso/química , Proteínas del Tejido Nervioso/genética , Programas Informáticos
2.
Science ; 384(6698): eadh7688, 2024 May 24.
Artículo en Inglés | MEDLINE | ID: mdl-38781356

RESUMEN

RNA splicing is highly prevalent in the brain and has strong links to neuropsychiatric disorders; yet, the role of cell type-specific splicing and transcript-isoform diversity during human brain development has not been systematically investigated. In this work, we leveraged single-molecule long-read sequencing to deeply profile the full-length transcriptome of the germinal zone and cortical plate regions of the developing human neocortex at tissue and single-cell resolution. We identified 214,516 distinct isoforms, of which 72.6% were novel (not previously annotated in Gencode version 33), and uncovered a substantial contribution of transcript-isoform diversity-regulated by RNA binding proteins-in defining cellular identity in the developing neocortex. We leveraged this comprehensive isoform-centric gene annotation to reprioritize thousands of rare de novo risk variants and elucidate genetic risk mechanisms for neuropsychiatric disorders.


Asunto(s)
Trastornos Mentales , Neocórtex , Neurogénesis , Isoformas de Proteínas , Empalme del ARN , Análisis de la Célula Individual , Transcriptoma , Humanos , Empalme Alternativo , Predisposición Genética a la Enfermedad , Trastornos Mentales/genética , Anotación de Secuencia Molecular , Neocórtex/metabolismo , Neocórtex/embriología , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , Proteínas de Unión al ARN/genética , Proteínas de Unión al ARN/metabolismo , Neurogénesis/genética
3.
bioRxiv ; 2023 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-37546812

RESUMEN

In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is used to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used to define both cell clusters and DE genes, leading to false-positive DE genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE test for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality. The core idea of ClusterDE is to generate real-data-based synthetic null data with only one cluster, as a counterfactual in contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to find cell-type marker genes that are biologically meaningful. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.

4.
Res Sq ; 2023 Aug 02.
Artículo en Inglés | MEDLINE | ID: mdl-37577698

RESUMEN

In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is employed to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used twice to define cell clusters as potential cell types and DE genes as potential cell-type marker genes, leading to false-positive cell-type marker genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE method for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality, which can work as an add-on to popular pipelines such as Seurat. The core idea of ClusterDE is to generate real-data-based synthetic null data containing only one cluster, as contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to identify cell-type marker genes as top DE genes and distinguish them from housekeeping genes. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.

5.
bioRxiv ; 2023 Aug 29.
Artículo en Inglés | MEDLINE | ID: mdl-37693523

RESUMEN

A central task in expression quantitative trait locus (eQTL) analysis is to identify cis-eGenes (henceforth "eGenes"), i.e., genes whose expression levels are regulated by at least one local genetic variant. Among the existing eGene identification methods, FastQTL is considered the gold standard but is computationally expensive as it requires thousands of permutations for each gene. Alternative methods such as eigenMT and TreeQTL have lower power than FastQTL. In this work, we propose ClipperQTL, which reduces the number of permutations needed from thousands to 20 for data sets with large sample sizes (> 450) by using the contrastive strategy developed in Clipper; for data sets with smaller sample sizes, it uses the same permutation-based approach as FastQTL. We show that ClipperQTL performs as well as FastQTL and runs about 500 times faster if the contrastive strategy is used and 50 times faster if the conventional permutation-based approach is used. The R package ClipperQTL is available at https://github.com/heatherjzhou/ClipperQTL.

6.
Blood Cancer Discov ; 4(3): 228-245, 2023 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-37067905

RESUMEN

RNA splicing dysregulation underlies the onset and progression of cancers. In chronic lymphocytic leukemia (CLL), spliceosome mutations leading to aberrant splicing occur in ∼20% of patients. However, the mechanism for splicing defects in spliceosome-unmutated CLL cases remains elusive. Through an integrative transcriptomic and proteomic analysis, we discover that proteins involved in RNA splicing are posttranscriptionally upregulated in CLL cells, resulting in splicing dysregulation. The abundance of splicing complexes is an independent risk factor for poor prognosis. Moreover, increased splicing factor expression is highly correlated with the abundance of METTL3, an RNA methyltransferase that deposits N6-methyladenosine (m6A) on mRNA. METTL3 is essential for cell growth in vitro and in vivo and controls splicing factor protein expression in a methyltransferase-dependent manner through m6A modification-mediated ribosome recycling and decoding. Our results uncover METTL3-mediated m6A modification as a novel regulatory axis in driving splicing dysregulation and contributing to aggressive CLL. SIGNIFICANCE: METTL3 controls widespread splicing factor abundance via translational control of m6A-modified mRNA, contributes to RNA splicing dysregulation and disease progression in CLL, and serves as a potential therapeutic target in aggressive CLL. See related commentary by Janin and Esteller, p. 176. This article is highlighted in the In This Issue feature, p. 171.


Asunto(s)
Empalme Alternativo , Leucemia Linfocítica Crónica de Células B , Humanos , Leucemia Linfocítica Crónica de Células B/genética , Proteómica , Metiltransferasas/genética , Metiltransferasas/metabolismo , Factores de Empalme de ARN/genética , Factores de Empalme de ARN/metabolismo , ARN Mensajero/genética , ARN Mensajero/metabolismo
7.
Genome Biol ; 23(1): 79, 2022 03 15.
Artículo en Inglés | MEDLINE | ID: mdl-35292087

RESUMEN

When identifying differentially expressed genes between two conditions using human population RNA-seq samples, we found a phenomenon by permutation analysis: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates. Expanding the analysis to limma-voom, NOISeq, dearseq, and Wilcoxon rank-sum test, we found that FDR control is often failed except for the Wilcoxon rank-sum test. Particularly, the actual FDRs of DESeq2 and edgeR sometimes exceed 20% when the target FDR is 5%. Based on these results, for population-level RNA-seq studies with large sample sizes, we recommend the Wilcoxon rank-sum test.


Asunto(s)
Biología Computacional , Perfilación de la Expresión Génica , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Humanos , RNA-Seq , Tamaño de la Muestra , Análisis de Secuencia de ARN/métodos
8.
Genome Biol ; 22(1): 288, 2021 10 11.
Artículo en Inglés | MEDLINE | ID: mdl-34635147

RESUMEN

High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Secuenciación de Inmunoprecipitación de Cromatina/métodos , Cromosomas , Simulación por Computador , Interpretación Estadística de Datos , Humanos , Espectrometría de Masas , Péptidos/química , Proteómica/métodos , RNA-Seq/métodos , Análisis de la Célula Individual
9.
Sci Adv ; 6(46)2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33177077

RESUMEN

Data-driven discovery of cancer driver genes, including tumor suppressor genes (TSGs) and oncogenes (OGs), is imperative for cancer prevention, diagnosis, and treatment. Although epigenetic alterations are important for tumor initiation and progression, most known driver genes were identified based on genetic alterations alone. Here, we developed an algorithm, DORGE (Discovery of Oncogenes and tumor suppressoR genes using Genetic and Epigenetic features), to identify TSGs and OGs by integrating comprehensive genetic and epigenetic data. DORGE identified histone modifications as strong predictors for TSGs, and it found missense mutations, super enhancers, and methylation differences as strong predictors for OGs. We extensively validated DORGE-predicted cancer driver genes using independent functional genomics data. We also found that DORGE-predicted dual-functional genes (both TSGs and OGs) are enriched at hubs in protein-protein interaction and drug-gene networks. Overall, our study has deepened the understanding of epigenetic mechanisms in tumorigenesis and revealed previously undetected cancer driver genes.


Asunto(s)
Genes Supresores de Tumor , Oncogenes , Transformación Celular Neoplásica/genética , Metilación de ADN , Epigénesis Genética , Regulación Neoplásica de la Expresión Génica , Redes Reguladoras de Genes , Humanos
10.
PLoS One ; 13(4): e0196226, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29702671

RESUMEN

Copy number variations (CNVs) are gain and loss of DNA sequence of a genome. High throughput platforms such as microarrays and next generation sequencing technologies (NGS) have been applied for genome wide copy number losses. Although progress has been made in both approaches, the accuracy and consistency of CNV calling from the two platforms remain in dispute. In this study, we perform a deep analysis on copy number losses on 254 human DNA samples, which have both SNP microarray data and NGS data publicly available from Hapmap Project and 1000 Genomes Project respectively. We show that the copy number losses reported from Hapmap Project and 1000 Genome Project only have < 30% overlap, while these reports are required to have cross-platform (e.g. PCR, microarray and high-throughput sequencing) experimental supporting by their corresponding projects, even though state-of-art calling methods were employed. On the other hand, copy number losses are found directly from HapMap microarray data by an accurate algorithm, i.e. CNVhac, almost all of which have lower read mapping depth in NGS data; furthermore, 88% of which can be supported by the sequences with breakpoint in NGS data. Our results suggest the ability of microarray calling CNVs and the possible introduction of false negatives from the unessential requirement of the additional cross-platform supporting. The inconsistency of CNV reports from Hapmap Project and 1000 Genomes Project might result from the inadequate information containing in microarray data, the inconsistent detection criteria, or the filtration effect of cross-platform supporting. The statistical test on CNVs called from CNVhac show that the microarray data can offer reliable CNV reports, and majority of CNV candidates can be confirmed by raw sequences. Therefore, the CNV candidates given by a good caller could be highly reliable without cross-platform supporting, so additional experimental information should be applied in need instead of necessarily.


Asunto(s)
Biología Computacional/métodos , Variaciones en el Número de Copia de ADN , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN/métodos , Algoritmos , Genoma Humano , Proyecto Mapa de Haplotipos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Proyecto Genoma Humano , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA