Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 103
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
PLoS Genet ; 20(9): e1011412, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39348415

RESUMEN

Rare variants, comprising the vast majority of human genetic variations, are likely to have more deleterious impact in the context of human diseases compared to common variants. Here we present carrier statistic, a statistical framework to prioritize disease-related rare variants by integrating gene expression data. By quantifying the impact of rare variants on gene expression, carrier statistic can prioritize those rare variants that have large functional consequence in the patients. Through simulation studies and analyzing real multi-omics dataset, we demonstrated that carrier statistic is applicable in studies with limited sample size (a few hundreds) and achieves substantially higher sensitivity than existing rare variants association methods. Application to Alzheimer's disease reveals 16 rare variants within 15 genes with extreme carrier statistics. We also found strong excess of rare variants among the top prioritized genes in patients compared to that in healthy individuals. The carrier statistic method can be applied to various rare variant types and is adaptable to other omics data modalities, offering a powerful tool for investigating the molecular mechanisms underlying complex diseases.


Asunto(s)
Enfermedad de Alzheimer , Predisposición Genética a la Enfermedad , Variación Genética , Humanos , Enfermedad de Alzheimer/genética , Estudio de Asociación del Genoma Completo/métodos , Expresión Génica/genética , Simulación por Computador
2.
Proc Natl Acad Sci U S A ; 121(23): e2322376121, 2024 Jun 04.
Artículo en Inglés | MEDLINE | ID: mdl-38809705

RESUMEN

In this article, we develop CausalEGM, a deep learning framework for nonlinear dimension reduction and generative modeling of the dependency among covariate features affecting treatment and response. CausalEGM can be used for estimating causal effects in both binary and continuous treatment settings. By learning a bidirectional transformation between the high-dimensional covariate space and a low-dimensional latent space and then modeling the dependencies of different subsets of the latent variables on the treatment and response, CausalEGM can extract the latent covariate features that affect both treatment and response. By conditioning on these features, one can mitigate the confounding effect of the high dimensional covariate on the estimation of the causal relation between treatment and response. In a series of experiments, the proposed method is shown to achieve superior performance over existing methods in both binary and continuous treatment settings. The improvement is substantial when the sample size is large and the covariate is of high dimension. Finally, we established excess risk bounds and consistency results for our method, and discuss how our approach is related to and improves upon other dimension reduction approaches in causal inference.

3.
Proc Natl Acad Sci U S A ; 120(28): e2305236120, 2023 07 11.
Artículo en Inglés | MEDLINE | ID: mdl-37399400

RESUMEN

Plasma cell-free DNA (cfDNA) is a noninvasive biomarker for cell death of all organs. Deciphering the tissue origin of cfDNA can reveal abnormal cell death because of diseases, which has great clinical potential in disease detection and monitoring. Despite the great promise, the sensitive and accurate quantification of tissue-derived cfDNA remains challenging to existing methods due to the limited characterization of tissue methylation and the reliance on unsupervised methods. To fully exploit the clinical potential of tissue-derived cfDNA, here we present one of the largest comprehensive and high-resolution methylation atlas based on 521 noncancer tissue samples spanning 29 major types of human tissues. We systematically identified fragment-level tissue-specific methylation patterns and extensively validated them in orthogonal datasets. Based on the rich tissue methylation atlas, we develop the first supervised tissue deconvolution approach, a deep-learning-powered model, cfSort, for sensitive and accurate tissue deconvolution in cfDNA. On the benchmarking data, cfSort showed superior sensitivity and accuracy compared to the existing methods. We further demonstrated the clinical utilities of cfSort with two potential applications: aiding disease diagnosis and monitoring treatment side effects. The tissue-derived cfDNA fraction estimated from cfSort reflected the clinical outcomes of the patients. In summary, the tissue methylation atlas and cfSort enhanced the performance of tissue deconvolution in cfDNA, thus facilitating cfDNA-based disease detection and longitudinal treatment monitoring.


Asunto(s)
Ácidos Nucleicos Libres de Células , Aprendizaje Profundo , Humanos , Ácidos Nucleicos Libres de Células/genética , Metilación de ADN , Biomarcadores , Regiones Promotoras Genéticas , Biomarcadores de Tumor/genética
4.
Hum Mol Genet ; 32(21): 3105-3120, 2023 10 17.
Artículo en Inglés | MEDLINE | ID: mdl-37584462

RESUMEN

DNA methyltransferase type 1 (DNMT1) is a major enzyme involved in maintaining the methylation pattern after DNA replication. Mutations in DNMT1 have been associated with autosomal dominant cerebellar ataxia, deafness and narcolepsy (ADCA-DN). We used fibroblasts, induced pluripotent stem cells (iPSCs) and induced neurons (iNs) generated from patients with ADCA-DN and controls, to explore the epigenomic and transcriptomic effects of mutations in DNMT1. We show cell type-specific changes in gene expression and DNA methylation patterns. DNA methylation and gene expression changes were negatively correlated in iPSCs and iNs. In addition, we identified a group of genes associated with clinical phenotypes of ADCA-DN, including PDGFB and PRDM8 for cerebellar ataxia, psychosis and dementia and NR2F1 for deafness and optic atrophy. Furthermore, ZFP57, which is required to maintain gene imprinting through DNA methylation during early development, was hypomethylated in promoters and exhibited upregulated expression in patients with ADCA-DN in both iPSC and iNs. Our results provide insight into the functions of DNMT1 and the molecular changes associated with ADCA-DN, with potential implications for genes associated with related phenotypes.


Asunto(s)
Ataxia Cerebelosa , Sordera , Humanos , Ataxia Cerebelosa/genética , ADN (Citosina-5-)-Metiltransferasas/genética , Transcriptoma/genética , Epigenómica , ADN (Citosina-5-)-Metiltransferasa 1/genética , Metilación de ADN/genética , Sordera/genética , Mutación , ADN
5.
Nucleic Acids Res ; 51(D1): D159-D166, 2023 01 06.
Artículo en Inglés | MEDLINE | ID: mdl-36215037

RESUMEN

Elucidating the role of 3D architecture of DNA in gene regulation is crucial for understanding cell differentiation, tissue homeostasis and disease development. Among various chromatin conformation capture methods, HiChIP has received increasing attention for its significant improvement over other methods in profiling of regulatory (e.g. H3K27ac) and structural (e.g. cohesin) interactions. To facilitate the studies of 3D regulatory interactions, we developed a HiChIP interactions database, HiChIPdb (http://health.tsinghua.edu.cn/hichipdb/). The current version of HiChIPdb contains ∼262M annotated HiChIP interactions from 200 high-throughput HiChIP samples across 108 cell types. The functionalities of HiChIPdb include: (i) standardized categorization of HiChIP interactions in a hierarchical structure based on organ, tissue and cell line and (ii) comprehensive annotations of HiChIP interactions with regulatory genes and GWAS Catalog SNPs. To the best of our knowledge, HiChIPdb is the first comprehensive database that utilizes a unified pipeline to map the functional interactions across diverse cell types and tissues in different resolutions. We believe this database has the potential to advance cutting-edge research in regulatory mechanisms in development and disease by removing the barrier in data aggregation, preprocessing, and analysis.


Asunto(s)
Cromatina , ADN , Línea Celular , Cromatina/genética , Regulación de la Expresión Génica , Análisis de Secuencia de ADN/métodos , Bases de Datos Genéticas
6.
Proc Natl Acad Sci U S A ; 119(1)2022 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-34930827

RESUMEN

Abdominal aortic aneurysm (AAA) is a common degenerative cardiovascular disease whose pathobiology is not clearly understood. The cellular heterogeneity and cell-type-specific gene regulation of vascular cells in human AAA have not been well-characterized. Here, we performed analysis of whole-genome sequencing data in AAA patients versus controls with the aim of detecting disease-associated variants that may affect gene regulation in human aortic smooth muscle cells (AoSMC) and human aortic endothelial cells (HAEC), two cell types of high relevance to AAA disease. To support this analysis, we generated H3K27ac HiChIP data for these cell types and inferred cell-type-specific gene regulatory networks. We observed that AAA-associated variants were most enriched in regulatory regions in AoSMC, compared with HAEC and CD4+ cells. The cell-type-specific regulation defined by this HiChIP data supported the importance of ERG and the KLF family of transcription factors in AAA disease. The analysis of regulatory elements that contain noncoding variants and also are differentially open between AAA patients and controls revealed the significance of the interleukin-6-mediated signaling pathway. This finding was further validated by including information from the deleteriousness effect of nonsynonymous single-nucleotide variants in AAA patients and additional control data from the Medical Genome Reference Bank dataset. These results shed important insights into AAA pathogenesis and provide a model for cell-type-specific analysis of disease-associated variants.


Asunto(s)
Aneurisma de la Aorta Abdominal/genética , Redes Reguladoras de Genes , Estudios de Casos y Controles , Células Cultivadas , Regulación hacia Abajo , Humanos , Interleucina-6/metabolismo , Factores de Transcripción de Tipo Kruppel/genética , Regulador Transcripcional ERG/genética
7.
Proc Natl Acad Sci U S A ; 118(15)2021 04 13.
Artículo en Inglés | MEDLINE | ID: mdl-33833061

RESUMEN

Density estimation is one of the fundamental problems in both statistics and machine learning. In this study, we propose Roundtrip, a computational framework for general-purpose density estimation based on deep generative neural networks. Roundtrip retains the generative power of deep generative models, such as generative adversarial networks (GANs) while it also provides estimates of density values, thus supporting both data generation and density estimation. Unlike previous neural density estimators that put stringent conditions on the transformation from the latent space to the data space, Roundtrip enables the use of much more general mappings where target density is modeled by learning a manifold induced from a base density (e.g., Gaussian distribution). Roundtrip provides a statistical framework for GAN models where an explicit evaluation of density values is feasible. In numerical experiments, Roundtrip exceeds state-of-the-art performance in a diverse range of density estimation tasks.

8.
Genome Res ; 30(4): 622-634, 2020 04.
Artículo en Inglés | MEDLINE | ID: mdl-32188700

RESUMEN

A time course experiment is a widely used design in the study of cellular processes such as differentiation or response to stimuli. In this paper, we propose time course regulatory analysis (TimeReg) as a method for the analysis of gene regulatory networks based on paired gene expression and chromatin accessibility data from a time course. TimeReg can be used to prioritize regulatory elements, to extract core regulatory modules at each time point, to identify key regulators driving changes of the cellular state, and to causally connect the modules across different time points. We applied the method to analyze paired chromatin accessibility and gene expression data from a retinoic acid (RA)-induced mouse embryonic stem cells (mESCs) differentiation experiment. The analysis identified 57,048 novel regulatory elements regulating cerebellar development, synapse assembly, and hindbrain morphogenesis, which substantially extended our knowledge of cis-regulatory elements during differentiation. Using single-cell RNA-seq data, we showed that the core regulatory modules can reflect the properties of different subpopulations of cells. Finally, the driver regulators are shown to be important in clarifying the relations between modules across adjacent time points. As a second example, our method on Ascl1-induced direct reprogramming from fibroblast to neuron time course data identified Id1/2 as driver regulators of early stage of reprogramming.


Asunto(s)
Ensamble y Desensamble de Cromatina , Cromatina/genética , Regulación de la Expresión Génica , Células Madre Embrionarias de Ratones/metabolismo , Algoritmos , Animales , Diferenciación Celular/efectos de los fármacos , Diferenciación Celular/genética , Linaje de la Célula , Reprogramación Celular/genética , Técnicas de Reprogramación Celular , Cromatina/metabolismo , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Redes Reguladoras de Genes , Ratones , Células Madre Embrionarias de Ratones/efectos de los fármacos , Factores de Transcripción/metabolismo , Transcriptoma , Tretinoina/farmacología
9.
Brief Bioinform ; 22(6)2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34180954

RESUMEN

Multi-omics data allow us to select a small set of informative markers for the discrimination of specific cell types and study of cellular heterogeneity. However, it is often challenging to choose an optimal marker panel from the high-dimensional molecular profiles for a large amount of cell types. Here, we propose a method called Mixed Integer programming Model to Identify Cell type-specific marker panel (MIMIC). MIMIC maintains the hierarchical topology among different cell types and simultaneously maximizes the specificity of a fixed number of selected markers. MIMIC was benchmarked on the mouse ENCODE RNA-seq dataset, with 29 diverse tissues, for 43 surface markers (SMs) and 1345 transcription factors (TFs). MIMIC could select biologically meaningful markers and is robust for different accuracy criteria. It shows advantages over the standard single gene-based approaches and widely used dimensional reduction methods, such as multidimensional scaling and t-SNE, both in accuracy and in biological interpretation. Furthermore, the combination of SMs and TFs achieves better specificity than SMs or TFs alone. Applying MIMIC to a large collection of 641 RNA-seq samples covering 231 cell types identifies a panel of TFs and SMs that reveal the modularity of cell type association networks. Finally, the scalability of MIMIC is demonstrated by selecting enhancer markers from mouse ENCODE data. MIMIC is freely available at https://github.com/MengZou1/MIMIC.


Asunto(s)
Biomarcadores , Biología Computacional , Citometría de Flujo/métodos , Perfilación de la Expresión Génica/métodos , Especificidad de Órganos , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Bases de Datos Genéticas , Regulación de la Expresión Génica , Humanos , Especificidad de Órganos/genética , Reproducibilidad de los Resultados
10.
Proc Natl Acad Sci U S A ; 117(35): 21364-21372, 2020 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-32817564

RESUMEN

A person's genome typically contains millions of variants which represent the differences between this personal genome and the reference human genome. The interpretation of these variants, i.e., the assessment of their potential impact on a person's phenotype, is currently of great interest in human genetics and medicine. We have developed a prioritization tool called OpenCausal which takes as inputs 1) a personal genome and 2) a reference context-specific TF expression profile and returns a list of noncoding variants prioritized according to their impact on chromatin accessibility for any given genomic region of interest. We applied OpenCausal to 6,430 samples across 18 tissues derived from the GTEx project and found that the variants prioritized by OpenCausal are highly enriched for eQTLs and caQTLs. We further propose a strategy to integrate the predicted open scores with genome-wide association studies (GWAS) data to prioritize putative causal variants and regulatory elements for a given risk locus (i.e., fine-mapping analysis). As an initial example, we applied this method to a GWAS dataset of human height and found that the prioritized putative variants and elements are correlated with the phenotype (i.e., heights of individuals) better than others.


Asunto(s)
Técnicas Genéticas , Variación Genética , Genoma Humano , Modelos Genéticos , Elementos Reguladores de la Transcripción , Estatura/genética , Perfilación de la Expresión Génica , Estudio de Asociación del Genoma Completo , Humanos , Sitios de Carácter Cuantitativo , Programas Informáticos , Factores de Transcripción/metabolismo
11.
Proc Natl Acad Sci U S A ; 117(9): 4864-4873, 2020 03 03.
Artículo en Inglés | MEDLINE | ID: mdl-32071206

RESUMEN

In both Turner syndrome (TS) and Klinefelter syndrome (KS) copy number aberrations of the X chromosome lead to various developmental symptoms. We report a comparative analysis of TS vs. KS regarding differences at the genomic network level measured in primary samples by analyzing gene expression, DNA methylation, and chromatin conformation. X-chromosome inactivation (XCI) silences transcription from one X chromosome in female mammals, on which most genes are inactive, and some genes escape from XCI. In TS, almost all differentially expressed escape genes are down-regulated but most differentially expressed inactive genes are up-regulated. In KS, differentially expressed escape genes are up-regulated while the majority of inactive genes appear unchanged. Interestingly, 94 differentially expressed genes (DEGs) overlapped between TS and female and KS and male comparisons; and these almost uniformly display expression changes into opposite directions. DEGs on the X chromosome and the autosomes are coexpressed in both syndromes, indicating that there are molecular ripple effects of the changes in X chromosome dosage. Six potential candidate genes (RPS4X, SEPT6, NKRF, CX0rf57, NAA10, and FLNA) for KS are identified on Xq, as well as candidate central genes on Xp for TS. Only promoters of inactive genes are differentially methylated in both syndromes while escape gene promoters remain unchanged. The intrachromosomal contact map of the X chromosome in TS exhibits the structure of an active X chromosome. The discovery of shared DEGs indicates the existence of common molecular mechanisms for gene regulation in TS and KS that transmit the gene dosage changes to the transcriptome.


Asunto(s)
Dosificación de Gen , Regulación de la Expresión Génica , Genómica , Síndrome de Klinefelter/genética , Síndrome de Turner/genética , Cromosoma X , Animales , Cromatina/química , Cromosomas Humanos X , Metilación de ADN , Femenino , Filaminas , Humanos , Cariotipo , Masculino , Mamíferos/genética , Acetiltransferasa A N-Terminal , Acetiltransferasa E N-Terminal , Proteínas Serina-Treonina Quinasas/genética , Receptor PAR-2 , Proteínas Represoras/genética , Septinas , Transcriptoma/genética , Inactivación del Cromosoma X
12.
Nucleic Acids Res ; 47(10): e60, 2019 06 04.
Artículo en Inglés | MEDLINE | ID: mdl-30869141

RESUMEN

Interactions between regulatory elements are of crucial importance for the understanding of transcriptional regulation and the interpretation of disease mechanisms. Hi-C technique has been developed for genome-wide detection of chromatin contacts. However, unless extremely deep sequencing is performed on a very large number of input cells, which is technically limited and expensive, current Hi-C experiments do not have high enough resolution to resolve contacts between regulatory elements. Here, we develop DeepTACT, a bootstrapping deep learning model, to integrate genome sequences and chromatin accessibility data for the prediction of chromatin contacts between regulatory elements. DeepTACT can infer not only promoter-enhancer interactions, but also promoter-promoter interactions. In tests based on promoter capture Hi-C data, DeepTACT shows better performance over existing methods. DeepTACT analysis also identifies a class of hub promoters, which are correlated with transcriptional activation across cell lines, enriched in housekeeping genes, functionally related to fundamental biological processes, and capable of reflecting cell similarity. Finally, the utility of chromatin contacts in the study of human diseases is illustrated by the association of IFNA2 to coronary artery disease via an integrative analysis of GWAS data and interactions predicted by DeepTACT.


Asunto(s)
Algoritmos , Cromatina/genética , Biología Computacional/métodos , Aprendizaje Profundo , Regiones Promotoras Genéticas/genética , Secuencias Reguladoras de Ácidos Nucleicos/genética , Células Cultivadas , Cromatina/metabolismo , Regulación de la Expresión Génica , Estudio de Asociación del Genoma Completo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos
13.
Proc Natl Acad Sci U S A ; 115(30): 7723-7728, 2018 07 24.
Artículo en Inglés | MEDLINE | ID: mdl-29987051

RESUMEN

When different types of functional genomics data are generated on single cells from different samples of cells from the same heterogeneous population, the clustering of cells in the different samples should be coupled. We formulate this "coupled clustering" problem as an optimization problem and propose the method of coupled nonnegative matrix factorizations (coupled NMF) for its solution. The method is illustrated by the integrative analysis of single-cell RNA-sequencing (RNA-seq) and single-cell ATAC-sequencing (ATAC-seq) data.


Asunto(s)
Bases de Datos Genéticas , Modelos Genéticos , Análisis de Secuencia de ARN/métodos , Animales , Humanos
14.
Nucleic Acids Res ; 46(15): e89, 2018 09 06.
Artículo en Inglés | MEDLINE | ID: mdl-29897492

RESUMEN

The detection of tumor-derived cell-free DNA in plasma is one of the most promising directions in cancer diagnosis. The major challenge in such an approach is how to identify the tiny amount of tumor DNAs out of total cell-free DNAs in blood. Here we propose an ultrasensitive cancer detection method, termed 'CancerDetector', using the DNA methylation profiles of cell-free DNAs. The key of our method is to probabilistically model the joint methylation states of multiple adjacent CpG sites on an individual sequencing read, in order to exploit the pervasive nature of DNA methylation for signal amplification. Therefore, CancerDetector can sensitively identify a trace amount of tumor cfDNAs in plasma, at the level of individual reads. We evaluated CancerDetector on the simulated data, and showed a high concordance of the predicted and true tumor fraction. Testing CancerDetector on real plasma data demonstrated its high sensitivity and specificity in detecting tumor cfDNAs. In addition, the predicted tumor fraction showed great consistency with tumor size and survival outcome. Note that all of those testing were performed on sequencing data at low to medium coverage (1× to 10×). Therefore, CancerDetector holds the great potential to detect cancer early and cost-effectively.


Asunto(s)
Algoritmos , Ácidos Nucleicos Libres de Células/genética , Biología Computacional/métodos , Metilación de ADN , Neoplasias/diagnóstico , Ácidos Nucleicos Libres de Células/química , Islas de CpG/genética , ADN de Neoplasias/química , ADN de Neoplasias/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Neoplasias/sangre , Neoplasias/genética , Curva ROC , Reproducibilidad de los Resultados
15.
Proc Natl Acad Sci U S A ; 114(25): E4914-E4923, 2017 06 20.
Artículo en Inglés | MEDLINE | ID: mdl-28576882

RESUMEN

The rapid increase of genome-wide datasets on gene expression, chromatin states, and transcription factor (TF) binding locations offers an exciting opportunity to interpret the information encoded in genomes and epigenomes. This task can be challenging as it requires joint modeling of context-specific activation of cis-regulatory elements (REs) and the effects on transcription of associated regulatory factors. To meet this challenge, we propose a statistical approach based on paired expression and chromatin accessibility (PECA) data across diverse cellular contexts. In our approach, we model (i) the localization to REs of chromatin regulators (CRs) based on their interaction with sequence-specific TFs, (ii) the activation of REs due to CRs that are localized to them, and (iii) the effect of TFs bound to activated REs on the transcription of target genes (TGs). The transcriptional regulatory network inferred by PECA provides a detailed view of how trans- and cis-regulatory elements work together to affect gene expression in a context-specific manner. We illustrate the feasibility of this approach by analyzing paired expression and accessibility data from the mouse Encyclopedia of DNA Elements (ENCODE) and explore various applications of the resulting model.


Asunto(s)
Cromatina/genética , Regulación de la Expresión Génica/genética , Redes Reguladoras de Genes/genética , Animales , Sitios de Unión/genética , Ensamble y Desensamble de Cromatina/genética , Elementos de Facilitación Genéticos/genética , Humanos , Ratones , Unión Proteica/genética , Elementos Reguladores de la Transcripción/genética , Factores de Transcripción/genética
16.
Nucleic Acids Res ; 45(14): e132, 2017 Aug 21.
Artículo en Inglés | MEDLINE | ID: mdl-28586438

RESUMEN

Third generation sequencing (TGS) are highly promising technologies but the long and noisy reads from TGS are difficult to align using existing algorithms. Here, we present COSINE, a conceptually new method designed specifically for aligning long reads contaminated by a high level of errors. COSINE computes the context similarity of two stretches of nucleobases given the similarity over distributions of their short k-mers (k = 3-4) along the sequences. The results on simulated and real data show that COSINE achieves high sensitivity and specificity under a wide range of read accuracies. When the error rate is high, COSINE can offer substantial advantages over existing alignment methods.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Secuencia de Bases , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Reproducibilidad de los Resultados
17.
Nucleic Acids Res ; 45(10): 5666-5677, 2017 Jun 02.
Artículo en Inglés | MEDLINE | ID: mdl-28472398

RESUMEN

Transcription factors (TFs) play crucial roles in regulating gene expression through interactions with specific DNA sequences. Recently, the sequence motif of almost 400 human TFs have been identified using high-throughput SELEX sequencing. However, there remain a large number of TFs (∼800) with no high-throughput-derived binding motifs. Computational methods capable of associating known motifs to such TFs will avoid tremendous experimental efforts and enable deeper understanding of transcriptional regulatory functions. We present a method to associate known motifs to TFs (MATLAB code is available in Supplementary Materials). Our method is based on a probabilistic framework that not only exploits DNA-binding domains and specificities, but also integrates open chromatin, gene expression and genomic data to accurately infer monomeric and homodimeric binding motifs. Our analysis resulted in the assignment of motifs to 200 TFs with no SELEX-derived motifs, roughly a 50% increase compared to the existing coverage.


Asunto(s)
Algoritmos , Cromatina/química , ADN/química , Regulación de la Expresión Génica , Modelos Estadísticos , Factores de Transcripción/genética , Sitios de Unión , Cromatina/metabolismo , ADN/genética , ADN/metabolismo , Genoma Humano , Humanos , Motivos de Nucleótidos , Unión Proteica , Técnica SELEX de Producción de Aptámeros , Factores de Transcripción/metabolismo
18.
Proc Natl Acad Sci U S A ; 113(51): 14662-14667, 2016 12 20.
Artículo en Inglés | MEDLINE | ID: mdl-27930330

RESUMEN

Dimension reduction methods are commonly applied to high-throughput biological datasets. However, the results can be hindered by confounding factors, either biological or technical in origin. In this study, we extend principal component analysis (PCA) to propose AC-PCA for simultaneous dimension reduction and adjustment for confounding (AC) variation. We show that AC-PCA can adjust for (i) variations across individual donors present in a human brain exon array dataset and (ii) variations of different species in a model organism ENCODE RNA sequencing dataset. Our approach is able to recover the anatomical structure of neocortical regions and to capture the shared variation among species during embryonic development. For gene selection purposes, we extend AC-PCA with sparsity constraints and propose and implement an efficient algorithm. The methods developed in this paper can also be applied to more general settings. The R package and MATLAB source code are available at https://github.com/linzx06/AC-PCA.


Asunto(s)
Encéfalo/metabolismo , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Componente Principal , Análisis de Secuencia de ARN , Algoritmos , Mapeo Encefálico , Simulación por Computador , Interpretación Estadística de Datos , Exones , Humanos , Modelos Estadísticos , Programas Informáticos , Transcriptoma
19.
PLoS Comput Biol ; 13(12): e1005875, 2017 12.
Artículo en Inglés | MEDLINE | ID: mdl-29281633

RESUMEN

Mass cytometry (CyTOF) has greatly expanded the capability of cytometry. It is now easy to generate multiple CyTOF samples in a single study, with each sample containing single-cell measurement on 50 markers for more than hundreds of thousands of cells. Current methods do not adequately address the issues concerning combining multiple samples for subpopulation discovery, and these issues can be quickly and dramatically amplified with increasing number of samples. To overcome this limitation, we developed Partition-Assisted Clustering and Multiple Alignments of Networks (PAC-MAN) for the fast automatic identification of cell populations in CyTOF data closely matching that of expert manual-discovery, and for alignments between subpopulations across samples to define dataset-level cellular states. PAC-MAN is computationally efficient, allowing the management of very large CyTOF datasets, which are increasingly common in clinical studies and cancer studies that monitor various tissue samples for each subject.


Asunto(s)
Análisis de la Célula Individual/estadística & datos numéricos , Animales , Biomarcadores/análisis , Análisis por Conglomerados , Biología Computacional , Simulación por Computador , Interpretación Estadística de Datos , Bases de Datos Factuales , Citometría de Flujo/estadística & datos numéricos , Expresión Génica , Humanos , Ratones
20.
Nucleic Acids Res ; 43(18): e116, 2015 Oct 15.
Artículo en Inglés | MEDLINE | ID: mdl-26040699

RESUMEN

We developed an innovative hybrid sequencing approach, IDP-fusion, to detect fusion genes, determine fusion sites and identify and quantify fusion isoforms. IDP-fusion is the first method to study gene fusion events by integrating Third Generation Sequencing long reads and Second Generation Sequencing short reads. We applied IDP-fusion to PacBio data and Illumina data from the MCF-7 breast cancer cells. Compared with the existing tools, IDP-fusion detects fusion genes at higher precision and a very low false positive rate. The results show that IDP-fusion will be useful for unraveling the complexity of multiple fusion splices and fusion isoforms within tumorigenesis-relevant fusion genes.


Asunto(s)
Carcinogénesis/genética , Perfilación de la Expresión Génica , Fusión Génica , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Neoplasias de la Mama/genética , Neoplasias de la Mama/metabolismo , Femenino , Humanos , Células MCF-7 , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , Alineación de Secuencia
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA