Pesquisa | Secretaria de Estado da Saúde

1.

Bayesian LASSO for population stratification correction in rare haplotype association studies.

Liu, Zilu; Turkmen, Asuman Seda; Lin, Shili.

Stat Appl Genet Mol Biol ; 23(1)2024 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-38235525

RESUMO

Population stratification (PS) is one major source of confounding in both single nucleotide polymorphism (SNP) and haplotype association studies. To address PS, principal component regression (PCR) and linear mixed model (LMM) are the current standards for SNP associations, which are also commonly borrowed for haplotype studies. However, the underfitting and overfitting problems introduced by PCR and LMM, respectively, have yet to be addressed. Furthermore, there have been only a few theoretical approaches proposed to address PS specifically for haplotypes. In this paper, we propose a new method under the Bayesian LASSO framework, QBLstrat, to account for PS in identifying rare and common haplotypes associated with a continuous trait of interest. QBLstrat utilizes a large number of principal components (PCs) with appropriate priors to sufficiently correct for PS, while shrinking the estimates of unassociated haplotypes and PCs. We compare the performance of QBLstrat with the Bayesian counterparts of PCR and LMM and a current method, haplo.stats. Extensive simulation studies and real data analyses show that QBLstrat is superior in controlling false positives while maintaining competitive power for identifying true positives under PS.

Assuntos

Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Haplótipos , Teorema de Bayes , Fenótipo , Estudo de Associação Genômica Ampla

2.

Population stratification correction using Bayesian shrinkage priors for genetic association studies.

Liu, Zilu; Turkmen, Asuman S; Lin, Shili.

Ann Hum Genet ; 87(6): 302-315, 2023 11.

Artigo em Inglês | MEDLINE | ID: mdl-37771252

RESUMO

INTRODUCTION: Population stratification (PS) is a major source of confounding in population-based genetic association studies of quantitative traits. Principal component regression (PCR) and linear mixed model (LMM) are two commonly used approaches to account for PS in association studies. Previous studies have shown that LMM can be interpreted as including all principal components (PCs) as random-effect covariates. However, including all PCs in LMM may dilute the influence of relevant PCs in some scenarios, while including only a few preselected PCs in PCR may fail to fully capture the genetic diversity. MATERIALS AND METHODS: To address these shortcomings, we introduce Bayestrat-a method to detect associated variants with PS correction under the Bayesian LASSO framework. To adjust for PS, Bayestrat accommodates a large number of PCs and utilizes appropriate shrinkage priors to shrink the effects of nonassociated PCs. RESULTS: Simulation results show that Bayestrat consistently controls type I error rates and achieves higher power compared to its non-shrinkage counterparts, especially when the number of PCs included in the model is large. As a demonstration of the utility of Bayestrat, we apply it to the Multi-Ethnic Study of Atherosclerosis (MESA). Variants and genes associated with serum triglyceride or HDL cholesterol are identified in our analyses. DISCUSSION: The automatic and self-selection features of Bayestrat make it particularly suited in situations with complex underlying PS scenarios, where it is unknown a priori which PCs are potential confounders, yet the number that needs to be considered could be large in order to fully account for PS.

Assuntos

Estudo de Associação Genômica Ampla , Modelos Genéticos , Humanos , Teorema de Bayes , Estudos de Associação Genética , Simulação por Computador , Modelos Lineares , Fenótipo

3.

Are dropout imputation methods for scRNA-seq effective for scHi-C data?

Han, Chenggong; Xie, Qing; Lin, Shili.

Brief Bioinform ; 22(4)2021 07 20.

Artigo em Inglês | MEDLINE | ID: mdl-33201180

RESUMO

The prevalence of dropout events is a serious problem for single-cell Hi-C (scHiC) data due to insufficient sequencing depth and data coverage, which brings difficulties in downstream studies such as clustering and structural analysis. Complicating things further is the fact that dropouts are confounded with structural zeros due to underlying properties, leading to observed zeros being a mixture of both types of events. Although a great deal of progress has been made in imputing dropout events for single cell RNA-sequencing (RNA-seq) data, little has been done in identifying structural zeros and imputing dropouts for scHiC data. In this paper, we adapted several methods from the single-cell RNA-seq literature for inference on observed zeros in scHiC data and evaluated their effectiveness. Through an extensive simulation study and real data analysis, we have shown that a couple of the adapted single-cell RNA-seq algorithms can be powerful for correctly identifying structural zeros and accurately imputing dropout values. Downstream analysis using the imputed values showed considerable improvement for clustering cells of the same types together over clustering results before imputation.

Assuntos

Algoritmos , Simulação por Computador , RNA Citoplasmático Pequeno , RNA-Seq , Análise de Célula Única , Software , Humanos , RNA Citoplasmático Pequeno/genética , RNA Citoplasmático Pequeno/metabolismo

4.

HiCImpute: A Bayesian hierarchical model for identifying structural zeros and enhancing single cell Hi-C data.

Xie, Qing; Han, Chenggong; Jin, Victor; Lin, Shili.

PLoS Comput Biol ; 18(6): e1010129, 2022 06.

Artigo em Inglês | MEDLINE | ID: mdl-35696429

RESUMO

Single cell Hi-C techniques enable one to study cell to cell variability in chromatin interactions. However, single cell Hi-C (scHi-C) data suffer severely from sparsity, that is, the existence of excess zeros due to insufficient sequencing depth. Complicating the matter further is the fact that not all zeros are created equal: some are due to loci truly not interacting because of the underlying biological mechanism (structural zeros); others are indeed due to insufficient sequencing depth (sampling zeros or dropouts), especially for loci that interact infrequently. Differentiating between structural zeros and dropouts is important since correct inference would improve downstream analyses such as clustering and discovery of subtypes. Nevertheless, distinguishing between these two types of zeros has received little attention in the single cell Hi-C literature, where the issue of sparsity has been addressed mainly as a data quality improvement problem. To fill this gap, in this paper, we propose HiCImpute, a Bayesian hierarchical model that goes beyond data quality improvement by also identifying observed zeros that are in fact structural zeros. HiCImpute takes spatial dependencies of scHi-C 2D data structure into account while also borrowing information from similar single cells and bulk data, when such are available. Through an extensive set of analyses of synthetic and real data, we demonstrate the ability of HiCImpute for identifying structural zeros with high sensitivity, and for accurate imputation of dropout values. Downstream analyses using data improved from HiCImpute yielded much more accurate clustering of cell types compared to using observed data or data improved by several comparison methods. Most significantly, HiCImpute-improved data have led to the identification of subtypes within each of the excitatory neuronal cells of L4 and L5 in the prefrontal cortex.

Assuntos

Cromatina , Cromossomos , Teorema de Bayes , Análise por Conglomerados , Análise Espacial

5.

Sparse estimation in semiparametric finite mixture of varying coefficient regression models.

Khalili, Abbas; Shokoohi, Farhad; Asgharian, Masoud; Lin, Shili.

Biometrics ; 79(4): 3445-3457, 2023 12.

Artigo em Inglês | MEDLINE | ID: mdl-37066855

RESUMO

Finite mixture of regressions (FMR) are commonly used to model heterogeneous effects of covariates on a response variable in settings where there are unknown underlying subpopulations. FMRs, however, cannot accommodate situations where covariates' effects also vary according to an "index" variable-known as finite mixture of varying coefficient regression (FM-VCR). Although complex, this situation occurs in real data applications: the osteocalcin (OCN) data analyzed in this manuscript presents a heterogeneous relationship where the effect of a genetic variant on OCN in each hidden subpopulation varies over time. Oftentimes, the number of covariates with varying coefficients also presents a challenge: in the OCN study, genetic variants on the same chromosome are considered jointly. The relative proportions of hidden subpopulations may also change over time. Nevertheless, existing methods cannot provide suitable solutions for accommodating all these features in real data applications. To fill this gap, we develop statistical methodologies based on regularized local-kernel likelihood for simultaneous parameter estimation and variable selection in sparse FM-VCR models. We study large-sample properties of the proposed methods. We then carry out a simulation study to evaluate the performance of various penalties adopted for our regularized approach and ascertain the ability of a BIC-type criterion for estimating the number of subpopulations. Finally, we applied the FM-VCR model to analyze the OCN data and identified several covariates, including genetic variants, that have age-dependent effects on OCN.

Assuntos

Modelos Estatísticos , Simulação por Computador , Funções Verossimilhança

6.

Detecting X-linked common and rare variant effects in family-based sequencing studies.

Turkmen, Asuman S; Lin, Shili.

Genet Epidemiol ; 45(1): 36-45, 2021 02.

Artigo em Inglês | MEDLINE | ID: mdl-32864779

RESUMO

The breakthroughs in next generation sequencing have allowed us to access data consisting of both common and rare variants, and in particular to investigate the impact of rare genetic variation on complex diseases. Although rare genetic variants are thought to be important components in explaining genetic mechanisms of many diseases, discovering these variants remains challenging, and most studies are restricted to population-based designs. Further, despite the shift in the field of genome-wide association studies (GWAS) towards studying rare variants due to the "missing heritability" phenomenon, little is known about rare X-linked variants associated with complex diseases. For instance, there is evidence that X-linked genes are highly involved in brain development and cognition when compared with autosomal genes; however, like most GWAS for other complex traits, previous GWAS for mental diseases have provided poor resources to deal with identification of rare variant associations on X-chromosome. In this paper, we address the two issues described above by proposing a method that can be used to test X-linked variants using sequencing data on families. Our method is much more general than existing methods, as it can be applied to detect both common and rare variants, and is applicable to autosomes as well. Our simulation study shows that the method is efficient, and exhibits good operational characteristics. An application to the University of Miami Study on Genetics of Autism and Related Disorders also yielded encouraging results.

Assuntos

Genes Ligados ao Cromossomo X , Estudo de Associação Genômica Ampla , Variação Genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Modelos Genéticos , Herança Multifatorial

7.

Evaluation and comparison of methods for recapitulation of 3D spatial chromatin structures.

Park, Jincheol; Lin, Shili.

Brief Bioinform ; 20(4): 1205-1214, 2019 07 19.

Artigo em Inglês | MEDLINE | ID: mdl-29091999

RESUMO

How chromosomes fold and how distal genomic elements interact with one another at a genomic scale have been actively pursued in the past decade following the seminal work describing the Chromosome Conformation Capture (3C) assay. Essentially, 3C-based technologies produce two-dimensional (2D) contact maps that capture interactions between genomic fragments. Accordingly, a plethora of analytical methods have been proposed to take a 2D contact map as input to recapitulate the underlying whole genome three-dimensional (3D) structure of the chromatin. However, their performance in terms of several factors, including data resolution and ability to handle contact map features, have not been sufficiently evaluated. This task is taken up in this article, in which we consider several recent and/or well-regarded methods, both optimization-based and model-based, for their aptness of producing 3D structures using contact maps generated based on a population of cells. These methods are evaluated and compared using both simulated and real data. Several criteria have been used. For simulated data sets, the focus is on accurate recapitulation of the entire structure given the existence of the gold standard. For real data sets, comparison with distances measured by Florescence in situ Hybridization and consistency with several genomic features of known biological functions are examined.

Assuntos

Cromatina/química , Cromatina/genética , Animais , Cromatina/ultraestrutura , Cromossomos Humanos/química , Cromossomos Humanos/genética , Cromossomos Humanos/ultraestrutura , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados Genéticas , Genoma Humano , Humanos , Imageamento Tridimensional/métodos , Hibridização in Situ Fluorescente , Camundongos , Modelos Genéticos , Conformação Molecular

8.

Three-dimensional analysis reveals altered chromatin interaction by enhancer inhibitors harbors TCF7L2-regulated cancer gene signature.

Gerrard, Diana L; Wang, Yao; Gaddis, Malaina; Zhou, Yufan; Wang, Junbai; Witt, Heather; Lin, Shili; Farnham, Peggy J; Jin, Victor X; Frietze, Seth E.

J Cell Biochem ; 120(3): 3056-3070, 2019 03.

Artigo em Inglês | MEDLINE | ID: mdl-30548288

RESUMO

Distal regulatory elements influence the activity of gene promoters through chromatin looping. Chromosome conformation capture (3C) methods permit identification of chromatin contacts across different regions of the genome. However, due to limitations in the resolution of these methods, the detection of functional chromatin interactions remains a challenge. In the current study, we employ an integrated approach to define and characterize the functional chromatin contacts of human pancreatic cancer cells. We applied tethered chromatin capture to define classes of chromatin domains on a genome-wide scale. We identified three types of structural domains (topologically associated, boundary, and gap) and investigated the functional relationships of these domains with respect to chromatin state and gene expression. We uncovered six distinct sub-domains associated with epigenetic states. Interestingly, specific epigenetically active domains are sensitive to treatment with histone acetyltransferase (HAT) inhibitors and decrease in H3K27 acetylation levels. To examine whether the subdomains that change upon drug treatment are functionally linked to transcription factor regulation, we compared TCF7L2 chromatin binding and gene regulation to HAT inhibition. We identified a subset of coding RNA genes that together can stratify pancreatic cancer patients into distinct survival groups. Overall, this study describes a process to evaluate the functional features of chromosome architecture and reveals the impact of epigenetic inhibitors on chromosome architecture and identifies genes that may provide insight into disease outcome.

Assuntos

Benzoatos/farmacologia , Compostos Bicíclicos Heterocíclicos com Pontes/farmacologia , Cromatina/metabolismo , Redes Reguladoras de Genes , Neoplasias Pancreáticas/genética , Pirazóis/farmacologia , Pirimidinonas/farmacologia , Proteína 2 Semelhante ao Fator 7 de Transcrição/metabolismo , Linhagem Celular Tumoral , Cromatina/química , Cromatina/genética , Montagem e Desmontagem da Cromatina , Mapeamento Cromossômico , Epigênese Genética/efeitos dos fármacos , Epigenômica , Regulação Neoplásica da Expressão Gênica/efeitos dos fármacos , Redes Reguladoras de Genes/efeitos dos fármacos , Humanos , Nitrobenzenos , Neoplasias Pancreáticas/metabolismo , Pirazolonas , Proteína 2 Semelhante ao Fator 7 de Transcrição/genética

9.

Agonist and antagonist switch DNA motifs recognized by human androgen receptor in prostate cancer.

Chen, Zhong; Lan, Xun; Thomas-Ahner, Jennifer M; Wu, Dayong; Liu, Xiangtao; Ye, Zhenqing; Wang, Liguo; Sunkel, Benjamin; Grenade, Cassandra; Chen, Junsheng; Zynger, Debra L; Yan, Pearlly S; Huang, Jiaoti; Nephew, Kenneth P; Huang, Tim H-M; Lin, Shili; Clinton, Steven K; Li, Wei; Jin, Victor X; Wang, Qianben.

EMBO J ; 34(4): 502-16, 2015 Feb 12.

Artigo em Inglês | MEDLINE | ID: mdl-25535248

RESUMO

Human transcription factors recognize specific DNA sequence motifs to regulate transcription. It is unknown whether a single transcription factor is able to bind to distinctly different motifs on chromatin, and if so, what determines the usage of specific motifs. By using a motif-resolution chromatin immunoprecipitation-exonuclease (ChIP-exo) approach, we find that agonist-liganded human androgen receptor (AR) and antagonist-liganded AR bind to two distinctly different motifs, leading to distinct transcriptional outcomes in prostate cancer cells. Further analysis on clinical prostate tissues reveals that the binding of AR to these two distinct motifs is involved in prostate carcinogenesis. Together, these results suggest that unique ligands may switch DNA motifs recognized by ligand-dependent transcription factors in vivo. Our findings also provide a broad mechanistic foundation for understanding ligand-specific induction of gene expression profiles.

Assuntos

Antagonistas de Receptores de Andrógenos/química , Androgênios/química , DNA/metabolismo , Neoplasias da Próstata/metabolismo , Receptores Androgênicos/metabolismo , Antagonistas de Receptores de Andrógenos/metabolismo , Androgênios/metabolismo , Proliferação de Células/fisiologia , Imunoprecipitação da Cromatina , Ensaio de Desvio de Mobilidade Eletroforética , Humanos , Masculino , Reação em Cadeia da Polimerase Via Transcriptase Reversa

10.

A Family-Based Rare Haplotype Association Method for Quantitative Traits.

Datta, Ananda S; Lin, Shili; Biswas, Swati.

Hum Hered ; 83(4): 175-195, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-30799419

RESUMO

BACKGROUND: The variants identified in genome-wide association studies account for only a small fraction of disease heritability. A key to this "missing heritability" is believed to be rare variants. Specifically, we focus on rare haplotype variant (rHTV). The existing methods for detecting rHTV are mostly population-based, and as such, are susceptible to population stratification and admixture, leading to an inflated false-positive rate. Family-based methods are more robust in this respect. METHODS: We propose a method for detecting rHTVs associated with quantitative traits called family-based quantitative Bayesian LASSO (famQBL). FamQBL can analyze any type of pedigree and is based on a mixed model framework. We regularize the haplotype effects using Bayesian LASSO and estimate the posterior distributions using Markov chain Monte Carlo methods. RESULTS: We conduct simulation studies, including analyses of Genetic Analysis Workshop 18 simulated data, to study the properties of famQBL and compare with a standard family-based haplotype association test implemented in FBAT (family-based association test) software. We find famQBL to be more powerful than FBAT with well-controlled false-positive rates. We also apply famQBL to the Framingham Heart Study data and detect an rHTV associated with diastolic blood pressure. CONCLUSION: FamQBL can help uncover rHTVs associated with quantitative traits.

Assuntos

Doenças Cardiovasculares/genética , Estudo de Associação Genômica Ampla , Haplótipos , Polimorfismo de Nucleotídeo Único , Característica Quantitativa Herdável , Teorema de Bayes , Doenças Cardiovasculares/patologia , Feminino , Humanos , Masculino , Modelos Genéticos , Linhagem , Fenótipo

11.

Are rare variants really independent?

Turkmen, Asuman; Lin, Shili.

Genet Epidemiol ; 41(4): 363-371, 2017 05.

Artigo em Inglês | MEDLINE | ID: mdl-28300291

RESUMO

Recent advances in genotyping with high-density markers allow researchers access to genomic variants including rare ones. Linkage disequilibrium (LD) is widely used to provide insight into evolutionary history. It is also the basis for association mapping in humans and other species. Better understanding of the genomic LD structure may lead to better-informed statistical tests that can improve the power of association studies. Although rare variant associations with common diseases (RVCD) have been extensively studied recently, there is very limited understanding, and even controversial view of LD structures among rare variants and between rare and common variants. In fact, many popular RVCD tests make the assumptions that rare variants are independent. In this report, we show that two commonly used LD measures are not capable of detecting LD when rare variants are involved. We present this argument from two perspectives, both the LD measures themselves and the computational issues associated with them. To address these issues, we propose an alternative LD measure, the polychoric correlation, that was originally designed for detecting associations among categorical variables. Using simulated as well as the 1000 Genomes data, we explore the performances of LD measures in detail and discuss their implications in association studies.

Assuntos

Variação Genética , Estudo de Associação Genômica Ampla , Cromossomos Humanos Par 21/genética , Simulação por Computador , Frequência do Gene/genética , Genótipo , Humanos , Desequilíbrio de Ligação/genética , Polimorfismo de Nucleotídeo Único/genética

12.

Statistical methods for detecting differentially methylated regions based on MethylCap-seq data.

Ayyala, Deepak N; Frankhouser, David E; Ganbat, Javkhlan-Ochir; Marcucci, Guido; Bundschuh, Ralf; Yan, Pearlly; Lin, Shili.

Brief Bioinform ; 17(6): 926-937, 2016 11.

Artigo em Inglês | MEDLINE | ID: mdl-26454095

RESUMO

DNA methylation is a well-established epigenetic mark, whose pattern throughout the genome, especially in the promoter or CpG islands, may be modified in a cell at a disease stage. Recently developed probabilistic approaches allow distributing methylation signals at nucleotide resolution from MethylCap-seq data. Standard statistical methods for detecting differential methylation suffer from 'curse of dimensionality' and sparsity in signals, resulting in high false-positive rates. Strong correlation of signals between CG sites also yields spurious results. In this article, we review applicability of high-dimensional mean vector tests for detection of differentially methylated regions (DMRs) and compare and contrast such tests with other methods for detecting DMRs. Comprehensive simulation studies are conducted to highlight the performance of these tests under different settings. Based on our observation, we make recommendations on the optimal test to use. We illustrate the superiority of mean vector tests in detecting cancer-related canonical gene pathways, which are significantly enriched for acute myeloid leukemia and ovarian cancer.

Assuntos

Metilação de DNA , Ilhas de CpG , Epigenômica , Humanos , Leucemia Mieloide Aguda , Regiões Promotoras Genéticas

13.

Indirect effect inference and application to GAW20 data.

Li, Liming; Wang, Chan; Lu, Tianyuan; Lin, Shili; Hu, Yue-Qing.

BMC Genet ; 19(Suppl 1): 67, 2018 09 17.

Artigo em Inglês | MEDLINE | ID: mdl-30255768

RESUMO

BACKGROUND: Association studies using a single type of omics data have been successful in identifying disease-associated genetic markers, but the underlying mechanisms are unaddressed. To provide a possible explanation of how these genetic factors affect the disease phenotype, integration of multiple omics data is needed. RESULTS: We propose a novel method, LIPID (likelihood inference proposal for indirect estimation), that uses both single nucleotide polymorphism (SNP) and DNA methylation data jointly to analyze the association between a trait and SNPs. The total effect of SNPs is decomposed into direct and indirect effects, where the indirect effects are the focus of our investigation. Simulation studies show that LIPID performs better in various scenarios than existing methods. Application to the GAW20 data also leads to encouraging results, as the genes identified appear to be biologically relevant to the phenotype studied. CONCLUSIONS: The proposed LIPID method is shown to be meritorious in extensive simulations and in real-data analyses.

Assuntos

Estudo de Associação Genômica Ampla , Genômica/métodos , Metilação de DNA , Humanos , Hipertrigliceridemia/tratamento farmacológico , Hipertrigliceridemia/genética , Hipoglicemiantes/uso terapêutico , Polimorfismo de Nucleotídeo Único

14.

Detecting associations of rare variants with common diseases: collapsing or haplotyping?

Wang, Meng; Lin, Shili.

Brief Bioinform ; 16(5): 759-68, 2015 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-25596401

RESUMO

In recent years, a myriad of new statistical methods have been proposed for detecting associations of rare single-nucleotide variants (SNVs) with common diseases. These methods can be generally classified as 'collapsing' or 'haplotyping' based. The former is the predominant class, composed of most of the rare variant association methods proposed to date. However, recent works have suggested that haplotyping-based methods may offer advantages and can even be more powerful than collapsing methods in certain situations. In this article, we review and compare collapsing- versus haplotyping-based methods/software in terms of both power and type I error. For collapsing methods, we consider three approaches: Combined Multivariate and Collapsing, Sequence Kernel Association Test and Family-Based Association Test (FBAT): the first two are population based and are among the most popular; the last test is family based, a modification from the popular FBAT to accommodate rare SNVs. For haplotyping-based methods, we include Logistic Bayesian Lasso (LBL) for population data and family-based LBL (famLBL) for family (trio) data. These two methods are selected, as they can be used to test association for specific rare and common haplotypes. Our results show that haplotype methods can be more powerful than collapsing methods if there are interacting SNVs leading to larger haplotype effects. Even if only common SNVs are genotyped, haplotype methods can still detect specific rare haplotypes that tag rare causal SNVs. As expected, family-based methods are robust, whereas population-based methods are susceptible, to population substructure. However, the population-based haplotype approach appears to have smaller inflation of type I error than its collapsing counterparts.

Assuntos

Predisposição Genética para Doença , Haplótipos , Teorema de Bayes , Estudos de Casos e Controles , Humanos , Polimorfismo de Nucleotídeo Único

15.

Logistic Bayesian LASSO for genetic association analysis of data from complex sampling designs.

Zhang, Yuan; Hofmann, Jonathan N; Purdue, Mark P; Lin, Shili; Biswas, Swati.

J Hum Genet ; 62(9): 819-829, 2017 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-28424482

RESUMO

Detecting gene-environment interactions with rare variants is critical in dissecting the etiology of common diseases. Interactions with rare haplotype variants (rHTVs) are of particular interest. At the same time, complex sampling designs, such as stratified random sampling, are becoming increasingly popular for designing case-control studies, especially for recruiting controls. The US Kidney Cancer Study (KCS) is an example, wherein all available cases were included while the controls at each site were randomly selected from the population by frequency matching with cases based on age, sex and race. There is currently no rHTV association method that can account for such a complex sampling design. To fill this gap, we consider logistic Bayesian LASSO (LBL), an existing rHTV approach for case-control data, and show that its model can easily accommodate the complex sampling design. We study two extensions that include stratifying variables either as main effects only or with additional modeling of their interactions with haplotypes. We conduct extensive simulation studies to compare the complex sampling methods with the original LBL methods. We find that, when there is no interaction between haplotype and stratifying variables, both extensions perform well while the original LBL methods lead to inflated type I error rates. However, when such an interaction exists, it is necessary to include the interaction effect in the model to control the type I error rate. Finally, we analyze the KCS data and find a significant interaction between (current) smoking and a specific rHTV in the N-acetyltransferase 2 gene.

Assuntos

Teorema de Bayes , Interação Gene-Ambiente , Estudos de Associação Genética/métodos , Haplótipos , Modelos Genéticos , Algoritmos , Simulação por Computador , Frequência do Gene , Variação Genética , Humanos , Padrões de Herança , Razão de Chances

16.

A random effect model for reconstruction of spatial chromatin structure.

Park, Jincheol; Lin, Shili.

Biometrics ; 73(1): 52-62, 2017 03.

Artigo em Inglês | MEDLINE | ID: mdl-27214023

RESUMO

A gene may be controlled by distal enhancers and repressors, not merely by regulatory elements in its promoter. Spatial organization of chromosomes is the mechanism that brings genes and their distal regulatory elements into close proximity. Recent molecular techniques, coupled with Next Generation Sequencing (NGS) technology, enable genome-wide detection of physical contacts between distant genomic loci. In particular, Hi-C is an NGS-aided assay for the study of genome-wide spatial interactions. The availability of such data makes it possible to reconstruct the underlying three-dimensional (3D) spatial chromatin structure. In this article, we present the Poisson Random effect Architecture Model (PRAM) for such an inference. The main feature of PRAM that separates it from previous methods is that it addresses the issue of over-dispersion and takes correlations among contact counts into consideration, thereby achieving greater consistency with observed data. PRAM was applied to Hi-C data to illustrate its performance and to compare the predicted distances with those measured by a Fluorescence In Situ Hybridization (FISH) validation experiment. Further, PRAM was compared to other methods in the literature based on both real and simulated data.

Assuntos

Cromatina/química , Modelos Biológicos , Modelos Estatísticos , Análise Espacial , Regulação da Expressão Gênica , Hibridização in Situ Fluorescente , Distribuição de Poisson

17.

Detecting rare and common haplotype-environment interaction under uncertainty of gene-environment independence assumption.

Zhang, Yuan; Lin, Shili; Biswas, Swati.

Biometrics ; 73(1): 344-355, 2017 03.

Artigo em Inglês | MEDLINE | ID: mdl-27478935

RESUMO

Finding rare variants and gene-environment interactions (GXE) is critical in dissecting complex diseases. We consider the problem of detecting GXE where G is a rare haplotype and E is a nongenetic factor. Such methods typically assume G-E independence, which may not hold in many applications. A pertinent example is lung cancer-there is evidence that variants on Chromosome 15q25.1 interact with smoking to affect the risk. However, these variants are associated with smoking behavior rendering the assumption of G-E independence inappropriate. With the motivation of detecting GXE under G-E dependence, we extend an existing approach, logistic Bayesian LASSO, which assumes G-E independence (LBL-GXE-I) by modeling G-E dependence through a multinomial logistic regression (referred to as LBL-GXE-D). Unlike LBL-GXE-I, LBL-GXE-D controls type I error rates in all situations; however, it has reduced power when G-E independence holds. To control type I error without sacrificing power, we further propose a unified approach, LBL-GXE, to incorporate uncertainty in the G-E independence assumption by employing a reversible jump Markov chain Monte Carlo method. Our simulations show that LBL-GXE has power similar to that of LBL-GXE-I when G-E independence holds, yet has well-controlled type I errors in all situations. To illustrate the utility of LBL-GXE, we analyzed a lung cancer dataset and found several significant interactions in the 15q25.1 region, including one between a specific rare haplotype and smoking.

Assuntos

Biometria/métodos , Interação Gene-Ambiente , Neoplasias Pulmonares/etiologia , Fumar/efeitos adversos , Cromossomos Humanos Par 15 , Simulação por Computador , Interpretação Estatística de Dados , Variação Genética , Haplótipos , Humanos , Modelos Logísticos , Neoplasias Pulmonares/genética , Modelos Genéticos , Risco , Fumar/genética

18.

Impact of data resolution on three-dimensional structure inference methods.

Park, Jincheol; Lin, Shili.

BMC Bioinformatics ; 17: 70, 2016 Feb 06.

Artigo em Inglês | MEDLINE | ID: mdl-26852142

RESUMO

BACKGROUND: Assays that are capable of detecting genome-wide chromatin interactions have produced massive amount of data and led to great understanding of the chromosomal three-dimensional (3D) structure. As technology becomes more sophisticated, higher-and-higher resolution data are being produced, going from the initial 1 Megabases (Mb) resolution to the current 10 Kilobases (Kb) or even 1 Kb resolution. The availability of genome-wide interaction data necessitates development of analytical methods to recover the underlying 3D spatial chromatin structure, but challenges abound. Most of the methods were proposed for analyzing data at low resolution (1 Mb). Their behaviors are thus unknown for higher resolution data. For such data, one of the key features is the high proportion of "0" contact counts among all available data, in other words, the excess of zeros. RESULTS: To address the issue of excess of zeros, in this paper, we propose a truncated Random effect EXpression (tREX) method that can handle data at various resolutions. We then assess the performance of tREX and a number of leading existing methods for recovering the underlying chromatin 3D structure. This was accomplished by creating in-silico data to mimic multiple levels of resolution and submit the methods to a "stress test". Finally, we applied tREX and the comparison methods to a Hi-C dataset for which FISH measurements are available to evaluate estimation accuracy. CONCLUSION: The proposed tREX method achieves consistently good performance in all 30 simulated settings considered. It is not only robust to resolution level and underlying parameters, but also insensitive to model misspecification. This conclusion is based on observations made in terms of 3D structure estimation accuracy and preservation of topologically associated domains. Application of the methods to the human lymphoblastoid cell line data on chromosomes 14 and 22 further substantiates the superior performance of tREX: the constructed 3D structure from tREX is consistent with the FISH measurements, and the corresponding distances predicted by tREX have higher correlation with the FISH measurements than any of the comparison methods. SOFTWARE: An open-source R-package is available at http://www.stat.osu.edu/~statgen/Software/tRex.

Assuntos

Cromatina/química , Cromossomos Humanos/química , Simulação por Computador , Linfócitos/química , Modelos Teóricos , Software , Células Cultivadas , Genoma Humano , Humanos , Hibridização in Situ Fluorescente

19.

GrammR: graphical representation and modeling of count data with application in metagenomics.

Ayyala, Deepak Nag; Lin, Shili.

Bioinformatics ; 31(10): 1648-54, 2015 May 15.

Artigo em Inglês | MEDLINE | ID: mdl-25609792

RESUMO

MOTIVATION: Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations. RESULTS: We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.

Assuntos

Algoritmos , Bactérias/genética , Biologia Computacional/métodos , Gráficos por Computador , Armazenamento e Recuperação da Informação/métodos , Metagenômica , Modelos Estatísticos , Bactérias/classificação , Análise por Conglomerados , Trato Gastrointestinal/microbiologia , Humanos , Análise Multivariada , Filogenia

20.

Block-based association tests for rare variants using Kullback-Leibler divergence.

Zhu, Degang; Hu, Yue-Qing; Lin, Shili.

J Hum Genet ; 61(11): 965-975, 2016 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-27412875

RESUMO

Although genome-wide association studies have successfully detected numerous associations between common variants and complex diseases, these variants typically can only explain a small part of the heritable component of a disease. With the advent of next-generation sequencing, attention has turned to rare variants. Recently, a variety of approaches for detecting associations of rare variants have been proposed, including the Kullback-Leibler divergence-based tests (KLTs) for detecting genotypic differences between cases and controls. However, few of these approaches consider linkage disequilibrium (LD) structure among rare variants and common variants. In this study, we propose two block-based association tests for testing the effects of rare variants on a disease. The main idea for this approach comes from the hypothesis that a region of interest may consist of two or more LD blocks such that single-nucleotide variants (SNVs) within each block are correlated, whereas SNVs in different blocks are independent or weakly correlated. Under this hypothesis, we propose two tests that are generalizations of the KLTs by taking the block structure into account. A simulation study under various scenarios shows that the proposed methods have well-controlled type I error rates and outperform some leading methods in the literature. Moreover, application to the Dallas Heart Study data demonstrates the feasibility and performance of the two proposed methods in a realistic setting.

Assuntos

Estudos de Associação Genética/métodos , Variação Genética , Modelos Genéticos , Algoritmos , Simulação por Computador , Conjuntos de Dados como Assunto , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla/métodos , Humanos , Desequilíbrio de Ligação , Polimorfismo de Nucleotídeo Único , Reprodutibilidade dos Testes

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa