Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
1.
Proc Natl Acad Sci U S A ; 114(38): 10166-10171, 2017 09 19.
Artigo em Inglês | MEDLINE | ID: mdl-28874526

RESUMO

Prediction of human physical traits and demographic information from genomic data challenges privacy and data deidentification in personalized medicine. To explore the current capabilities of phenotype-based genomic identification, we applied whole-genome sequencing, detailed phenotyping, and statistical modeling to predict biometric traits in a cohort of 1,061 participants of diverse ancestry. Individually, for a large fraction of the traits, their predictive accuracy beyond ancestry and demographic information is limited. However, we have developed a maximum entropy algorithm that integrates multiple predictions to determine which genomic samples and phenotype measurements originate from the same person. Using this algorithm, we have reidentified an average of >8 of 10 held-out individuals in an ethnically mixed cohort and an average of 5 of either 10 African Americans or 10 Europeans. This work challenges current conceptions of personal privacy and may have far-reaching ethical and legal implications.


Assuntos
Confidencialidade , Impressões Digitais de DNA , Modelos Genéticos , Fenótipo , Sequenciamento Completo do Genoma , Adulto , Fatores Etários , Algoritmos , Tamanho Corporal , Estudos de Coortes , Anonimização de Dados , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Pigmentação/genética , Adulto Jovem
2.
PLoS Genet ; 12(3): e1005849, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-26943367

RESUMO

Although genome-wide association studies (GWASs) have discovered numerous novel genetic variants associated with many complex traits and diseases, those genetic variants typically explain only a small fraction of phenotypic variance. Factors that account for phenotypic variance include environmental factors and gene-by-environment interactions (GEIs). Recently, several studies have conducted genome-wide gene-by-environment association analyses and demonstrated important roles of GEIs in complex traits. One of the main challenges in these association studies is to control effects of population structure that may cause spurious associations. Many studies have analyzed how population structure influences statistics of genetic variants and developed several statistical approaches to correct for population structure. However, the impact of population structure on GEI statistics in GWASs has not been extensively studied and nor have there been methods designed to correct for population structure on GEI statistics. In this paper, we show both analytically and empirically that population structure may cause spurious GEIs and use both simulation and two GWAS datasets to support our finding. We propose a statistical approach based on mixed models to account for population structure on GEI statistics. We find that our approach effectively controls population structure on statistics for GEIs as well as for genetic variants.


Assuntos
Interação Gene-Ambiente , Genética Populacional , Genoma Humano , Estudo de Associação Genômica Ampla/métodos , Simulação por Computador , Humanos , Modelos Genéticos , Fenótipo , Polimorfismo de Nucleotídeo Único/genética
3.
Bioinformatics ; 31(12): i206-13, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-26072484

RESUMO

MOTIVATION: Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations. RESULTS: In this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability ρ. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2. AVAILABILITY AND IMPLEMENTATION: Software is freely available for download at genetics.cs.ucla.edu/caviar.


Assuntos
Genes , Estudo de Associação Genômica Ampla , Algoritmos , Animais , Apolipoproteína A-II/genética , Interpretação Estatística de Dados , Humanos , Desequilíbrio de Ligação , Camundongos , Polimorfismo de Nucleotídeo Único
4.
J Comput Biol ; 22(5): 451-62, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25526526

RESUMO

Ever since its introduction, the haplotype copy model has proven to be one of the most successful approaches for modeling genetic variation in human populations, with applications ranging from ancestry inference to genotype phasing and imputation. Motivated by coalescent theory, this approach assumes that any chromosome (haplotype) can be modeled as a mosaic of segments copied from a set of chromosomes sampled from the same population. At the core of the model is the assumption that any chromosome from the sample is equally likely to contribute a priori to the copying process. Motivated by recent works that model genetic variation in a geographic continuum, we propose a new spatial-aware haplotype copy model that jointly models geography and the haplotype copying process. We extend hidden Markov models of haplotype diversity such that at any given location, haplotypes that are closest in the genetic-geographic continuum map are a priori more likely to contribute to the copying process than distant ones. Through simulations starting from the 1000 Genomes data, we show that our model achieves superior accuracy in genotype imputation over the standard spatial-unaware haplotype copy model. In addition, we show the utility of our model in selecting a small personalized reference panel for imputation that leads to both improved accuracy as well as to a lower computational runtime than the standard approach. Finally, we show our proposed model can be used to localize individuals on the genetic-geographical map on the basis of their genotype data.


Assuntos
Algoritmos , Genoma Humano , Haplótipos , Modelos Genéticos , Grupos Raciais/genética , Cromossomos Humanos Par 22 , Variação Genética , Genética Populacional , Estudo de Associação Genômica Ampla , Geografia , Humanos , Desequilíbrio de Ligação , Cadeias de Markov , Polimorfismo de Nucleotídeo Único
5.
G3 (Bethesda) ; 4(12): 2505-18, 2014 Nov 03.
Artigo em Inglês | MEDLINE | ID: mdl-25371484

RESUMO

Ancestry analysis from genetic data plays a critical role in studies of human disease and evolution. Recent work has introduced explicit models for the geographic distribution of genetic variation and has shown that such explicit models yield superior accuracy in ancestry inference over nonmodel-based methods. Here we extend such work to introduce a method that models admixture between ancestors from multiple sources across a geographic continuum. We devise efficient algorithms based on hidden Markov models to localize on a map the recent ancestors (e.g., grandparents) of admixed individuals, joint with assigning ancestry at each locus in the genome. We validate our methods by using empirical data from individuals with mixed European ancestry from the Population Reference Sample study and show that our approach is able to localize their recent ancestors within an average of 470 km of the reported locations of their grandparents. Furthermore, simulations from real Population Reference Sample genotype data show that our method attains high accuracy in localizing recent ancestors of admixed individuals in Europe (an average of 550 km from their true location for localization of two ancestries in Europe, four generations ago). We explore the limits of ancestry localization under our approach and find that performance decreases as the number of distinct ancestries and generations since admixture increases. Finally, we build a map of expected localization accuracy across admixed individuals according to the location of origin within Europe of their ancestors.


Assuntos
Modelos Genéticos , Algoritmos , Diploide , Frequência do Gene , Loci Gênicos , Variação Genética , Genoma Humano , Haplótipos , Humanos , Desequilíbrio de Ligação , Cadeias de Markov , População Branca/genética
6.
PLoS Genet ; 10(10): e1004722, 2014 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-25357204

RESUMO

Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Estudo de Associação Genômica Ampla/métodos , Humanos , Desequilíbrio de Ligação , Modelos Teóricos , Polimorfismo de Nucleotídeo Único/genética
7.
Bioinformatics ; 29(18): 2245-52, 2013 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-23825370

RESUMO

MOTIVATION: Haplotypes, defined as the sequence of alleles on one chromosome, are crucial for many genetic analyses. As experimental determination of haplotypes is extremely expensive, haplotypes are traditionally inferred using computational approaches from genotype data, i.e. the mixture of the genetic information from both haplotypes. Best performing approaches for haplotype inference rely on Hidden Markov Models, with the underlying assumption that the haplotypes of a given individual can be represented as a mosaic of segments from other haplotypes in the same population. Such algorithms use this model to predict the most likely haplotypes that explain the observed genotype data conditional on reference panel of haplotypes. With rapid advances in short read sequencing technologies, sequencing is quickly establishing as a powerful approach for collecting genetic variation information. As opposed to traditional genotyping-array technologies that independently call genotypes at polymorphic sites, short read sequencing often collects haplotypic information; a read spanning more than one polymorphic locus (multi-single nucleotide polymorphic read) contains information on the haplotype from which the read originates. However, this information is generally ignored in existing approaches for haplotype phasing and genotype-calling from short read data. RESULTS: In this article, we propose a novel framework for haplotype inference from short read sequencing that leverages multi-single nucleotide polymorphic reads together with a reference panel of haplotypes. The basis of our approach is a new probabilistic model that finds the most likely haplotype segments from the reference panel to explain the short read sequencing data for a given individual. We devised an efficient sampling method within a probabilistic model to achieve superior performance than existing methods. Using simulated sequencing reads from real individual genotypes in the HapMap data and the 1000 Genomes projects, we show that our method is highly accurate and computationally efficient. Our haplotype predictions improve accuracy over the basic haplotype copying model by ∼20% with comparable computational time, and over another recently proposed approach Hap-SeqX by ∼10% with significantly reduced computational time and memory usage. AVAILABILITY: Publicly available software is available at http://genetics.cs.ucla.edu/harsh CONTACT: bpasaniuc@mednet.ucla.edu or eeskin@cs.ucla.edu.


Assuntos
Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos , Algoritmos , Alelos , Genoma Humano , Técnicas de Genotipagem , Projeto HapMap , Humanos , Modelos Estatísticos , Software
8.
J Comput Biol ; 20(3): 224-36, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23421794

RESUMO

Copy number variations (CNVs) are widely known to be an important mediator for diseases and traits. The development of high-throughput sequencing (HTS) technologies has provided great opportunities to identify CNV regions in mammalian genomes. In a typical experiment, millions of short reads obtained from a genome of interest are mapped to a reference genome. The mapping information can be used to identify CNV regions. One important challenge in analyzing the mapping information is the large fraction of reads that can be mapped to multiple positions. Most existing methods either only consider reads that can be uniquely mapped to the reference genome or randomly place a read to one of its mapping positions. Therefore, these methods have low power to detect CNVs located within repeated sequences. In this study, we propose a probabilistic model, CNVeM, that utilizes the inherent uncertainty of read mapping. We use maximum likelihood to estimate locations and copy numbers of copied regions and implement an expectation-maximization (EM) algorithm. One important contribution of our model is that we can distinguish between regions in the reference genome that differ from each other by as little as 0.1%. As our model aims to predict the copy number of each nucleotide, we can predict the CNV boundaries with high resolution. We apply our method to simulated datasets and achieve higher accuracy compared to CNVnator. Moreover, we apply our method to real data from which we detected known CNVs. To our knowledge, this is the first attempt to predict CNVs at nucleotide resolution and to utilize uncertainty of read mapping.


Assuntos
Variações do Número de Cópias de DNA/genética , Sequenciamento de Nucleotídeos em Larga Escala , Modelos Estatísticos , Incerteza , Algoritmos , Animais , Cromossomos de Mamíferos/genética , Simulação por Computador , Bases de Dados de Ácidos Nucleicos , Genoma/genética , Humanos , Camundongos , Fatores de Tempo
9.
Nat Genet ; 44(6): 725-31, 2012 May 20.
Artigo em Inglês | MEDLINE | ID: mdl-22610118

RESUMO

Characterizing genetic diversity within and between populations has broad applications in studies of human disease and evolution. We propose a new approach, spatial ancestry analysis, for the modeling of genotypes in two- or three-dimensional space. In spatial ancestry analysis (SPA), we explicitly model the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space. We show that the explicit modeling of the allele frequency allows individuals to be localized on the map on the basis of their genetic information alone. We apply our SPA method to a European and a worldwide population genetic variation data set and identify SNPs showing large gradients in allele frequency, and we suggest these as candidate regions under selection. These regions include SNPs in the well-characterized LCT region, as well as at loci including FOXP2, OCA2 and LRP1B.


Assuntos
Demografia , Variação Genética , Modelos Genéticos , Frequência do Gene , Genética Populacional , Genótipo , Humanos , Polimorfismo de Nucleotídeo Único , Seleção Genética , População Branca/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA