Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
2.
PLoS Comput Biol ; 16(3): e1007732, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-32191703

RESUMEN

The use of comparative genomics for functional, evolutionary, and epidemiological studies requires methods to classify gene families in terms of occurrence in a given species. These methods usually lack multivariate statistical models to infer the partitions and the optimal number of classes and don't account for genome organization. We introduce a graph structure to model pangenomes in which nodes represent gene families and edges represent genomic neighborhood. Our method, named PPanGGOLiN, partitions nodes using an Expectation-Maximization algorithm based on multivariate Bernoulli Mixture Model coupled with a Markov Random Field. This approach takes into account the topology of the graph and the presence/absence of genes in pangenomes to classify gene families into persistent, cloud, and one or several shell partitions. By analyzing the partitioned pangenome graphs of isolate genomes from 439 species and metagenome-assembled genomes from 78 species, we demonstrate that our method is effective in estimating the persistent genome. Interestingly, it shows that the shell genome is a key element to understand genome dynamics, presumably because it reflects how genes present at intermediate frequencies drive adaptation of species, and its proportion in genomes is independent of genome size. The graph-based approach proposed by PPanGGOLiN is useful to depict the overall genomic diversity of thousands of strains in a compact structure and provides an effective basis for very large scale comparative genomics. The software is freely available at https://github.com/labgem/PPanGGOLiN.


Asunto(s)
Genoma Bacteriano/genética , Genómica/métodos , Programas Informáticos , Algoritmos , Bacterias/clasificación , Bacterias/genética , Análisis Multivariante
3.
BMC Bioinformatics ; 19(1): 459, 2018 Nov 29.
Artículo en Inglés | MEDLINE | ID: mdl-30497371

RESUMEN

BACKGROUND: Genome-Wide Association Studies (GWAS) seek to identify causal genomic variants associated with rare human diseases. The classical statistical approach for detecting these variants is based on univariate hypothesis testing, with healthy individuals being tested against affected individuals at each locus. Given that an individual's genotype is characterized by up to one million SNPs, this approach lacks precision, since it may yield a large number of false positives that can lead to erroneous conclusions about genetic associations with the disease. One way to improve the detection of true genetic associations is to reduce the number of hypotheses to be tested by grouping SNPs. RESULTS: We propose a dimension-reduction approach which can be applied in the context of GWAS by making use of the haplotype structure of the human genome. We compare our method with standard univariate and group-based approaches on both synthetic and real GWAS data. CONCLUSION: We show that reducing the dimension of the predictor matrix by aggregating SNPs gives a greater precision in the detection of associations between the phenotype and genomic regions.


Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple/genética , Algoritmos , Área Bajo la Curva , Estudios de Casos y Controles , Simulación por Computador , Frecuencia de los Genes/genética , Humanos , Desequilibrio de Ligamiento/genética , Análisis Numérico Asistido por Computador , Fenotipo , Curva ROC , Espondilitis Anquilosante/genética
4.
BMC Bioinformatics ; 18(1): 54, 2017 Jan 23.
Artículo en Inglés | MEDLINE | ID: mdl-28114904

RESUMEN

BACKGROUND: A large amount of research has been devoted to the detection and investigation of epistatic interactions in genome-wide association studies (GWASs). Most of the literature focuses on low-order interactions between single-nucleotide polymorphisms (SNPs) with significant main effects. RESULTS: In this paper we propose an original approach for detecting epistasis at the gene level, without systematically filtering on significant genes. We first compute interaction variables for each gene pair by finding its Eigen-Epistasis component, defined as the linear combination of Gene SNPs having the highest correlation with the phenotype. The selection of significant effects is done using a penalized regression method based on Group Lasso controlling the False Discovery Rate. CONCLUSION: The method is tested against two recent alternative proposals from the literature using synthetic data, and shows good performances in different settings. We demonstrate the power of our approach by detecting new gene-gene interactions on three genome-wide association studies.


Asunto(s)
Biología Computacional/métodos , Epistasis Genética , Simulación por Computador , Estudio de Asociación del Genoma Completo , Genotipo , Humanos , Enfermedades Inflamatorias del Intestino/genética , Modelos Teóricos , Fenotipo , Polimorfismo de Nucleótido Simple , Análisis de Componente Principal , Neoplasias de la Tiroides/genética
5.
BMC Bioinformatics ; 16: 148, 2015 May 08.
Artículo en Inglés | MEDLINE | ID: mdl-25951947

RESUMEN

BACKGROUND: Genome-wide association studies (GWAS) aim at finding genetic markers that are significantly associated with a phenotype of interest. Single nucleotide polymorphism (SNP) data from the entire genome are collected for many thousands of SNP markers, leading to high-dimensional regression problems where the number of predictors greatly exceeds the number of observations. Moreover, these predictors are statistically dependent, in particular due to linkage disequilibrium (LD). We propose a three-step approach that explicitly takes advantage of the grouping structure induced by LD in order to identify common variants which may have been missed by single marker analyses (SMA). In the first step, we perform a hierarchical clustering of SNPs with an adjacency constraint using LD as a similarity measure. In the second step, we apply a model selection approach to the obtained hierarchy in order to define LD blocks. Finally, we perform Group Lasso regression on the inferred LD blocks. We investigate the efficiency of this approach compared to state-of-the art regression methods: haplotype association tests, SMA, and Lasso and Elastic-Net regressions. RESULTS: Our results on simulated data show that the proposed method performs better than state-of-the-art approaches as soon as the number of causal SNPs within an LD block exceeds 2. Our results on semi-simulated data and a previously published HIV data set illustrate the relevance of the proposed method and its robustness to a real LD structure. The method is implemented in the R package BALD (Blockwise Approach using Linkage Disequilibrium), available from http://www.math-evry.cnrs.fr/publications/logiciels . CONCLUSIONS: Our results show that the proposed method is efficient not only at the level of LD blocks by inferring well the underlying block structure but also at the level of individual SNPs. Thus, this study demonstrates the importance of tailored integration of biological knowledge in high-dimensional genomic studies such as GWAS.


Asunto(s)
Algoritmos , Estudio de Asociación del Genoma Completo/métodos , Haplotipos/genética , Desequilibrio de Ligamiento , Modelos Teóricos , Polimorfismo de Nucleótido Simple/genética , Marcadores Genéticos/genética , Humanos
6.
BMC Ecol Evol ; 23(1): 46, 2023 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-37658324

RESUMEN

BACKGROUND: Plankton seascape genomics studies have revealed different trends from large-scale weak differentiation to microscale structures. Previous studies have underlined the influence of the environment and seascape on species differentiation and adaptation. However, these studies have generally focused on a few single species, sparse molecular markers, or local scales. Here, we investigated the genomic differentiation of plankton at the macro-scale in a holistic approach using Tara Oceans metagenomic data together with a reference-free computational method. RESULTS: We reconstructed the FST-based genomic differentiation of 113 marine planktonic taxa occurring in the North and South Atlantic Oceans, Southern Ocean, and Mediterranean Sea. These taxa belong to various taxonomic clades spanning Metazoa, Chromista, Chlorophyta, Bacteria, and viruses. Globally, population genetic connectivity was significantly higher within oceanic basins and lower in bacteria and unicellular eukaryotes than in zooplankton. Using mixed linear models, we tested six abiotic factors influencing connectivity, including Lagrangian travel time, as proxies of oceanic current effects. We found that oceanic currents were the main population genetic connectivity drivers, together with temperature and salinity. Finally, we classified the 113 taxa into parameter-driven groups and showed that plankton taxa belonging to the same taxonomic rank such as phylum, class or order presented genomic differentiation driven by different environmental factors. CONCLUSION: Our results validate the isolation-by-current hypothesis for a non-negligible proportion of taxa and highlight the role of other physicochemical parameters in large-scale plankton genetic connectivity. The reference-free approach used in this study offers a new systematic framework to analyse the population genomics of non-model and undocumented marine organisms from a large-scale and holistic point of view.


Asunto(s)
Aclimatación , Plancton , Animales , Plancton/genética , Zooplancton/genética , Genómica , Océano Atlántico , Eucariontes
7.
Front Genet ; 13: 859462, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35734430

RESUMEN

Motivation: Identifying new genetic associations in non-Mendelian complex diseases is an increasingly difficult challenge. These diseases sometimes appear to have a significant component of heritability requiring explanation, and this missing heritability may be due to the existence of subtypes involving different genetic factors. Taking genetic information into account in clinical trials might potentially have a role in guiding the process of subtyping a complex disease. Most methods dealing with multiple sources of information rely on data transformation, and in disease subtyping, the two main strategies used are 1) the clustering of clinical data followed by posterior genetic analysis and 2) the concomitant clustering of clinical and genetic variables. Both of these strategies have limitations that we propose to address. Contribution: This work proposes an original method for disease subtyping on the basis of both longitudinal clinical variables and high-dimensional genetic markers via a sparse mixture-of-regressions model. The added value of our approach lies in its interpretability in relation to two aspects. First, our model links both clinical and genetic data with regard to their initial nature (i.e., without transformation) and does not require post-processing where the original information is accessed a second time to interpret the subtypes. Second, it can address large-scale problems because of a variable selection step that is used to discard genetic variables that may not be relevant for subtyping. Results: The proposed method was validated on simulations. A dataset from a cohort of Parkinson's disease patients was also analyzed. Several subtypes of the disease and genetic variants that potentially have a role in this typology were identified. Software availability: The R code for the proposed method, named DiSuGen, and a tutorial are available for download (see the references).

8.
Stat Appl Genet Mol Biol ; 9: Article 15, 2010.
Artículo en Inglés | MEDLINE | ID: mdl-20196750

RESUMEN

We present a weighted-LASSO method to infer the parameters of a first-order vector auto-regressive model that describes time course expression data generated by directed gene-to-gene regulation networks. These networks are assumed to own prior internal structures of connectivity which drive the inference method. This prior structure can be either derived from prior biological knowledge or inferred by the method itself. We illustrate the performance of this structure-based penalization both on synthetic data and on two canonical regulatory networks (the yeast cell cycle regulation network and the E. coli S.O.S. DNA repair network).


Asunto(s)
Perfilación de la Expresión Génica/estadística & datos numéricos , Redes Reguladoras de Genes , Análisis de Secuencia por Matrices de Oligonucleótidos/estadística & datos numéricos , Análisis de Regresión , Algoritmos , Bioestadística , Ciclo Celular/genética , Escherichia coli/genética , Escherichia coli/metabolismo , Funciones de Verosimilitud , Modelos Genéticos , Modelos Estadísticos , Respuesta SOS en Genética/genética , Saccharomyces cerevisiae/citología , Saccharomyces cerevisiae/genética
9.
Nat Commun ; 12(1): 1173, 2021 02 19.
Artículo en Inglés | MEDLINE | ID: mdl-33608509

RESUMEN

Antimicrobial resistance is a major global health threat and its development is promoted by antibiotic misuse. While disk diffusion antibiotic susceptibility testing (AST, also called antibiogram) is broadly used to test for antibiotic resistance in bacterial infections, it faces strong criticism because of inter-operator variability and the complexity of interpretative reading. Automatic reading systems address these issues, but are not always adapted or available to resource-limited settings. We present an artificial intelligence (AI)-based, offline smartphone application for antibiogram analysis. The application captures images with the phone's camera, and the user is guided throughout the analysis on the same device by a user-friendly graphical interface. An embedded expert system validates the coherence of the antibiogram data and provides interpreted results. The fully automatic measurement procedure of our application's reading system achieves an overall agreement of 90% on susceptibility categorization against a hospital-standard automatic system and 98% against manual measurement (gold standard), with reduced inter-operator variability. The application's performance showed that the automatic reading of antibiotic resistance testing is entirely feasible on a smartphone. Moreover our application is suited for resource-limited settings, and therefore has the potential to significantly increase patients' access to AST worldwide.


Asunto(s)
Inteligencia Artificial , Farmacorresistencia Microbiana , Pruebas de Sensibilidad Microbiana/métodos , Aplicaciones Móviles , Teléfono Inteligente , Antibacterianos/farmacología , Infecciones Bacterianas , Farmacorresistencia Microbiana/efectos de los fármacos , Humanos , Procesamiento de Imagen Asistido por Computador , Aprendizaje Automático , Programas Informáticos
10.
Bioinformatics ; 25(3): 417-8, 2009 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-19073589

RESUMEN

SUMMARY: The R package SIMoNe (Statistical Inference for MOdular NEtworks) enables inference of gene-regulatory networks based on partial correlation coefficients from microarray experiments. Modelling gene expression data with a Gaussian graphical model (hereafter GGM), the algorithm estimates non-zero entries of the concentration matrix, in a sparse and possibly high-dimensional setting. Its originality lies in the fact that it searches for a latent modular structure to drive the inference procedure through adaptive penalization of the concentration matrix. AVAILABILITY: Under the GNU General Public Licence at http://cran.r-project.org/web/packages/simone/


Asunto(s)
Algoritmos , Redes Reguladoras de Genes , Programas Informáticos , Simulación por Computador , Bases de Datos Genéticas , Perfilación de la Expresión Génica
11.
Front Microbiol ; 11: 649, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32351481

RESUMEN

We consider the problem of incorporating evolutionary information (e.g., taxonomic or phylogenic trees) in the context of metagenomics differential analysis. Recent results published in the literature propose different ways to leverage the tree structure to increase the detection rate of differentially abundant taxa. Here, we propose instead to use a different hierarchical structure, in the form of a correlation-based tree, as it may capture the structure of the data better than the phylogeny. We first show that the correlation tree and the phylogeny are significantly different before turning to the impact of tree choice on detection rates. Using synthetic data, we show that the tree does have an impact: smoothing p-values according to the phylogeny leads to equal or inferior rates as smoothing according to the correlation tree. However, both trees are outperformed by the classical, non-hierarchical, Benjamini-Hochberg (BH) procedure in terms of detection rates. Other procedures may use the hierarchical structure with profit but do not control the False Discovery Rate (FDR) a priori and remain inferior to a classical Benjamini-Hochberg procedure with the same nominal FDR. On real datasets, no hierarchical procedure had significantly higher detection rate that BH. Intuition advocates that the use of hierarchical structures should increase the detection rate of differentially abundant taxa in microbiome studies. However, our results suggest that current hierarchical procedures are still inferior to standard methods and more effective procedures remain to be invented.

12.
PLoS One ; 15(12): e0244637, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33378381

RESUMEN

The availability of large metagenomic data offers great opportunities for the population genomic analysis of uncultured organisms, which represent a large part of the unexplored biosphere and play a key ecological role. However, the majority of these organisms lack a reference genome or transcriptome, which constitutes a technical obstacle for classical population genomic analyses. We introduce the metavariant species (MVS) model, in which a species is represented only by intra-species nucleotide polymorphism. We designed a method combining reference-free variant calling, multiple density-based clustering and maximum-weighted independent set algorithms to cluster intra-species variants into MVSs directly from multisample metagenomic raw reads without a reference genome or read assembly. The frequencies of the MVS variants are then used to compute population genomic statistics such as FST, in order to estimate genomic differentiation between populations and to identify loci under natural selection. The MVS construction was tested on simulated and real metagenomic data. MVSs showed the required quality for robust population genomics and allowed an accurate estimation of genomic differentiation (ΔFST < 0.0001 and <0.03 on simulated and real data respectively). Loci predicted under natural selection on real data were all detected by MVSs. MVSs represent a new paradigm that may simplify and enhance holistic approaches for population genomics and the evolution of microorganisms.


Asunto(s)
Biología Computacional/métodos , Variación Genética , Metagenómica/métodos , Análisis por Conglomerados , Genética de Población , Modelos Genéticos , Selección Genética , Programas Informáticos
13.
Ecol Evol ; 10(16): 8894-8905, 2020 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-32884665

RESUMEN

Acclimation allowed by variation in gene or allele expression in natural populations is increasingly understood as a decisive mechanism, as much as adaptation, for species evolution. However, for small eukaryotic organisms, as species from zooplankton, classical methods face numerous challenges. Here, we propose the concept of allelic differential expression at the population-scale (psADE) to investigate the variation in allele expression in natural populations. We developed a novel approach to detect psADE based on metagenomic and metatranscriptomic data from environmental samples. This approach was applied on the widespread marine copepod, Oithona similis, by combining samples collected during the Tara Oceans expedition (2009-2013) and de novo transcriptome assemblies. Among a total of 25,768 single nucleotide variants (SNVs) of O. similis, 572 (2.2%) were affected by psADE in at least one population (FDR < 0.05). The distribution of SNVs under psADE in different populations is significantly shaped by population genomic differentiation (Pearson r = 0.87, p = 5.6 × 10-30), supporting a partial genetic control of psADE. Moreover, a significant amount of SNVs (0.6%) were under both selection and psADE (p < .05), supporting the hypothesis that natural selection and psADE tends to impact common loci. Population-scale allelic differential expression offers new insights into the gene regulation control in populations and its link with natural selection.

14.
Front Genet ; 11: 581594, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33329721

RESUMEN

Genome-Wide Association Studies (GWAS) explain only a small fraction of heritability for most complex human phenotypes. Genomic heritability estimates the variance explained by the SNPs on the whole genome using mixed models and accounts for the many small contributions of SNPs in the explanation of a phenotype. This paper approaches heritability from a machine learning perspective, and examines the close link between mixed models and ridge regression. Our contribution is two-fold. First, we propose estimating genomic heritability using a predictive approach via ridge regression and Generalized Cross Validation (GCV). We show that this is consistent with classical mixed model based estimation. Second, we derive simple formulae that express prediction accuracy as a function of the ratio n p , where n is the population size and p the total number of SNPs. These formulae clearly show that a high heritability does not imply an accurate prediction when p > n. Both the estimation of heritability via GCV and the prediction accuracy formulae are validated using simulated data and real data from UK Biobank.

15.
Algorithms Mol Biol ; 15: 13, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32625242

RESUMEN

MOTIVATION: Association studies have been widely used to search for associations between common genetic variants observations and a given phenotype. However, it is now generally accepted that genes and environment must be examined jointly when estimating phenotypic variance. In this work we consider two types of biological markers: genotypic markers, which characterize an observation in terms of inherited genetic information, and metagenomic marker which are related to the environment. Both types of markers are available in their millions and can be used to characterize any observation uniquely. OBJECTIVE: Our focus is on detecting interactions between groups of genetic and metagenomic markers in order to gain a better understanding of the complex relationship between environment and genome in the expression of a given phenotype. CONTRIBUTIONS: We propose a novel approach for efficiently detecting interactions between complementary datasets in a high-dimensional setting with a reduced computational cost. The method, named SICOMORE, reduces the dimension of the search space by selecting a subset of supervariables in the two complementary datasets. These supervariables are given by a weighted group structure defined on sets of variables at different scales. A Lasso selection is then applied on each type of supervariable to obtain a subset of potential interactions that will be explored via linear model testing. RESULTS: We compare SICOMORE with other approaches in simulations, with varying sample sizes, noise, and numbers of true interactions. SICOMORE exhibits convincing results in terms of recall, as well as competitive performances with respect to running time. The method is also used to detect interaction between genomic markers in Medicago truncatula and metagenomic markers in its rhizosphere bacterial community. SOFTWARE AVAILABILITY: An R package is available [4], along with its documentation and associated scripts, allowing the reader to reproduce the results presented in the paper.

16.
Algorithms Mol Biol ; 14: 22, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31807137

RESUMEN

BACKGROUND: Genomic data analyses such as Genome-Wide Association Studies (GWAS) or Hi-C studies are often faced with the problem of partitioning chromosomes into successive regions based on a similarity matrix of high-resolution, locus-level measurements. An intuitive way of doing this is to perform a modified Hierarchical Agglomerative Clustering (HAC), where only adjacent clusters (according to the ordering of positions within a chromosome) are allowed to be merged. But a major practical drawback of this method is its quadratic time and space complexity in the number of loci, which is typically of the order of 10 4 to 10 5 for each chromosome. RESULTS: By assuming that the similarity between physically distant objects is negligible, we are able to propose an implementation of adjacency-constrained HAC with quasi-linear complexity. This is achieved by pre-calculating specific sums of similarities, and storing candidate fusions in a min-heap. Our illustrations on GWAS and Hi-C datasets demonstrate the relevance of this assumption, and show that this method highlights biologically meaningful signals. Thanks to its small time and memory footprint, the method can be run on a standard laptop in minutes or even seconds. AVAILABILITY AND IMPLEMENTATION: Software and sample data are available as an R package, adjclust, that can be downloaded from the Comprehensive R Archive Network (CRAN).

17.
Sci Rep ; 9(1): 7550, 2019 05 17.
Artículo en Inglés | MEDLINE | ID: mdl-31101892

RESUMEN

High-throughput RNA-sequencing has become the gold standard method for whole-transcriptome gene expression analysis, and is widely used in numerous applications to study cell and tissue transcriptomes. It is also being increasingly used in a number of clinical applications, including expression profiling for diagnostics and alternative transcript detection. However, despite its many advantages, RNA sequencing can be challenging in some situations, for instance in cases of low input amounts or degraded RNA samples. Several protocols have been proposed to overcome these challenges, and many are available as commercial kits. In this study, we systematically test three recent commercial technologies for RNA-seq library preparation (TruSeq, SMARTer and SMARTer Ultra-Low) on human biological reference materials, using standard (1 mg), low (100 ng and 10 ng) and ultra-low (<1 ng) input amounts, and for mRNA and total RNA, stranded and unstranded. The results are analyzed using read quality and alignment metrics, gene detection and differential gene expression metrics. Overall, we show that the TruSeq kit performs well with an input amount of 100 ng, while the SMARTer kit shows decreased performance for inputs of 100 and 10 ng, and the SMARTer Ultra-Low kit performs relatively well for input amounts <1 ng. All the results are discussed in detail, and we provide guidelines for biologists for the selection of an RNA-seq library preparation kit.


Asunto(s)
Secuenciación del Exoma/métodos , Perfilación de la Expresión Génica/métodos , RNA-Seq/métodos , Transcriptoma/genética , Humanos , ARN Mensajero/genética , Juego de Reactivos para Diagnóstico
18.
PLoS One ; 7(10): e45685, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-23077494

RESUMEN

Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer fine-scale genetic patterns.


Asunto(s)
Análisis por Conglomerados , Grupos de Población , Algoritmos , Haplotipos , Humanos , Modelos Teóricos , Polimorfismo de Nucleótido Simple
19.
PLoS One ; 6(12): e28845, 2011.
Artículo en Inglés | MEDLINE | ID: mdl-22216125

RESUMEN

Genome-Wide Association Studies are powerful tools to detect genetic variants associated with diseases. Their results have, however, been questioned, in part because of the bias induced by population stratification. This is a consequence of systematic differences in allele frequencies due to the difference in sample ancestries that can lead to both false positive or false negative findings. Many strategies are available to account for stratification but their performances differ, for instance according to the type of population structure, the disease susceptibility locus minor allele frequency, the degree of sampling imbalanced, or the sample size. We focus on the type of population structure and propose a comparison of the most commonly used methods to deal with stratification that are the Genomic Control, Principal Component based methods such as implemented in Eigenstrat, adjusted Regressions and Meta-Analyses strategies. Our assessment of the methods is based on a large simulation study, involving several scenarios corresponding to many types of population structures. We focused on both false positive rate and power to determine which methods perform the best. Our analysis showed that if there is no population structure, none of the tests led to a bias nor decreased the power except for the Meta-Analyses. When the population is stratified, adjusted Logistic Regressions and Eigenstrat are the best solutions to account for stratification even though only the Logistic Regressions are able to constantly maintain correct false positive rates. This study provides more details about these methods. Their advantages and limitations in different stratification scenarios are highlighted in order to propose practical guidelines to account for population stratification in Genome-Wide Association Studies.


Asunto(s)
Estudio de Asociación del Genoma Completo , Frecuencia de los Genes , Humanos
20.
BMC Proc ; 2 Suppl 4: S4, 2008 Dec 17.
Artículo en Inglés | MEDLINE | ID: mdl-19091051

RESUMEN

BACKGROUND: Identifying gene functional modules is an important step towards elucidating gene functions at a global scale. Clustering algorithms mostly rely on co-expression of genes, that is group together genes having similar expression profiles. RESULTS: We propose to cluster genes by co-regulation rather than by co-expression. We therefore present an inference algorithm for detecting co-regulated groups from gene expression data and introduce a method to cluster genes given that inferred regulatory structure. Finally, we propose to validate the clustering through a score based on the GO enrichment of the obtained groups of genes. CONCLUSION: We evaluate the methods on the stress response of S. Cerevisiae data and obtain better scores than clustering obtained directly from gene expression.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA