Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 45
Filtrar
1.
Eur J Cancer ; 202: 113978, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38471290

RESUMO

BACKGROUND: The PAOLA-1/ENGOT-ov25 trial showed that maintenance olaparib plus bevacizumab increases survival of advanced ovarian cancer patients with homologous recombination deficiency (HRD). However, decentralized solutions to test for HRD in clinical routine are scarce. The goal of this study was to retrospectively validate on tumor samples from the PAOLA-1 trial, the decentralized SeqOne assay, which relies on shallow Whole Genome Sequencing (sWGS) to capture genomic instability and targeted sequencing to determine BRCA status. METHODS: The study comprised 368 patients from the PAOLA-1 trial. The SeqOne assay was compared to the Myriad MyChoice HRD test (Myriad Genetics), and results were analyzed with respect to Progression-Free Survival (PFS). RESULTS: We found a 95% concordance between the HRD status of the two tests (95% Confidence Interval (CI); 92%-97%). The Positive Percentage Agreement (PPA) of the sWGS test was 95% (95% CI; 91%-97%) like its Negative Percentage Agreement (NPA) (95% CI; 89%-98%). In patients with HRD-positive tumors treated with olaparib plus bevacizumab, the PFS Hazard Ratio (HR) was 0.38 (95% CI; 0.26-0.54) with SeqOne assay and 0.32 (95% CI; 0.22-0.45) with the Myriad assay. In patients with HRD-negative tumors, HR was 0.99 (95% CI; 0.68-1.42) and 1.05 (95% CI; 0.70-1.57) with SeqOne and Myriad assays. Among patients with BRCA-wildtype tumors, those with HRD-positive tumors, benefited from olaparib plus bevacizumab maintenance, with HR of 0.48 (95% CI: 0.29-0.79) and of 0.38 (95% CI: 0.23 to 0.63) with the SeqOne and Myriad assay. CONCLUSION: The SeqOne assay offers a clinically validated approach to detect HRD.


Assuntos
Neoplasias Ovarianas , Humanos , Feminino , Bevacizumab/uso terapêutico , Estudos Retrospectivos , Neoplasias Ovarianas/tratamento farmacológico , Neoplasias Ovarianas/genética , Carcinoma Epitelial do Ovário , Recombinação Homóloga
2.
Trials ; 24(1): 380, 2023 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-37280655

RESUMO

Adjustment for prognostic covariates increases the statistical power of randomized trials. The factors influencing the increase of power are well-known for trials with continuous outcomes. Here, we study which factors influence power and sample size requirements in time-to-event trials. We consider both parametric simulations and simulations derived from the Cancer Genome Atlas (TCGA) cohort of hepatocellular carcinoma (HCC) patients to assess how sample size requirements are reduced with covariate adjustment. Simulations demonstrate that the benefit of covariate adjustment increases with the prognostic performance of the adjustment covariate (C-index) and with the cumulative incidence of the event in the trial. For a covariate that has an intermediate prognostic performance (C-index=0.65), the reduction of sample size varies from 3.1% when cumulative incidence is of 10% to 29.1% when the cumulative incidence is of 90%. Broadening eligibility criteria usually reduces statistical power while our simulations show that it can be maintained with adequate covariate adjustment. In a simulation of adjuvant trials in HCC, we find that the number of patients screened for eligibility can be divided by 2.4 when broadening eligibility criteria. Last, we find that the Cox-Snell [Formula: see text] is a conservative estimation of the reduction in sample size requirements provided by covariate adjustment. Overall, more systematic adjustment for prognostic covariates leads to more efficient and inclusive clinical trials especially when cumulative incidence is large as in metastatic and advanced cancers. Code and results are available at https://github.com/owkin/CovadjustSim .


Assuntos
Carcinoma Hepatocelular , Neoplasias Hepáticas , Humanos , Carcinoma Hepatocelular/genética , Simulação por Computador , Neoplasias Hepáticas/terapia , Prognóstico , Tamanho da Amostra , Ensaios Clínicos como Assunto
3.
BMC Med Res Methodol ; 22(1): 335, 2022 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-36577946

RESUMO

BACKGROUND: An external control arm is a cohort of control patients that are collected from data external to a single-arm trial. To provide an unbiased estimation of efficacy, the clinical profiles of patients from single and external arms should be aligned, typically using propensity score approaches. There are alternative approaches to infer efficacy based on comparisons between outcomes of single-arm patients and machine-learning predictions of control patient outcomes. These methods include G-computation and Doubly Debiased Machine Learning (DDML) and their evaluation for External Control Arms (ECA) analysis is insufficient. METHODS: We consider both numerical simulations and a trial replication procedure to evaluate the different statistical approaches: propensity score matching, Inverse Probability of Treatment Weighting (IPTW), G-computation, and DDML. The replication study relies on five type 2 diabetes randomized clinical trials granted by the Yale University Open Data Access (YODA) project. From the pool of five trials, observational experiments are artificially built by replacing a control arm from one trial by an arm originating from another trial and containing similarly-treated patients. RESULTS: Among the different statistical approaches, numerical simulations show that DDML has the smallest bias followed by G-computation. In terms of mean squared error, G-computation usually minimizes mean squared error. Compared to other methods, DDML has varying Mean Squared Error performances that improves with increasing sample sizes. For hypothesis testing, all methods control type I error and DDML is the most conservative. G-computation is the best method in terms of statistical power, and DDML has comparable power at [Formula: see text] but inferior ones for smaller sample sizes. The replication procedure also indicates that G-computation minimizes mean squared error whereas DDML has intermediate performances in between G-computation and propensity score approaches. The confidence intervals of G-computation are the narrowest whereas confidence intervals obtained with DDML are the widest for small sample sizes, which confirms its conservative nature. CONCLUSIONS: For external control arm analyses, methods based on outcome prediction models can reduce estimation error and increase statistical power compared to propensity score approaches.


Assuntos
Diabetes Mellitus Tipo 2 , Humanos , Viés , Simulação por Computador , Diabetes Mellitus Tipo 2/terapia , Aprendizado de Máquina , Pontuação de Propensão , Projetos de Pesquisa , Ensaios Clínicos Controlados Aleatórios como Assunto
4.
Nat Commun ; 12(1): 634, 2021 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-33504775

RESUMO

The SARS-COV-2 pandemic has put pressure on intensive care units, so that identifying predictors of disease severity is a priority. We collect 58 clinical and biological variables, and chest CT scan data, from 1003 coronavirus-infected patients from two French hospitals. We train a deep learning model based on CT scans to predict severity. We then construct the multimodal AI-severity score that includes 5 clinical and biological variables (age, sex, oxygenation, urea, platelet) in addition to the deep learning model. We show that neural network analysis of CT-scans brings unique prognosis information, although it is correlated with other markers of severity (oxygenation, LDH, and CRP) explaining the measurable but limited 0.03 increase of AUC obtained when adding CT-scan information to clinical variables. Here, we show that when comparing AI-severity with 11 existing severity scores, we find significantly improved prognosis performance; AI-severity can therefore rapidly become a reference scoring approach.


Assuntos
COVID-19/diagnóstico , COVID-19/fisiopatologia , Aprendizado Profundo , Redes Neurais de Computação , Tomografia Computadorizada por Raios X/métodos , Inteligência Artificial , COVID-19/classificação , Humanos , Modelos Biológicos , Análise Multivariada , Prognóstico , Radiologistas , Índice de Gravidade de Doença
5.
Eur J Hum Genet ; 29(2): 325-337, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33005019

RESUMO

Taste is essential for the interaction of animals with their food and has co-evolved with diet. Humans have peopled a large range of environments and present a wide range of diets, but little is known about the diversity and evolution of human taste perception. We measured taste recognition thresholds across populations differing in lifestyles (hunter gatherers and farmers from Central Africa, nomad herders, and farmers from Central Asia). We also generated genome-wide genotype data and performed association studies and selection scans in order to link the phenotypic variation in taste sensitivity with genetic variation. We found that hunter gatherers have lower overall sensitivity as well as lower sensitivity to quinine and fructose than their farming neighbors. In parallel, there is strong population divergence in genes associated with tongue morphogenesis and genes involved in the transduction pathway of taste signals in the African populations. We find signals of recent selection in bitter taste-receptor genes for all four populations. Enrichment analysis on association scans for the various tastes confirmed already documented associations and revealed novel GO terms that are good candidates for being involved in taste perception. Our framework permitted us to gain insight into the genetic basis of taste sensitivity variation across populations and lifestyles.


Assuntos
Genoma , Estilo de Vida , Percepção Gustatória/genética , Paladar/genética , Adolescente , Adulto , Povo Asiático , População Negra , Genótipo , Humanos , Pessoa de Meia-Idade , Fenótipo , Adulto Jovem
6.
Bioinformatics ; 36(16): 4449-4457, 2020 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-32415959

RESUMO

MOTIVATION: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. RESULTS: For example, we find that PC19-PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. AVAILABILITY AND IMPLEMENTATION: R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genética Populacional , Software , Algoritmos , Humanos , Desequilíbrio de Ligação , Análise de Componente Principal
7.
Environ Health Perspect ; 128(5): 55001, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-32379489

RESUMO

BACKGROUND: Mediation analysis is used in epidemiology to identify pathways through which exposures influence health. The advent of high-throughput (omics) technologies gives opportunities to perform mediation analysis with a high-dimension pool of covariates. OBJECTIVE: We aimed to highlight some biostatistical issues of this expanding field of high-dimension mediation. DISCUSSION: The mediation techniques used for a single mediator cannot be generalized in a straightforward manner to high-dimension mediation. Causal knowledge on the relation between covariates is required for mediation analysis, and it is expected to be more limited as dimension and system complexity increase. The methods developed in high dimension can be distinguished according to whether mediators are considered separately or as a whole. Methods considering each potential mediator separately do not allow efficient identification of the indirect effects when mutual influences exist among the mediators, which is expected for many biological (e.g., epigenetic) parameters. In this context, methods considering all potential mediators simultaneously, based, for example, on data reduction techniques, are more adapted to the causal inference framework. Their cost is a possible lack of ability to single out the causal mediators. Moreover, the ability of the mediators to predict the outcome can be overestimated, in particular because many machine-learning algorithms are optimized to increase predictive ability rather than their aptitude to make causal inference. Given the lack of overarching validated framework and the generally complex causal structure of high-dimension data, analysis of high-dimension mediation currently requires great caution and effort to incorporate a priori biological knowledge. https://doi.org/10.1289/EHP6240.


Assuntos
Análise de Mediação , Humanos , Modelos Estatísticos
8.
Mol Biol Evol ; 37(7): 2153-2154, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32343802

RESUMO

R package pcadapt is a user-friendly R package for performing genome scans for local adaptation. Here, we present version 4 of pcadapt which substantially improves computational efficiency while providing similar results. This improvement is made possible by using a different format for storing genotypes and a different algorithm for computing principal components of the genotype matrix, which is the most computationally demanding step in method pcadapt. These changes are seamlessly integrated into the existing pcadapt package, and users will experience a large reduction in computation time (by a factor of 20-60 in our analyses) as compared with previous versions.


Assuntos
Adaptação Biológica , Genômica/métodos , Software
9.
BMC Bioinformatics ; 21(1): 16, 2020 Jan 13.
Artigo em Inglês | MEDLINE | ID: mdl-31931698

RESUMO

BACKGROUND: Cell-type heterogeneity of tumors is a key factor in tumor progression and response to chemotherapy. Tumor cell-type heterogeneity, defined as the proportion of the various cell-types in a tumor, can be inferred from DNA methylation of surgical specimens. However, confounding factors known to associate with methylation values, such as age and sex, complicate accurate inference of cell-type proportions. While reference-free algorithms have been developed to infer cell-type proportions from DNA methylation, a comparative evaluation of the performance of these methods is still lacking. RESULTS: Here we use simulations to evaluate several computational pipelines based on the software packages MeDeCom, EDec, and RefFreeEWAS. We identify that accounting for confounders, feature selection, and the choice of the number of estimated cell types are critical steps for inferring cell-type proportions. We find that removal of methylation probes which are correlated with confounder variables reduces the error of inference by 30-35%, and that selection of cell-type informative probes has similar effect. We show that Cattell's rule based on the scree plot is a powerful tool to determine the number of cell-types. Once the pre-processing steps are achieved, the three deconvolution methods provide comparable results. We observe that all the algorithms' performance improves when inter-sample variation of cell-type proportions is large or when the number of available samples is large. We find that under specific circumstances the methods are sensitive to the initialization method, suggesting that averaging different solutions or optimizing initialization is an avenue for future research. CONCLUSION: Based on the lessons learned, to facilitate pipeline validation and catalyze further pipeline improvement by the community, we develop a benchmark pipeline for inference of cell-type proportions and implement it in the R package medepir.


Assuntos
Biologia Computacional/normas , Metilação de DNA , Neoplasias/genética , Algoritmos , Biologia Computacional/métodos , Simulação por Computador , Humanos , Software
10.
Ecol Evol ; 9(22): 12658-12675, 2019 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-31788205

RESUMO

Invasive species can encounter environments different from their source populations, which may trigger rapid adaptive changes after introduction (niche shift hypothesis). To test this hypothesis, we investigated whether postintroduction evolution is correlated with contrasting environmental conditions between the European invasive and source ranges in the Asian tiger mosquito Aedes albopictus. The comparison of environmental niches occupied in European and source population ranges revealed more than 96% overlap between invasive and source niches, supporting niche conservatism. However, we found evidence for postintroduction genetic evolution by reanalyzing a published ddRADseq genomic dataset from 90 European invasive populations using genotype-environment association (GEA) methods and generalized dissimilarity modeling (GDM). Three loci, among which a putative heat-shock protein, exhibited significant allelic turnover along the gradient of winter precipitation that could be associated with ongoing range expansion. Wing morphometric traits weakly correlated with environmental gradients within Europe, but wing size differed between invasive and source populations located in different climatic areas. Niche similarities between source and invasive ranges might have facilitated the establishment of populations. Nonetheless, we found evidence for environmental-induced adaptive changes after introduction. The ability to rapidly evolve observed in invasive populations (genetic shift) together with a large proportion of unfilled potential suitable areas (80%) pave the way to further spread of Ae. albopictus in Europe.

11.
Am J Hum Genet ; 105(6): 1213-1221, 2019 12 05.
Artigo em Inglês | MEDLINE | ID: mdl-31761295

RESUMO

Polygenic prediction has the potential to contribute to precision medicine. Clumping and thresholding (C+T) is a widely used method to derive polygenic scores. When using C+T, several p value thresholds are tested to maximize predictive ability of the derived polygenic scores. Along with this p value threshold, we propose to tune three other hyper-parameters for C+T. We implement an efficient way to derive thousands of different C+T scores corresponding to a grid over four hyper-parameters. For example, it takes a few hours to derive 123K different C+T scores for 300K individuals and 1M variants using 16 physical cores. We find that optimizing over these four hyper-parameters improves the predictive performance of C+T in both simulations and real data applications as compared to tuning only the p value threshold. A particularly large increase can be noted when predicting depression status, from an AUC of 0.557 (95% CI: [0.544-0.569]) when tuning only the p value threshold to an AUC of 0.592 (95% CI: [0.580-0.604]) when tuning all four hyper-parameters we propose for C+T. We further propose stacked clumping and thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores by using an efficient penalized regression. We apply SCT to eight different case-control diseases in the UK biobank data and find that SCT substantially improves prediction accuracy with an average AUC increase of 0.035 over standard C+T.


Assuntos
Algoritmos , Doença/genética , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Herança Multifatorial/genética , Polimorfismo de Nucleotídeo Único , Bancos de Espécimes Biológicos , Estudos de Casos e Controles , Simulação por Computador , Humanos , Modelos Genéticos , Reino Unido
12.
Evolution ; 73(9): 1793-1808, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31313825

RESUMO

Adaptation to environmental conditions within the native range of exotic species can condition the invasion success of these species outside their range. The striking success of the Asian tiger mosquito, Aedes albopictus, to invade temperate regions has been attributed to the winter survival of diapause eggs in cold environments. In this study, we evaluate genetic polymorphisms (SNPs) and wing morphometric variation among three biogeographical regions of the native range of A. albopictus. Reconstructed demographic histories of populations show an initial expansion in Southeast Asia and suggest that marine regression during late Pleistocene and climate warming after the last glacial period favored expansion of populations in southern and northern regions, respectively. Searching for genomic signatures of selection, we identified significantly differentiated SNPs among which several are located in or within 20 kb distance from candidate genes for cold adaptation. These genes involve cellular and metabolic processes and several of them have been shown to be differentially expressed under diapausing conditions. The three biogeographical regions also differ for wing size and shape, and wing size increases with latitude supporting Bergmann's rule. Adaptive genetic and morphometric variation observed along the climatic gradient of A. albopictus native range suggests that colonization of northern latitudes promoted adaptation to cold environments prior to its worldwide invasion.


Assuntos
Adaptação Fisiológica , Aedes/genética , Aedes/fisiologia , Temperatura Baixa , Animais , China , Clima , Ecossistema , Feminino , Genética Populacional , Geografia , Japão , Malásia , Masculino , Óvulo/fisiologia , Polimorfismo de Nucleotídeo Único , Densidade Demográfica , Estações do Ano , Asas de Animais
13.
Mol Ecol ; 28(9): 2360-2377, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-30849200

RESUMO

Multiple introductions are key features for the establishment and persistence of introduced species. However, little is known about the contribution of genetic admixture to the invasive potential of populations. To address this issue, we studied the recent invasion of the Asian tiger mosquito (Aedes albopictus) in Europe. Combining genome-wide single nucleotide polymorphisms and historical knowledge using an approximate Bayesian computation framework, we reconstruct the colonization routes and establish the demographic dynamics of invasion. The colonization of Europe involved at least three independent introductions in Albania, North Italy and Central Italy that subsequently acted as dispersal centres throughout Europe. We show that the topology of human transportation networks shaped demographic histories with North Italy and Central Italy being the main dispersal centres in Europe. Introduction modalities conditioned the levels of genetic diversity in invading populations, and genetically diverse and admixed populations promoted more secondary introductions and have spread farther than single-source invasions. This genomic study provides further crucial insights into a general understanding of the role of genetic diversity promoted by modern trade in driving biological invasions.


Assuntos
Aedes/fisiologia , Variação Genética , Espécies Introduzidas , Aedes/genética , Animais , Teorema de Bayes , Europa (Continente) , Genética Populacional , Itália , Polimorfismo de Nucleotídeo Único , Densidade Demográfica
14.
Genetics ; 212(1): 65-74, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-30808621

RESUMO

Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The "Clumping+Thresholding" (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr.


Assuntos
Algoritmos , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Herança Multifatorial , Polimorfismo de Nucleotídeo Único , Doença Celíaca/genética , Feminino , Humanos , Masculino
15.
PeerJ ; 6: e5325, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30294507

RESUMO

Secondary contact is the reestablishment of gene flow between sister populations that have diverged. For instance, at the end of the Quaternary glaciations in Europe, secondary contact occurred during the northward expansion of the populations which had found refugia in the southern peninsulas. With the advent of multi-locus markers, secondary contact can be investigated using various molecular signatures including gradients of allele frequency, admixture clines, and local increase of genetic differentiation. We use coalescent simulations to investigate if molecular data provide enough information to distinguish between secondary contact following range expansion and an alternative evolutionary scenario consisting of a barrier to gene flow in an isolation-by-distance model. We find that an excess of linkage disequilibrium and of genetic diversity at the suture zone is a unique signature of secondary contact. We also find that the directionality index ψ, which was proposed to study range expansion, is informative to distinguish between the two hypotheses. However, although evidence for secondary contact is usually conveyed by statistics related to admixture coefficients, we find that they can be confounded by isolation-by-distance. We recommend to account for the spatial repartition of individuals when investigating secondary contact in order to better reflect the complex spatio-temporal evolution of populations and species.

16.
Mol Biol Evol ; 35(9): 2318-2326, 2018 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-29931083

RESUMO

Admixture between populations provides opportunity to study biological adaptation and phenotypic variation. Admixture studies rely on local ancestry inference for admixed individuals, which consists of computing at each locus the number of copies that originate from ancestral source populations. Existing software packages for local ancestry inference are tuned to provide accurate results on human data and recent admixture events. Here, we introduce Loter, an open-source software package that does not require any biological parameter besides haplotype data in order to make local ancestry inference available for a wide range of species. Using simulations, we compare the performance of Loter to HAPMIX, LAMP-LD, and RFMix. HAPMIX is the only software severely impacted by imperfect haplotype reconstruction. Loter is the less impacted software by increasing admixture time when considering simulated and admixed human genotypes. For simulations of admixed Populus genotypes, Loter and LAMP-LD are robust to increasing admixture times by contrast to RFMix. When comparing length of reconstructed and true ancestry tracts, Loter and LAMP-LD provide results whose accuracy is again more robust than RFMix to increasing admixture times. We apply Loter to individuals resulting from admixture between Populus trichocarpa and Populus balsamifera and lengths of ancestry tracts indicate that admixture took place ∼100 generations ago. We expect that providing a rapid and parameter-free software for local ancestry inference will make more accessible genomic studies about admixture processes.


Assuntos
Técnicas Genéticas , Software , Haplótipos , Humanos , Populus/genética
17.
Mol Ecol Resour ; 18(6): 1223-1233, 2018 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-29802785

RESUMO

Ordination is a common tool in ecology that aims at representing complex biological information in a reduced space. In landscape genetics, ordination methods such as principal component analysis (PCA) have been used to detect adaptive variation based on genomic data. Taking advantage of environmental data in addition to genotype data, redundancy analysis (RDA) is another ordination approach that is useful to detect adaptive variation. This study aims at proposing a test statistic based on RDA to search for loci under selection. We compare redundancy analysis to pcadapt, which is a nonconstrained ordination method, and to a latent factor mixed model (LFMM), which is a univariate genotype-environment association method. Individual-based simulations identify evolutionary scenarios where RDA genome scans have a greater statistical power than genome scans based on PCA. By constraining the analysis with environmental variables, RDA performs better than PCA in identifying adaptive variation when selection gradients are weakly correlated with population structure. In addition, we show that if RDA and LFMM have a similar power to identify genetic markers associated with environmental variables, the RDA-based procedure has the advantage to identify the main selective gradients as a combination of environmental variables. To give a concrete illustration of RDA in population genomics, we apply this method to the detection of outliers and selective gradients on an SNP data set of Populus trichocarpa (Geraldes et al., ). The RDA-based approach identifies the main selective gradient contrasting southern and coastal populations to northern and continental populations in the north-western American coast.


Assuntos
Adaptação Biológica , Variação Genética , Genética Populacional/métodos , Genômica/métodos , Bioestatística/métodos , Biologia Computacional/métodos , Loci Gênicos , Genótipo , Polimorfismo de Nucleotídeo Único
18.
Bioinformatics ; 34(16): 2781-2787, 2018 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-29617937

RESUMO

Motivation: Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses, leading to some software becoming obsolete and researchers having limited access to diverse analysis tools. Results: Here we present two R packages, bigstatsr and bigsnpr, allowing for the analysis of large scale genomic data to be performed within R. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement fast and accurate computations of principal component analysis and association studies, functions to remove single nucleotide polymorphisms in linkage disequilibrium and algorithms to learn polygenic risk scores on millions of single nucleotide polymorphisms. We illustrate applications of the two R packages by analyzing a case-control genomic dataset for celiac disease, performing an association study and computing polygenic risk scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500 000 individuals and 1 million markers on a single desktop computer. Availability and implementation: https://privefl.github.io/bigstatsr/ and https://privefl.github.io/bigsnpr/. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Algoritmos , Genoma Humano , Humanos , Herança Multifatorial , Polimorfismo de Nucleotídeo Único , Software
19.
Mol Ecol Resour ; 17(1): 67-77, 2017 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-27601374

RESUMO

The R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large-scale data. It can handle missing data and pooled sequencing data. By contrast to population-based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved in terms of both statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other computer programs for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful in scenarios of population divergence and range expansion. Because pcadapt handles next-generation sequencing data, it is a valuable tool for data analysis in molecular ecology.


Assuntos
Adaptação Biológica , Bioestatística/métodos , Biologia Computacional/métodos , Genética Populacional/métodos , Seleção Genética , Software , Análise de Componente Principal
20.
Mol Ecol ; 25(20): 5029-5042, 2016 10.
Artigo em Inglês | MEDLINE | ID: mdl-27565448

RESUMO

Finding genetic signatures of local adaptation is of great interest for many population genetic studies. Common approaches to sorting selective loci from their genomic background focus on the extreme values of the fixation index, FST , across loci. However, the computation of the fixation index becomes challenging when the population is genetically continuous, when predefining subpopulations is a difficult task, and in the presence of admixed individuals in the sample. In this study, we present a new method to identify loci under selection based on an extension of the FST statistic to samples with admixed individuals. In our approach, FST values are computed from the ancestry coefficients obtained with ancestry estimation programs. More specifically, we used factor models to estimate FST , and we compared our neutrality tests with those derived from a principal component analysis approach. The performances of the tests were illustrated using simulated data and by re-analysing genomic data from European lines of the plant species Arabidopsis thaliana and human genomic data from the population reference sample, POPRES.


Assuntos
Genética Populacional/métodos , Genômica/métodos , Adaptação Biológica/genética , Arabidopsis/genética , Simulação por Computador , Frequência do Gene , Loci Gênicos , Genoma Humano , Humanos , Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Seleção Genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...