Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
1.
Am Stat ; 77(4): 432-442, 2023.
Article in English | MEDLINE | ID: mdl-38045013

ABSTRACT

The graphical representation of the correlation matrix by means of different multivariate statistical methods is reviewed, a comparison of the different procedures is presented with the use of an example data set, and an improved representation with better fit is proposed. Principal component analysis is widely used for making pictures of correlation structure, though as shown a weighted alternating least squares approach that avoids the fitting of the diagonal of the correlation matrix outperforms both principal component analysis and principal factor analysis in approximating a correlation matrix. Weighted alternating least squares is a very strong competitor for principal component analysis, in particular if the correlation matrix is the focus of the study, because it improves the representation of the correlation matrix, often at the expense of only a minor percentage of explained variance for the original data matrix, if the latter is mapped onto the correlation biplot by regression. In this article, we propose to combine weighted alternating least squares with an additive adjustment of the correlation matrix, and this is seen to lead to further improved approximation of the correlation matrix.

2.
Forensic Sci Int Genet ; 58: 102680, 2022 05.
Article in English | MEDLINE | ID: mdl-35313226

ABSTRACT

The Hardy-Weinberg law is shown to be transitive in the sense that a multi-allelic polymorphism that is in equilibrium will retain its equilibrium status if any allele together with its corresponding genotypes is deleted from the population. Similarly, the transitivity principle also applies if alleles are joined, which leads to the summation of allele frequencies and their corresponding genotype frequencies. These basic polymorphism properties are intuitive, but they have apparently not been formalized or investigated. This article provides a straightforward proof of the transitivity principle, and its usefulness in genetic data analysis is explored, using high-quality autosomal microsatellite databases from the US National Institute of Standards and Technology. We address the reduction of multi-allelic polymorphisms to variants with fewer alleles, two in the limit. Equilibrium test results obtained with the original and reduced polymorphisms are generally observed to be coherent, in particular when results obtained with length-based and sequence-based microsatellites are compared. We exploit the transitivity principle in order to identify disequilibrium-related alleles, and show its usefulness for detecting population substructure and genotyping problems that relate to null alleles and allele imbalance.


Subject(s)
Polymorphism, Genetic , Alleles , Gene Frequency , Genotype , Humans
3.
Am J Epidemiol ; 190(10): 1977-1992, 2021 10 01.
Article in English | MEDLINE | ID: mdl-33861317

ABSTRACT

Genotype-phenotype association studies often combine phenotype data from multiple studies to increase statistical power. Harmonization of the data usually requires substantial effort due to heterogeneity in phenotype definitions, study design, data collection procedures, and data-set organization. Here we describe a centralized system for phenotype harmonization that includes input from phenotype domain and study experts, quality control, documentation, reproducible results, and data-sharing mechanisms. This system was developed for the National Heart, Lung, and Blood Institute's Trans-Omics for Precision Medicine (TOPMed) program, which is generating genomic and other -omics data for more than 80 studies with extensive phenotype data. To date, 63 phenotypes have been harmonized across thousands of participants (recruited in 1948-2012) from up to 17 studies per phenotype. Here we discuss challenges in this undertaking and how they were addressed. The harmonized phenotype data and associated documentation have been submitted to National Institutes of Health data repositories for controlled access by the scientific community. We also provide materials to facilitate future harmonization efforts by the community, which include 1) the software code used to generate the 63 harmonized phenotypes, enabling others to reproduce, modify, or extend these harmonizations to additional studies, and 2) the results of labeling thousands of phenotype variables with controlled vocabulary terms.


Subject(s)
Genetic Association Studies/methods , Phenomics/methods , Precision Medicine/methods , Data Aggregation , Humans , Information Dissemination , National Heart, Lung, and Blood Institute (U.S.) , Phenotype , Program Evaluation , United States
4.
Mol Ecol Resour ; 21(5): 1547-1557, 2021 Jul.
Article in English | MEDLINE | ID: mdl-33687797

ABSTRACT

Statistical methodology for testing the Hardy-Weinberg equilibrium at X chromosomal variants has recently experienced considerable development. Up to a few years ago, testing X chromosomal variants for equilibrium was basically done by applying autosomal test procedures to females only. At present, male alleles can be taken into account in asymptotic and exact test procedures for both the bi- and multiallelic case. However, current X chromosomal exact procedures for multiple alleles rely on a classical full enumeration algorithm and are computationally expensive, and in practice not feasible for more than three alleles. In this article, we extend the autosomal network algorithm for exact Hardy-Weinberg testing with multiple alleles to the X chromosome, achieving considerable reduction in computation times for multiallelic variants with up to five alleles. The performance of the X chromosomal network algorithm is assessed in a simulation study. Beyond four alleles, a permutation test is, in general, the more feasible approach. A detailed description of the algorithm is given, and examples of X chromosomal indels and microsatellites are discussed.


Subject(s)
Algorithms , Chromosomes, Human, X/genetics , Models, Genetic , Alleles , Female , Gene Frequency , Humans , Male
5.
Heredity (Edinb) ; 126(3): 537-547, 2021 03.
Article in English | MEDLINE | ID: mdl-33452467

ABSTRACT

The detection of family relationships in genetic databases is of interest in various scientific disciplines such as genetic epidemiology, population and conservation genetics, forensic science, and genealogical research. Nowadays, screening genetic databases for related individuals forms an important aspect of standard quality control procedures. Relatedness research is usually based on an allele sharing analysis of identity by state (IBS) or identity by descent (IBD) alleles. Existing IBS/IBD methods mainly aim to identify first-degree relationships (parent-offspring or full siblings) and second degree (half-siblings, avuncular, or grandparent-grandchild) pairs. Little attention has been paid to the detection of in-between first and second-degree relationships such as three-quarter siblings (3/4S) who share fewer alleles than first-degree relationships but more alleles than second-degree relationships. With the progressively increasing sample sizes used in genetic research, it becomes more likely that such relationships are present in the database under study. In this paper, we extend existing likelihood ratio (LR) methodology to accurately infer the existence of 3/4S, distinguishing them from full siblings and second-degree relatives. We use bootstrap confidence intervals to express uncertainty in the LRs. Our proposal accounts for linkage disequilibrium (LD) by using marker pruning, and we validate our methodology with a pedigree-based simulation study accounting for both LD and recombination. An empirical genome-wide array data set from the GCAT Genomes for Life cohort project is used to illustrate the method.


Subject(s)
Databases, Genetic , Siblings , Alleles , Genotype , Humans , Pedigree
6.
J Appl Stat ; 47(11): 2011-2024, 2020.
Article in English | MEDLINE | ID: mdl-33041421

ABSTRACT

Metric multidimensional scaling (MDS) is a widely used multivariate method with applications in almost all scientific disciplines. Eigenvalues obtained in the analysis are usually reported in order to calculate the overall goodness-of-fit of the distance matrix. In this paper, we refine MDS goodness-of-fit calculations, proposing additional point and pairwise goodness-of-fit statistics that can be used to filter poorly represented observations in MDS maps. The proposed statistics are especially relevant for large data sets that contain outliers, with typically many poorly fitted observations, and are helpful for improving MDS output and emphasising the most important features of the dataset. Several goodness-of-fit statistics are considered, and both Euclidean and non-Euclidean distance matrices are considered. Some examples with data from demographic, genetic and geographic studies are shown.

7.
NAR Genom Bioinform ; 2(4): lqaa094, 2020 Dec.
Article in English | MEDLINE | ID: mdl-33575638

ABSTRACT

Measurements in sequencing studies are mostly based on counts. There is a lack of theoretical developments for the analysis and modelling of this type of data. Some thoughts in this direction are presented, which might serve as a seed. The main issues addressed are the compositional character of multinomial probabilities and the corresponding representation in orthogonal (isometric) coordinates, and modelling distributions for sequencing data taking into account possible effects of amplification techniques.

8.
Nucleic Acids Res ; 47(21): e136, 2019 12 02.
Article in English | MEDLINE | ID: mdl-31501877

ABSTRACT

Analysis of RNA sequencing (RNA-seq) data from related individuals is widely used in clinical and molecular genetics studies. Prediction of kinship from RNA-seq data would be useful for confirming the expected relationships in family based studies and for highlighting samples from related individuals in case-control or population based studies. Currently, reconstruction of pedigrees is largely based on SNPs or microsatellites, obtained from genotyping arrays, whole genome sequencing and whole exome sequencing. Potential problems with using RNA-seq data for kinship detection are the low proportion of the genome that it covers, the highly skewed coverage of exons of different genes depending on expression level and allele-specific expression. In this study we assess the use of RNA-seq data to detect kinship between individuals, through pairwise identity by descent (IBD) estimates. First, we obtained high quality SNPs after successive filters to minimize the effects due to allelic imbalance as well as errors in sequencing, mapping and genotyping. Then, we used these SNPs to calculate pairwise IBD estimates. By analysing both real and simulated RNA-seq data we show that it is possible to identify up to second degree relationships using RNA-seq data of even low to moderate sequencing depth.


Subject(s)
Base Sequence/genetics , Genome, Human , Pedigree , RNA/genetics , Sequence Analysis, RNA , Databases, Genetic , Humans , Polymorphism, Single Nucleotide/genetics
9.
Heredity (Edinb) ; 123(5): 549-564, 2019 11.
Article in English | MEDLINE | ID: mdl-31142813

ABSTRACT

Standard statistical tests for Hardy-Weinberg equilibrium assume the equality of allele frequencies in the sexes, whereas tests for the equality of allele frequencies in the sexes assume Hardy-Weinberg equilibrium. This produces a circularity in the testing of genetic variants, which has recently been resolved with new frequentist likelihood and exact procedures. In this paper, we tackle the same problem by posing it as a Bayesian model comparison problem. We formulate an exhaustive set of ten alternative scenarios for biallelic genetic variants. Using Dirichlet and Beta priors for genotype and allele frequencies, we derive marginal likelihoods for all scenarios, and select the most likely scenario using the posterior probabilities that each of these scenarios is the one in place. Different from the usual frequentist testing approach, the Bayesian approach allows one to compare any number of models, and not just two at a time, and the models compared do not have to be nested. We illustrate our Bayesian approach with genetic data from the 1,000 genomes project and through a simulation study.


Subject(s)
Alleles , Gene Frequency , Genotype , Models, Genetic , Animals , Female , Humans , Male
10.
Front Genet ; 10: 341, 2019.
Article in English | MEDLINE | ID: mdl-31068965

ABSTRACT

The detection of cryptic relatedness in large population-based cohorts is of great importance in genome research. The usual approach for detecting closely related individuals is to plot allele sharing statistics, based on identity-by-state or identity-by-descent, in a two-dimensional scatterplot. This approach ignores that allele sharing data across individuals has in reality a higher dimensionality, and neither regards the compositional nature of the underlying counts of shared genotypes. In this paper we develop biplot methodology based on log-ratio principal component analysis that overcomes these restrictions. This leads to entirely new graphics that are essentially useful for exploring relatedness in genetic databases from homogeneous populations. The proposed method can be applied in an iterative manner, acting as a looking glass for more remote relationships that are harder to classify. Datasets from the 1,000 Genomes Project and the Genomes For Life-GCAT Project are used to illustrate the proposed method. The discriminatory power of the log-ratio biplot approach is compared with the classical plots in a simulation study. In a non-inbred homogeneous population the classification rate of the log-ratio principal component approach outperforms the classical graphics across the whole allele frequency spectrum, using only identity by state. In these circumstances, simulations show that with 35,000 independent bi-allelic variants, log-ratio principal component analysis, combined with discriminant analysis, can correctly classify relationships up to and including the fourth degree.

11.
J Geochem Explor ; 194: 120-133, 2018 Nov.
Article in English | MEDLINE | ID: mdl-33510550

ABSTRACT

The study of the relationships between two compositions is of paramount importance in geochemical data analysis. This paper develops a compositional version of canonical correlation analysis, called CoDA-CCO, for this purpose. We consider two approaches, using the centred log-ratio transformation and the calculation of all possible pairwise log-ratios within sets. The relationships between both approaches are pointed out, and their merits are discussed. The related covariance matrices are structurally singular, and this is efficiently dealt with by using generalized inverses. We develop compositional canonical biplots and detail their properties. The canonical biplots are shown to be powerful tools for discovering the most salient relationships between two compositions. Some guidelines for compositional canonical biplot construction are discussed. A geochemical data set with X-ray fluorescence spectrometry measurements on major oxides and trace elements of European floodplains is used to illustrate the proposed method. The relationships between an analysis based on centred log-ratios and on isometric log-ratios are also shown.

12.
Genet Epidemiol ; 42(1): 34-48, 2018 02.
Article in English | MEDLINE | ID: mdl-29071737

ABSTRACT

Standard statistical tests for equality of allele frequencies in males and females and tests for Hardy-Weinberg equilibrium are tightly linked by their assumptions. Tests for equality of allele frequencies assume Hardy-Weinberg equilibrium, whereas the usual chi-square or exact test for Hardy-Weinberg equilibrium assume equality of allele frequencies in the sexes. In this paper, we propose ways to break this interdependence in assumptions of the two tests by proposing an omnibus exact test that can test both hypotheses jointly, as well as a likelihood ratio approach that permits these phenomena to be tested both jointly and separately. The tests are illustrated with data from the 1000 Genomes project.


Subject(s)
Alleles , Gene Frequency , Genetic Markers/genetics , Models, Genetic , Consanguinity , Female , Genome, Human/genetics , Genomics , Humans , Male
13.
Mol Ecol Resour ; 18(3): 461-473, 2018 May.
Article in English | MEDLINE | ID: mdl-29288525

ABSTRACT

Statistical tests for Hardy-Weinberg equilibrium are important elementary tools in genetic data analysis. X-chromosomal variants have long been tested by applying autosomal test procedures to females only, and gender is usually not considered when testing autosomal variants for equilibrium. Recently, we proposed specific X-chromosomal exact test procedures for bi-allelic variants that include the hemizygous males, as well as autosomal tests that consider gender. In this study, we present the extension of the previous work for variants with multiple alleles. A full enumeration algorithm is used for the exact calculations of tri-allelic variants. For variants with many alternate alleles, we use a permutation test. Some empirical examples with data from the 1,000 genomes project are discussed.


Subject(s)
Chromosomes, Human, X/genetics , Computer Simulation , Gene Frequency , Genetic Variation , Algorithms , Chromosomes, Human, Pair 7/chemistry , Chromosomes, Human, X/chemistry , Female , Genotype , Humans , Male , Models, Genetic , Sex Factors
14.
Hum Genet ; 136(6): 727-741, 2017 06.
Article in English | MEDLINE | ID: mdl-28374190

ABSTRACT

Statistical tests for Hardy-Weinberg equilibrium have been an important tool for detecting genotyping errors in the past, and remain important in the quality control of next generation sequence data. In this paper, we analyze complete chromosomes of the 1000 genomes project by using exact test procedures for autosomal and X-chromosomal variants. We find that the rate of disequilibrium largely exceeds what might be expected by chance alone for all chromosomes. Observed disequilibrium is, in about 60% of the cases, due to heterozygote excess. We suggest that most excess disequilibrium can be explained by sequencing problems, and hypothesize mechanisms that can explain exceptional heterozygosities. We report higher rates of disequilibrium for the MHC region on chromosome 6, regions flanking centromeres and p-arms of acrocentric chromosomes. We also detected long-range haplotypes and areas with incidental high disequilibrium. We report disequilibrium to be related to read depth, with variants having extreme read depths being more likely to be out of equilibrium. Disequilibrium rates were found to be 11 times higher in segmental duplications and simple tandem repeat regions. The variants with significant disequilibrium are seen to be concentrated in these areas. For next generation sequence data, Hardy-Weinberg disequilibrium seems to be a major indicator for copy number variation.


Subject(s)
Genome-Wide Association Study , DNA Copy Number Variations , Humans , Linkage Disequilibrium
15.
Mol Ecol Resour ; 17(6): 1271-1282, 2017 Nov.
Article in English | MEDLINE | ID: mdl-28374569

ABSTRACT

Studies of relatedness have been crucial in molecular ecology over the last decades. Good evidence of this is the fact that studies of population structure, evolution of social behaviours, genetic diversity and quantitative genetics all involve relatedness research. The main aim of this article was to review the most common graphical methods used in allele sharing studies for detecting and identifying family relationships. Both IBS- and IBD-based allele sharing studies are considered. Furthermore, we propose two additional graphical methods from the field of compositional data analysis: the ternary diagram and scatterplots of isometric log-ratios of IBS and IBD probabilities. We illustrate all graphical tools with genetic data from the HGDP-CEPH diversity panel, using mainly 377 microsatellites genotyped for 25 individuals from the Maya population of this panel. We enhance all graphics with convex hulls obtained by simulation and use these to confirm the documented relationships. The proposed compositional graphics are shown to be useful in relatedness research, as they also single out the most prominent related pairs. The ternary diagram is advocated for its ability to display all three allele sharing probabilities simultaneously. The log-ratio plots are advocated as an attempt to overcome the problems with the Euclidean distance interpretation in the classical graphics.


Subject(s)
Computer Graphics , Genetic Variation , Genetics, Population/methods , Genotyping Techniques/methods , Alleles , Genotype , Humans
16.
Tob Control ; 26(2): 149-152, 2017 03.
Article in English | MEDLINE | ID: mdl-26888824

ABSTRACT

OBJECTIVE: To analyse the correlation between the implementation of tobacco control policies and tobacco consumption, particularly rolling tobacco, electronic cigarettes (e-cigarettes) users and the intent to quit smoking in 27 countries of the European Union. DESIGN: Ecological study with the country as the unit of analysis. DATA SOURCES: We used the data from tobacco control activities, measured by the Tobacco Control Scale (TCS), in 27 European countries, in 2010, and the prevalence of tobacco consumption data from the Eurobarometer of 2012. ANALYSIS: Spearman correlation coefficients (rsp) and their 95% CIs. RESULTS: There was a negative correlation between TCS and prevalence of smoking (rsp=-0.41; 95% CI -0.67 to -0.07). We also found a negative correlation (rsp=-0.31) between TCS and the prevalence of ever e-cigarette users, but it was not statistically significant. Among former cigarette smokers, there was a positive and statistically significant correlation between TCS and the consumption of hand-rolled tobacco (rsp=0.46; 95% CI 0.06 to 0.70). We observed a similar correlation between TCS and other tobacco products (cigars and pipe) among former cigarette smokers. There was a significant positive correlation between TCS and intent to quit smoking in the past 12 months (rsp=0.66; 95% CI 0.36 to 0.87). CONCLUSIONS: The level of smoke-free legislation among European countries is correlated with a decrease in the prevalence of smoking of conventional cigarettes and an increase in the intent to quit smoking within the past 12 months. However, the consumption of other tobacco products, particularly hand-rolled tobacco, is positively correlated with TCS among former cigarette smokers. Therefore, tobacco control policies should also consider other tobacco products, such as rolling tobacco, cigars and pipes.


Subject(s)
Electronic Nicotine Delivery Systems/statistics & numerical data , Smoking Cessation/psychology , Smoking Prevention/legislation & jurisprudence , Smoking/epidemiology , Europe/epidemiology , European Union , Humans , Intention , Prevalence , Smoke-Free Policy/legislation & jurisprudence , Statistics, Nonparametric , Tobacco Products/statistics & numerical data
17.
G3 (Bethesda) ; 5(11): 2365-73, 2015 Sep 15.
Article in English | MEDLINE | ID: mdl-26377959

ABSTRACT

This paper addresses the issue of exact-test based statistical inference for Hardy-Weinberg equilibrium in the presence of missing genotype data. Missing genotypes often are discarded when markers are tested for Hardy-Weinberg equilibrium, which can lead to bias in the statistical inference about equilibrium. Single and multiple imputation can improve inference on equilibrium. We develop tests for equilibrium in the presence of missingness by using both inbreeding coefficients (or, equivalently, χ(2) statistics) and exact p-values. The analysis of a set of markers with a high missing rate from the GENEVA project on prematurity shows that exact inference on equilibrium can be altered considerably when missingness is taken into account. For markers with a high missing rate (>5%), we found that both single and multiple imputation tend to diminish evidence for Hardy-Weinberg disequilibrium. Depending on the imputation method used, 6-13% of the test results changed qualitatively at the 5% level.


Subject(s)
Linkage Disequilibrium , Models, Genetic , Algorithms , Data Accuracy , Genetics, Population/methods , Inbreeding
18.
Stat Appl Genet Mol Biol ; 12(4): 433-48, 2013 Aug.
Article in English | MEDLINE | ID: mdl-23934608

ABSTRACT

OBJECTIVE: Exact tests for Hardy-Weinberg equilibrium are widely used in genetic association studies. We evaluate the mid p-value, unknown in the genetics literature, as an alternative for the standard p-value in the exact test. METHOD: The type 1 error rate and the power of the exact test are calculated for different sample sizes, significance levels, minor allele counts and degrees of deviation from equilibrium. Three different p-value are considered: the standard two-sided p-value, the doubled one-sided p-value and the mid p-value. Practical implications of using the mid p-value are discussed with HapMap datasets and a data set on colon cancer. RESULTS: The mid p-value is shown to have a type 1 error rate that is always closer to the nominal level, and to have better power. Differences between the standard p-value and the mid p-value can be large for insignificant results, and are smaller for significant results. The analysis of empirical databases shows that the mid p-value uncovers more significant markers, and that the equilibrium null distribution is not tenable for both databases. CONCLUSION: The standard exact p-value is overly conservative, in particular for small minor allele frequencies. The mid p-value ameliorates this problem by bringing the rejection rate closer to the nominal level, at the price of occasionally exceeding the nominal level.


Subject(s)
Genome-Wide Association Study/methods , Models, Genetic , Algorithms , Bayes Theorem , Colonic Neoplasms/genetics , Computer Simulation , Data Interpretation, Statistical , Gene Frequency , Genetic Markers , HapMap Project , Humans , Polymorphism, Single Nucleotide
19.
PLoS One ; 8(12): e83316, 2013.
Article in English | MEDLINE | ID: mdl-24391752

ABSTRACT

In genetic association studies, tests for Hardy-Weinberg proportions are often employed as a quality control checking procedure. Missing genotypes are typically discarded prior to testing. In this paper we show that inference for Hardy-Weinberg proportions can be biased when missing values are discarded. We propose to use multiple imputation of missing values in order to improve inference for Hardy-Weinberg proportions. For imputation we employ a multinomial logit model that uses information from allele intensities and/or neighbouring markers. Analysis of an empirical data set of single nucleotide polymorphisms possibly related to colon cancer reveals that missing genotypes are not missing completely at random. Deviation from Hardy-Weinberg proportions is mostly due to a lack of heterozygotes. Inbreeding coefficients estimated by multiple imputation of the missings are typically lowered with respect to inbreeding coefficients estimated by discarding the missings. Accounting for missings by multiple imputation qualitatively changed the results of 10 to 17% of the statistical tests performed. Estimates of inbreeding coefficients obtained by multiple imputation showed high correlation with estimates obtained by single imputation using an external reference panel. Our conclusion is that imputation of missing data leads to improved statistical inference for Hardy-Weinberg proportions.


Subject(s)
Biostatistics/methods , Genetic Association Studies/statistics & numerical data , Genotype , Algorithms , Case-Control Studies , Colonic Neoplasms/genetics , Computer Simulation , Gene Frequency , Genetic Markers , Humans , Logistic Models , Models, Genetic , Polymorphism, Single Nucleotide
20.
PLoS One ; 6(3): e17913, 2011 Mar 28.
Article in English | MEDLINE | ID: mdl-21464928

ABSTRACT

Recombination varies greatly among species, as illustrated by the poor conservation of the recombination landscape between humans and chimpanzees. Thus, shorter evolutionary time frames are needed to understand the evolution of recombination. Here, we analyze its recent evolution in humans. We calculated the recombination rates between adjacent pairs of 636,933 common single-nucleotide polymorphism loci in 28 worldwide human populations and analyzed them in relation to genetic distances between populations. We found a strong and highly significant correlation between similarity in the recombination rates corrected for effective population size and genetic differentiation between populations. This correlation is observed at the genome-wide level, but also for each chromosome and when genetic distances and recombination similarities are calculated independently from different parts of the genome. Moreover, and more relevant, this relationship is robustly maintained when considering presence/absence of recombination hotspots. Simulations show that this correlation cannot be explained by biases in the inference of recombination rates caused by haplotype sharing among similar populations. This result indicates a rapid pace of evolution of recombination, within the time span of differentiation of modern humans.


Subject(s)
Genetics, Population , Recombination, Genetic , Chromosomes, Human/genetics , Computer Simulation , Gene Frequency/genetics , Humans , Polymorphism, Single Nucleotide/genetics , Population Density
SELECTION OF CITATIONS
SEARCH DETAIL
...