Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 39
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Hum Mol Genet ; 33(1): 38-47, 2023 Dec 12.
Article in English | MEDLINE | ID: mdl-37740403

ABSTRACT

Breast cancer (BC) risk is suspected to be linked to thyroid disorders, however observational studies exploring the association between BC and thyroid disorders gave conflicting results. We proposed an alternative approach by investigating the shared genetic risk factors between BC and several thyroid traits. We report a positive genetic correlation between BC and thyroxine (FT4) levels (corr = 0.13, p-value = 2.0 × 10-4) and a negative genetic correlation between BC and thyroid-stimulating hormone (TSH) levels (corr = -0.09, p-value = 0.03). These associations are more striking when restricting the analysis to estrogen receptor-positive BC. Moreover, the polygenic risk scores (PRS) for FT4 and hyperthyroidism are positively associated to BC risk (OR = 1.07, 95%CI: 1.00-1.13, p-value = 2.8 × 10-2 and OR = 1.04, 95%CI: 1.00-1.08, p-value = 3.8 × 10-2, respectively), while the PRS for TSH is inversely associated to BC risk (OR = 0.93, 95%CI: 0.89-0.97, p-value = 2.0 × 10-3). Using the PLACO method, we detected 49 loci associated to both BC and thyroid traits (p-value < 5 × 10-8), in the vicinity of 130 genes. An additional colocalization and gene-set enrichment analyses showed a convincing causal role for a known pleiotropic locus at 2q35 and revealed an additional one at 8q22.1 associated to both BC and thyroid cancer. We also found two new pleiotropic loci at 14q32.33 and 17q21.31 that were associated to both TSH levels and BC risk. Enrichment analyses and evidence of regulatory signals also highlighted brain tissues and immune system as candidates for obtaining associations between BC and TSH levels. Overall, our study sheds light on the complex interplay between BC and thyroid traits and provides evidence of shared genetic risk between those conditions.


Subject(s)
Breast Neoplasms , Thyroid Gland , Humans , Female , Breast Neoplasms/genetics , Thyrotropin/genetics , Thyroxine/genetics , Risk Factors , Genetic Risk Score
2.
Nat Methods ; 18(11): 1304-1316, 2021 11.
Article in English | MEDLINE | ID: mdl-34725484

ABSTRACT

Glycoproteomics is a powerful yet analytically challenging research tool. Software packages aiding the interpretation of complex glycopeptide tandem mass spectra have appeared, but their relative performance remains untested. Conducted through the HUPO Human Glycoproteomics Initiative, this community study, comprising both developers and users of glycoproteomics software, evaluates solutions for system-wide glycopeptide analysis. The same mass spectrometrybased glycoproteomics datasets from human serum were shared with participants and the relative team performance for N- and O-glycopeptide data analysis was comprehensively established by orthogonal performance tests. Although the results were variable, several high-performance glycoproteomics informatics strategies were identified. Deep analysis of the data revealed key performance-associated search parameters and led to recommendations for improved 'high-coverage' and 'high-accuracy' glycoproteomics search solutions. This study concludes that diverse software packages for comprehensive glycopeptide data analysis exist, points to several high-performance search strategies and specifies key variables that will guide future software developments and assist informatics decision-making in glycoproteomics.


Subject(s)
Glycopeptides/blood , Glycoproteins/blood , Informatics/methods , Proteome/analysis , Proteomics/methods , Research Personnel/statistics & numerical data , Software , Glycosylation , Humans , Proteome/metabolism , Tandem Mass Spectrometry
3.
BMC Med Res Methodol ; 22(1): 9, 2022 01 07.
Article in English | MEDLINE | ID: mdl-34996381

ABSTRACT

BACKGROUND: Genome-wide association studies (GWAS) have identified genetic variants associated with multiple complex diseases. We can leverage this phenomenon, known as pleiotropy, to integrate multiple data sources in a joint analysis. Often integrating additional information such as gene pathway knowledge can improve statistical efficiency and biological interpretation. In this article, we propose statistical methods which incorporate both gene pathway and pleiotropy knowledge to increase statistical power and identify important risk variants affecting multiple traits. METHODS: We propose novel feature selection methods for the group variable selection in multi-task regression problem. We develop penalised likelihood methods exploiting different penalties to induce structured sparsity at a gene (or pathway) and SNP level across all studies. We implement an alternating direction method of multipliers (ADMM) algorithm for our penalised regression methods. The performance of our approaches are compared to a subset based meta analysis approach on simulated data sets. A bootstrap sampling strategy is provided to explore the stability of the penalised methods. RESULTS: Our methods are applied to identify potential pleiotropy in an application considering the joint analysis of thyroid and breast cancers. The methods were able to detect eleven potential pleiotropic SNPs and six pathways. A simulation study found that our method was able to detect more true signals than a popular competing method while retaining a similar false discovery rate. CONCLUSION: We developed feature selection methods for jointly analysing multiple logistic regression tasks where prior grouping knowledge is available. Our method performed well on both simulation studies and when applied to a real data analysis of multiple cancers.


Subject(s)
Genome-Wide Association Study , Genomics , Algorithms , Genomics/methods , Humans , Phenotype , Polymorphism, Single Nucleotide
4.
BMC Bioinformatics ; 22(1): 86, 2021 Feb 24.
Article in English | MEDLINE | ID: mdl-33627076

ABSTRACT

BACKGROUND: The increasing number of genome-wide association studies (GWAS) has revealed several loci that are associated to multiple distinct phenotypes, suggesting the existence of pleiotropic effects. Highlighting these cross-phenotype genetic associations could help to identify and understand common biological mechanisms underlying some diseases. Common approaches test the association between genetic variants and multiple traits at the SNP level. In this paper, we propose a novel gene- and a pathway-level approach in the case where several independent GWAS on independent traits are available. The method is based on a generalization of the sparse group Partial Least Squares (sgPLS) to take into account groups of variables, and a Lasso penalization that links all independent data sets. This method, called joint-sgPLS, is able to convincingly detect signal at the variable level and at the group level. RESULTS: Our method has the advantage to propose a global readable model while coping with the architecture of data. It can outperform traditional methods and provides a wider insight in terms of a priori information. We compared the performance of the proposed method to other benchmark methods on simulated data and gave an example of application on real data with the aim to highlight common susceptibility variants to breast and thyroid cancers. CONCLUSION: The joint-sgPLS shows interesting properties for detecting a signal. As an extension of the PLS, the method is suited for data with a large number of variables. The choice of Lasso penalization copes with architectures of groups of variables and observations sets. Furthermore, although the method has been applied to a genetic study, its formulation is adapted to any data with high number of variables and an exposed a priori architecture in other application fields.


Subject(s)
Genome-Wide Association Study , Polymorphism, Single Nucleotide , Least-Squares Analysis , Phenotype
5.
Stat Med ; 40(6): 1498-1518, 2021 03 15.
Article in English | MEDLINE | ID: mdl-33368447

ABSTRACT

An increasing number of genome-wide association studies (GWAS) summary statistics is made available to the scientific community. Exploiting these results from multiple phenotypes would permit identification of novel pleiotropic associations. In addition, incorporating prior biological information in GWAS such as group structure information (gene or pathway) has shown some success in classical GWAS approaches. However, this has not been widely explored in the context of pleiotropy. We propose a Bayesian meta-analysis approach (termed GCPBayes) that uses summary-level GWAS data across multiple phenotypes to detect pleiotropy at both group-level (gene or pathway) and within group (eg, at the SNP level). We consider both continuous and Dirac spike and slab priors for group selection. We also use a Bayesian sparse group selection approach with hierarchical spike and slab priors that enables us to select important variables both at the group level and within group. GCPBayes uses a Bayesian statistical framework based on Markov chain Monte Carlo (MCMC) Gibbs sampling. It can be applied to multiple types of phenotypes for studies with overlapping or nonoverlapping subjects, and takes into account heterogeneity in the effect size and allows for the opposite direction of the genetic effects across traits. Simulations show that the proposed methods outperform benchmark approaches such as ASSET and CPBayes in the ability to retrieve pleiotropic associations at both SNP and gene-levels. To illustrate the GCPBayes method, we investigate the shared genetic effects between thyroid cancer and breast cancer in candidate pathways.


Subject(s)
Genome-Wide Association Study , Neoplasms , Bayes Theorem , Genomics , Group Structure , Humans , Models, Genetic , Polymorphism, Single Nucleotide
6.
Crit Care ; 25(1): 199, 2021 06 09.
Article in English | MEDLINE | ID: mdl-34108029

ABSTRACT

BACKGROUND: Heterogeneous respiratory system static compliance (CRS) values and levels of hypoxemia in patients with novel coronavirus disease (COVID-19) requiring mechanical ventilation have been reported in previous small-case series or studies conducted at a national level. METHODS: We designed a retrospective observational cohort study with rapid data gathering from the international COVID-19 Critical Care Consortium study to comprehensively describe CRS-calculated as: tidal volume/[airway plateau pressure-positive end-expiratory pressure (PEEP)]-and its association with ventilatory management and outcomes of COVID-19 patients on mechanical ventilation (MV), admitted to intensive care units (ICU) worldwide. RESULTS: We studied 745 patients from 22 countries, who required admission to the ICU and MV from January 14 to December 31, 2020, and presented at least one value of CRS within the first seven days of MV. Median (IQR) age was 62 (52-71), patients were predominantly males (68%) and from Europe/North and South America (88%). CRS, within 48 h from endotracheal intubation, was available in 649 patients and was neither associated with the duration from onset of symptoms to commencement of MV (p = 0.417) nor with PaO2/FiO2 (p = 0.100). Females presented lower CRS than males (95% CI of CRS difference between females-males: - 11.8 to - 7.4 mL/cmH2O p < 0.001), and although females presented higher body mass index (BMI), association of BMI with CRS was marginal (p = 0.139). Ventilatory management varied across CRS range, resulting in a significant association between CRS and driving pressure (estimated decrease - 0.31 cmH2O/L per mL/cmH20 of CRS, 95% CI - 0.48 to - 0.14, p < 0.001). Overall, 28-day ICU mortality, accounting for the competing risk of being discharged within the period, was 35.6% (SE 1.7). Cox proportional hazard analysis demonstrated that CRS (+ 10 mL/cm H2O) was only associated with being discharge from the ICU within 28 days (HR 1.14, 95% CI 1.02-1.28, p = 0.018). CONCLUSIONS: This multicentre report provides a comprehensive account of CRS in COVID-19 patients on MV. CRS measured within 48 h from commencement of MV has marginal predictive value for 28-day mortality, but was associated with being discharged from ICU within the same period. Trial documentation: Available at https://www.covid-critical.com/study . TRIAL REGISTRATION: ACTRN12620000421932.


Subject(s)
COVID-19/complications , COVID-19/therapy , Lung Compliance/physiology , Respiration, Artificial/methods , Respiratory Distress Syndrome/etiology , Respiratory Distress Syndrome/therapy , Adult , Cohort Studies , Critical Care/methods , Europe , Female , Humans , Intensive Care Units , Male , Middle Aged , Retrospective Studies , Severity of Illness Index
8.
Glob Chang Biol ; 26(5): 2785-2797, 2020 05.
Article in English | MEDLINE | ID: mdl-32115808

ABSTRACT

Anticipating future changes of an ecosystem's dynamics requires knowledge of how its key communities respond to current environmental regimes. The Great Barrier Reef (GBR) is under threat, with rapid changes of its reef-building hard coral (HC) community structure already evident across broad spatial scales. While several underlying relationships between HC and multiple disturbances have been documented, responses of other benthic communities to disturbances are not well understood. Here we used statistical modelling to explore the effects of broad-scale climate-related disturbances on benthic communities to predict their structure under scenarios of increasing disturbance frequency. We parameterized a multivariate model using the composition of benthic communities estimated by 145,000 observations from the northern GBR between 2012 and 2017. During this time, surveyed reefs were variously impacted by two tropical cyclones and two heat stress events that resulted in extensive HC mortality. This unprecedented sequence of disturbances was used to estimate the effects of discrete versus interacting disturbances on the compositional structure of HC, soft corals (SC) and algae. Discrete disturbances increased the prevalence of algae relative to HC while the interaction between cyclones and heat stress was the main driver of the increase in SC relative to algae and HC. Predictions from disturbance scenarios included relative increases in algae versus SC that varied by the frequency and types of disturbance interactions. However, high uncertainty of compositional changes in the presence of several disturbances shows that responses of algae and SC to the decline in HC needs further research. Better understanding of the effects of multiple disturbances on benthic communities as a whole is essential for predicting the future status of coral reefs and managing them in the light of new environmental regimes. The approach we develop here opens new opportunities for reaching this goal.


Subject(s)
Anthozoa , Cyclonic Storms , Animals , Coral Reefs , Ecosystem
9.
Stat Med ; 39(28): 4201-4217, 2020 12 10.
Article in English | MEDLINE | ID: mdl-32844489

ABSTRACT

Identification of biomarkers is an emerging area in oncology. In this article, we develop an efficient statistical procedure for the classification of protein markers according to their effect on cancer progression. A high-dimensional time-course dataset of protein markers for 80 patients motivates us for developing the model. The threshold value is formulated as a level of a marker having maximum impact on cancer progression. The classification algorithm technique for high-dimensional time-course data is developed and the algorithm is validated by comparing random components using both proportional hazard and accelerated failure time frailty models. The study elucidates the application of two separate joint modeling techniques using auto regressive-type model and mixed effect model for time-course data and proportional hazard model for survival data with proper utilization of Bayesian methodology. Also, a prognostic score is developed on the basis of few selected genes with application on patients. This study facilitates to identify relevant biomarkers from a set of markers.


Subject(s)
Algorithms , Medical Oncology , Bayes Theorem , Biomarkers , Humans , Proportional Hazards Models
10.
Environ Sci Technol ; 54(21): 13719-13730, 2020 11 03.
Article in English | MEDLINE | ID: mdl-32856893

ABSTRACT

Anomaly detection (AD) in high-volume environmental data requires one to tackle a series of challenges associated with the typical low frequency of anomalous events, the broad-range of possible anomaly types, and local nonstationary environmental conditions, suggesting the need for flexible statistical methods that are able to cope with unbalanced high-volume data problems. Here, we aimed to detect anomalies caused by technical errors in water-quality (turbidity and conductivity) data collected by automated in situ sensors deployed in contrasting riverine and estuarine environments. We first applied a range of artificial neural networks that differed in both learning method and hyperparameter values, then calibrated models using a Bayesian multiobjective optimization procedure, and selected and evaluated the "best" model for each water-quality variable, environment, and anomaly type. We found that semi-supervised classification was better able to detect sudden spikes, sudden shifts, and small sudden spikes, whereas supervised classification had higher accuracy for predicting long-term anomalies associated with drifts and periods of otherwise unexplained high variability.


Subject(s)
Neural Networks, Computer , Water , Bayes Theorem , Water Quality
11.
BMC Med Res Methodol ; 19(1): 79, 2019 04 16.
Article in English | MEDLINE | ID: mdl-30991962

ABSTRACT

BACKGROUND: In medical research, explanatory continuous variables are frequently transformed or converted into categorical variables. If the coding is unknown, many tests can be used to identify the "optimal" transformation. This common process, involving the problems of multiple testing, requires a correction of the significance level. Liquet and Commenges proposed an asymptotic correction of significance level in the context of generalized linear models (GLM) (Liquet and Commenges, Stat Probab Lett 71:33-38, 2005). This procedure has been developed for dichotomous and Box-Cox transformations. Furthermore, Liquet and Riou suggested the use of resampling methods to estimate the significance level for transformations into categorical variables with more than two levels (Liquet and Riou, BMC Med Res Methodol 13:75, 2013). RESULTS: CPMCGLM provides to users both methods of p-value adjustment. Futhermore, they are available for a large set of transformations. This paper aims to provide insight the user an overview of the methodological context, and explain in detail the use of the CPMCGLM R package through its application to a real epidemiological dataset. CONCLUSION: We present here the CPMCGLMR package providing efficient methods for the correction of type-I error rate in the context of generalized linear models. This is the first and the only available package in R providing such methods applied to this context. This package is designed to help researchers, who work principally in the field of biostatistics and epidemiology, to analyze their data in the context of optimal cutoff point determination.


Subject(s)
Algorithms , Biometry/methods , Computational Biology/methods , Linear Models , Cholesterol, HDL/blood , Dementia/blood , Female , Humans , Male , Reproducibility of Results
12.
Int J Cancer ; 143(6): 1335-1347, 2018 09 15.
Article in English | MEDLINE | ID: mdl-29667176

ABSTRACT

Recent prospective studies have shown that dysregulation of the immune system may precede the development of B-cell lymphomas (BCL) in immunocompetent individuals. However, to date, the studies were restricted to a few immune markers, which were considered separately. Using a nested case-control study within two European prospective cohorts, we measured plasma levels of 28 immune markers in samples collected a median of 6 years before diagnosis (range 2.01-15.97) in 268 incident cases of BCL (including multiple myeloma [MM]) and matched controls. Linear mixed models and partial least square analyses were used to analyze the association between levels of immune marker and the incidence of BCL and its main histological subtypes and to investigate potential biomarkers predictive of the time to diagnosis. Linear mixed model analyses identified associations linking lower levels of fibroblast growth factor-2 (FGF-2 p = 7.2 × 10-4 ) and transforming growth factor alpha (TGF-α, p = 6.5 × 10-5 ) and BCL incidence. Analyses stratified by histological subtypes identified inverse associations for MM subtype including FGF-2 (p = 7.8 × 10-7 ), TGF-α (p = 4.08 × 10-5 ), fractalkine (p = 1.12 × 10-3 ), monocyte chemotactic protein-3 (p = 1.36 × 10-4 ), macrophage inflammatory protein 1-alpha (p = 4.6 × 10-4 ) and vascular endothelial growth factor (p = 4.23 × 10-5 ). Our results also provided marginal support for already reported associations between chemokines and diffuse large BCL (DLBCL) and cytokines and chronic lymphocytic leukemia (CLL). Case-only analyses showed that Granulocyte-macrophage colony stimulating factor levels were consistently higher closer to diagnosis, which provides further evidence of its role in tumor progression. In conclusion, our study suggests a role of growth-factors in the incidence of MM and of chemokine and cytokine regulation in DLBCL and CLL.


Subject(s)
Biomarkers/blood , Lymphoma, Large B-Cell, Diffuse/blood , Multiple Myeloma/blood , Adult , Aged , Case-Control Studies , Chemokine CCL7/blood , Chemokine CX3CL1/blood , Europe , Female , Fibroblast Growth Factor 2/blood , Follow-Up Studies , Humans , Incidence , Lymphoma, Large B-Cell, Diffuse/diagnosis , Lymphoma, Large B-Cell, Diffuse/epidemiology , Lymphoma, Large B-Cell, Diffuse/immunology , Male , Middle Aged , Multiple Myeloma/diagnosis , Multiple Myeloma/epidemiology , Multiple Myeloma/immunology , Multivariate Analysis , Prognosis , Prospective Studies , Transforming Growth Factor alpha/blood , Vascular Endothelial Growth Factor A/blood
13.
Stat Med ; 37(23): 3338-3356, 2018 10 15.
Article in English | MEDLINE | ID: mdl-29888397

ABSTRACT

Integrative analysis of high dimensional omics datasets has been studied by many authors in recent years. By incorporating prior known relationships among the variables, these analyses have been successful in elucidating the relationships between different sets of omics data. In this article, our goal is to identify important relationships between genomic expression and cytokine data from a human immunodeficiency virus vaccine trial. We proposed a flexible partial least squares technique, which incorporates group and subgroup structure in the modelling process. Our new method accounts for both grouping of genetic markers (eg, gene sets) and temporal effects. The method generalises existing sparse modelling techniques in the partial least squares methodology and establishes theoretical connections to variable selection methods for supervised and unsupervised problems. Simulation studies are performed to investigate the performance of our methods over alternative sparse approaches. Our R package sgspls is available at https://github.com/matt-sutton/sgspls.


Subject(s)
Least-Squares Analysis , Models, Statistical , AIDS Vaccines/therapeutic use , Algorithms , Biostatistics , Clinical Trials as Topic/statistics & numerical data , Computer Simulation , Genomics/methods , Humans , Likelihood Functions , Multivariate Analysis , Regression Analysis
14.
Bioinformatics ; 32(1): 35-42, 2016 Jan 01.
Article in English | MEDLINE | ID: mdl-26358727

ABSTRACT

MOTIVATION: The association between two blocks of 'omics' data brings challenging issues in computational biology due to their size and complexity. Here, we focus on a class of multivariate statistical methods called partial least square (PLS). Sparse version of PLS (sPLS) operates integration of two datasets while simultaneously selecting the contributing variables. However, these methods do not take into account the important structural or group effects due to the relationship between markers among biological pathways. Hence, considering the predefined groups of markers (e.g. genesets), this could improve the relevance and the efficacy of the PLS approach. RESULTS: We propose two PLS extensions called group PLS (gPLS) and sparse gPLS (sgPLS). Our algorithm enables to study the relationship between two different types of omics data (e.g. SNP and gene expression) or between an omics dataset and multivariate phenotypes (e.g. cytokine secretion). We demonstrate the good performance of gPLS and sgPLS compared with the sPLS in the context of grouped data. Then, these methods are compared through an HIV therapeutic vaccine trial. Our approaches provide parsimonious models to reveal the relationship between gene abundance and the immunological response to the vaccine. AVAILABILITY AND IMPLEMENTATION: The approach is implemented in a comprehensive R package called sgPLS available on the CRAN. CONTACT: b.liquet@uq.edu.au SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Genomics/methods , AIDS Vaccines/immunology , Computer Simulation , Humans , Least-Squares Analysis , Sample Size
15.
Stat Med ; 35(16): 2687-714, 2016 07 20.
Article in English | MEDLINE | ID: mdl-26914402

ABSTRACT

Multiple endpoints are increasingly used in clinical trials. The significance of some of these clinical trials is established if at least r null hypotheses are rejected among m that are simultaneously tested. The usual approach in multiple hypothesis testing is to control the family-wise error rate, which is defined as the probability that at least one type-I error is made. More recently, the q-generalized family-wise error rate has been introduced to control the probability of making at least q false rejections. For procedures controlling this global type-I error rate, we define a type-II r-generalized family-wise error rate, which is directly related to the r-power defined as the probability of rejecting at least r false null hypotheses. We obtain very general power formulas that can be used to compute the sample size for single-step and step-wise procedures. These are implemented in our R package rPowerSampleSize available on the CRAN, making them directly available to end users. Complexities of the formulas are presented to gain insight into computation time issues. Comparison with Monte Carlo strategy is also presented. We compute sample sizes for two clinical trials involving multiple endpoints: one designed to investigate the effectiveness of a drug against acute heart failure and the other for the immunogenicity of a vaccine strategy against pneumococcus. Copyright © 2016 John Wiley & Sons, Ltd.


Subject(s)
Research Design , Sample Size , Humans , Monte Carlo Method , Probability
16.
PLoS Genet ; 9(8): e1003657, 2013.
Article in English | MEDLINE | ID: mdl-23950726

ABSTRACT

Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n  =  3,175), when compared with the largest published meta-GWAS (n > 100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. This provides a powerful tool for the analysis of diverse genomic features, for instance including gene expression and exome sequencing data, where complex dependencies are present in the predictor space.


Subject(s)
Algorithms , Biological Evolution , Genome-Wide Association Study , Quantitative Trait Loci/genetics , Bayes Theorem , Exome/genetics , Gene Expression , Humans , Linkage Disequilibrium , Phenotype , Polymorphism, Single Nucleotide/genetics
17.
J Stat Softw ; 69(2)2016 Jan 29.
Article in English | MEDLINE | ID: mdl-29568242

ABSTRACT

Technological advances in molecular biology over the past decade have given rise to high dimensional and complex datasets offering the possibility to investigate biological associations between a range of genomic features and complex phenotypes. The analysis of this novel type of data generated unprecedented computational challenges which ultimately led to the definition and implementation of computationally efficient statistical models that were able to scale to genome-wide data, including Bayesian variable selection approaches. While extensive methodological work has been carried out in this area, only few methods capable of handling hundreds of thousands of predictors were implemented and distributed. Among these we recently proposed GUESS, a computationally optimised algorithm making use of graphics processing unit capabilities, which can accommodate multiple outcomes. In this paper we propose R2GUESS, an R package wrapping the original C++ source code. In addition to providing a user-friendly interface of the original code automating its parametrisation, and data handling, R2GUESS also incorporates many features to explore the data, to extend statistical inferences from the native algorithm (e.g., effect size estimation, significance assessment), and to visualize outputs from the algorithm. We first detail the model and its parametrisation, and describe in details its optimised implementation. Based on two examples we finally illustrate its statistical performances and flexibility.

18.
J Biopharm Stat ; 24(2): 378-97, 2014.
Article in English | MEDLINE | ID: mdl-24605975

ABSTRACT

The use of two or more primary correlated endpoints is becoming increasingly common. A mandatory approach when analyzing data from such clinical trials is to control the family-wise error rate (FWER). In this context, we provide formulas for computation of sample size and for data analysis. Two approaches are discussed: an individual method based on a union-intersection procedure and a global procedure, based on a multivariate model that can take into account adjustment variables. These methods are illustrated with simulation studies and applications. An R package known as rPowerSampleSize is also available.


Subject(s)
Clinical Trials as Topic , Computer Simulation , Endpoint Determination/methods , Clinical Trials as Topic/statistics & numerical data , Computer Simulation/statistics & numerical data , Endpoint Determination/statistics & numerical data , Humans , Sample Size
19.
BMC Med Res Methodol ; 13: 75, 2013 Jun 08.
Article in English | MEDLINE | ID: mdl-23758852

ABSTRACT

BACKGROUND: In statistical modeling, finding the most favorable coding for an exploratory quantitative variable involves many tests. This process involves multiple testing problems and requires the correction of the significance level. METHODS: For each coding, a test on the nullity of the coefficient associated with the new coded variable is computed. The selected coding corresponds to that associated with the largest statistical test (or equivalently the smallest pvalue). In the context of the Generalized Linear Model, Liquet and Commenges (Stat Probability Lett,71:33-38,2005) proposed an asymptotic correction of the significance level. This procedure, based on the score test, has been developed for dichotomous and Box-Cox transformations. In this paper, we suggest the use of resampling methods to estimate the significance level for categorical transformations with more than two levels and, by definition those that involve more than one parameter in the model. The categorical transformation is a more flexible way to explore the unknown shape of the effect between an explanatory and a dependent variable. RESULTS: The simulations we ran in this study showed good performances of the proposed methods. These methods were illustrated using the data from a study of the relationship between cholesterol and dementia. CONCLUSION: The algorithms were implemented using R, and the associated CPMCGLM R package is available on the CRAN.


Subject(s)
Computer Simulation , Epidemiologic Research Design , Linear Models , Aged , Algorithms , Cholesterol, HDL/blood , Data Interpretation, Statistical , Dementia/blood , Epidemiologic Factors , Humans , Multivariate Analysis , Reproducibility of Results , Risk Factors , Sample Size
20.
PLoS One ; 18(6): e0287705, 2023.
Article in English | MEDLINE | ID: mdl-37384667

ABSTRACT

Compositional data are a special kind of data, represented as a proportion carrying relative information. Although this type of data is widely spread, no solution exists to deal with the cases where the classes are not well balanced. After describing compositional data imbalance, this paper proposes an adaptation of the original Synthetic Minority Oversampling TEchnique (SMOTE) to deal with compositional data imbalance. The new approach, called SMOTE for Compositional Data (SMOTE-CD), generates synthetic examples by computing a linear combination of selected existing data points, using compositional data operations. The performance of the SMOTE-CD is tested with three different regressors (Gradient Boosting tree, Neural Networks, Dirichlet regressor) applied to two real datasets and to synthetic generated data, and the performance is evaluated using accuracy, cross-entropy, F1-score, R2 score and RMSE. The results show improvements across all metrics, but the impact of oversampling on performance varies depending on the model and the data. In some cases, oversampling may lead to a decrease in performance for the majority class. However, for the real data, the best performance across all models is achieved when oversampling is used. Notably, the F1-score is consistently increased with oversampling. Unlike the original technique, the performance is not improved when combining oversampling of the minority classes and undersampling of the majority class. The Python package smote-cd implements the method and is available online.


Subject(s)
Acclimatization , Benchmarking , Entropy , Minority Groups , Neural Networks, Computer
SELECTION OF CITATIONS
SEARCH DETAIL