ABSTRACT
Breast cancer (BC) risk is suspected to be linked to thyroid disorders, however observational studies exploring the association between BC and thyroid disorders gave conflicting results. We proposed an alternative approach by investigating the shared genetic risk factors between BC and several thyroid traits. We report a positive genetic correlation between BC and thyroxine (FT4) levels (corr = 0.13, p-value = 2.0 × 10-4) and a negative genetic correlation between BC and thyroid-stimulating hormone (TSH) levels (corr = -0.09, p-value = 0.03). These associations are more striking when restricting the analysis to estrogen receptor-positive BC. Moreover, the polygenic risk scores (PRS) for FT4 and hyperthyroidism are positively associated to BC risk (OR = 1.07, 95%CI: 1.00-1.13, p-value = 2.8 × 10-2 and OR = 1.04, 95%CI: 1.00-1.08, p-value = 3.8 × 10-2, respectively), while the PRS for TSH is inversely associated to BC risk (OR = 0.93, 95%CI: 0.89-0.97, p-value = 2.0 × 10-3). Using the PLACO method, we detected 49 loci associated to both BC and thyroid traits (p-value < 5 × 10-8), in the vicinity of 130 genes. An additional colocalization and gene-set enrichment analyses showed a convincing causal role for a known pleiotropic locus at 2q35 and revealed an additional one at 8q22.1 associated to both BC and thyroid cancer. We also found two new pleiotropic loci at 14q32.33 and 17q21.31 that were associated to both TSH levels and BC risk. Enrichment analyses and evidence of regulatory signals also highlighted brain tissues and immune system as candidates for obtaining associations between BC and TSH levels. Overall, our study sheds light on the complex interplay between BC and thyroid traits and provides evidence of shared genetic risk between those conditions.
Subject(s)
Breast Neoplasms , Thyroid Gland , Humans , Female , Breast Neoplasms/genetics , Thyrotropin/genetics , Thyroxine/genetics , Risk Factors , Genetic Risk ScoreABSTRACT
Glycoproteomics is a powerful yet analytically challenging research tool. Software packages aiding the interpretation of complex glycopeptide tandem mass spectra have appeared, but their relative performance remains untested. Conducted through the HUPO Human Glycoproteomics Initiative, this community study, comprising both developers and users of glycoproteomics software, evaluates solutions for system-wide glycopeptide analysis. The same mass spectrometrybased glycoproteomics datasets from human serum were shared with participants and the relative team performance for N- and O-glycopeptide data analysis was comprehensively established by orthogonal performance tests. Although the results were variable, several high-performance glycoproteomics informatics strategies were identified. Deep analysis of the data revealed key performance-associated search parameters and led to recommendations for improved 'high-coverage' and 'high-accuracy' glycoproteomics search solutions. This study concludes that diverse software packages for comprehensive glycopeptide data analysis exist, points to several high-performance search strategies and specifies key variables that will guide future software developments and assist informatics decision-making in glycoproteomics.
Subject(s)
Glycopeptides/blood , Glycoproteins/blood , Informatics/methods , Proteome/analysis , Proteomics/methods , Research Personnel/statistics & numerical data , Software , Glycosylation , Humans , Proteome/metabolism , Tandem Mass SpectrometryABSTRACT
BACKGROUND: Genome-wide association studies (GWAS) have identified genetic variants associated with multiple complex diseases. We can leverage this phenomenon, known as pleiotropy, to integrate multiple data sources in a joint analysis. Often integrating additional information such as gene pathway knowledge can improve statistical efficiency and biological interpretation. In this article, we propose statistical methods which incorporate both gene pathway and pleiotropy knowledge to increase statistical power and identify important risk variants affecting multiple traits. METHODS: We propose novel feature selection methods for the group variable selection in multi-task regression problem. We develop penalised likelihood methods exploiting different penalties to induce structured sparsity at a gene (or pathway) and SNP level across all studies. We implement an alternating direction method of multipliers (ADMM) algorithm for our penalised regression methods. The performance of our approaches are compared to a subset based meta analysis approach on simulated data sets. A bootstrap sampling strategy is provided to explore the stability of the penalised methods. RESULTS: Our methods are applied to identify potential pleiotropy in an application considering the joint analysis of thyroid and breast cancers. The methods were able to detect eleven potential pleiotropic SNPs and six pathways. A simulation study found that our method was able to detect more true signals than a popular competing method while retaining a similar false discovery rate. CONCLUSION: We developed feature selection methods for jointly analysing multiple logistic regression tasks where prior grouping knowledge is available. Our method performed well on both simulation studies and when applied to a real data analysis of multiple cancers.
Subject(s)
Genome-Wide Association Study , Genomics , Algorithms , Genomics/methods , Humans , Phenotype , Polymorphism, Single NucleotideABSTRACT
BACKGROUND: The increasing number of genome-wide association studies (GWAS) has revealed several loci that are associated to multiple distinct phenotypes, suggesting the existence of pleiotropic effects. Highlighting these cross-phenotype genetic associations could help to identify and understand common biological mechanisms underlying some diseases. Common approaches test the association between genetic variants and multiple traits at the SNP level. In this paper, we propose a novel gene- and a pathway-level approach in the case where several independent GWAS on independent traits are available. The method is based on a generalization of the sparse group Partial Least Squares (sgPLS) to take into account groups of variables, and a Lasso penalization that links all independent data sets. This method, called joint-sgPLS, is able to convincingly detect signal at the variable level and at the group level. RESULTS: Our method has the advantage to propose a global readable model while coping with the architecture of data. It can outperform traditional methods and provides a wider insight in terms of a priori information. We compared the performance of the proposed method to other benchmark methods on simulated data and gave an example of application on real data with the aim to highlight common susceptibility variants to breast and thyroid cancers. CONCLUSION: The joint-sgPLS shows interesting properties for detecting a signal. As an extension of the PLS, the method is suited for data with a large number of variables. The choice of Lasso penalization copes with architectures of groups of variables and observations sets. Furthermore, although the method has been applied to a genetic study, its formulation is adapted to any data with high number of variables and an exposed a priori architecture in other application fields.
Subject(s)
Genome-Wide Association Study , Polymorphism, Single Nucleotide , Least-Squares Analysis , PhenotypeABSTRACT
An increasing number of genome-wide association studies (GWAS) summary statistics is made available to the scientific community. Exploiting these results from multiple phenotypes would permit identification of novel pleiotropic associations. In addition, incorporating prior biological information in GWAS such as group structure information (gene or pathway) has shown some success in classical GWAS approaches. However, this has not been widely explored in the context of pleiotropy. We propose a Bayesian meta-analysis approach (termed GCPBayes) that uses summary-level GWAS data across multiple phenotypes to detect pleiotropy at both group-level (gene or pathway) and within group (eg, at the SNP level). We consider both continuous and Dirac spike and slab priors for group selection. We also use a Bayesian sparse group selection approach with hierarchical spike and slab priors that enables us to select important variables both at the group level and within group. GCPBayes uses a Bayesian statistical framework based on Markov chain Monte Carlo (MCMC) Gibbs sampling. It can be applied to multiple types of phenotypes for studies with overlapping or nonoverlapping subjects, and takes into account heterogeneity in the effect size and allows for the opposite direction of the genetic effects across traits. Simulations show that the proposed methods outperform benchmark approaches such as ASSET and CPBayes in the ability to retrieve pleiotropic associations at both SNP and gene-levels. To illustrate the GCPBayes method, we investigate the shared genetic effects between thyroid cancer and breast cancer in candidate pathways.
Subject(s)
Genome-Wide Association Study , Neoplasms , Bayes Theorem , Genomics , Group Structure , Humans , Models, Genetic , Polymorphism, Single NucleotideABSTRACT
BACKGROUND: Heterogeneous respiratory system static compliance (CRS) values and levels of hypoxemia in patients with novel coronavirus disease (COVID-19) requiring mechanical ventilation have been reported in previous small-case series or studies conducted at a national level. METHODS: We designed a retrospective observational cohort study with rapid data gathering from the international COVID-19 Critical Care Consortium study to comprehensively describe CRS-calculated as: tidal volume/[airway plateau pressure-positive end-expiratory pressure (PEEP)]-and its association with ventilatory management and outcomes of COVID-19 patients on mechanical ventilation (MV), admitted to intensive care units (ICU) worldwide. RESULTS: We studied 745 patients from 22 countries, who required admission to the ICU and MV from January 14 to December 31, 2020, and presented at least one value of CRS within the first seven days of MV. Median (IQR) age was 62 (52-71), patients were predominantly males (68%) and from Europe/North and South America (88%). CRS, within 48 h from endotracheal intubation, was available in 649 patients and was neither associated with the duration from onset of symptoms to commencement of MV (p = 0.417) nor with PaO2/FiO2 (p = 0.100). Females presented lower CRS than males (95% CI of CRS difference between females-males: - 11.8 to - 7.4 mL/cmH2O p < 0.001), and although females presented higher body mass index (BMI), association of BMI with CRS was marginal (p = 0.139). Ventilatory management varied across CRS range, resulting in a significant association between CRS and driving pressure (estimated decrease - 0.31 cmH2O/L per mL/cmH20 of CRS, 95% CI - 0.48 to - 0.14, p < 0.001). Overall, 28-day ICU mortality, accounting for the competing risk of being discharged within the period, was 35.6% (SE 1.7). Cox proportional hazard analysis demonstrated that CRS (+ 10 mL/cm H2O) was only associated with being discharge from the ICU within 28 days (HR 1.14, 95% CI 1.02-1.28, p = 0.018). CONCLUSIONS: This multicentre report provides a comprehensive account of CRS in COVID-19 patients on MV. CRS measured within 48 h from commencement of MV has marginal predictive value for 28-day mortality, but was associated with being discharged from ICU within the same period. Trial documentation: Available at https://www.covid-critical.com/study . TRIAL REGISTRATION: ACTRN12620000421932.
Subject(s)
COVID-19/complications , COVID-19/therapy , Lung Compliance/physiology , Respiration, Artificial/methods , Respiratory Distress Syndrome/etiology , Respiratory Distress Syndrome/therapy , Adult , Cohort Studies , Critical Care/methods , Europe , Female , Humans , Intensive Care Units , Male , Middle Aged , Retrospective Studies , Severity of Illness IndexABSTRACT
Anticipating future changes of an ecosystem's dynamics requires knowledge of how its key communities respond to current environmental regimes. The Great Barrier Reef (GBR) is under threat, with rapid changes of its reef-building hard coral (HC) community structure already evident across broad spatial scales. While several underlying relationships between HC and multiple disturbances have been documented, responses of other benthic communities to disturbances are not well understood. Here we used statistical modelling to explore the effects of broad-scale climate-related disturbances on benthic communities to predict their structure under scenarios of increasing disturbance frequency. We parameterized a multivariate model using the composition of benthic communities estimated by 145,000 observations from the northern GBR between 2012 and 2017. During this time, surveyed reefs were variously impacted by two tropical cyclones and two heat stress events that resulted in extensive HC mortality. This unprecedented sequence of disturbances was used to estimate the effects of discrete versus interacting disturbances on the compositional structure of HC, soft corals (SC) and algae. Discrete disturbances increased the prevalence of algae relative to HC while the interaction between cyclones and heat stress was the main driver of the increase in SC relative to algae and HC. Predictions from disturbance scenarios included relative increases in algae versus SC that varied by the frequency and types of disturbance interactions. However, high uncertainty of compositional changes in the presence of several disturbances shows that responses of algae and SC to the decline in HC needs further research. Better understanding of the effects of multiple disturbances on benthic communities as a whole is essential for predicting the future status of coral reefs and managing them in the light of new environmental regimes. The approach we develop here opens new opportunities for reaching this goal.
Subject(s)
Anthozoa , Cyclonic Storms , Animals , Coral Reefs , EcosystemABSTRACT
Identification of biomarkers is an emerging area in oncology. In this article, we develop an efficient statistical procedure for the classification of protein markers according to their effect on cancer progression. A high-dimensional time-course dataset of protein markers for 80 patients motivates us for developing the model. The threshold value is formulated as a level of a marker having maximum impact on cancer progression. The classification algorithm technique for high-dimensional time-course data is developed and the algorithm is validated by comparing random components using both proportional hazard and accelerated failure time frailty models. The study elucidates the application of two separate joint modeling techniques using auto regressive-type model and mixed effect model for time-course data and proportional hazard model for survival data with proper utilization of Bayesian methodology. Also, a prognostic score is developed on the basis of few selected genes with application on patients. This study facilitates to identify relevant biomarkers from a set of markers.
Subject(s)
Algorithms , Medical Oncology , Bayes Theorem , Biomarkers , Humans , Proportional Hazards ModelsABSTRACT
Anomaly detection (AD) in high-volume environmental data requires one to tackle a series of challenges associated with the typical low frequency of anomalous events, the broad-range of possible anomaly types, and local nonstationary environmental conditions, suggesting the need for flexible statistical methods that are able to cope with unbalanced high-volume data problems. Here, we aimed to detect anomalies caused by technical errors in water-quality (turbidity and conductivity) data collected by automated in situ sensors deployed in contrasting riverine and estuarine environments. We first applied a range of artificial neural networks that differed in both learning method and hyperparameter values, then calibrated models using a Bayesian multiobjective optimization procedure, and selected and evaluated the "best" model for each water-quality variable, environment, and anomaly type. We found that semi-supervised classification was better able to detect sudden spikes, sudden shifts, and small sudden spikes, whereas supervised classification had higher accuracy for predicting long-term anomalies associated with drifts and periods of otherwise unexplained high variability.
Subject(s)
Neural Networks, Computer , Water , Bayes Theorem , Water QualityABSTRACT
BACKGROUND: In medical research, explanatory continuous variables are frequently transformed or converted into categorical variables. If the coding is unknown, many tests can be used to identify the "optimal" transformation. This common process, involving the problems of multiple testing, requires a correction of the significance level. Liquet and Commenges proposed an asymptotic correction of significance level in the context of generalized linear models (GLM) (Liquet and Commenges, Stat Probab Lett 71:33-38, 2005). This procedure has been developed for dichotomous and Box-Cox transformations. Furthermore, Liquet and Riou suggested the use of resampling methods to estimate the significance level for transformations into categorical variables with more than two levels (Liquet and Riou, BMC Med Res Methodol 13:75, 2013). RESULTS: CPMCGLM provides to users both methods of p-value adjustment. Futhermore, they are available for a large set of transformations. This paper aims to provide insight the user an overview of the methodological context, and explain in detail the use of the CPMCGLM R package through its application to a real epidemiological dataset. CONCLUSION: We present here the CPMCGLMR package providing efficient methods for the correction of type-I error rate in the context of generalized linear models. This is the first and the only available package in R providing such methods applied to this context. This package is designed to help researchers, who work principally in the field of biostatistics and epidemiology, to analyze their data in the context of optimal cutoff point determination.
Subject(s)
Algorithms , Biometry/methods , Computational Biology/methods , Linear Models , Cholesterol, HDL/blood , Dementia/blood , Female , Humans , Male , Reproducibility of ResultsABSTRACT
Recent prospective studies have shown that dysregulation of the immune system may precede the development of B-cell lymphomas (BCL) in immunocompetent individuals. However, to date, the studies were restricted to a few immune markers, which were considered separately. Using a nested case-control study within two European prospective cohorts, we measured plasma levels of 28 immune markers in samples collected a median of 6 years before diagnosis (range 2.01-15.97) in 268 incident cases of BCL (including multiple myeloma [MM]) and matched controls. Linear mixed models and partial least square analyses were used to analyze the association between levels of immune marker and the incidence of BCL and its main histological subtypes and to investigate potential biomarkers predictive of the time to diagnosis. Linear mixed model analyses identified associations linking lower levels of fibroblast growth factor-2 (FGF-2 p = 7.2 × 10-4 ) and transforming growth factor alpha (TGF-α, p = 6.5 × 10-5 ) and BCL incidence. Analyses stratified by histological subtypes identified inverse associations for MM subtype including FGF-2 (p = 7.8 × 10-7 ), TGF-α (p = 4.08 × 10-5 ), fractalkine (p = 1.12 × 10-3 ), monocyte chemotactic protein-3 (p = 1.36 × 10-4 ), macrophage inflammatory protein 1-alpha (p = 4.6 × 10-4 ) and vascular endothelial growth factor (p = 4.23 × 10-5 ). Our results also provided marginal support for already reported associations between chemokines and diffuse large BCL (DLBCL) and cytokines and chronic lymphocytic leukemia (CLL). Case-only analyses showed that Granulocyte-macrophage colony stimulating factor levels were consistently higher closer to diagnosis, which provides further evidence of its role in tumor progression. In conclusion, our study suggests a role of growth-factors in the incidence of MM and of chemokine and cytokine regulation in DLBCL and CLL.
Subject(s)
Biomarkers/blood , Lymphoma, Large B-Cell, Diffuse/blood , Multiple Myeloma/blood , Adult , Aged , Case-Control Studies , Chemokine CCL7/blood , Chemokine CX3CL1/blood , Europe , Female , Fibroblast Growth Factor 2/blood , Follow-Up Studies , Humans , Incidence , Lymphoma, Large B-Cell, Diffuse/diagnosis , Lymphoma, Large B-Cell, Diffuse/epidemiology , Lymphoma, Large B-Cell, Diffuse/immunology , Male , Middle Aged , Multiple Myeloma/diagnosis , Multiple Myeloma/epidemiology , Multiple Myeloma/immunology , Multivariate Analysis , Prognosis , Prospective Studies , Transforming Growth Factor alpha/blood , Vascular Endothelial Growth Factor A/bloodABSTRACT
Integrative analysis of high dimensional omics datasets has been studied by many authors in recent years. By incorporating prior known relationships among the variables, these analyses have been successful in elucidating the relationships between different sets of omics data. In this article, our goal is to identify important relationships between genomic expression and cytokine data from a human immunodeficiency virus vaccine trial. We proposed a flexible partial least squares technique, which incorporates group and subgroup structure in the modelling process. Our new method accounts for both grouping of genetic markers (eg, gene sets) and temporal effects. The method generalises existing sparse modelling techniques in the partial least squares methodology and establishes theoretical connections to variable selection methods for supervised and unsupervised problems. Simulation studies are performed to investigate the performance of our methods over alternative sparse approaches. Our R package sgspls is available at https://github.com/matt-sutton/sgspls.
Subject(s)
Least-Squares Analysis , Models, Statistical , AIDS Vaccines/therapeutic use , Algorithms , Biostatistics , Clinical Trials as Topic/statistics & numerical data , Computer Simulation , Genomics/methods , Humans , Likelihood Functions , Multivariate Analysis , Regression AnalysisABSTRACT
MOTIVATION: The association between two blocks of 'omics' data brings challenging issues in computational biology due to their size and complexity. Here, we focus on a class of multivariate statistical methods called partial least square (PLS). Sparse version of PLS (sPLS) operates integration of two datasets while simultaneously selecting the contributing variables. However, these methods do not take into account the important structural or group effects due to the relationship between markers among biological pathways. Hence, considering the predefined groups of markers (e.g. genesets), this could improve the relevance and the efficacy of the PLS approach. RESULTS: We propose two PLS extensions called group PLS (gPLS) and sparse gPLS (sgPLS). Our algorithm enables to study the relationship between two different types of omics data (e.g. SNP and gene expression) or between an omics dataset and multivariate phenotypes (e.g. cytokine secretion). We demonstrate the good performance of gPLS and sgPLS compared with the sPLS in the context of grouped data. Then, these methods are compared through an HIV therapeutic vaccine trial. Our approaches provide parsimonious models to reveal the relationship between gene abundance and the immunological response to the vaccine. AVAILABILITY AND IMPLEMENTATION: The approach is implemented in a comprehensive R package called sgPLS available on the CRAN. CONTACT: b.liquet@uq.edu.au SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Algorithms , Genomics/methods , AIDS Vaccines/immunology , Computer Simulation , Humans , Least-Squares Analysis , Sample SizeABSTRACT
Multiple endpoints are increasingly used in clinical trials. The significance of some of these clinical trials is established if at least r null hypotheses are rejected among m that are simultaneously tested. The usual approach in multiple hypothesis testing is to control the family-wise error rate, which is defined as the probability that at least one type-I error is made. More recently, the q-generalized family-wise error rate has been introduced to control the probability of making at least q false rejections. For procedures controlling this global type-I error rate, we define a type-II r-generalized family-wise error rate, which is directly related to the r-power defined as the probability of rejecting at least r false null hypotheses. We obtain very general power formulas that can be used to compute the sample size for single-step and step-wise procedures. These are implemented in our R package rPowerSampleSize available on the CRAN, making them directly available to end users. Complexities of the formulas are presented to gain insight into computation time issues. Comparison with Monte Carlo strategy is also presented. We compute sample sizes for two clinical trials involving multiple endpoints: one designed to investigate the effectiveness of a drug against acute heart failure and the other for the immunogenicity of a vaccine strategy against pneumococcus. Copyright © 2016 John Wiley & Sons, Ltd.
Subject(s)
Research Design , Sample Size , Humans , Monte Carlo Method , ProbabilityABSTRACT
Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n â= â3,175), when compared with the largest published meta-GWAS (n > 100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. This provides a powerful tool for the analysis of diverse genomic features, for instance including gene expression and exome sequencing data, where complex dependencies are present in the predictor space.
Subject(s)
Algorithms , Biological Evolution , Genome-Wide Association Study , Quantitative Trait Loci/genetics , Bayes Theorem , Exome/genetics , Gene Expression , Humans , Linkage Disequilibrium , Phenotype , Polymorphism, Single Nucleotide/geneticsABSTRACT
Technological advances in molecular biology over the past decade have given rise to high dimensional and complex datasets offering the possibility to investigate biological associations between a range of genomic features and complex phenotypes. The analysis of this novel type of data generated unprecedented computational challenges which ultimately led to the definition and implementation of computationally efficient statistical models that were able to scale to genome-wide data, including Bayesian variable selection approaches. While extensive methodological work has been carried out in this area, only few methods capable of handling hundreds of thousands of predictors were implemented and distributed. Among these we recently proposed GUESS, a computationally optimised algorithm making use of graphics processing unit capabilities, which can accommodate multiple outcomes. In this paper we propose R2GUESS, an R package wrapping the original C++ source code. In addition to providing a user-friendly interface of the original code automating its parametrisation, and data handling, R2GUESS also incorporates many features to explore the data, to extend statistical inferences from the native algorithm (e.g., effect size estimation, significance assessment), and to visualize outputs from the algorithm. We first detail the model and its parametrisation, and describe in details its optimised implementation. Based on two examples we finally illustrate its statistical performances and flexibility.
ABSTRACT
The use of two or more primary correlated endpoints is becoming increasingly common. A mandatory approach when analyzing data from such clinical trials is to control the family-wise error rate (FWER). In this context, we provide formulas for computation of sample size and for data analysis. Two approaches are discussed: an individual method based on a union-intersection procedure and a global procedure, based on a multivariate model that can take into account adjustment variables. These methods are illustrated with simulation studies and applications. An R package known as rPowerSampleSize is also available.
Subject(s)
Clinical Trials as Topic , Computer Simulation , Endpoint Determination/methods , Clinical Trials as Topic/statistics & numerical data , Computer Simulation/statistics & numerical data , Endpoint Determination/statistics & numerical data , Humans , Sample SizeABSTRACT
Through spectral unmixing, hyperspectral imaging (HSI) in fluorescence-guided brain tumor surgery has enabled the detection and classification of tumor regions invisible to the human eye. Prior unmixing work has focused on determining a minimal set of viable fluorophore spectra known to be present in the brain and effectively reconstructing human data without overfitting. With these endmembers, non-negative least squares regression (NNLS) was commonly used to compute the abundances. However, HSI images are heterogeneous, so one small set of endmember spectra may not fit all pixels well. Additionally, NNLS is the maximum likelihood estimator only if the measurement is normally distributed, and it does not enforce sparsity, which leads to overfitting and unphysical results. In this paper, we analyzed 555666 HSI fluorescence spectra from 891 ex vivo measurements of patients with various brain tumors to show that a Poisson distribution indeed models the measured data 82% better than a Gaussian in terms of the Kullback-Leibler divergence, and that the endmember abundance vectors are sparse. With this knowledge, we introduce (1) a library of 9 endmember spectra, including PpIX (620â nm and 634â nm photostates), NADH, FAD, flavins, lipofuscin, melanin, elastin, and collagen, (2) a sparse, non-negative Poisson regression algorithm to perform physics-informed unmixing with this library without overfitting, and (3) a highly realistic spectral measurement simulation with known endmember abundances. The new unmixing method was then tested on the human and simulated data and compared to four other candidate methods. It outperforms previous methods with 25% lower error in the computed abundances on the simulated data than NNLS, lower reconstruction error on human data, better sparsity, and 31 times faster runtime than state-of-the-art Poisson regression. This method and library of endmember spectra can enable more accurate spectral unmixing to aid the surgeon better during brain tumor resection.
ABSTRACT
BACKGROUND: In statistical modeling, finding the most favorable coding for an exploratory quantitative variable involves many tests. This process involves multiple testing problems and requires the correction of the significance level. METHODS: For each coding, a test on the nullity of the coefficient associated with the new coded variable is computed. The selected coding corresponds to that associated with the largest statistical test (or equivalently the smallest pvalue). In the context of the Generalized Linear Model, Liquet and Commenges (Stat Probability Lett,71:33-38,2005) proposed an asymptotic correction of the significance level. This procedure, based on the score test, has been developed for dichotomous and Box-Cox transformations. In this paper, we suggest the use of resampling methods to estimate the significance level for categorical transformations with more than two levels and, by definition those that involve more than one parameter in the model. The categorical transformation is a more flexible way to explore the unknown shape of the effect between an explanatory and a dependent variable. RESULTS: The simulations we ran in this study showed good performances of the proposed methods. These methods were illustrated using the data from a study of the relationship between cholesterol and dementia. CONCLUSION: The algorithms were implemented using R, and the associated CPMCGLM R package is available on the CRAN.