Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 28
Filter
Add more filters











Publication year range
1.
Biom J ; 66(4): e2200334, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38747086

ABSTRACT

Many data sets exhibit a natural group structure due to contextual similarities or high correlations of variables, such as lipid markers that are interrelated based on biochemical principles. Knowledge of such groupings can be used through bi-level selection methods to identify relevant feature groups and highlight their predictive members. One of the best known approaches of this kind combines the classical Least Absolute Shrinkage and Selection Operator (LASSO) with the Group LASSO, resulting in the Sparse Group LASSO. We propose the Sparse Group Penalty (SGP) framework, which allows for a flexible combination of different SGL-style shrinkage conditions. Analogous to SGL, we investigated the combination of the Smoothly Clipped Absolute Deviation (SCAD), the Minimax Concave Penalty (MCP) and the Exponential Penalty (EP) with their group versions, resulting in the Sparse Group SCAD, the Sparse Group MCP, and the novel Sparse Group EP (SGE). Those shrinkage operators provide refined control of the effect of group formation on the selection process through a tuning parameter. In simulation studies, SGPs were compared with other bi-level selection methods (Group Bridge, composite MCP, and Group Exponential LASSO) for variable and group selection evaluated with the Matthews correlation coefficient. We demonstrated the advantages of the new SGE in identifying parsimonious models, but also identified scenarios that highlight the limitations of the approach. The performance of the techniques was further investigated in a real-world use case for the selection of regulated lipids in a randomized clinical trial.


Subject(s)
Biometry , Biometry/methods , Humans
2.
Comput Biol Med ; 172: 108236, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38471351

ABSTRACT

The diagnosis of cancer based on gene expression profile data has attracted extensive attention in the field of biomedical science. This type of data usually has the characteristics of high dimensionality and noise. In this paper, a hybrid gene selection method based on clustering and sparse learning is proposed to choose the key genes with high precision. We first propose a filter method, which combines the k-means clustering algorithm and signal-to-noise ratio ranking method, and then, a weighted gene co-expression network has been applied to the reduced data set to identify modules corresponding to biological pathways. Moreover, we choose the key genes by using group bridge and sparse group lasso as wrapper methods. Finally, we conduct some numerical experiments on six cancer datasets. The numerical results show that our proposed method has achieved good performance in gene selection and cancer classification.


Subject(s)
Algorithms , Neoplasms , Humans , Gene Regulatory Networks , Neoplasms/genetics , Neoplasms/metabolism , Cluster Analysis
3.
J Am Stat Assoc ; 118(543): 2088-2100, 2023.
Article in English | MEDLINE | ID: mdl-38143787

ABSTRACT

Though Gaussian graphical models have been widely used in many scientific fields, relatively limited progress has been made to link graph structures to external covariates. We propose a Gaussian graphical regression model, which regresses both the mean and the precision matrix of a Gaussian graphical model on covariates. In the context of co-expression quantitative trait locus (QTL) studies, our method can determine how genetic variants and clinical conditions modulate the subject-level network structures, and recover both the population-level and subject-level gene networks. Our framework encourages sparsity of covariate effects on both the mean and the precision matrix. In particular for the precision matrix, we stipulate simultaneous sparsity, i.e., group sparsity and element-wise sparsity, on effective covariates and their effects on network edges, respectively. We establish variable selection consistency first under the case with known mean parameters and then a more challenging case with unknown means depending on external covariates, and establish in both cases the ℓ2 convergence rates and the selection consistency of the estimated precision parameters. The utility and efficacy of our proposed method is demonstrated through simulation studies and an application to a co-expression QTL study with brain cancer patients.

4.
BMC Med Res Methodol ; 23(1): 254, 2023 10 28.
Article in English | MEDLINE | ID: mdl-37898791

ABSTRACT

BACKGROUND: A substantial body of clinical research involving individuals infected with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has evaluated the association between in-hospital biomarkers and severe SARS-CoV-2 outcomes, including intubation and death. However, most existing studies considered each of multiple biomarkers independently and focused analysis on baseline or peak values. METHODS: We propose a two-stage analytic strategy combining functional principal component analysis (FPCA) and sparse-group LASSO (SGL) to characterize associations between biomarkers and 30-day mortality rates. Unlike prior reports, our proposed approach leverages: 1) time-varying biomarker trajectories, 2) multiple biomarkers simultaneously, and 3) the pathophysiological grouping of these biomarkers. We apply this method to a retrospective cohort of 12, 941 patients hospitalized at Massachusetts General Hospital or Brigham and Women's Hospital and conduct simulation studies to assess performance. RESULTS: Renal, inflammatory, and cardio-thrombotic biomarkers were associated with 30-day mortality rates among hospitalized SARS-CoV-2 patients. Sex-stratified analysis revealed that hematogolical biomarkers were associated with higher mortality in men while this association was not identified in women. In simulation studies, our proposed method maintained high true positive rates and outperformed alternative approaches using baseline or peak values only with respect to false positive rates. CONCLUSIONS: The proposed two-stage approach is a robust strategy for identifying biomarkers that associate with disease severity among SARS-CoV-2-infected individuals. By leveraging information on multiple, grouped biomarkers' longitudinal trajectories, our method offers an important first step in unraveling disease etiology and defining meaningful risk strata.


Subject(s)
COVID-19 , SARS-CoV-2 , Male , Humans , Female , Retrospective Studies , Principal Component Analysis , Hospitalization , Biomarkers
5.
Biology (Basel) ; 11(10)2022 Oct 12.
Article in English | MEDLINE | ID: mdl-36290397

ABSTRACT

With the emergence of single-cell RNA sequencing (scRNA-seq) technology, scientists are able to examine gene expression at single-cell resolution. Analysis of scRNA-seq data has its own challenges, which stem from its high dimensionality. The method of machine learning comes with the potential of gene (feature) selection from the high-dimensional scRNA-seq data. Even though there exist multiple machine learning methods that appear to be suitable for feature selection, such as penalized regression, there is no rigorous comparison of their performances across data sets, where each poses its own challenges. Therefore, in this paper, we analyzed and compared multiple penalized regression methods for scRNA-seq data. Given the scRNA-seq data sets we analyzed, the results show that sparse group lasso (SGL) outperforms the other six methods (ridge, lasso, elastic net, drop lasso, group lasso, and big lasso) using the metrics area under the receiver operating curve (AUC) and computation time. Building on these findings, we proposed a new algorithm for feature selection using penalized regression methods. The proposed algorithm works by selecting a small subset of genes and applying SGL to select the differentially expressed genes in scRNA-seq data. By using hierarchical clustering to group genes, the proposed method bypasses the need for domain-specific knowledge for gene grouping information. In addition, the proposed algorithm provided consistently better AUC for the data sets used.

6.
Comput Methods Programs Biomed ; 225: 107082, 2022 Oct.
Article in English | MEDLINE | ID: mdl-36055040

ABSTRACT

BACKGROUND AND OBJECTIVE: Functional brain graph (FBG), by describing the interactions between different brain regions, provides an effective representation of fMRI data for identifying mild cognitive impairment (MCI), an early stage of Alzheimer's Disease (AD). Prior to the identification task, selecting features from the estimated FBG is a necessary step for reducing computational cost, alleviating the risk of overfitting, and finding potential biomarkers of brain diseases. In practice, either node-based features (e.g., local clustering coefficients) or edge-based features (e.g., adjacency weights) are generally considered in current studies. Despite their popularity, these schemes can only capture one granularity (node or edge) of information in the FBG, which might be insufficient for the classification task and the interpretation of the classification result. METHODS: To address this issue, in this paper, we propose to jointly select nodes and edges from the estimated FBGs. Specifically, we first assign the edges to different node groups. Then, sparse group least absolute shrinkage and selection operator (sgLASSO) is used to select groups (nodes) and edges in the groups towards a better classification performance. Such a technique enables us to simultaneously locate discriminative brain regions, as well as connections between these brain regions, making the classification results more interpretable. RESULTS: Experimental results show that the proposed method achieves better classification performance than state-of-the-art methods. Moreover, by exploring brain network "features" that contributed most to MCI identification, we discover potential biomarkers for MCI diagnosis. CONCLUSION: A novel method for jointly selecting nodes and edges from the estimated functional brain graphs (FBGs) is proposed.


Subject(s)
Alzheimer Disease , Cognitive Dysfunction , Alzheimer Disease/diagnostic imaging , Biomarkers , Brain/diagnostic imaging , Cognitive Dysfunction/diagnostic imaging , Humans , Magnetic Resonance Imaging/methods
7.
J Comput Chem ; 43(20): 1342-1354, 2022 07 30.
Article in English | MEDLINE | ID: mdl-35656889

ABSTRACT

Machine learning methods have helped to advance wide range of scientific and technological field in recent years, including computational chemistry. As the chemical systems could become complex with high dimension, feature selection could be critical but challenging to develop reliable machine learning based prediction models, especially for proteins as bio-macromolecules. In this study, we applied sparse group lasso (SGL) method as a general feature selection method to develop classification model for an allosteric protein in different functional states. This results into a much improved model with comparable accuracy (Acc) and only 28 selected features comparing to 289 selected features from a previous study. The Acc achieves 91.50% with 1936 selected feature, which is far higher than that of baseline methods. In addition, grouping protein amino acids into secondary structures provides additional interpretability of the selected features. The selected features are verified as associated with key allosteric residues through comparison with both experimental and computational works about the model protein, and demonstrate the effectiveness and necessity of applying rigorous feature selection and evaluation methods on complex chemical systems.


Subject(s)
Machine Learning , Proteins , Algorithms , Proteins/chemistry
8.
IEEE Trans Inf Theory ; 68(9): 5975-6002, 2022 Sep.
Article in English | MEDLINE | ID: mdl-36865503

ABSTRACT

We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model - an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results.

9.
Sensors (Basel) ; 23(1)2022 Dec 21.
Article in English | MEDLINE | ID: mdl-36616640

ABSTRACT

Accurate prediction of aviation safety levels is significant for the efficient early warning and prevention of incidents. However, the causal mechanism and temporal character of aviation accidents are complex and not fully understood, which increases the operation cost of accurate aviation safety prediction. This paper adopts an innovative statistical method involving a least absolute shrinkage and selection operator (LASSO) and long short-term memory (LSTM). We compiled and calculated 138 monthly aviation insecure events collected from the Aviation Safety Reporting System (ASRS) and took minor accidents as the predictor. Firstly, this paper introduced the group variables and the weight matrix into LASSO to realize the adaptive variable selection. Furthermore, it took the selected variable into multistep stacked LSTM (MSSLSTM) to predict the monthly accidents in 2020. Finally, the proposed method was compared with multiple existing variable selection and prediction methods. The results demonstrate that the RMSE (root mean square error) of the MSSLSTM is reduced by 41.98%, compared with the original model; on the other hand, the key variable selected by the adaptive spare group lasso (ADSGL) can reduce the elapsed time by 42.67% (13 s). This shows that aviation safety prediction based on ADSGL and MSSLSTM can improve the prediction efficiency of the model while keeping excellent generalization ability and robustness.


Subject(s)
Accidents, Aviation , Aviation , Accidents , Accidents, Aviation/prevention & control
10.
Comput Biol Med ; 141: 105154, 2022 02.
Article in English | MEDLINE | ID: mdl-34952336

ABSTRACT

Cancer diagnosis based on gene expression profile data has attracted extensive attention in computational biology and medicine. It suffers from three challenges in practical applications: noise, gene grouping, and adaptive gene selection. This paper aims to solve the above problems by developing the logistic regression with adaptive sparse group lasso penalty (LR-ASGL). A noise information processing method for cancer gene expression profile data is first presented via robust principal component analysis. Genes are then divided into groups by performing weighted gene co-expression network analysis on the clean matrix. By approximating the relative value of the noise size, gene reliability criterion and robust evaluation criterion are proposed. Finally, LR-ASGL is presented for simultaneous cancer diagnosis and adaptive gene selection. The performance of the proposed method is compared with the other four methods in three simulation settings: Gaussian noise, uniformly distributed noise, and mixed noise. The acute leukemia data are adopted as an experimental example to demonstrate the advantages of LR-ASGL in prediction and gene selection.


Subject(s)
Leukemia , Neoplasms , Computational Biology/methods , Humans , Leukemia/diagnosis , Leukemia/genetics , Logistic Models , Neoplasms/metabolism , Reproducibility of Results
11.
Can J Stat ; 49(1): 182-202, 2021 Mar.
Article in English | MEDLINE | ID: mdl-34566241

ABSTRACT

A multi-stage variable selection method is introduced for detecting association signals in structured brain-wide and genome-wide association studies (brain-GWAS). Compared to conventional single-voxel-to-single-SNP approaches, our approach is more efficient and powerful in selecting the important signals by integrating anatomic and gene grouping structures in the brain and the genome, respectively. It avoids large number of multiple comparisons while effectively controls the false discoveries. Validity of the proposed approach is demonstrated by both theoretical investigation and numerical simulations. We apply the proposed method to a brain-GWAS using ADNI PET imaging and genomic data. We confirm previously reported association signals and also find several novel SNPs and genes that either are associated with brain glucose metabolism or have their association significantly modified by Alzheimer's disease status.

12.
Stat Med ; 40(20): 4473-4491, 2021 09 10.
Article in English | MEDLINE | ID: mdl-34031919

ABSTRACT

This article concerns robust modeling of the survival time for cancer patients. Accurate prediction of patient survival time is crucial to the development of effective therapeutic strategies. To this goal, we propose a unified Expectation-Maximization approach combined with the L1 -norm penalty to perform variable selection and parameter estimation simultaneously in the accelerated failure time model with right-censored survival data of moderate sizes. Our approach accommodates general loss functions, and reduces to the well-known Buckley-James method when the squared-error loss is used without regularization. To mitigate the effects of outliers and heavy-tailed noise in real applications, we recommend the use of robust loss functions under the general framework. Furthermore, our approach can be extended to incorporate group structure among covariates. We conduct extensive simulation studies to assess the performance of the proposed methods with different loss functions and apply them to an ovarian carcinoma study as an illustration.


Subject(s)
Computer Simulation , Neoplasms/mortality , Humans , Survival Analysis
13.
Entropy (Basel) ; 22(11)2020 Nov 05.
Article in English | MEDLINE | ID: mdl-33287025

ABSTRACT

Distance weighted discrimination (DWD) is an appealing classification method that is capable of overcoming data piling problems in high-dimensional settings. Especially when various sparsity structures are assumed in these settings, variable selection in multicategory classification poses great challenges. In this paper, we propose a multicategory generalized DWD (MgDWD) method that maintains intrinsic variable group structures during selection using a sparse group lasso penalty. Theoretically, we derive minimizer uniqueness for the penalized MgDWD loss function and consistency properties for the proposed classifier. We further develop an efficient algorithm based on the proximal operator to solve the optimization problem. The performance of MgDWD is evaluated using finite sample simulations and miRNA data from an HIV study.

14.
Hum Brain Mapp ; 41(17): 4997-5014, 2020 12.
Article in English | MEDLINE | ID: mdl-32813309

ABSTRACT

Major depressive disorder (MDD) is a leading cause of disability; its symptoms interfere with social, occupational, interpersonal, and academic functioning. However, the diagnosis of MDD is still made by phenomenological approach. The advent of neuroimaging techniques allowed numerous studies to use resting-state functional magnetic resonance imaging (rs-fMRI) and estimate functional connectivity for brain-disease identification. Recently, attempts have been made to investigate effective connectivity (EC) that represents causal relations among regions of interest. In the meantime, to identify meaningful phenotypes for clinical diagnosis, graph-based approaches such as graph convolutional networks (GCNs) have been leveraged recently to explore complex pairwise similarities in imaging/nonimaging features among subjects. In this study, we validate the use of EC for MDD identification by estimating its measures via a group sparse representation along with a structured equation modeling approach in a whole-brain data-driven manner from rs-fMRI. To distinguish drug-naïve MDD patients from healthy controls, we utilize spectral GCNs based on a population graph to successfully integrate EC and nonimaging phenotypic information. Furthermore, we devise a novel sensitivity analysis method to investigate the discriminant connections for MDD identification in our trained GCNs. Our experimental results validated the effectiveness of our method in various scenarios, and we identified altered connectivities associated with the diagnosis of MDD.


Subject(s)
Cerebral Cortex/physiopathology , Connectome/methods , Deep Learning , Depressive Disorder, Major/diagnostic imaging , Depressive Disorder, Major/physiopathology , Magnetic Resonance Imaging/methods , Nerve Net/physiopathology , Adult , Cerebral Cortex/diagnostic imaging , Female , Humans , Male , Middle Aged , Nerve Net/diagnostic imaging , Prospective Studies , Young Adult
15.
Genet Epidemiol ; 44(5): 408-424, 2020 07.
Article in English | MEDLINE | ID: mdl-32342572

ABSTRACT

Mediation analysis attempts to determine whether the relationship between an independent variable (e.g., exposure) and an outcome variable can be explained, at least partially, by an intermediate variable, called a mediator. Most methods for mediation analysis focus on one mediator at a time, although multiple mediators can be jointly analyzed by structural equation models (SEMs) that account for correlations among the mediators. We extend the use of SEMs for the analysis of multiple mediators by creating a sparse group lasso penalized model such that the penalty considers the natural groupings of parameters that determine mediation, as well as encourages sparseness of the model parameters. This provides a way to simultaneously evaluate many mediators and select those that have the most impact, a feature of modern penalized models. Simulations are used to illustrate the benefits and limitations of our approach, and application to a study of DNA methylation and reactive cortisol stress following childhood trauma discovered two novel methylation loci that mediate the association of childhood trauma scores with reactive cortisol stress levels. Our new methods are incorporated into R software called regmed.


Subject(s)
DNA Methylation , Models, Genetic , Models, Statistical , Software , Child , Computational Biology , Computer Simulation , Humans , Hydrocortisone/metabolism , Wounds and Injuries/metabolism
16.
Front Neurosci ; 14: 243, 2020.
Article in English | MEDLINE | ID: mdl-32300289

ABSTRACT

[This corrects the article DOI: 10.3389/fnins.2020.00060.].

17.
Front Neurosci ; 14: 60, 2020.
Article in English | MEDLINE | ID: mdl-32116508

ABSTRACT

Recent works have shown that the resting-state brain functional connectivity hypernetwork, where multiple nodes can be connected, are an effective technique for brain disease diagnosis and classification research. The lasso method was used to construct hypernetworks by solving sparse linear regression models in previous research. But, constructing a hypernetwork based on the lasso method simply selects a single variable, in that it lacks the ability to interpret the grouping effect. Considering the group structure problem, the previous study proposed to create a hypernetwork based on the elastic net and the group lasso methods, and the results showed that the former method had the best classification performance. However, the highly correlated variables selected by the elastic net method were not necessarily in the active set in the group. Therefore, we extended our research to address this issue. Herein, we propose a new method that introduces the sparse group lasso method to improve the construction of the hypernetwork by solving the group structure problem of the brain regions. We used the traditional lasso, group lasso method, and sparse group lasso method to construct a hypernetwork in patients with depression and normal subjects. Meanwhile, other clustering coefficients (clustering coefficients based on pairs of nodes) were also introduced to extract features with traditional clustering coefficients. Two types of features with significant differences obtained after feature selection were subjected to multi-kernel learning for feature fusion and classification using each method, respectively. The network topology results revealed differences among the three networks, where hypernetwork using the lasso method was the strictest; the group lasso, most lenient; and the sgLasso method, moderate. The network topology of the sparse group lasso method was similar to that of the group lasso method but different from the lasso method. The classification results show that the sparse group lasso method achieves the best classification accuracy by using multi-kernel learning, which indicates that better classification performance can be achieved when the group structure exists and is properly extended.

18.
Front Genet ; 11: 155, 2020.
Article in English | MEDLINE | ID: mdl-32194631

ABSTRACT

Identification of genetic variants associated with complex traits is a critical step for improving plant resistance and breeding. Although the majority of existing methods for variants detection have good predictive performance in the average case, they can not precisely identify the variants present in a small number of target genes. In this paper, we propose a weighted sparse group lasso (WSGL) method to select both common and low-frequency variants in groups. Under the biologically realistic assumption that complex traits are influenced by a few single loci in a small number of genes, our method involves a sparse group lasso approach to simultaneously select associated groups along with the loci within each group. To increase the probability of selecting out low-frequency variants, biological prior information is introduced in the model by re-weighting lasso regularization based on weights calculated from input data. Experimental results from both simulation and real data of single nucleotide polymorphisms (SNPs) associated with Arabidopsis flowering traits demonstrate the superiority of WSGL over other competitive approaches for genetic variants detection.

19.
J Theor Biol ; 486: 110098, 2020 02 07.
Article in English | MEDLINE | ID: mdl-31786183

ABSTRACT

At present, with the in-depth study of gene expression data, the significant role of tumor classification in clinical medicine has become more apparent. In particular, the sparse characteristics of gene expression data within and between groups. Therefore, this paper focuses on the study of tumor classification based on the sparsity characteristics of genes. On this basis, we propose a new method of tumor classification-Sparse Group Lasso (least absolute shrinkage and selection operator) and Support Vector Machine (SGL-SVM). Firstly, the primary selection of feature genes is performed on the normalized tumor datasets using the Kruskal-Wallis rank sum test. Secondly, using a sparse group Lasso for further selection, and finally, the support vector machine serves as a classifier for classification. We validate proposed method on microarray and NGS datasets respectively. Formerly, on three two-class and five multi-class microarray datasets it is tested by 10-fold cross-validation and compared with other three classifiers. SGL-SVM is then applied on BRCA and GBM datasets and tested by 5-fold cross-validation. Satisfactory accuracy is obtained by above experiments and compared with other proposed methods. The experimental results show that the proposed method achieves a higher classification accuracy and selects fewer feature genes, which can be widely applied in classification for high-dimensional and small-sample tumor datasets. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/SGL-SVM/.


Subject(s)
Neoplasms , Support Vector Machine , Algorithms , Gene Expression Profiling , Humans , Microarray Analysis , Neoplasms/genetics , Software
20.
Neuroimage ; 206: 116317, 2020 02 01.
Article in English | MEDLINE | ID: mdl-31678502

ABSTRACT

Predicting the progression of Alzheimer's Disease (AD) has been held back for decades due to the lack of sufficient longitudinal data required for the development of novel machine learning algorithms. This study proposes a novel machine learning algorithm for predicting the progression of Alzheimer's disease using a distributed multimodal, multitask learning method. More specifically, each individual task is defined as a regression model, which predicts cognitive scores at a single time point. Since the prediction tasks for multiple intervals are related to each other in chronological order, multitask regression models have been developed to track the relationship between subsequent tasks. Furthermore, since subjects have various combinations of recording modalities together with other genetic, neuropsychological and demographic risk factors, special attention is given to the fact that each modality may experience a specific sparsity pattern. The model is hence generalized by exploiting multiple individual multitask regression coefficient matrices for each modality. The outcome for each independent modality-specific learner is then integrated with complementary information, known as risk factor parameters, revealing the most prevalent trends of the multimodal data. This new feature space is then used as input to the gradient boosting kernel in search for a more accurate prediction. This proposed model not only captures the complex relationships between the different feature representations, but it also ignores any unrelated information which might skew the regression coefficients. Comparative assessments are made between the performance of the proposed method with several other well-established methods using different multimodal platforms. The results indicate that by capturing the interrelatedness between the different modalities and extracting only relevant information in the data, even in an incomplete longitudinal dataset, will yield minimized prediction errors.


Subject(s)
Alzheimer Disease/diagnostic imaging , Alzheimer Disease/physiopathology , Cognitive Dysfunction/diagnostic imaging , Cognitive Dysfunction/physiopathology , Disease Progression , Machine Learning , Aged , Aged, 80 and over , Female , Humans , Longitudinal Studies , Magnetic Resonance Imaging , Male , Mental Status and Dementia Tests , Neuropsychological Tests , Positron-Emission Tomography , Regression Analysis
SELECTION OF CITATIONS
SEARCH DETAIL