RESUMEN
Feature selection methods are essential for accurate disease classification and identifying informative biomarkers. While information-theoretic methods have been widely used, they often exhibit limitations such as high computational costs. Our previously proposed method, ClearF, addresses these issues by using reconstruction error from low-dimensional embeddings as a proxy for the entropy term in the mutual information. However, ClearF still has limitations, including a nontransparent bottleneck layer selection process, which can result in unstable feature selection. To address these limitations, we propose ClearF++, which simplifies the bottleneck layer selection and incorporates feature-wise clustering to enhance biomarker detection. We compare its performance with other commonly used methods such as MultiSURF and IFS, as well as ClearF, across multiple benchmark datasets. Our results demonstrate that ClearF++ consistently outperforms these methods in terms of prediction accuracy and stability, even with limited samples. We also observe that employing the Deep Embedded Clustering (DEC) algorithm for feature-wise clustering improves performance, indicating its suitability for handling complex data structures with limited samples. ClearF++ offers an improved biomarker prioritization approach with enhanced prediction performance and faster execution. Its stability and effectiveness with limited samples make it particularly valuable for biomedical data analysis.
RESUMEN
Prurigo nodularis (PN) is a chronic dermatosis typified by extraordinarily itchy nodules. However, little is known of the nature and extent of PN in Asian people. This study aimed to describe the epidemiology, comorbidities, and prescription pattern of PN in Koreans based on a large dermatology outpatient cohort. Patients with PN were identified from the Catholic Medical Center (CMC) clinical data warehouse. Anonymized data on age, sex, diagnostic codes, prescriptions, visitation dates, and other relevant parameters were collected. Pearson correlation analysis was used to calculate the correlation between PN prevalence and patient age. Conditional logistic regression modeling was adopted to measure the comorbidity risk of PN. A total of 3591 patients with PN were identified at the Catholic Medical Center Health System dermatology outpatient clinic in the period 2007-2020. A comparison of the study patients with age- and sex-matched controls (dermatology outpatients without PN) indicated that PN was associated with various comorbidities including chronic kidney disease (adjusted odds ratio (aOR), 1.48; 95% confidence interval (CI), 1.29-1.70), dyslipidemia (aOR, 1.88; 95% CI, 1.56-2.27), type 2 diabetes mellitus (aOR, 1.37; 95% CI, 1.22-1.54), arterial hypertension (aOR, 1.50; 95% CI, 1.30-1.73), autoimmune thyroiditis (aOR, 2.43; 95% CI, 1.42-4.16), non-Hodgkin's lymphoma (aOR, 1.95; 95% CI, 1.23-3.07), and atopic dermatitis (aOR, 2.16, 95% CI, 1.91-2.45). Regarding prescription patterns, topical steroids were most favored, followed by topical calcineurin inhibitors; oral antihistamines were the most preferred systemic agent for PN. PN is a relatively rare but significant disease among Korean dermatology outpatients with a high comorbidity burden compared to dermatology outpatients without PN. There is great need for breakthroughs in PN treatment.
RESUMEN
BACKGROUND: Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information. RESULTS: In this paper, a novel supervised feature scoring method named ClearF is proposed. The proposed method is suitable for continuous-valued data, which is similar to the principle of feature selection using mutual information, with the added advantage of a reduced computation time. The proposed score calculation is motivated by the association between the reconstruction error and the information-theoretic measurement. Our method is based on class-wise low-dimensional embedding and the resulting reconstruction error. Given multi-class datasets such as a case-control study dataset, low-dimensional embedding is first applied to each class to obtain a compressed representation of the class, and also for the entire dataset. Reconstruction is then performed to calculate the error of each feature and the final score for each feature is defined in terms of the reconstruction errors. The correlation between the information theoretic measurement and the proposed method is demonstrated using a simulation. For performance validation, we compared the classification performance of the proposed method with those of various algorithms on benchmark datasets. CONCLUSIONS: The proposed method showed higher accuracy and lower execution time than the other established methods. Moreover, an experiment was conducted on the TCGA breast cancer dataset, and it was confirmed that the genes with the highest scores were highly associated with subtypes of breast cancer.
Asunto(s)
Biomarcadores/metabolismo , Biología Computacional/métodos , Aprendizaje Automático Supervisado , BenchmarkingRESUMEN
BACKGROUND: Aspirin Exacerbated Respiratory Disease (AERD) is a chronic medical condition that encompasses asthma, nasal polyposis, and hypersensitivity to aspirin and other non-steroidal anti-inflammatory drugs. Several previous studies have shown that part of the genetic effects of the disease may be induced by the interaction of multiple genetic variants. However, heavy computational cost as well as the complexity of the underlying biological mechanism has prevented a thorough investigation of epistatic interactions and thus most previous studies have typically considered only a small number of genetic variants at a time. METHODS: In this study, we propose a gene network based analysis framework to identify genetic risk factors from a genome-wide association study dataset. We first derive multiple single nucleotide polymorphisms (SNP)-based epistasis networks that consider marginal and epistatic effects by using different information theoretic measures. Each SNP epistasis network is converted into a gene-gene interaction network, and the resulting gene networks are combined as one for downstream analysis. The integrated network is validated on existing knowledgebase of DisGeNET for known gene-disease associations and GeneMANIA for biological function prediction. RESULTS: We demonstrated our proposed method on a Korean GWAS dataset, which has genotype information of 440,094 SNPs for 188 cases and 247 controls. The topological properties of the generated networks are examined for scale-freeness, and we further performed various statistical analyses in the Allergy and Asthma Portal (AAP) using the selected genes from our integrated network. CONCLUSIONS: Our result reveals that there are several gene modules in the network that are of biological significance and have evidence for controlling susceptibility and being related to the treatment of AERD.