RESUMO
With the increase in the number of parameters that can be detected at the single-cell level using flow and mass cytometry, there has been a paradigm shift when handling and analyzing data sets. Cytometry Shared Resource Laboratories (SRLs) already take on the responsibility of ensuring users have resources and training in experimental design and operation of instruments to promote high-quality data acquisition. However, the role of SRLs downstream, during data handling and analysis, is not as well defined and agreed upon. Best practices dictate a central role for SRLs in this process as they are in a pivotal position to support research in this context, but key considerations about how to effectively fill this role need to be addressed. Two surveys and one workshop at CYTO 2022 in Philadelphia, PA, were performed to gain insight into what strategies SRLs are successfully employing to support high-dimensional data analysis and where SRLs and their users see limitations and long-term challenges in this area. Recommendations for high-dimensional data analysis support provided by SRLs will be offered and discussed.
Assuntos
Laboratórios , Projetos de Pesquisa , Confiabilidade dos Dados , Citometria de Fluxo/métodosRESUMO
Characterization of target abundance on cells has broad translational applications. Among the approaches for assessing membrane target expression is quantification of the number of target-specific antibody (Ab) bound per cell (ABC). ABC determination on relevant cell subsets in complex and limited biological samples necessitates multidimensional immunophenotyping, for which the high-order multiparameter capabilities of mass cytometry provide considerable advantages. In the present study, we describe the implementation of CyTOF® for the concomitant quantification of membrane markers on diverse types of immune cells in human whole blood. Specifically, our protocol relies on establishing Bmax of Ab saturable binding on cells, then converted into ABC according to a metal's transmission efficiency and number of metal atoms per Ab. Using this method, we calculated ABC values for CD4 and CD8 within the expected range for circulating T cells and in concordance with the ABC obtained in the same samples by flow cytometry. Furthermore, we successfully conducted multiplex measurements of the ABC for CD28, CD16, CD32a, and CD64, on >15 immune cell subsets in human whole blood samples. We developed a high-dimensional data analysis workflow enabling semi-automated Bmax calculation in all examined cell subsets to facilitate ABC reporting across populations. In addition, we investigated impacts of the type of metal isotope and acquisition batch effect on the ABC evaluation with CyTOF®. In summary, our findings demonstrate mass cytometry is a valuable tool for concurrent quantitative analysis of multiple targets in specific and rare cell types, thus increasing the numbers of biomeasures obtained from a single sample.
Assuntos
Anticorpos , Linfócitos T , Humanos , Citometria de Fluxo/métodos , ImunofenotipagemRESUMO
BACKGROUND: Investigating molecular heterogeneity provides insights into tumour origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible-therefore, automated unsupervised learning approaches are utilised for discovering tissue heterogeneity. However, automated analyses require experience setting the algorithms' hyperparameters and expert knowledge about the analysed biological processes. Moreover, feature engineering is needed to obtain valuable results because of the numerous features measured. RESULTS: We propose DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions (regular k-means, spatial and spectral approaches) combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions). Three quality indices: Dice Index, Rand Index and EXIMS score, focusing on the overall composition of the clustering, coverage of the tumour region and spatial cluster consistency, are used to assess the quality of unsupervised analyses. Algorithms were validated on mass spectrometry imaging (MSI) datasets-2D human cancer tissue samples and 3D mouse kidney images. DiviK algorithm performed the best among the four clustering algorithms compared (overall quality score 1.24, 0.58 and 162 for d(0, 0, 0), d(1, 1, 1) and the sum of ranks, respectively), with spectral clustering being mostly second. Feature engineering techniques impact the overall clustering results less than the algorithms themselves (partial [Formula: see text] effect size: 0.141 versus 0.345, Kendall's concordance index: 0.424 versus 0.138 for d(0, 0, 0)). CONCLUSIONS: DiviK could be the default choice in the exploration of MSI data. Thanks to its unique, GMM-based local optimisation of the feature space and deglomerative schema, DiviK results do not strongly depend on the feature engineering technique applied and can reveal the hidden structure in a tissue sample. Additionally, DiviK shows high scalability, and it can process at once the big omics data with more than 1.5 mln instances and a few thousand features. Finally, due to its simplicity, DiviK is easily generalisable to an even more flexible framework. Therefore, it is helpful for other -omics data (as single cell spatial transcriptomic) or tabular data in general (including medical images after appropriate embedding). A generic implementation is freely available under Apache 2.0 license at https://github.com/gmrukwa/divik .
Assuntos
Algoritmos , Metabolômica , Animais , Camundongos , Humanos , Análise por Conglomerados , Espectrometria de Massas , Big DataRESUMO
BACKGROUND: The severity of an influenza infection is influenced by both host and viral characteristics. This study aims to assess the relevance of viral genomic data for the prediction of severe influenza A(H3N2) infections among patients hospitalized for severe acute respiratory infection (SARI), in view of risk assessment and patient management. METHODS: 160 A(H3N2) influenza positive samples from the 2016-2017 season originating from the Belgian SARI surveillance were selected for whole genome sequencing. Predictor variables for severity were selected using a penalized elastic net logistic regression model from a combined host and genomic dataset, including patient information and nucleotide mutations identified in the viral genome. The goodness-of-fit of the model combining host and genomic data was compared using a likelihood-ratio test with the model including host data only. Internal validation of model discrimination was conducted by calculating the optimism-adjusted area under the Receiver Operating Characteristic curve (AUC) for both models. RESULTS: The model including viral mutations in addition to the host characteristics had an improved fit ([Formula: see text]=12.03, df = 3, p = 0.007). The optimism-adjusted AUC increased from 0.671 to 0.732. CONCLUSIONS: Adding genomic data (selected season-specific mutations in the viral genome) to the model containing host characteristics improved the prediction of severe influenza infection among hospitalized SARI patients, thereby offering the potential for translation into a prospective strategy to perform early season risk assessment or to guide individual patient management.
Assuntos
Influenza Humana , Genoma Viral , Genômica , Humanos , Vírus da Influenza A Subtipo H3N2/genética , Influenza Humana/diagnóstico , Estudos ProspectivosRESUMO
Penalized regression methods are an attractive tool for high-dimensional data analysis, but their widespread adoption has been hampered by the difficulty of applying inferential tools. In particular, the question "How reliable is the selection of those features?" has proved difficult to address. In part, this difficulty arises from defining false discoveries in the classical, fully conditional sense, which is possible in low dimensions but does not scale well to high-dimensional settings. Here, we consider the analysis of marginal false discovery rates (mFDRs) for penalized regression methods. Restricting attention to the mFDR permits straightforward estimation of the number of selections that would likely have occurred by chance alone, and therefore provides a useful summary of selection reliability. Theoretical analysis and simulation studies demonstrate that this approach is quite accurate when the correlation among predictors is mild, and only slightly conservative when the correlation is stronger. Finally, the practical utility of the proposed method and its considerable advantages over other approaches are illustrated using gene expression data from The Cancer Genome Atlas and genome-wide association study data from the Myocardial Applied Genomics Network.
Assuntos
Bioestatística/métodos , Interpretação Estatística de Dados , Modelos Estatísticos , Análise de Regressão , Expressão Gênica/genética , Estudo de Associação Genômica Ampla , HumanosRESUMO
The arrival of mass cytometry (MC) and, more recently, spectral flow cytometry (SFC) has revolutionized the study of cellular, functional and phenotypic diversity, significantly increasing the number of characteristics measurable at the single-cell level. As a consequence, new computational techniques such as dimensionality reduction and/or clustering algorithms are necessary to analyze, clean, visualize, and interpret these high-dimensional data sets. In this small comparison study, we investigated splenocytes from the same sample by either MC or SFC and compared both high-dimensional data sets using expert gating, t-distributed stochastic neighbor embedding (t-SNE), uniform manifold approximation and projection (UMAP) analysis and FlowSOM. When we downsampled each data set to their equivalent cell numbers and parameters, our analysis yielded highly comparable results. Differences between the data sets only became apparent when the maximum number of parameters in each data set were assessed, due to differences in the number of recorded events or the maximum number of assessed parameters. Overall, our small comparison study suggests that mass cytometry and spectral flow cytometry both yield comparable results when analyzed manually or by high-dimensional clustering or dimensionality reduction algorithms such as t-SNE, UMAP, or FlowSOM. However, large scale studies combined with an in-depth technical analysis will be needed to assess differences between these technologies in more detail. © 2020 International Society for Advancement of Cytometry.
Assuntos
Algoritmos , Análise de Dados , Análise por Conglomerados , Citometria de FluxoRESUMO
Cytometry by time-of-flight (CyTOF) has emerged as a high-throughput single cell technology able to provide large samples of protein readouts. Already, there exists a large pool of advanced high-dimensional analysis algorithms that explore the observed heterogeneous distributions making intriguing biological inferences. A fact largely overlooked by these methods, however, is the effect of the established data preprocessing pipeline to the distributions of the measured quantities. In this article, we focus on randomization, a transformation used for improving data visualization, which can negatively affect multivariate data analysis methods such as dimensionality reduction, clustering, and network reconstruction algorithms. Our results indicate that randomization should be used only for visualization purposes, but not in conjunction with high-dimensional analytical tools. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.
Assuntos
Algoritmos , Citometria de Fluxo/métodos , Leucócitos Mononucleares/citologia , Linfócitos B/citologia , Linfócitos B/metabolismo , Buffy Coat/citologia , Buffy Coat/metabolismo , Análise por Conglomerados , Humanos , Leucócitos Mononucleares/metabolismo , Análise Multivariada , Redes Neurais de Computação , Distribuição Aleatória , Análise de Célula Única , Linfócitos T/citologia , Linfócitos T/metabolismoRESUMO
The popularity of penalized regression in high-dimensional data analysis has led to a demand for new inferential tools for these models. False discovery rate control is widely used in high-dimensional hypothesis testing, but has only recently been considered in the context of penalized regression. Almost all of this work, however, has focused on lasso-penalized linear regression. In this paper, we derive a general method for controlling the marginal false discovery rate that can be applied to any penalized likelihood-based model, such as logistic regression and Cox regression. Our approach is fast, flexible and can be used with a variety of penalty functions including lasso, elastic net, MCP, and MNet. We derive theoretical results under which the proposed method is valid, and use simulation studies to demonstrate that the approach is reasonably robust, albeit slightly conservative, when these assumptions are violated. Despite being conservative, we show that our method often offers more power to select causally important features than existing approaches. Finally, the practical utility of the method is demonstrated on gene expression datasets with binary and time-to-event outcomes.
Assuntos
Biometria/métodos , Reações Falso-Positivas , Perfilação da Expressão Gênica , Humanos , Funções Verossimilhança , Neoplasias Pulmonares/epidemiologia , Neoplasias Pulmonares/genética , Análise de Regressão , Fumar , Análise de SobrevidaRESUMO
We consider a research scenario motivated by integrating multiple sources of information for better knowledge discovery in diverse dynamic biological processes. Given two longitudinal high-dimensional datasets for a group of subjects, we want to extract shared latent trends and identify relevant features. To solve this problem, we present a new statistical method named as joint principal trend analysis (JPTA). We demonstrate the utility of JPTA through simulations and applications to gene expression data of the mammalian cell cycle and longitudinal transcriptional profiling data in response to influenza viral infections.
Assuntos
Biometria/métodos , Interpretação Estatística de Dados , Modelos Estatísticos , Ciclo Celular/genética , Simulação por Computador , Conjuntos de Dados como Assunto , Perfilação da Expressão Gênica , Humanos , Influenza Humana/genéticaRESUMO
We consider an independence feature screening technique for identifying explanatory variables that locally contribute to the response variable in high-dimensional regression analysis. Without requiring a specific parametric form of the underlying data model, our approach accommodates a wide spectrum of nonparametric and semiparametric model families. To detect the local contributions of explanatory variables, our approach constructs empirical likelihood locally in conjunction with marginal nonparametric regressions. Since our approach actually requires no estimation, it is advantageous in scenarios such as the single-index models where even specification and identification of a marginal model is an issue. By automatically incorporating the level of variation of the nonparametric regression and directly assessing the strength of data evidence supporting local contribution from each explanatory variable, our approach provides a unique perspective for solving feature screening problems. Theoretical analysis shows that our approach can handle data dimensionality growing exponentially with the sample size. With extensive theoretical illustrations and numerical examples, we show that the local independence screening approach performs promisingly.
RESUMO
The hubness phenomenon is a recently discovered aspect of the curse of dimensionality. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. A closely related problem is the concentration of distances in high-dimensional spaces. Previous work has already advocated the use of fractional â p norms instead of the ubiquitous Euclidean norm to avoid the negative effects of distance concentration. However, which exact fractional norm to use is a largely unsolved problem. The contribution of this work is an empirical analysis of the relation of different â p norms and hubness. We propose an unsupervised approach for choosing an â p norm which minimizes hubs while simultaneously maximizing nearest neighbor classification. Our approach is evaluated on seven high-dimensional data sets and compared to three approaches that re-scale distances to avoid hubness.
RESUMO
Robust data normalization and analysis are pivotal in biomedical research to ensure that observed differences in populations are directly attributable to the target variable, rather than disparities between control and study groups. ArsHive addresses this challenge using advanced algorithms to normalize populations (e.g., control and study groups) and perform statistical evaluations between demographic, clinical, and other variables within biomedical datasets, resulting in more balanced and unbiased analyses. The tool's functionality extends to comprehensive data reporting, which elucidates the effects of data processing, while maintaining dataset integrity. Additionally, ArsHive is complemented by A.D.A. (Autonomous Digital Assistant), which employs OpenAI's GPT-4 model to assist researchers with inquiries, enhancing the decision-making process. In this proof-of-concept study, we tested ArsHive on three different datasets derived from proprietary data, demonstrating its effectiveness in managing complex clinical and therapeutic information and highlighting its versatility for diverse research fields.
RESUMO
BACKGROUND: Active surveillance pharmacovigilance is an emerging approach to identify medications with unanticipated effects. We previously developed a framework called pharmacopeia-wide association studies (PharmWAS) that limits false positive medication associations through high-dimensional confounding adjustment and set enrichment. We aimed to assess the transportability and generalizability of the PharmWAS framework by using medical claims data to reproduce known medication associations with Clostridioides difficile infection (CDI) or gastrointestinal bleeding (GIB). METHODS: We conducted case-control studies using Optum's de-identified Clinformatics Data Mart Database of individuals enrolled in large commercial and Medicare Advantage health plans in the United States. Individuals with CDI (from 2010 to 2015) or GIB (from 2010 to 2021) were matched to controls by age and sex. We identified all medications utilized prior to diagnosis and analysed the association of each with CDI or GIB using conditional logistic regression adjusted for risk factors for the outcome and a high-dimensional propensity score. FINDINGS: For the CDI study, we identified 55,137 cases, 220,543 controls, and 290 medications to analyse. Antibiotics with Gram-negative spectrum, including ciprofloxacin (aOR 2.83), ceftriaxone (aOR 2.65), and levofloxacin (aOR 1.60), were strongly associated. For the GIB study, we identified 450,315 cases, 1,801,260 controls, and 354 medications to analyse. Antiplatelets, anticoagulants, and non-steroidal anti-inflammatory drugs, including ticagrelor (aOR 2.81), naproxen (aOR 1.87), and rivaroxaban (aOR 1.31), were strongly associated. INTERPRETATION: These studies demonstrate the generalizability and transportability of the PharmWAS pharmacovigilance framework. With additional validation, PharmWAS could complement traditional passive surveillance systems to identify medications that unexpectedly provoke or prevent high-impact conditions. FUNDING: U.S. National Institute of Diabetes and Digestive and Kidney Diseases.
Assuntos
Clostridioides difficile , Infecções por Clostridium , Hemorragia Gastrointestinal , Farmacovigilância , Humanos , Infecções por Clostridium/epidemiologia , Infecções por Clostridium/etiologia , Infecções por Clostridium/tratamento farmacológico , Estudos de Casos e Controles , Masculino , Hemorragia Gastrointestinal/induzido quimicamente , Hemorragia Gastrointestinal/epidemiologia , Hemorragia Gastrointestinal/etiologia , Feminino , Idoso , Pessoa de Meia-Idade , Antibacterianos/efeitos adversos , Antibacterianos/uso terapêutico , Estados Unidos/epidemiologia , Fatores de Risco , Adulto , Idoso de 80 Anos ou maisRESUMO
Neonatal brain inflammation produced by intraperitoneal (i.p.) injection of lipopolysaccharide (LPS) results in long-lasting brain dopaminergic injury and motor disturbances in adult rats. The goal of the present work is to investigate the effect of neonatal systemic LPS exposure (1 or 2 mg/kg, i.p. injection in postnatal day 5, P5, male rats)-induced dopaminergic injury to examine methamphetamine (METH)-induced behavioral sensitization as an indicator of drug addiction. On P70, subjects underwent a treatment schedule of 5 once daily subcutaneous (s.c.) administrations of METH (0.5 mg/kg) (P70-P74) to induce behavioral sensitization. Ninety-six hours following the 5th treatment of METH (P78), the rats received one dose of 0.5 mg/kg METH (s.c.) to reintroduce behavioral sensitization. Hyperlocomotion is a critical index caused by drug abuse, and METH administration has been shown to produce remarkable locomotor-enhancing effects. Therefore, a random forest model was used as the detector to extract the feature interaction patterns among the collected high-dimensional locomotor data. Our approaches identified neonatal systemic LPS exposure dose and METH-treated dates as features significantly associated with METH-induced behavioral sensitization, reinstated behavioral sensitization, and perinatal inflammation in this experimental model of drug addiction. Overall, the analysis suggests that the implementation of machine learning strategies is sensitive enough to detect interaction patterns in locomotor activity. Neonatal LPS exposure also enhanced METH-induced reduction of dopamine transporter expression and [3H]dopamine uptake, reduced mitochondrial complex I activity, and elevated interleukin-1ß and cyclooxygenase-2 concentrations in the P78 rat striatum. These results indicate that neonatal systemic LPS exposure produces a persistent dopaminergic lesion leading to a long-lasting change in the brain reward system as indicated by the enhanced METH-induced behavioral sensitization and reinstated behavioral sensitization later in life. These findings indicate that early-life brain inflammation may enhance susceptibility to drug addiction development later in life, which provides new insights for developing potential therapeutic treatments for drug addiction.
Assuntos
Animais Recém-Nascidos , Lipopolissacarídeos , Aprendizado de Máquina , Metanfetamina , Animais , Metanfetamina/farmacologia , Metanfetamina/toxicidade , Ratos , Masculino , Lipopolissacarídeos/toxicidade , Comportamento Animal/efeitos dos fármacos , Estimulantes do Sistema Nervoso Central/farmacologia , Encefalite/induzido quimicamente , Encefalite/metabolismo , Doenças Neuroinflamatórias/tratamento farmacológico , Doenças Neuroinflamatórias/induzido quimicamente , Doenças Neuroinflamatórias/metabolismo , Locomoção/efeitos dos fármacos , Locomoção/fisiologia , Feminino , Ratos Sprague-Dawley , Atividade Motora/efeitos dos fármacosRESUMO
BACKGROUND: The advent of single-cell RNA sequencing (scRNA-seq) has revolutionized transcriptomic research by enabling the exploration of gene expression at an individual cell level. This advancement sheds light on how cells differentiate and evolve over time. Effectively classification cell types within scRNA-seq datasets are essential for understanding the intricate cell compositions within tissues and elucidating the origins of various diseases. Challenges persist in the field, emphasizing the need for precise categorization across diverse datasets, addressing aggregated cell data, and managing the complexity of high-dimensional data spaces. METHODOLOGY: XgCPred is a novel approach combining XGBoost with convolutional neural networks (CNNs) to provide cell type classification with better accuracy in single-cell RNA-seq data. This combo reveals how well CNNs can detect spatial hierarchy in gene expression images and how XGBoost performs with large volumes of data. XgCPred utilizes an imaging representation of gene expression that is based on the hierarchical organization of genes found in the KEGG BRITE database. RESULTS: Rigorous testing of XgCPred across multiple scRNA-seq datasets, each presenting unique challenges such as varying cell counts, gene expression diversity, and cellular heterogeneity, has demonstrated its superiority compared to earlier methods. The algorithm shows remarkable accuracy and precision in cell type annotation, achieving near-perfect classification scores in some cases. These results underscore its capability to effectively manage data variability. CONCLUSIONS: XgCPred distinguishes itself through its dependable and accurate cell type classification across a range of scRNA-seq datasets. Its effectiveness stems from sophisticated data handling and its ability to adapt to the complexities inherent in scRNA-seq data. XgCPred delivers reliable cell annotations essential for further biological analysis and research, marking a significant advancement in genomic studies. With scRNA-seq datasets growing in size and complexity, XgCPred offers a scalable and potent solution for cell type identification, potentially enhancing our understanding of cellular biology and aiding in the precise detection of diseases. XgCPred is a useful tool in genomic research and tailored therapy because it solves current constraints on computing efficiency and generalizability.
Assuntos
RNA-Seq , Análise de Célula Única , Análise de Célula Única/métodos , Humanos , RNA-Seq/métodos , Redes Neurais de Computação , Software , Análise de Sequência de RNA/métodos , Perfilação da Expressão Gênica/métodos , Análise da Expressão Gênica de Célula ÚnicaRESUMO
This paper pioneers a novel approach in electromagnetic (EM) system analysis by synergistically combining Bayesian Neural Networks (BNNs) informed by Latin Hypercube Sampling (LHS) with advanced thermal-mechanical surrogate modeling within COMSOL simulations for high-frequency low-pass filter modeling. Our methodology transcends traditional EM characterization by integrating physical dimension variability, thermal effects, mechanical deformation, and real-world operational conditions, thereby achieving a significant leap in predictive modeling fidelity. Through rigorous evaluation using Mean Squared Error (MSE), Maximum Learning Error (MLE), and Maximum Test Error (MTE) metrics, as well as comprehensive validation on unseen data, the model's robustness and generalization capability is demonstrated. This research challenges conventional methods, offering a nuanced understanding of multiphysical phenomena to enhance reliability and resilience in electronic component design and optimization. The integration of thermal variables alongside dimensional parameters marks a novel paradigm in filter performance analysis, significantly improving simulation accuracy. Our findings not only contribute to the body of knowledge in EM diagnostics and complex-environment analysis but also pave the way for future investigations into the fusion of machine learning with computational physics, promising transformative impacts across various applications, from telecommunications to medical devices.
RESUMO
In drug safety, development of statistical methods for multiplicity adjustments has exploited potential relationships among adverse events (AEs) according to underlying medical features. Due to the coarseness of the biological features used to group AEs together, which serves as the basis for the adjustment, it is possible that a single adverse event can be simultaneously described by multiple biological features. However, existing methods are limited in that they are not structurally flexible enough to accurately exploit this multi-dimensional characteristic of an adverse event. In order to preserve the complex dependencies present in clinical safety data, a Bayesian approach for modeling the risk differentials of the AEs between the treatment and comparator arms is proposed which provides a more appropriate clinical description of the drug's safety profile. The proposed procedure uses an Ising prior to unite medically related AEs. The proposed method and an existing Bayesian method are applied to a clinical dataset, and the signals from the two methods are presented. Results from a small simulation study are also presented.
Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos/estatística & dados numéricos , Teorema de Bayes , Modelos Estatísticos , Biometria/métodos , Ensaios Clínicos como Assunto/estatística & dados numéricos , Simulação por Computador , Bases de Dados Factuais/estatística & dados numéricos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Cadeias de Markov , Método de Monte CarloRESUMO
We study a marginal empirical likelihood approach in scenarios when the number of variables grows exponentially with the sample size. The marginal empirical likelihood ratios as functions of the parameters of interest are systematically examined, and we find that the marginal empirical likelihood ratio evaluated at zero can be used to differentiate whether an explanatory variable is contributing to a response variable or not. Based on this finding, we propose a unified feature screening procedure for linear models and the generalized linear models. Different from most existing feature screening approaches that rely on the magnitudes of some marginal estimators to identify true signals, the proposed screening approach is capable of further incorporating the level of uncertainties of such estimators. Such a merit inherits the self-studentization property of the empirical likelihood approach, and extends the insights of existing feature screening methods. Moreover, we show that our screening approach is less restrictive to distributional assumptions, and can be conveniently adapted to be applied in a broad range of scenarios such as models specified using general moment conditions. Our theoretical results and extensive numerical examples by simulations and data analysis demonstrate the merits of the marginal empirical likelihood approach.
RESUMO
Data sets derived from practical experiments often pose challenges for (robust) statistical methods. In high-dimensional data sets, more variables than observations are recorded and often, there are also data present that do not follow the structure of the data majority. In order to handle such data with outlying observations, a variety of robust regression and classification methods have been developed for low-dimensional data. The high-dimensional case, however, is more challenging, and the variety of robust methods is much more limited. The choice of the method depends on the specific data structure, and numerical problems are more likely to occur. We give an overview of selected robust methods as well as implementations and demonstrate the application with two high-dimensional data sets from tribology. We show that robust statistical methods combined with appropriate pre-processing and sampling strategies yield increased prediction performance and insight into data differing from the majority.
RESUMO
Dimensional reduction (DR) maps high-dimensional data into a lower dimensions latent space with minimized defined optimization objectives. The two independent branches of DR are feature selection (FS) and feature projection (FP). FS focuses on selecting a critical subset of dimensions but risks destroying the data distribution (structure). On the other hand, FP combines all the input features into lower dimensions space, aiming to maintain the data structure, but lacks interpretability and sparsity. Moreover, FS and FP are traditionally incompatible categories and have not been unified into an amicable framework. Therefore, we consider that the ideal DR approach combines both FS and FP into a unified end-to-end manifold learning framework, simultaneously performing fundamental feature discovery while maintaining the intrinsic relationships between data samples in the latent space. This paper proposes a unified framework named Unified Dimensional Reduction Network (UDRN) to integrate FS and FP in an end-to-end way. Furthermore, a novel network framework is designed to implement FS and FP tasks separately using a stacked feature selection network and feature projection network. In addition, a stronger manifold assumption and a novel loss function are proposed. Furthermore, the loss function can leverage the priors of data augmentation to enhance the generalization ability of the proposed UDRN. Finally, comprehensive experimental results on four image and four biological datasets, including very high-dimensional data, demonstrate the advantages of DRN over existing methods (FS, FP, and FS&FP pipeline), especially in downstream tasks such as classification and visualization.