Your browser doesn't support javascript.
loading
The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE).
Avval, Tahereh G; Moeini, Behnam; Carver, Victoria; Fairley, Neal; Smith, Emily F; Baltrusaitis, Jonas; Fernandez, Vincent; Tyler, Bonnie J; Gallagher, Neal; Linford, Matthew R.
Afiliação
  • Avval TG; Department of Chemistry and Biochemistry, Brigham Young University, C100 BNSN, Provo, Utah 84602, United States.
  • Moeini B; Department of Chemistry and Biochemistry, Brigham Young University, C100 BNSN, Provo, Utah 84602, United States.
  • Carver V; Department of Chemistry and Biochemistry, Brigham Young University, C100 BNSN, Provo, Utah 84602, United States.
  • Fairley N; Casa Software Ltd., Bay House, 5 Grosvenor Terrace, Teignmouth, Devon TQ14 8NE, U.K.
  • Smith EF; Nanoscale and Microscale Research Centre (NMRC) and School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, U.K.
  • Baltrusaitis J; Department of Chemical and Biomolecular Engineering, Lehigh University, B336 Iacocca Hall, 111 Research Drive, Bethlehem, Pennsylvania 18015, United States.
  • Fernandez V; Institut des Matériaux Jean Rouxel, IMN, Université de Nantes, CNRS, F-44000 Nantes, France.
  • Tyler BJ; Institut für Physik, Westfälische Wilhelms-Universität, 48149 Münster, Germany.
  • Gallagher N; Eigenvector Research, Inc., Manson, Washington 98831, United States.
  • Linford MR; Department of Chemistry and Biochemistry, Brigham Young University, C100 BNSN, Provo, Utah 84602, United States.
J Chem Inf Model ; 61(9): 4173-4189, 2021 09 27.
Article em En | MEDLINE | ID: mdl-34499501
ABSTRACT
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing data-they are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the "critical pair," which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
Assuntos

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos Idioma: En Ano de publicação: 2021 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos Idioma: En Ano de publicação: 2021 Tipo de documento: Article