RESUMO
A time series is an extremely abundant data type arising in many areas of scientific research, including the biological sciences. Any method that compares time series data relies on a pairwise distance between trajectories, and the choice of distance measure determines the accuracy and speed of the time series comparison. This paper introduces an optimal transport type distance for comparing time series trajectories that are allowed to lie in spaces of different dimensions and/or with differing numbers of points possibly unequally spaced along each trajectory. The construction is based on a modified Gromov-Wasserstein distance optimization program, reducing the problem to a Wasserstein distance on the real line. The resulting program has a closed-form solution and can be computed quickly due to the scalability of the one-dimensional Wasserstein distance. We discuss theoretical properties of this distance measure, and empirically demonstrate the performance of the proposed distance on several datasets with a range of characteristics commonly found in biologically relevant data. We also use our proposed distance to demonstrate that averaging oscillatory time series trajectories using the recently proposed Fused Gromov-Wasserstein barycenter retains more characteristics in the averaged trajectory when compared to traditional averaging, which demonstrates the applicability of Fused Gromov-Wasserstein barycenters for biological time series. Fast and user friendly software for computing the proposed distance and related applications is provided. The proposed distance allows fast and meaningful comparison of biological time series and can be efficiently used in a wide range of applications.
Assuntos
Algoritmos , Conceitos Matemáticos , Fatores de Tempo , Modelos Biológicos , SoftwareRESUMO
Computational subphenotyping, a data-driven approach to understanding disease subtypes, is a prominent topic in medical research. Numerous ongoing studies are dedicated to developing advanced computational subphenotyping methods for cross-sectional data. However, the potential of time-series data has been underexplored until now. Here, we propose a Multivariate Levenshtein Distance (MLD) that can account for address correlation in multiple discrete features over time-series data. Our algorithm has two distinct components: it integrates an optimal threshold score to enhance the sensitivity in discriminating between pairs of instances, and the MLD itself. We have applied the proposed distance metrics on the k-means clustering algorithm to derive temporal subphenotypes from time-series data of biomarkers and treatment administrations from 1039 critically ill patients with COVID-19 and compare its effectiveness to standard methods. In conclusion, the Multivariate Levenshtein Distance metric is a novel method to quantify the distance from multiple discrete features over time-series data and demonstrates superior clustering performance among competing time-series distance metrics.
Assuntos
COVID-19 , Estado Terminal , Humanos , Fatores de Tempo , Estudos Transversais , AlgoritmosRESUMO
Choosing a suitable gridded climate dataset is a significant challenge in hydro-climatic research, particularly in areas lacking long-term, reliable, and dense records. This study used the most common method (Perkins skill score (PSS)) with two advanced time series similarity algorithms, short time series distance (STS), and cross-correlation distance (CCD), for the first time to evaluate, compare, and rank five gridded climate datasets, namely, Climate Research Unit (CRU), TERRA Climate (TERRA), Climate Prediction Center (CPC), European Reanalysis V.5 (ERA5), and Climatologies at high resolution for Earth's land surface areas (CHELSA), according to their ability to replicate the in situ rainfall and temperature data in Nigeria. The performance of the methods was evaluated by comparing the ranking obtained using compromise programming (CP) based on four statistical criteria in replicating in situ rainfall, maximum temperature, and minimum temperature at 26 locations distributed over Nigeria. Both methods identified CRU as Nigeria's best-gridded climate dataset, followed by CHELSA, TERRA, ERA5, and CPC. The integrated STS values using the group decision-making method for CRU rainfall, maximum and minimum temperatures were 17, 10.1, and 20.8, respectively, while CDD values for those variables were 17.7, 11, and 12.2, respectively. The CP based on conventional statistical metrics supported the results obtained using STS and CCD. CRU's Pbias was between 0.5 and 1; KGE ranged from 0.5 to 0.9; NSE ranged from 0.3 to 0.8; and NRMSE between - 30 and 68.2, which were much better than the other products. The findings establish STS and CCD's ability to evaluate the performance of climate data by avoiding the complex and time-consuming multi-criteria decision algorithms based on multiple statistical metrics.