<i>greylock</i>: A Python Package for Measuring The Composition of Complex Datasets.

Nguyen, Phuc; Arora, Rohit; Hill, Elliot D; Braun, Jasper; Morgan, Alexandra; Quintana, Liza M; Mazzoni, Gabrielle; Lee, Ghee Rye; Arnaout, Rima; Arnaout, Ramy

greylock: A Python Package for Measuring The Composition of Complex Datasets.

Nguyen, Phuc; Arora, Rohit; Hill, Elliot D; Braun, Jasper; Morgan, Alexandra; Quintana, Liza M; Mazzoni, Gabrielle; Lee, Ghee Rye; Arnaout, Rima; Arnaout, Ramy.

ArXiv ; 2023 Dec 29.

Article in En | MEDLINE | ID: mdl-39070042

ABSTRACT

ABSTRACT

Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: ArXiv Year: 2023 Type: Article

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: ArXiv Year: 2023 Type: Article