Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review.

Abu Alfeilat, Haneen Arafat; Hassanat, Ahmad B A; Lasassmeh, Omar; Tarawneh, Ahmad S; Alhasanat, Mahmoud Bashir; Eyal Salman, Hamzeh S; Prasath, V B Surya

Abu Alfeilat, Haneen Arafat; Hassanat, Ahmad B A; Lasassmeh, Omar; Tarawneh, Ahmad S; Alhasanat, Mahmoud Bashir; Eyal Salman, Hamzeh S; Prasath, V B Surya.

Affiliation

Abu Alfeilat HA; Department of Computer Science, Faculty of Information Technology, Mutah University, Karak, Jordan.
Hassanat ABA; Department of Computer Science, Faculty of Information Technology, Mutah University, Karak, Jordan.
Lasassmeh O; Department of Computer Science, Faculty of Information Technology, Mutah University, Karak, Jordan.
Tarawneh AS; Department of Algorithm and Their Applications, Eötvös Loránd University, Budapest, Hungary.
Alhasanat MB; Department of Geomatics, Faculty of Environmental Design, King Abdulaziz University, Jeddah, Saudi Arabia.
Eyal Salman HS; Department of Civil Engineering, Faculty of Engineering, Al-Hussein Bin Talal University, Maan, Jordan.
Prasath VBS; Department of Computer Science, Faculty of Information Technology, Mutah University, Karak, Jordan.

Big Data ; 7(4): 221-248, 2019 12.

Article in En | MEDLINE | ID: mdl-31411491

ABSTRACT

ABSTRACT

The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures available? This review attempts to answer this question through evaluating the performance (measured by accuracy, precision, and recall) of the KNN using a large number of distance measures, tested on a number of real-world data sets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, and the results showed large gaps between the performances of different distances. We found that a recently proposed nonconvex distance performed the best when applied on most data sets comparing with the other tested distances. In addition, the performance of the KNN with this top performing distance degraded only â¼20% while the noise level reaches 90%, this is true for most of the distances used as well. This means that the KNN classifier using any of the top 10 distances tolerates noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing with other distances.

Subject(s)

Big Data; Algorithms; Cluster Analysis

Key words

K-nearest neighbor; big data; machine learning; noise; supervised learning

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Big Data Language: En Journal: Big Data Year: 2019 Type: Article Affiliation country: Jordan

Fulltext

XML

PubMed Links

Search on Google