K nearest Neighbour — Machine Learning (Classification Algorithm)

Pratham saraf
5 min readAug 20, 2022

--

When we often wish to categorise things based on predetermined classifications, we can utilise the K Nearest Neighbor method. Since the approach falls within supervised learning situations, we already have access to the preset classes.

IN LAYMAN’S TERMS, WHAT IS KNN ALGORITHM?

Simply put, a training data set was supplied. The KNN algorithm seeks out the locations that are closest to the test point and attempts to categorise them using the majority vote.

The K in the KNN method denotes the number of neighbours that are taken into account and can accept any whole integer as input.

Take a look at the example below, where the blue squares represent regular products, the red triangles represent anomalous products, and the green circle represents the product we wish to forecast. In this instance, if we assume that K has a value of 3 (the inner circle), the product is categorised as an anomaly as a result of the triangle’s vote, but if we assume that K has a value of 5, the product is categorised as normal as a result of the blue square’s majority.

To sum up, we can say that KNN attempts to categorise based on the majority of the points it is closest to by taking the nearest neighbours of the point.

HOW SHOULD DISTANCES BETWEEN POINTS BE CONSIDERED?

Another question of how to take the distances between the spots into account now arises. When employing the KNN method, there are specific distances that are frequently used.

Euclidean Distance: If we were to consider two points X₂₁, Y₂₁ and X₂₂ Y₂₂, the distance between them according to Euclid would be

. This distance, which is also known as Norm L₂, indicates the distance between the two places that is the shortest.

https://hlab.stanford.edu/brian/making7.gif

Manhattan distance:- The total absolute difference between the points is referred to as the Manhattan distance.

Minkowski distance:- For the distance between two points it is calculated as

If we make p = 1 it becomes Manhattan distance and when p = 2 it becomes euclidean distance.

How is the value of K chosen?

How do we choose the value of K to utilise in the algorithm is the next most apparent question. Before learning how to choose the best value for K, it is important to comprehend what a decision boundary is.

https://raw.githubusercontent.com/guoqi228/logistic_regression_matlab/master/fig_2_decision_boundary.png

Consider a data set consisting of circles and plus labels so if we try to generalize the relation or the pattern of classification we can draw a line or a curve as drawn by the blue line which easily separates the majority of plus and the circles. The curve acts as a guide to the algorithm in classifying the point. The curve is known as the decision boundary.

Now if we consider another data set which has points in the following distribution and consider K = 1 then the decision boundary will be something like

And if test point is closer to the center circle it would be classified as circle rather than as cross which is in majority around the center. Such kind of model is known as overfitted model

In another case if we consider value of K to be very lage we may end up classifying everything as square as the decision boundary would tend more towards the majority of the label which being square in this case such model is known as underfit model thus it is important to select the optimal value of K

Now we may ask how can we calculate perfect K so that our model is neither overfitted nor underfited?
To answer that there is no step by step process to calculate the perfect value of the K.

To calculate the value of the K we will have to go by approximation and guess and to find the best match.

Lets understand this with an example

As usual, lets start with importing the libraries and then reading the dataset:

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
dataset = datasets.load_breast_cancer()

This data set contains the data about where an cancer is malignant or not.

X_train , X_test , Y_train , Y_test = train_test_split(dataset.data , dataset.target , test_size = 0.2,random_state=0)

We then split data into training and testing datasets

clf = KNeighborsClassifier()
clf.fit(X_train, Y_train)

Next we select the classifier as Kneigbor and fit it using X_train adn Y_train data

clf.score(X_test, Y_test)0.9385964912280702

Upon scoring the algorithm on we get a score of .93. By default KNN selects the value of K to be 5.

--

--