Cross-validation and K Fold Cross Validation

Pratham saraf
4 min readAug 26, 2022

--

This article is a continuation of the previous article based on KNN

Now to find the optimum value of K what we can do is take our training data and split it into two parts in a 60:20 ratio then we can use the 60% of the data to train our model and the remaining 20% of the training data to test and determine the value of K without compromising the integrity of the Test data. This process is known as Cross-validation.

However, it is also said that the more the training data is available the better the model is but we already lost 40% of our data to testing cases to overcome this we use K fold cross-validation.

Consider we have a box denoting our training data

Then k fold cross validation states that we can split our training data into P number of parts and use P -1 parts for training and the remaining 1 for testing purposes.

In simpler words we can use 1 part at a time for testing and the remaining for training i.e. if we considered 1st part for testing then from 2nd part onwards up to the end will be used for training purposes, next 2nd part would be considered for testing and the remaining parts would be considered for training, so on and so forth

After calculating scores of various parts we can take the average of the scores and consider it as the score for the particular value of K

Let's see this using an example

from sklearn import datasetsfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_score

Again importing the library and datasets which are built in SK learn

dataset = datasets.load_breast_cancer()

We will reconsider the breast cancer dataset

X_train, X_test, Y_train, Y_test = train_test_split(dataset.data, dataset.target, test_size = 0.2, random_state = 0)Splitting the data into test train splitclf = KNeighborsClassifier()clf.fit(X_train, Y_train)Using the K neighbor as the classifier and fitting it to the dataclf.score(X_test, Y_test)

Upon scoring the algorithm we get a score of

0.93859649122807021

Now to implement K fold CV we use a for loop and repeatedly score the data for the various values of K and store them in an array so it would be easier to plot using matplot lib

x_axis = []y_axis = []for i in range(1, 26, 2):clf = KNeighborsClassifier(n_neighbors = i)score = cross_val_score(clf, X_train, Y_train)x_axis.append(i)y_axis.append(score.mean())

We use odd values of the K as while considering the values of K assume in the neighbour of the unknown point we get an equal number of label A or label B it would be difficult to classify thus to avoid it we consider an odd number of neighbours

import matplotlib.pyplot as pltplt.plot(x_axis, y_axis)plt.show()

We import matplotlib and plot the variation of score against the various value of K

From this plot, we come to know that the best accuracy comes near K = 7. Thus we will consider K as 7.

Now let's try to implement this algorithm from scratch by ourselves

Using K fold CV we came to know that we would get the best result at K = 7 so let's run the algorithm using K = 7 and score it on the same random state using SK learn

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

Importing libraries and datasets from sklearn

dataset = datasets.load_breast_cancer()
X_train, X_test, Y_train, Y_test = train_test_split(dataset.data, dataset.target, test_size = 0.2, random_state = 0)

Loading dataset and splitting the data into training and testing data in an 80:20 ratio, since we would be trying to implement this algorithm ourselves we are splitting data using a particular random state to get same split every time

clf = KNeighborsClassifier(n_neighbors=7)
clf.fit(X_train, Y_train)

Calling the classifier and specifying the value of K as 7 then we call the fit function to train the model

clf.score(X_test, Y_test)

Upon scoring the algorithm we get a score of

0.94736842105263153

Limitation of KNN

KNN’s limitations: KNN is an extremely strong algorithm. It is also known as “slow learner.” It does, however, have the following limitations:

1. Doesn’t function well with huge datasets: Because KNN is a distance-based method, the cost of computing the distance between a new point and each old point is quite high, degrading the algorithm’s speed.

2. Doesn’t function well with a large number of dimensions: Same as above. The cost of calculating distance grows expensive in higher dimensional space, affecting performance.

3. Sensitive to outliers and missing data: Because KNN is sensitive to outliers and missing values, we must first impute missing values and remove outliers before using the KNN method.

--

--