Khalida

KNN

KNN Project

This project is about implementing a simple KNN algorithm.

Get the Data

We’ll use the famous iris data set for this project. It’s a small data set with flower features that can be used to attempt to predict the species of an iris flower.

Use the ISLR library to get the iris data set. Check the head of the iris Data Frame.

rm(list = ls()) 

cat("\014")  # ctrl+L

Use the ISLR library to get the iris data set. Check the head of the iris Data Frame.

library(ISLR)
df_iris = data.frame(iris)
head(df_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(df_iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Standardize Data

Since standardizing features is crucial to use KNN algorithm. Although the iris data set has all its features in the same order of magnitude. Lets go ahead and do this even though its not necessary for this data!

Use scale() to standardize the feature columns of the iris dataset.

var(df_iris[,3])
## [1] 3.116278
library(dplyr)
df <- select(df_iris, -5)
feature_scales <- scale(df)

Check that the scaling worked by checking the variance of one of the new columns.

var(feature_scales[,3])
## [1] 1

Join the standardized data with the response/target/label column (the column with the species names.

df2 <- cbind(feature_scales, df_iris[5])

Train and Test Splits

Use the caTools library to split your standardized data into train and test sets. Us a 70/30 split.

library(caTools)
set.seed(42)



sample <- sample.split(df2$Species, SplitRatio = .70)
train <- subset(df2, sample == TRUE)
test <- subset(df2, sample == FALSE)

Build a KNN model.

Call the class library.

library(class)

Use the knn function to predict Species of the test set. for k=1

predicted_species <- knn(train[1:4],test[1:4],train$Species,k=1)
predicted_species
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica  virginica  virginica  virginica  versicolor virginica 
## [37] virginica  virginica  versicolor versicolor virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

The misclassification rate:

mean(test$Species != predicted_species)
## [1] 0.06666667

Choosing a K Value, Although our data is quite small for us to really get a feel for choosing a good K value, let’s practice.Create a plot of the error (misclassification) rate for k values ranging from 1 to 10.

prediction <- NULL
error_rate <- NULL

for(i in 1:10){
    set.seed(42)
    prediction <- knn(train[1:4],test[1:4],train$Species,k=i)
    error_rate[i] <- mean(test$Species != prediction)
}

bring the ggplot2 into the memory

library(ggplot2)
pl <- ggplot(data.frame(error_rate, 1:10),aes(x=1:10,y=error_rate)) + geom_point()
pl + geom_line(lty=6,color='blue')

We noticed that the error drops to its lowest for k values between 2-6. Then it begins to jump back up again, this is due to how small the data set it. At k=10 you begin to approach setting k=10% of the data, which is quite large.