This project is about implementing a simple KNN algorithm.
We’ll use the famous iris data set for this project. It’s a small data set with flower features that can be used to attempt to predict the species of an iris flower.
Use the ISLR library to get the iris data set. Check the head of the iris Data Frame.
rm(list = ls())
cat("\014") # ctrl+L
Use the ISLR library to get the iris data set. Check the head of the iris Data Frame.
library(ISLR)
df_iris = data.frame(iris)
head(df_iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(df_iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Since standardizing features is crucial to use KNN algorithm. Although the iris data set has all its features in the same order of magnitude. Lets go ahead and do this even though its not necessary for this data!
Use scale() to standardize the feature columns of the iris dataset.
var(df_iris[,3])
## [1] 3.116278
library(dplyr)
df <- select(df_iris, -5)
feature_scales <- scale(df)
Check that the scaling worked by checking the variance of one of the new columns.
var(feature_scales[,3])
## [1] 1
Join the standardized data with the response/target/label column (the column with the species names.
df2 <- cbind(feature_scales, df_iris[5])
library(caTools)
set.seed(42)
sample <- sample.split(df2$Species, SplitRatio = .70)
train <- subset(df2, sample == TRUE)
test <- subset(df2, sample == FALSE)
library(class)
Use the knn function to predict Species of the test set. for k=1
predicted_species <- knn(train[1:4],test[1:4],train$Species,k=1)
predicted_species
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica virginica virginica virginica versicolor virginica
## [37] virginica virginica versicolor versicolor virginica virginica
## [43] virginica virginica virginica
## Levels: setosa versicolor virginica
The misclassification rate:
mean(test$Species != predicted_species)
## [1] 0.06666667
prediction <- NULL
error_rate <- NULL
for(i in 1:10){
set.seed(42)
prediction <- knn(train[1:4],test[1:4],train$Species,k=i)
error_rate[i] <- mean(test$Species != prediction)
}
bring the ggplot2 into the memory
library(ggplot2)
pl <- ggplot(data.frame(error_rate, 1:10),aes(x=1:10,y=error_rate)) + geom_point()
pl + geom_line(lty=6,color='blue')