---
title: "<center> Example for Data Analysis <center>"
vignette: >
  %\VignetteIndexEntry{Example for Data Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  knitr:::html_vignette:
    number_sections: false
    fig_caption: true
    toc: false
    theme: cosmo
    highlight: tango
---

```{r include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = " ", fig.width = 7, fig.height = 7, fig.align = "center")
```

```{r include = FALSE}
library(liver)
library(ggplot2)  
library(pROC)    
```

Provides a suite of helper functions and a collection of datasets used in the book [Data Science Foundations and Machine Learning with R: From Data to Decisions](https://book-data-science-r.netlify.app). It is designed to make data science techniques accessible to individuals with minimal coding experience. Here is an example to show how to use the functionality of the package by using the *churn_mlc* dataset which is available in the package. For more examples and details, please refer to the book [Data Science Foundations and Machine Learning with R: From Data to Decisions](https://book-data-science-r.netlify.app).

```{r}
data(churn_mlc)       

str(churn_mlc)
```

It shows that the 'churn_mlc' dataset as a `data.frame` has `r ncol(churn_mlc)` variables and `r nrow(churn_mlc)` observations. 

# Partitioning the dataset

We partition the *churn_mlc* dataset randomly into two groups: train set (80%) and test set (20%). Here, we use the `partition` function from the *liver* package:

```{r}
set.seed(42)

splits = partition(data = churn_mlc, ratio = c(0.8, 0.2))

train_set = splits$part1
test_set  = splits$part2

test_labels  = test_set$churn
```

# Classification by kNN algorithm

The *churn_mlc* dataset has `r ncol(churn_mlc) - 1` predictors along with the target variable `churn`. Here we use the following predictors:

`account_length`, `voice_plan`, `voice_messages`, `intl_plan`, `intl_mins`, `day_mins`, `eve_mins`, `night_mins`, and `customer_calls`.

First, based on the above predictors, find the k-nearest neighbor for the test set, based on the training dataset, for the k = 8 as follows

```{r}
formula = churn ~ account_length + voice_plan + voice_messages + intl_plan + intl_mins + 
                  day_mins + eve_mins + night_mins + customer_calls

predict_knn = kNN(formula, train = train_set, test = test_set, k = 6)
```

To report Confusion Matrix:

```{r, fig.align = 'center', fig.height = 3, fig.width = 3}
conf.mat(predict_knn, test_labels)

conf.mat.plot(predict_knn, test_labels)
```

To report Mean Squared Error (MSE):
```{r}
mse(predict_knn, test_labels)
```

# Classification by kNN algorithm with data transformation

The predictors that we used in the previous part, do not have the same scale. For example, variable `day_mins` change between `r min(churn_mlc$day_mins)` and `r max(churn_mlc$day_mins)`, whereas variable `voice_plan` is binary. In this case, the values of variable `day_mins` will overwhelm the contribution of `voice_plan`. To avoid this situation we use normalization. So, we use min-max normalization and transfer the predictors as follows:

```{r}
predict_knn_trans = kNN(formula, train = train_set, test = test_set, k = 6, scaler = "minmax")
```

To report Confusion Matrix:

```{r fig.show = "hold", fig.align = 'default', out.width = "46%"}
conf.mat.plot(predict_knn_trans, test_labels)

conf.mat.plot(predict_knn, test_labels)
```

To report the ROC curve, we need the probability of our classification prediction. We can have it by using:

```{r}
prob_knn = kNN(formula, train = train_set, test = test_set, k = 6, type = "prob")[, 1]

prob_knn_trans = kNN(formula, train = train_set, test = test_set, scaler = "minmax", k = 6, type = "prob")[, 1]
```

To visualize the model performance between the raw data and the transformed data, we could report the ROC curve plot as well as AUC (Area Under the Curve) by using the `plot.roc` function from the **pROC** package:

```{r, message = F, fig.align = "center"}
roc_knn = roc(test_labels, prob_knn)
roc_knn_trans = roc(test_labels, prob_knn_trans)

ggroc(list(roc_knn, roc_knn_trans), linewidth = 0.8) + 
    theme_minimal() + ggtitle("ROC plots with AUC") +
  scale_color_manual(values = c("red", "blue"), 
    labels = c(paste("AUC=", round(auc(roc_knn), 3), "; Raw data; "),
                paste("AUC=", round(auc(roc_knn_trans), 3), "; Transformed data"))) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(.7, .3), text = element_text(size = 17)) + 
    geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed")
```

# Optimal value of k for the kNN algorithm

To find out the optimal value of `k` based on *Accuracy*, for the different values of k from 1 to 30, we run the k-nearest neighbor for the test set and compute the *Accuracy* for these models, by running `kNN.plot()` command 

```{r fig.align = "center" }
kNN.plot(formula, train = train_set, ratio = c(0.7, 0.3), scaler = "minmax", 
          k.max = 30, set.seed = 3)
```

The plot shows that the maximum value of *Accuracy* is for the case that k is 6; the higher values of *Accuracy* indicates better predictions.