Naive Bayes Classifier

Machine Learning Supervised Learning

When the “naive” one outperform the conventional

Jasper Lok true
09-04-2021

Photo by Shiva Smyth from Pexels

Conventionally, logistic regression is often used to solve classification problems.

In this post, I will explore an alternative popular machine learning algorithms, i.e. naive bayes classifier to solve classification problems.

Before diving into naive bayes classifier, let’s understand the difference between discriminative model and generative model.

Discriminative Model vs Generative Model

Discriminative models directly model the posterior probability \(Pr(Y|X)\). One good example of such model is logistic regression models.

For generative models, instead of directly model the posterior probability, this type of model will model the join distribution of X & Y before predicting the posterior probability given as \(Pr(Y|X)\) (Ottesen 2017). Naive bayes classifier is an example of such model.

Naive Bayes Classifier

In this algorithm, the prior knowledge is being included when making predictions. Essentially the idea for this algorithm is given the features we observed in the predictors, what is likelihood the ‘y’ belong to the particular class?

Following is the mathematical expression for naive bayes classifier obtained from Scikit-learn (scikit-learn):

\[\hat{y} = arg max_y P(y)\prod_{i=1}^n P(x_i|y)\]

One of the key assumptions in Naive Bayes classifier is all the predictors are assumed to be independent from one another (Kuhn and Johnson 2013). This assumption has effectively simplified the calculations, allowing one to multiply the conditional probabilities together as shown in the mathematical expression above.

Although the predictors are unlikely to be independent from one another, this model has a record of decent model performance.

Nevertheless, as what Dr Brownlee suggested in his post of tactics to combat imbalanced classes (Brownlee, n.d.), it is probably worthwhile to try different machine learning algorithms, instead of always sticking to our favorite algorithm. This would enable us in selecting the algorithm has a higher accuracy.

Discrim R Package

In this demonstration, naive bayes wrapper function from discrim package will be used.

Discrim packages contains various discriminant analysis models, including linear discriminant models and Naive Bayes Classifier.

To contrast the model performance, I will be using logistic_reg function from parnsip package to build logistic regression models.

Demonstration

For the dataset, I will be using a travel insurance dataset from Kaggle.

Photo by Annie Spratt on Unsplash

Setup the environment

As usual, I will start by calling all the relevant packages I would need in the analysis later on.

packages <- c('tidyverse', 'readr', 'skimr', 'tidymodels', 'discrim', 
              'naivebayes', 'glmnet', 'tictoc', 'vip', 'shapr', 
              'DALEXtra', 'funModeling', 'plotly', 'readxl', 'ggmosaic')

for(p in packages){
  if(!require (p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Import Data

I will import the dataset obtained from Kaggle website.

df <- read_csv("data/travel insurance.csv") %>%
  rename("Commission" = "Commision (in value)")

Data Checking & Wrangling

As usual, I will start with some basic data quality checking.

skim(df)
Table 1: Data summary
Name df
Number of rows 63326
Number of columns 11
_______________________
Column type frequency:
character 7
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Agency 0 1.00 3 3 0 16 0
Agency Type 0 1.00 8 13 0 2 0
Distribution Channel 0 1.00 6 7 0 2 0
Product Name 0 1.00 9 36 0 26 0
Claim 0 1.00 2 3 0 2 0
Destination 0 1.00 4 42 0 149 0
Gender 45107 0.29 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Duration 0 1 49.32 101.79 -2 9 22.00 53.00 4881.0 ▇▁▁▁▁
Net Sales 0 1 40.70 48.85 -389 18 26.53 48.00 810.0 ▁▇▁▁▁
Commission 0 1 9.81 19.80 0 0 0.00 11.55 283.5 ▇▁▁▁▁
Age 0 1 39.97 14.02 0 35 36.00 43.00 118.0 ▁▇▂▁▁

Gender

There is excessive missing value for Gender, i.e. only about 28.8% of the data contains values in Gender column. Hence, I will drop this columns since this variable is unlikely to be going to be meaningful in explaining the target variable.

Destinations

There are about 149 unique values under destinations. This could be an issue when we use this feature to build machine learning model as there are too many unique categories.

To further group the destinations, I have extracted the mapping between destinations and continents from internet and imported the mapping results into the environment. This allows me to join with the existing dataset.

Below is code chunk I have imported into the environment and performed a left join with the original dataset:

destination_state <- read_excel("data/Destination_Continent_Mapping.xlsx")

df <- df %>%
  left_join(destination_state, by = c("Destination" = "Destination"))

Next, I use bar chart to plot the claim proportion by continent.

ggplot(df, aes(Continent, fill = Claim)) +
  geom_bar(position = "fill") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Claim Proportion by Continents")

As the proportion of claim policies is rather low, it is hard to compare the claim proportion across different continents.

df %>%
  group_by(Continent, Claim) %>%
  summarise(count = n()) %>%
  pivot_wider(names_from = Claim, names_prefix = "Claim_", values_from = count) %>%
  mutate(total = Claim_No + Claim_Yes,
         Claim_perc = Claim_Yes / total * 100)
# A tibble: 6 x 5
# Groups:   Continent [6]
  Continent     Claim_No Claim_Yes total Claim_perc
  <chr>            <int>     <int> <int>      <dbl>
1 Africa             298         5   303      1.65 
2 Asia             49596       771 50367      1.53 
3 Europe            4991        63  5054      1.25 
4 North America     3041        45  3086      1.46 
5 Oceania           4234        42  4276      0.982
6 South America      239         1   240      0.417

Product Name

I will plot the claim proportion by product names.

ggplot(df, aes(x = `Product Name`, fill = Claim)) +
  geom_bar(position = "fill") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Claim Proportion by Product Name")

There are too many categories under product name. This would affect the performance of machine learning models if we use this variable to build models.

Hence, I will further group the product names so that this feature does not have too many unique categories.

As I am unable to find any further information on the product, hence I have kept products with top 90% of the sales as their original naming and renamed the remaining products as ‘Others’.

Once the product names are being recoded, I will join the recoded product names back to the original dataset.

df_prod <- df %>%
  group_by(`Product Name`) %>%
  summarise(count = n(),
            claim_ind = sum(Claim == "Yes")) %>%
  mutate(claim_perc = claim_ind/count) %>%
  arrange(desc(count)) %>%
  mutate(claim_cum_perc = cumsum(count)/sum(count),
         product_name_recoded = case_when(claim_cum_perc > 0.9 ~ "Others",
                                          TRUE ~ as.character(`Product Name`))) %>%
  select(-c("claim_perc", "claim_cum_perc", "claim_ind", "count")) %>%
  ungroup()

df <- df %>%
  left_join(df_prod, by = c("Product Name" = "Product Name"))
df %>%
  filter(Duration < 1000) %>%
  ggplot(aes(product_name_recoded, fill = Claim)) +
  geom_bar(position = "fill") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Claim Proportion of Product Name Recoded")

As shown above, we can see that claim proportion for Bronze Plan and Others are higher than other plans.

Duration

Below is the duration density plot:

duration_mean <- df %>%
  summarise(duration_mean = mean(Duration))

ggplot(df, aes(Duration)) +
  geom_density() +
  geom_vline(data = duration_mean, aes(xintercept = duration_mean), linetype="dashed", size=0.5) +
  annotate("text", x = 800, y = 0.025, label = paste0("Mean of Duration: ", round(duration_mean$duration_mean, 2))) +
  labs(title = "Density of Duration")

As shown in the graph, most of the trips were very short trip.

If we were removed any data points with duration longer than 1000 and plot duration density between claim and no claim policies, following is the chart:

duration_mean_claim <- df %>%
  filter(Duration < 1000, Claim == "Yes") %>%
  summarise(mean_dur = mean(Duration),
            mean_dur_log = log(mean_dur + 1))

duration_mean_Noclaim <- df %>%
  filter(Duration < 1000, Claim == "No") %>%
  summarise(mean_dur = mean(Duration),
            mean_dur_log = log(mean_dur + 1))
  
  
df %>%
  filter(Duration < 1000) %>%
  ggplot(aes(x = log(Duration + 1), color = Claim)) +
  geom_density(alpha = 0.2) + 
  geom_vline(data = duration_mean_claim, aes(xintercept = mean_dur_log), linetype="dashed", size=0.5) +
  annotate("text", x = 5.65, y = 0.32, label = "Average Duration - ") +
  annotate("text", x = 5.65, y = 0.3, label = "Claim Policies") +
  geom_vline(data = duration_mean_Noclaim, aes(xintercept = mean_dur_log), linetype="solid", size=0.5) +
  annotate("text", x = 3, y = 0.03, label = "Average Duration - ") +
  annotate("text", x = 3, y = 0.01, label = "No Claim Policies") +
  labs(title = "Density of Duration between Claim and No Claim Policies")

It seems like the average duration of the policies with no claim is shorter than the average duration of the duration with claim. This is logical since the duration is likely to represent the coverage duration of the travel insurance. Typically the coverage of the travel insurance starts from the day the insurance is purchased till the day the trip is being completed. Therefore, with a higher duration, it is more likely the customers would make a claim during the coverage period.

Meanwhile, I do note that there are negative values within this variable, which the values do not seem to be reasonable. Therefore, I will remove them from the dataset.

Sales

In the earlier data summary, there are some policies with sales with zero or even negative. This is not logical as the sales of insurance policies would be positive. As lack of information, I would remove these policies from the dataset so that it will not affect the model performance.

netSales_mean_claim <- df %>%
  filter(Claim == "Yes",
         `Net Sales` > 0) %>%
  summarise(netSales_mean_claim = mean(`Net Sales`))

netSales_mean_Noclaim <- df %>%
  filter(Claim == "No",
         `Net Sales` > 0) %>%
  summarise(netSales_mean_Noclaim = mean(`Net Sales`))

df %>%
  filter(`Net Sales` > 0) %>%
  ggplot(aes(`Net Sales`, color = Claim)) +
  geom_density() +
  geom_vline(data = netSales_mean_claim, aes(xintercept = netSales_mean_claim), linetype="dashed", size=0.5) +
  annotate("text", x = 240, y = 0.03, label = "Average Sales - Claim Policies") +
  geom_vline(data = netSales_mean_Noclaim, aes(xintercept = netSales_mean_Noclaim), linetype="solid", size=0.5) +
  annotate("text", x = 200, y = 0.02, label = "Average Sales - No Claim Policies") +
  labs(title = "Density of Sales between Claim and No Claim Policies")

Commission

Similarly, I will plot the commission distribution between claim and no claim policies.

comm_mean_claim <- df %>%
  filter(Claim == "Yes") %>%
  summarise(comm_mean_claim = mean(Commission))

comm_mean_Noclaim <- df %>%
  filter(Claim == "No") %>%
  summarise(comm_mean_Noclaim = mean(Commission))

df %>%
  ggplot(aes(Commission, color = Claim)) +
  geom_density() +
  geom_vline(data = comm_mean_claim, aes(xintercept = comm_mean_claim), linetype="dashed", size=0.5) +
  annotate("text", x = 80, y = 0.1, label = "Average Commission - Claim Policies") +
  geom_vline(data = comm_mean_Noclaim, aes(xintercept = comm_mean_Noclaim), linetype="solid", size=0.5) +
  annotate("text", x = 100, y = 0.15, label = "Average Commission - No Claim Policies") +
  labs(title = "Density of Commission between Claim and No Claim Policies")

From the graph, the average commission for policies that have made a claim is lower than the average commission for policies that does not make a claim.

Age

age_mean_claim <- df %>%
  filter(Claim == "Yes") %>%
  summarise(age_mean_claim = mean(Age))

age_mean_Noclaim <- df %>%
  filter(Claim == "No") %>%
  summarise(age_mean_Noclaim = mean(Age))

df %>%
  ggplot(aes(Age, color = Claim)) +
  geom_density() +
  geom_vline(data = age_mean_claim, aes(xintercept = age_mean_claim), linetype="dashed", size=0.5) +
  annotate("text", x = 20, y = 0.1, label = "Average Age - ") +
  annotate("text", x = 20, y = 0.088, label = "Claim Policies") +
  geom_vline(data = age_mean_Noclaim, aes(xintercept = age_mean_Noclaim), linetype="solid", size=0.5) +
  annotate("text", x = 68, y = 0.15, label = "Average Age - No Claim Policies") +
  labs(title = "Density of Age between Claim and No Claim Policies")

Unlike the previous numerical variables, it seems like the average age between claim and no claim policies is quite minimal.

Also, it seems like there are weird ages within the dataset as well. There are 984 policies with age 118, which seem to be what we know as ‘magic value’ in feature engineering. One possible explanation for this is the system will default the age to be 118 if the age is not being input into the system.

df %>%
  filter(Age > 100) %>%
  tally()
# A tibble: 1 x 1
      n
  <int>
1   984

As I do not have any further information to estimate the ages for these policies, I will remove them from the dataset before building the model.

Data Cleaning

Before building the model, I will proceed and clean the data.

df_1 <- df %>%
  select(-c(Gender, `Product Name`, Destination)) %>%
  filter(`Net Sales` > 0,
         Duration > 0,
         Age < 100) %>%
  rename_with(~gsub(" ", "_", .x, fixed = TRUE))

Model Building

Below is the generic parameters set for this analysis:

set.seed(1234)

prop_split <- 0.6

grid_num <- 5

In this section, I will split the data into training and testing dataset. Once this is done, I will prepare the dataset for cross validation later.

df_split <- initial_split(df_1, prop = prop_split, strata = Claim)
df_train <- training(df_split)
df_test <- testing(df_split)

df_folds <- vfold_cv(df_train, strata = Claim)

Next, I will define the formula and dataset to be used to train the model.

gen_recipe <- recipe(Claim ~ ., data = df_train)

Okay, let’s start the analysis by using our classic machine learning model for classification problem - logistic regression!

Logistic Regression

To record how long it takes to build a logistic regression model, I use tic & toc functions from tictoc package to capture the time spent in building model.

tic("Time to Build Logistic Regression")

First, I will define the data pre-processing steps to be performed.

logit_recipe <- gen_recipe %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors())

Then, I will define the model to be built. I have also indicated the parameters to be tuned by using tune function to mark the parameters.

logit_spec <- logistic_reg(mixture = tune(), penalty = tune()) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

Once this is done, I will chain all the created objects as a workflow.

logit_workflow <- workflow() %>% 
  add_recipe(logit_recipe) %>%
  add_model(logit_spec)

Then, I will start performing cross validation to find the best set of parameters.

logit_grid <- tidyr::crossing(penalty = c(1,2,3,4,5,6,7,8,9,10), mixture = c(0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1))

logit_tune <- tune_grid(logit_workflow, 
                        resample = df_folds, 
                        grid = logit_grid)

The model is then fitted with the best set of parameters.

logit_fit <- logit_workflow %>%
  finalize_workflow(select_best(logit_tune)) %>%
  last_fit(df_split)

The predicted values are then “collected” so that I could use them to calculate the necessary model performance.

logit_pred <- logit_fit %>%
  collect_predictions()

Once the predicted values are being “collected”, toc function is used to calculate how long it takes to fit the model and make the necessary predictions.

toc()
Time to Build Logistic Regression: 199.63 sec elapsed

Accuracy is typically used to measure classification model.

accuracy(logit_pred, 
         truth = Claim,
         estimate = .pred_class)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.984

Based on the accuracy results, it seems like this logistic model is a pretty decent model given it has such a high accuracy. More than 98% of the policies are being accurately classified.

However, it can be very misleading if our decision is just based on accuracy measurement. Accuracy looks the number of policies are being accurately classified whether they have claimed or not. It does not differentiate the true positive and true negative in this case.

However, in insurance, we are more interested in the policies that have claimed. One way is to compute the confusion matrix for the relevant model.

conf_mat(logit_pred, 
         truth = Claim,
         estimate = .pred_class)
          Truth
Prediction    No   Yes
       No  23539   372
       Yes     0     0

Oh no! According to the results under confusion matrix, our model always predict that the policies will not make a claim. This is bad as this would result in the expected claim severely unstated.

Now let’s us plot the ROC curve.

autoplot(roc_curve(logit_pred, 
                   truth = Claim, 
                   estimate = .pred_No))

The ROC curve for this model lie at the diagonal line, suggesting that this fitted model has no predictive power. It is as good as we take a random guess whether the selected policy will make a claim.

If we were to visualize the claim count by using a histogram, we can see that the proportion of policyholders claimed is very low.

ggplot(df_1, aes(Claim)) +
  geom_histogram(stat = "count")

Naive Bayes Classifier

Okay, let’s move on and build a naive bayes classifier model.

First, I will use tic function to indicate where I should start measuring the time taken to build the model.

tic("Time to Build Naive Bayes Classifier")

Then, I will start building the model, same as how I did it for logistic regression.

naive_recipe <- gen_recipe %>%
  step_zv(all_predictors())

naive_spec <- naive_Bayes(smoothness = tune(), Laplace = tune()) %>%
  set_mode("classification") %>%
  set_engine("naivebayes")

naive_workflow <- workflow() %>%
  add_recipe(naive_recipe) %>%
  add_model(naive_spec)

naive_tune <- tune_grid(naive_workflow, 
                        resample = df_folds, 
                        grid = grid_num)

naive_fit <- naive_workflow %>%
  finalize_workflow(select_best(naive_tune)) %>%
  last_fit(df_split)

naive_pred <- naive_fit %>%
  collect_predictions()
toc()
Time to Build Naive Bayes Classifier: 26.5 sec elapsed

Note that the naive bayes classifier takes lesser time to build the model. In fact, it took less time to build naive bayes classifier than build logistic regression.

This is consistent with what we have discussed in the earlier section of this post.

accuracy(naive_pred, 
         truth = Claim,
         estimate = .pred_class)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.941

Similarly, let’s check the confusion matrix results for our naive bayes classifier.

conf_mat(naive_pred, 
         truth = Claim,
         estimate = .pred_class)
          Truth
Prediction    No   Yes
       No  22391   252
       Yes  1148   120

Note that the model no longer always predict policyholders do not claim from their travel insurance.

When we plot the ROC curve, we noted that the ROC curve no longer lie on the diagonal line. This suggests that although the accuracy of this fitted model is lower than logistic regression, this fitted model does have some levels of predictive power.

autoplot(roc_curve(naive_pred, 
                   truth = Claim, 
                   estimate = .pred_No))

Overall, we can see that naive bayes classifier performs better than logistic regression in this data.

One of the possible reason why naive bayes classifier outperforms logistic regression is due to the small dataset. (Ng and Jordan) discussed that how the model performance for logistic regression will outperform naive bayes classifier when the training size reaches infinity.

However, for this scenario, the dataset seems to be too small for logistic regression to converge and outperform naive bayes classifier. The performance of logistic regression model might improve further when we increase the training data.

Conclusion

That’s all for the day! We have finished “sorting” the policies in whether they are expected to claim or does not claim.

Photo by emrecan arık on Unsplash

Thanks for reading the post until the end.

Do check out on the documentation page if you want to find out more on naive bayes classifier.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Brownlee, Jason, ed. n.d. “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset.” https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/.
Kuhn, Max, and Kjell Johnson, eds. 2013. Applied Predictive Modeling. Springer.
Ng, Andrew, and Michael Jordan, eds. “On Discriminative Vs. Genererative Classifiers: A Comparison of Logistic Regression and Naive Bayes.” http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf.
Ottesen, Christopher. 2017. “Comparison Between Naïve Bayes and Logistic Regression.” https://dataespresso.com/en/2017/10/24/comparison-between-naive-bayes-and-logistic-regression/.
scikit-learn, ed. “1.9 Naive Bayes.” https://scikit-learn.org/stable/modules/naive_bayes.html.

References