Encoding Categorical Variables Using GLM

Machine Learning Supervised Learning
Jasper Lok https://jasperlok.netlify.app/
04-10-2023

In this post, I will be exploring GLM encoding method.

Photo by cottonbro studio

What is encoding?

Some of the machine learning models require numerical variables.

For example, the generalized linear model algorithm is unable to handle categorical variables while fitting the model.

Therefore, often we need to encode the categorical variables into something that can be used by the machine learning algorithm.

Dummy encoding

One of the common approaches is to create dummy variables/indicators to capture the info under categorical variables.

Each indicator will capture one level from the categorical variable.

So, one of the main drawbacks of such encoding method is that the indicator columns could increase quickly when the number of levels increases.

Therefore, we will be considering two other types of encoding below.

GLM encoding

Under this method, the categorical variables are converted into a set of scores derived from the generalized linear model.

I will demonstrate how this could be done by using functions in tidymodels package.

Important notes

Issues of Encoding

(Kuhn and Johnson 2019) mentioned that the increase of overfitting likelihood is one of the issues of using such effect encoding method.

To overcome this issue, the author suggests the following methods:

Demonstration

Setup the environment

In this demonstration, I will be using the embed package to perform the encoding.

pacman::p_load(tidyverse, janitor, readr, tidymodels, embed, skimr, themis)

Import Data

I will be using the insurance dataset from Kaggle for this demonstration.

df <- 
  read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2021-08-31-naive-bayes/data/travel%20insurance.csv") %>%
  # rename the column
  rename("Commission" = "Commision (in value)") %>%
  # clean the column names
  clean_names() %>% 
  # mutate the variable to the necessary format
  mutate(claim = factor(claim, levels = c("Yes", "No"))) %>% 
  select(-gender)

For simplicity, I will drop “Gender” column since there are many missing values under this column.

Encoding

Next, I will perform encoding on one of the columns and see how values differ under different encoding approaches.

First, I will define the recipe with GLM encoding.

gen_recipe_glmEcode <-
  recipe(claim ~ . , data = df) %>%
  step_lencode_glm(product_name, outcome = vars(claim)) %>% 
  step_nzv(all_predictors()) 

Note that to indicate the outcome variable, we will need to wrap the target variable with vars function.

gen_recipe_glmEcode %>% 
  prep(df) %>% 
  tidy(number = 1)
# A tibble: 27 x 4
   level                           value terms        id              
   <chr>                           <dbl> <chr>        <chr>           
 1 1 way Comprehensive Plan        -5.91 product_name lencode_glm_8VU~
 2 2 way Comprehensive Plan        -4.52 product_name lencode_glm_8VU~
 3 24 Protect                     -16.6  product_name lencode_glm_8VU~
 4 Annual Gold Plan                -2.11 product_name lencode_glm_8VU~
 5 Annual Silver Plan              -2.09 product_name lencode_glm_8VU~
 6 Annual Travel Protect Gold      -2.20 product_name lencode_glm_8VU~
 7 Annual Travel Protect Platinum  -2.81 product_name lencode_glm_8VU~
 8 Annual Travel Protect Silver    -3.02 product_name lencode_glm_8VU~
 9 Basic Plan                      -5.47 product_name lencode_glm_8VU~
10 Bronze Plan                     -2.91 product_name lencode_glm_8VU~
# ... with 17 more rows

If I were to build a simple GLM model with the selected encoded variable, below are the coefficients from the model:

coef_glm <-
  glm(claim ~ product_name 
      ,data = df
      ,family = binomial()) %>% 
  tidy()

coef_glm
# A tibble: 26 x 5
   term                              estim~1 std.e~2 statis~3  p.value
   <chr>                               <dbl>   <dbl>    <dbl>    <dbl>
 1 (Intercept)                         5.91    0.334  17.7    3.55e-70
 2 product_name2 way Comprehensive ~  -1.39    0.344  -4.05   5.21e- 5
 3 product_name24 Protect             10.7   153.      0.0698 9.44e- 1
 4 product_nameAnnual Gold Plan       -3.80    0.406  -9.37   7.53e-21
 5 product_nameAnnual Silver Plan     -3.82    0.344 -11.1    1.19e-28
 6 product_nameAnnual Travel Protec~  -3.71    0.472  -7.87   3.46e-15
 7 product_nameAnnual Travel Protec~  -3.10    0.682  -4.54   5.52e- 6
 8 product_nameAnnual Travel Protec~  -2.89    0.611  -4.73   2.25e- 6
 9 product_nameBasic Plan             -0.444   0.394  -1.13   2.60e- 1
10 product_nameBronze Plan            -3.01    0.341  -8.81   1.28e-18
# ... with 16 more rows, and abbreviated variable names 1: estimate,
#   2: std.error, 3: statistic

From the result, we could see those coefficients of the different levels are similar to the coefficients from the encoding step, except the signs of the coefficients are swapped.

This is probably due to the difference in what is deemed as a “positive” class under the classification model, which is not a big issue since the machine learning will be able to recognize the sign is swapped in model fitting.

Also, note that the coefficients generated from GLM are the “relative” to the reference group.

For example, in this demonstration, the category of “1 way Comprehensive Plan” is being used as the reference group.

Hence to match the coefficient from the encoded variable, we will need to add the intercept coefficient and coefficient of the selected variable together.

# intercept + 2 way Comprehensive Plan
coef_glm[1,2] + coef_glm[2,2] 
  estimate
1 4.518108

Below are the coefficients of encoded levels:

gen_recipe_glmEcode %>% 
  prep(df) %>% 
  tidy(number = 1) %>% 
  ggplot(aes(fct_reorder(level, value), value)) +
  geom_col() +
  xlab("Product Name") +
  coord_flip() +
  theme_minimal()

Note that the encoding will include a new factor called “..new”.

According to the documentation page, these new factors represents the average effect.

If we were to plot the proportion of claims within each product name, we could observe the negative encoded value represent a lower likelihood of claim.

df %>% 
  mutate(product_name = forcats::fct_reorder(.f = product_name, 
                                             .x = claim,
                                             .fun = function(.x) mean(.x == "Yes"),
                                             .desc = FALSE)) %>%
  ggplot(aes(product_name, fill = claim)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  coord_flip() +
  theme_minimal()

This would provide the model of the “relativity” of the outcome under different levels.

Model Building

For the model building, I will build two models with different ways (i.e. one-hot encoding and GLM encoding) of handling categorical variables.

First, I will define the common parameters for the model building.

set.seed(1234)
prop_split <- 0.6
grid_num <- 5
metrics_optim <- metric_set(sens)

Split data into training and testing

Next, I will split the data into training and testing datasets for model fitting later.

df_split <-
  initial_split(df, prop = prop_split, strata = claim)

df_train <- training(df_split)
df_test <- testing(df_split)

df_folds <- vfold_cv(df_train, strata = claim)

Define the Model Recipes

# dummy
xgb_recipe_dummy <-
  recipe(claim ~ . , data = df_train) %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_nzv(all_predictors()) %>% 
  step_smote(claim)

# glm encoding
xgb_recipe_glmEncode <-
  recipe(claim ~ . , data = df_train) %>%
  step_lencode_glm(all_nominal_predictors(), outcome = vars(claim)) %>% 
  step_nzv(all_predictors()) %>% 
  step_smote(claim)

Model Specifications

Next, I will specify the model specification.

xgb_spec <-
  boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

Workflow

Then, I will create the workflow for two models.

# dummy
xgb_wk_dummy <-
  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(xgb_recipe_dummy)

# glm encoding
xgb_wk_glmEncode <-
  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(xgb_recipe_glmEncode)

Cross Validation

Once the recipes and models are defined, I will start tuning the models.

# dummy
xgb_tune_dummy <-
  tune_grid(xgb_wk_dummy,
            resample = df_folds,
            grid = grid_num,
            metrics = metrics_optim)

# glm encoding
xgb_tune_glmEncode <-
  tune_grid(xgb_wk_glmEncode,
            resample = df_folds,
            grid = grid_num,
            metrics = metrics_optim)

Extract the last fit

Once the parameters are tuned, I will extract the parameters that would give us the best fit from each model.

# dummy
xgb_fit_dummy <-
  xgb_wk_dummy %>% 
  finalize_workflow(select_best(xgb_tune_dummy)) %>%
  last_fit(df_split)

# glm encoding
xgb_fit_glmEncode <-
 xgb_wk_glmEncode %>% 
  finalize_workflow(select_best(xgb_tune_glmEncode)) %>%
  last_fit(df_split)

Model Predictions

In this next section, I will extract the predictions from the models.

First, I will define the list of models.

model_name_list <- c("dummy", "glmEncode")
model_list <- c("Dummy Encode", "GLM Encode")

Then, I will use for loop to compute the predictions for the models.

predictions <- as_tibble()

for(i in 1:length(model_list)){
  predictions <- 
    predictions %>% 
    bind_rows(
      get(paste0("xgb_fit_", model_name_list[i])) %>% 
        collect_predictions() %>% 
        mutate(model = model_list[i]) %>% 
        relocate(model, .before = id)
    )
}

Model Performance

In this section, I will compare the different performances under different models.

I will compute the ROC curve for the two models.

predictions %>%
  group_by(model) %>%
  roc_curve(claim, .pred_Yes) %>%
  autoplot()

Based on the results above, it seems like the model performances of the three models are rather similar.

Lastly, I will specify the metrics to be used to measure the model performance.

multi_metric <- 
  metric_set(accuracy, sensitivity, specificity, ppv)

Then, I will calculate the model performances for different models.

predictions %>% 
  group_by(model) %>% 
  multi_metric(truth = claim,
                estimate = .pred_class) %>%
  bind_rows(
    predictions %>%
      group_by(model) %>% 
      roc_auc(truth = claim,
              estimate = .pred_Yes)
  ) %>%
  select(-`.estimator`) %>%
  pivot_wider(names_from = .metric,
              values_from = .estimate)
# A tibble: 2 x 6
  model        accuracy sensitivity specificity    ppv roc_auc
  <chr>           <dbl>       <dbl>       <dbl>  <dbl>   <dbl>
1 Dummy Encode    0.913       0.432       0.921 0.0751   0.819
2 GLM Encode      0.907       0.461       0.913 0.0738   0.809

From the results above, we observe the following:

Nevertheless, the purpose of this post is to explore how to encode categorical variables by using GLM.

Hence I won’t be exploring how to improve the model performance.

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by cottonbro studio

Kuhn, Max, and Kjell Johnson. 2019. Taylor & Francis Group. https://bookdown.org/max/FES/categorical-supervised-encoding.html.

References