In this post, I will be exploring GLM encoding method.

Photo by cottonbro studio
Some of the machine learning models require numerical variables.
For example, the generalized linear model algorithm is unable to handle categorical variables while fitting the model.
Therefore, often we need to encode the categorical variables into something that can be used by the machine learning algorithm.
One of the common approaches is to create dummy variables/indicators to capture the info under categorical variables.
Each indicator will capture one level from the categorical variable.
So, one of the main drawbacks of such encoding method is that the indicator columns could increase quickly when the number of levels increases.
Therefore, we will be considering two other types of encoding below.
Under this method, the categorical variables are converted into a set of scores derived from the generalized linear model.
I will demonstrate how this could be done by using functions in tidymodels package.
(Kuhn and Johnson 2019) mentioned that the increase of overfitting likelihood is one of the issues of using such effect encoding method.
To overcome this issue, the author suggests the following methods:
Using different datasets for training and encoding
Use resampling (eg. cross-validation)
In this demonstration, I will be using the embed package to perform the encoding.
pacman::p_load(tidyverse, janitor, readr, tidymodels, embed, skimr, themis)
I will be using the insurance dataset from Kaggle for this demonstration.
df <-
read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2021-08-31-naive-bayes/data/travel%20insurance.csv") %>%
# rename the column
rename("Commission" = "Commision (in value)") %>%
# clean the column names
clean_names() %>%
# mutate the variable to the necessary format
mutate(claim = factor(claim, levels = c("Yes", "No"))) %>%
select(-gender)
For simplicity, I will drop “Gender” column since there are many missing values under this column.
Next, I will perform encoding on one of the columns and see how values differ under different encoding approaches.
First, I will define the recipe with GLM encoding.
gen_recipe_glmEcode <-
recipe(claim ~ . , data = df) %>%
step_lencode_glm(product_name, outcome = vars(claim)) %>%
step_nzv(all_predictors())
Note that to indicate the outcome variable, we will need to wrap the target variable with vars function.
gen_recipe_glmEcode %>%
prep(df) %>%
tidy(number = 1)
# A tibble: 27 x 4
level value terms id
<chr> <dbl> <chr> <chr>
1 1 way Comprehensive Plan -5.91 product_name lencode_glm_8VU~
2 2 way Comprehensive Plan -4.52 product_name lencode_glm_8VU~
3 24 Protect -16.6 product_name lencode_glm_8VU~
4 Annual Gold Plan -2.11 product_name lencode_glm_8VU~
5 Annual Silver Plan -2.09 product_name lencode_glm_8VU~
6 Annual Travel Protect Gold -2.20 product_name lencode_glm_8VU~
7 Annual Travel Protect Platinum -2.81 product_name lencode_glm_8VU~
8 Annual Travel Protect Silver -3.02 product_name lencode_glm_8VU~
9 Basic Plan -5.47 product_name lencode_glm_8VU~
10 Bronze Plan -2.91 product_name lencode_glm_8VU~
# ... with 17 more rows
If I were to build a simple GLM model with the selected encoded variable, below are the coefficients from the model:
# A tibble: 26 x 5
term estim~1 std.e~2 statis~3 p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.91 0.334 17.7 3.55e-70
2 product_name2 way Comprehensive ~ -1.39 0.344 -4.05 5.21e- 5
3 product_name24 Protect 10.7 153. 0.0698 9.44e- 1
4 product_nameAnnual Gold Plan -3.80 0.406 -9.37 7.53e-21
5 product_nameAnnual Silver Plan -3.82 0.344 -11.1 1.19e-28
6 product_nameAnnual Travel Protec~ -3.71 0.472 -7.87 3.46e-15
7 product_nameAnnual Travel Protec~ -3.10 0.682 -4.54 5.52e- 6
8 product_nameAnnual Travel Protec~ -2.89 0.611 -4.73 2.25e- 6
9 product_nameBasic Plan -0.444 0.394 -1.13 2.60e- 1
10 product_nameBronze Plan -3.01 0.341 -8.81 1.28e-18
# ... with 16 more rows, and abbreviated variable names 1: estimate,
# 2: std.error, 3: statistic
From the result, we could see those coefficients of the different levels are similar to the coefficients from the encoding step, except the signs of the coefficients are swapped.
This is probably due to the difference in what is deemed as a “positive” class under the classification model, which is not a big issue since the machine learning will be able to recognize the sign is swapped in model fitting.
Also, note that the coefficients generated from GLM are the “relative” to the reference group.
For example, in this demonstration, the category of “1 way Comprehensive Plan” is being used as the reference group.
Hence to match the coefficient from the encoded variable, we will need to add the intercept coefficient and coefficient of the selected variable together.
# intercept + 2 way Comprehensive Plan
coef_glm[1,2] + coef_glm[2,2]
estimate
1 4.518108
Below are the coefficients of encoded levels:
gen_recipe_glmEcode %>%
prep(df) %>%
tidy(number = 1) %>%
ggplot(aes(fct_reorder(level, value), value)) +
geom_col() +
xlab("Product Name") +
coord_flip() +
theme_minimal()

Note that the encoding will include a new factor called “..new”.
According to the documentation page, these new factors represents the average effect.
If we were to plot the proportion of claims within each product name, we could observe the negative encoded value represent a lower likelihood of claim.
df %>%
mutate(product_name = forcats::fct_reorder(.f = product_name,
.x = claim,
.fun = function(.x) mean(.x == "Yes"),
.desc = FALSE)) %>%
ggplot(aes(product_name, fill = claim)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
coord_flip() +
theme_minimal()

This would provide the model of the “relativity” of the outcome under different levels.
For the model building, I will build two models with different ways (i.e. one-hot encoding and GLM encoding) of handling categorical variables.
First, I will define the common parameters for the model building.
set.seed(1234)
prop_split <- 0.6
grid_num <- 5
metrics_optim <- metric_set(sens)
Next, I will split the data into training and testing datasets for model fitting later.
df_split <-
initial_split(df, prop = prop_split, strata = claim)
df_train <- training(df_split)
df_test <- testing(df_split)
df_folds <- vfold_cv(df_train, strata = claim)
# dummy
xgb_recipe_dummy <-
recipe(claim ~ . , data = df_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_nzv(all_predictors()) %>%
step_smote(claim)
# glm encoding
xgb_recipe_glmEncode <-
recipe(claim ~ . , data = df_train) %>%
step_lencode_glm(all_nominal_predictors(), outcome = vars(claim)) %>%
step_nzv(all_predictors()) %>%
step_smote(claim)
Next, I will specify the model specification.
xgb_spec <-
boost_tree() %>%
set_engine("xgboost") %>%
set_mode("classification")
Then, I will create the workflow for two models.
# dummy
xgb_wk_dummy <-
workflow() %>%
add_model(xgb_spec) %>%
add_recipe(xgb_recipe_dummy)
# glm encoding
xgb_wk_glmEncode <-
workflow() %>%
add_model(xgb_spec) %>%
add_recipe(xgb_recipe_glmEncode)
Once the recipes and models are defined, I will start tuning the models.
# dummy
xgb_tune_dummy <-
tune_grid(xgb_wk_dummy,
resample = df_folds,
grid = grid_num,
metrics = metrics_optim)
# glm encoding
xgb_tune_glmEncode <-
tune_grid(xgb_wk_glmEncode,
resample = df_folds,
grid = grid_num,
metrics = metrics_optim)
Once the parameters are tuned, I will extract the parameters that would give us the best fit from each model.
# dummy
xgb_fit_dummy <-
xgb_wk_dummy %>%
finalize_workflow(select_best(xgb_tune_dummy)) %>%
last_fit(df_split)
# glm encoding
xgb_fit_glmEncode <-
xgb_wk_glmEncode %>%
finalize_workflow(select_best(xgb_tune_glmEncode)) %>%
last_fit(df_split)
In this next section, I will extract the predictions from the models.
First, I will define the list of models.
Then, I will use for loop to compute the predictions for the models.
In this section, I will compare the different performances under different models.
I will compute the ROC curve for the two models.
predictions %>%
group_by(model) %>%
roc_curve(claim, .pred_Yes) %>%
autoplot()

Based on the results above, it seems like the model performances of the three models are rather similar.
Lastly, I will specify the metrics to be used to measure the model performance.
multi_metric <-
metric_set(accuracy, sensitivity, specificity, ppv)
Then, I will calculate the model performances for different models.
predictions %>%
group_by(model) %>%
multi_metric(truth = claim,
estimate = .pred_class) %>%
bind_rows(
predictions %>%
group_by(model) %>%
roc_auc(truth = claim,
estimate = .pred_Yes)
) %>%
select(-`.estimator`) %>%
pivot_wider(names_from = .metric,
values_from = .estimate)
# A tibble: 2 x 6
model accuracy sensitivity specificity ppv roc_auc
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Dummy Encode 0.913 0.432 0.921 0.0751 0.819
2 GLM Encode 0.907 0.461 0.913 0.0738 0.809
From the results above, we observe the following:
In general, the model performances are rather similar across the two models
While the model performance under the dummy encoding method is slightly better than GLM encoding, the sensitivity performance is slightly worse for the dummy encoding
From positive predictive value, both models perform badly in picking up the claim policies
Nevertheless, the purpose of this post is to explore how to encode categorical variables by using GLM.
Hence I won’t be exploring how to improve the model performance.
That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by cottonbro studio