Generalized Linear Model

Machine Learning Supervised Learning

Back to Actuaries Most Beloved Model

Jasper Lok true
09-24-2022

In this post, I will be exploring actuaries’ most beloved model, generalized linear model (GLM).

Photo by Jason Leung on Unsplash

This will set the scene for future sharing on other topics, such as generalized additive model, frequency severity modeling and so on.

Generalized Linear Models (GLM)

GLM is an extension of linear model.

In linear model, we assume the relationship between independent and dependent variables are linear.

However, this assumption is not appropriate in most of the scenarios (eg. insurance premium setting). This is where GLM could help us in our modeling work.

The idea of using GLM is that with appropriate transformation on the target variable, the relationship between independent variables and transformed dependent variable remain linear.

(kjytay 2020) A GLM model consists of following 3 parts:

where the general form of GLM formula can be found below:

\[y_i=\beta^Tx_i+\epsilon_i\]

Assumptions

Below are some of the assumptions when using GLM models:

The model performance might be affected if any of the assumptions is being violated.

Family

Below are some of the families supported under GLM:

(Molnar 2022) summarized down the appropriate families for the different predictions we are trying to make. This is the link to the book.

Offset function

The offset enters the score in an additive way.

This post provides a clear example on how the offset function works in a Poisson regression context.

Then you might be asking instead of passing feature into offset function, why can’t we just divide the target variable by offset variable?

The short answer is the likelihood function would change if we divide the target variable by using the offset variable. (Tiwari 2020) provides the mathematical proof on how the likelihood would change if we divide the target variable by the offset variable.

One important thing to note when using offset is offset should be in the same scale as the target variable. For example, when using Poisson family, the default link function is log. The offset variable should be log as well.

Note that offset function should not be confused with weight function in GLM.

In short, following are the differences between offset & weight:

Pros and cons of using GLM models

Of course there is not perfect model. Sometimes the advantages of the models bring could be disadvantages of using the relevant models.

Following are advantages and disadvantages of using GLM models:

Advantages of using GLM

Disadvantages of using GLM

Application of GLM models

(Michel, Donatien, and Trufin 2019) GLMs were originally introduced to improve the accuracy of motor insurance pricing. Today GLM is being used in many insurance contexts.

(Michel, Donatien, and Trufin 2019) also listed some of the typical insurance use case by using GLMs in the book:

Nevertheless, let’s start the demonstration!

Photo by Clemens van Lay on Unsplash

Demonstration

Setup the environment

In this demonstration, I will be using glmnet package to build the GLM model. The documentation can be found under this link.

At the point of writing, it seems like tidymodels package doesn’t support all functionalities of glmnet package.

According to this issue ticket, it seems like tidymodels is unable to support offset function for time being. glmnet requires the users to specify the variable to be used as offset in both training and testing datasets. However, tidymodels only allows the users to specify offset in training dataset.

Therefore, I will be using glmnet package directly in building the model and tidymodels package to prepare the data for model building.

I will call the relevant packages to setup the environment.

pacman::p_load(tidyverse, readr, tidymodels, corrplot, glmnet,
              glue)

Dataset

For the demonstration, I will be using the car insurance dataset from Chapter 1 of Predictive Modeling Applications in Actuarial Science - Volume 1.

The dataset can be found from the book website.

Photo by why kei on Unsplash

I will first import the data into environment.

I will also specify that the year of driving licence should be imported as character, instead of continuous variable.

df <- read_csv("data/sim-modeling-dataset.csv",
               col_types = list(yrs.lic = col_character()))

This is a rather clean data to work with since there is no missing value.

Explotary Data Analysis (EDA)

Before building the model, I will perform some simple EDA to perform simple data cleaning.

I will first select all the numeric variables.

df_num <- df %>%
  select_if(is.numeric)

Then I will pass them into corrplot function.

corrplot(cor(df_num, use="pairwise.complete.obs"), 
         method = "number", 
         type = "upper", 
         tl.cex = 0.65, 
         number.cex = 0.65, 
         diag = FALSE)

Note that there are some variables are perfectly correlated.

Based on the variable names, it seems like these variables are measuring the same things.

Hence I will remove the highly correlated variable.

df_1 <- df %>%
  select(-c(year, drv.age, veh.age, clm.incurred))

Also, GLM is unable to handle non-numeric variables.

To overcome this, I have used to the step_dummy function in recipe package to encode the non-numeric variables by using one hot encoding method. all_nominal_predictiors function will tell the function to identify all the nominal independent variables in the dataset.

Then I will pass the recipe into the prep function. The current dataset will then be passed into bake function to generate out the pre-processed data.

df_recoded <- df_1 %>%
  recipe(clm.count ~ .) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  prep() %>%
  bake(df_1)

Next, I will be building a model to predict the claim count from the various policies.

Model Building

Data Splitting

I will split the dataset into training and testing dataset.

df_split <- initial_split(df_recoded, prop = 0.6, strata = clm.count)

As usual, I will use training and testing function to crave the dataset into training and testing dataset.

df_train <- training(df_split)
df_test <- testing(df_split)

Building Model

glmnet requires the independent variables to be in matrix format. Hence I will pass the object into data.matrix to convert the tibble table into matrix format.

# predictors need to be in matrix format
x <- df_train %>%
  select(-c(clm.count, exposure)) %>%
  data.matrix()

Similarly, I will “cut out” the target variable from the training dataset.

y <- df_train$clm.count

I will also convert the test dataset into matrix format.

newx <- df_test %>%
  select(-c(clm.count, exposure)) %>%
  data.matrix()

Next, I will start building the model.

As I am trying to predict the claim count, hence I will specify Poisson to be the family in the GLM function.

glmnet_model <- glmnet(x, 
                       y, 
                       family = "poisson", 
                       offset = log(df_train$exposure))

As in the data, we are not told whether the claim counts and claim amounts are being adjusted for the exposure. Hence, I will assume the claim counts and claim amounts are not adjusted for the exposure of the policies.

To adjust for the exposure, I will specify what should be the “offset” in the model.

Also, glmnet package allows us to fit regularized model. The algorithm will fit a lasso model if we didn’t specify the value of alpha.

To fit a ridge model, we just need to set the alpha to be 0 as shown below.

glmnet(x, 
       y, 
       family = "poisson", 
       offset = log(df_train$exposure),
       alpha = 0)

tidy function from tidymodels package can be used to convert the output into tidy data format so that its easier to read and perform data wrangling.

glmnet_model %>% 
  tidy()
# A tibble: 1,714 x 5
   term         step estimate lambda dev.ratio
   <chr>       <dbl>    <dbl>  <dbl>     <dbl>
 1 (Intercept)     1    -1.82 0.0263  2.53e-14
 2 (Intercept)     2    -1.77 0.0240  3.54e- 3
 3 (Intercept)     3    -1.73 0.0218  6.50e- 3
 4 (Intercept)     4    -1.69 0.0199  8.96e- 3
 5 (Intercept)     5    -1.65 0.0181  1.10e- 2
 6 (Intercept)     6    -1.62 0.0165  1.27e- 2
 7 (Intercept)     7    -1.59 0.0151  1.41e- 2
 8 (Intercept)     8    -1.56 0.0137  1.53e- 2
 9 (Intercept)     9    -1.54 0.0125  1.71e- 2
10 (Intercept)    10    -1.53 0.0114  1.88e- 2
# ... with 1,704 more rows

The results above show the different lambda values are being used to fit the model.

The model results also contains dev.ratio, which represents the proportion of deviance explained.

Cross Validation

Alternatively, glmnet package also has functions to help us in performing cross validation.

As from the documentation, it doesn’t seem like the cross validation function in glmnet package allows us to iterate over different alpha values.

To overcome this, I will loop the cross validation over different alpha values.

alpha_list <- c(0, 0.25, 0.5, 0.75, 1)

for (i in alpha_list){
  assign(glue("cv_glmnet_", i), cv.glmnet(x, 
                       y, 
                       family = "poisson",
                       intercept = FALSE,
                       type.measure = "deviance", 
                       alpha = i,
                       nfolds = 20, 
                       offset = log(df_train$exposure))
         )
  
  print(glue("cv_glmnet_", i))
  print(get(glue("cv_glmnet_", i)))
}
cv_glmnet_0

Call:  cv.glmnet(x = x, y = y, offset = log(df_train$exposure), type.measure = "deviance",      nfolds = 20, family = "poisson", intercept = FALSE, alpha = i) 

Measure: Poisson Deviance 

    Lambda Index Measure       SE Nonzero
min  0.824   100  0.3988 0.008554      43
1se 13.429    70  0.4070 0.007821      43
cv_glmnet_0.25

Call:  cv.glmnet(x = x, y = y, offset = log(df_train$exposure), type.measure = "deviance",      nfolds = 20, family = "poisson", intercept = FALSE, alpha = i) 

Measure: Poisson Deviance 

     Lambda Index Measure       SE Nonzero
min 0.00330   100  0.3898 0.005572      29
1se 0.03703    74  0.3949 0.005402       6
cv_glmnet_0.5

Call:  cv.glmnet(x = x, y = y, offset = log(df_train$exposure), type.measure = "deviance",      nfolds = 20, family = "poisson", intercept = FALSE, alpha = i) 

Measure: Poisson Deviance 

    Lambda Index Measure       SE Nonzero
min 0.0681    60  0.4026 0.006298       1
1se 0.8395    33  0.4082 0.005454       2
cv_glmnet_0.75

Call:  cv.glmnet(x = x, y = y, offset = log(df_train$exposure), type.measure = "deviance",      nfolds = 20, family = "poisson", intercept = FALSE, alpha = i) 

Measure: Poisson Deviance 

    Lambda Index Measure       SE Nonzero
min 0.0498    59  0.4025 0.005125       1
1se 0.5100    34  0.4069 0.004504       1
cv_glmnet_1

Call:  cv.glmnet(x = x, y = y, offset = log(df_train$exposure), type.measure = "deviance",      nfolds = 20, family = "poisson", intercept = FALSE, alpha = i) 

Measure: Poisson Deviance 

    Lambda Index Measure       SE Nonzero
min 0.0374    59  0.4025 0.006287       1
1se 0.4607    32  0.4084 0.005440       1

we can pull out the coefficient by calling coef function.

coef(cv_glmnet_0.25)
45 x 1 sparse Matrix of class "dgCMatrix"
                                   s1
(Intercept)              .           
row.id                   .           
driver.age              -0.0006220311
yrs.licensed            -0.0268724105
ncd.level               -0.0720340462
region                   .           
vehicle.age              .           
vehicle.value            .           
seats                    .           
ccm                      .           
hp                       .           
weight                   .           
length                   .           
width                    .           
height                  -0.8287184062
prior.claims             0.0342454915
nb.rb_NB                 .           
nb.rb_RB                -0.0175072183
driver.gender_Female     .           
driver.gender_Male       .           
marital.status_Divorced  .           
marital.status_Married   .           
marital.status_Single    .           
marital.status_Widow     .           
yrs.lic_X1               .           
yrs.lic_X2               .           
yrs.lic_X3               .           
yrs.lic_X4               .           
yrs.lic_X5               .           
yrs.lic_X6               .           
yrs.lic_X7               .           
yrs.lic_X8               .           
yrs.lic_X9               .           
yrs.lic_X9.              .           
body.code_A              .           
body.code_B              .           
body.code_C              .           
body.code_D              .           
body.code_E              .           
body.code_F              .           
body.code_G              .           
body.code_H              .           
fuel.type_Diesel         .           
fuel.type_Gasoline       .           
fuel.type_LPG            .           

Note that over here, I am pulling the coefficients from the regularized model when cross validation error is within one standard error of the minimum.

Alternatively, we could extract the coefficient of the regularized model when cross validation error is the minimum by setting s to be “lambda.min” in the argument.

coef(cv_glmnet_0.25, s = "lambda.min")
45 x 1 sparse Matrix of class "dgCMatrix"
                                   s1
(Intercept)              .           
row.id                   3.077315e-06
driver.age              -6.003862e-03
yrs.licensed            -7.089763e-02
ncd.level               -9.562086e-02
region                  -7.018851e-04
vehicle.age             -1.777227e-02
vehicle.value           -7.633382e-03
seats                    .           
ccm                      .           
hp                       6.876185e-04
weight                   .           
length                   5.860735e-03
width                    2.131378e-01
height                  -7.334423e-01
prior.claims             1.281979e-01
nb.rb_NB                 .           
nb.rb_RB                -1.769263e-01
driver.gender_Female     .           
driver.gender_Male      -8.771084e-02
marital.status_Divorced -9.900298e-02
marital.status_Married  -7.775612e-03
marital.status_Single    .           
marital.status_Widow     5.797761e-01
yrs.lic_X1               2.569229e-02
yrs.lic_X2               .           
yrs.lic_X3              -4.162422e-02
yrs.lic_X4               7.175686e-02
yrs.lic_X5               .           
yrs.lic_X6              -2.272347e-01
yrs.lic_X7               .           
yrs.lic_X8              -1.465572e-01
yrs.lic_X9               3.793601e-01
yrs.lic_X9.              .           
body.code_A              .           
body.code_B             -2.891222e-01
body.code_C              .           
body.code_D              .           
body.code_E              8.646049e-03
body.code_F             -7.096296e-03
body.code_G              1.862983e-01
body.code_H              1.299067e-01
fuel.type_Diesel         .           
fuel.type_Gasoline      -9.764875e-02
fuel.type_LPG            .           

Alternatively, we could use tidy function to tidy up the objects and extract the output from the steps we are interested in.

tidy(cv_glmnet_0.25$glmnet.fit) %>%
  filter(step == 75)
# A tibble: 6 x 5
  term          step  estimate lambda dev.ratio
  <chr>        <dbl>     <dbl>  <dbl>     <dbl>
1 driver.age      75 -0.000946 0.0337     0.589
2 yrs.licensed    75 -0.0309   0.0337     0.589
3 ncd.level       75 -0.0748   0.0337     0.589
4 height          75 -0.812    0.0337     0.589
5 prior.claims    75  0.0424   0.0337     0.589
6 nb.rb_RB        75 -0.0291   0.0337     0.589

Prediction

To obtain the predictions, I will pass “response” to the type augment.

Since I have used offset to fit the model, offset needs to be specified in the predict function again based on the documentation (kjytay 2019).

glmnet_predict <- predict(glmnet_model, 
                          type = "response", 
                          newx = newx, 
                          newoffset = log(df_test$exposure))

Note that according to this stack overflow ticket, without performing cross validation, the glmnet will generate the predictions under all lambda values.

After performing cross validation, the algorithm will choose the largest lambda which MSE is witin one standard error of the smallest MSE by default.

cv_predict_glmnet <- predict(cv_glmnet_0.25, 
                             newx = newx, 
                             newoffset = log(df_test$exposure), 
                             type = "response")

cv_predict_glmnet %>% 
  as_tibble()
# A tibble: 16,304 x 1
   lambda.1se
        <dbl>
 1     0.121 
 2     0.156 
 3     0.0130
 4     0.158 
 5     0.158 
 6     0.159 
 7     0.194 
 8     0.153 
 9     0.197 
10     0.193 
# ... with 16,294 more rows

I have passed the predictions to as_tibble function to convert the object into a tibble table. This is to avoid R to print out all the predictions when we call the predictions.

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Markus Spiske on Unsplash

kjytay. 2019. “A Deep Dive into Glmnet: Offset.” https://www.r-bloggers.com/2019/01/a-deep-dive-into-glmnet-offset/.
———. 2020. “Glmnet V4.0: Generalizing the Family Parameter.” https://statisticaloddsandends.wordpress.com/2020/05/14/glmnet-v4-0-generalizing-the-family-parameter/.
Michel, Denuit, Hainaut Donatien, and Julien Trufin. 2019. Effective Statistical Learning Methods for Actuaries i: GLMs and Extensions. Springer.
Molnar, Christoph. 2022. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book/extend-lm.html#more-lm-extension.
Tiwari, Ajay. 2020. “Offsetting the Model — Logic to Implementation.” https://towardsdatascience.com/offsetting-the-model-logic-to-implementation-7e333bc25798.

References