Back to Actuaries Most Beloved Model
In this post, I will be exploring actuaries’ most beloved model, generalized linear model (GLM).

Photo by Jason Leung on Unsplash
This will set the scene for future sharing on other topics, such as generalized additive model, frequency severity modeling and so on.
GLM is an extension of linear model.
In linear model, we assume the relationship between independent and dependent variables are linear.
However, this assumption is not appropriate in most of the scenarios (eg. insurance premium setting). This is where GLM could help us in our modeling work.
The idea of using GLM is that with appropriate transformation on the target variable, the relationship between independent variables and transformed dependent variable remain linear.
(kjytay 2020) A GLM model consists of following 3 parts:
Linear predictor \(\eta_i=\beta_0+\beta^Tx_i\)
Link function \(\eta_i=g(\mu_i)\)
Random component \(y_i\sim f(y|\mu_i)\)
where the general form of GLM formula can be found below:
\[y_i=\beta^Tx_i+\epsilon_i\]
Below are some of the assumptions when using GLM models:
The predictors (a.k.a. independent variables) are independent from one another
The dependent variable is assumed to be one of the distribution from an exponential family (eg. normal, Poisson, binomial etc)
Errors are independent from one another
The model performance might be affected if any of the assumptions is being violated.
Below are some of the families supported under GLM:
Gaussian
Binomial
Poisson
Multinomial
Cox
Tweedie
(Molnar 2022) summarized down the appropriate families for the different predictions we are trying to make. This is the link to the book.
The offset enters the score in an additive way.
This post provides a clear example on how the offset function works in a Poisson regression context.
Then you might be asking instead of passing feature into offset function, why can’t we just divide the target variable by offset variable?
The short answer is the likelihood function would change if we divide the target variable by using the offset variable. (Tiwari 2020) provides the mathematical proof on how the likelihood would change if we divide the target variable by the offset variable.
One important thing to note when using offset is offset should be in the same scale as the target variable. For example, when using Poisson family, the default link function is log. The offset variable should be log as well.
Note that offset function should not be confused with weight function in GLM.
In short, following are the differences between offset & weight:
Offset is to adjust the target so that the targets of the different data points are in same unit of measurements
Weight is to assign different importance to some observations than the rest
Of course there is not perfect model. Sometimes the advantages of the models bring could be disadvantages of using the relevant models.
Following are advantages and disadvantages of using GLM models:
Advantages of using GLM
Does not assume the relationship between independent and dependent variables is linear
Easier to interpret than some other models, such as neural network
Less resistant from the business users since GLM can produce rating factors, which has been widely used
Disadvantages of using GLM
Sensitive to outliers
Unable to handle categorical variables
Model performance will be affected if any of the model assumptions mentioned above is being violated
(Michel, Donatien, and Trufin 2019) GLMs were originally introduced to improve the accuracy of motor insurance pricing. Today GLM is being used in many insurance contexts.
(Michel, Donatien, and Trufin 2019) also listed some of the typical insurance use case by using GLMs in the book:
Claim count modeling
Claim amount modeling
Graduation of mortality and morbidity rates
Elasticity modeling
Loss reserving
Competitive analysis
Nevertheless, let’s start the demonstration!

Photo by Clemens van Lay on Unsplash
In this demonstration, I will be using glmnet package to
build the GLM model. The documentation can be found under this link.
At the point of writing, it seems like tidymodels
package doesn’t support all functionalities of glmnet
package.
According to this issue
ticket, it seems like tidymodels is unable to support
offset function for time being. glmnet requires the users
to specify the variable to be used as offset in both training and
testing datasets. However, tidymodels only allows the users
to specify offset in training dataset.
Therefore, I will be using glmnet package directly in
building the model and tidymodels package to prepare the
data for model building.
I will call the relevant packages to setup the environment.
pacman::p_load(tidyverse, readr, tidymodels, corrplot, glmnet,
glue)
For the demonstration, I will be using the car insurance dataset from Chapter 1 of Predictive Modeling Applications in Actuarial Science - Volume 1.
The dataset can be found from the book website.

I will first import the data into environment.
I will also specify that the year of driving licence should be imported as character, instead of continuous variable.
df <- read_csv("data/sim-modeling-dataset.csv",
col_types = list(yrs.lic = col_character()))
This is a rather clean data to work with since there is no missing value.
Before building the model, I will perform some simple EDA to perform simple data cleaning.
I will first select all the numeric variables.
df_num <- df %>%
select_if(is.numeric)
Then I will pass them into corrplot function.
corrplot(cor(df_num, use="pairwise.complete.obs"),
method = "number",
type = "upper",
tl.cex = 0.65,
number.cex = 0.65,
diag = FALSE)

Note that there are some variables are perfectly correlated.
Based on the variable names, it seems like these variables are measuring the same things.
Hence I will remove the highly correlated variable.
df_1 <- df %>%
select(-c(year, drv.age, veh.age, clm.incurred))
Also, GLM is unable to handle non-numeric variables.
To overcome this, I have used to the step_dummy function
in recipe package to encode the non-numeric variables by using one hot
encoding method. all_nominal_predictiors function will tell
the function to identify all the nominal independent variables in the
dataset.
Then I will pass the recipe into the prep function. The
current dataset will then be passed into bake function to
generate out the pre-processed data.
df_recoded <- df_1 %>%
recipe(clm.count ~ .) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
prep() %>%
bake(df_1)
Next, I will be building a model to predict the claim count from the various policies.
I will split the dataset into training and testing dataset.
df_split <- initial_split(df_recoded, prop = 0.6, strata = clm.count)
As usual, I will use training and testing
function to crave the dataset into training and testing dataset.
df_train <- training(df_split)
df_test <- testing(df_split)
glmnet requires the independent variables to be in
matrix format. Hence I will pass the object into
data.matrix to convert the tibble table into matrix
format.
# predictors need to be in matrix format
x <- df_train %>%
select(-c(clm.count, exposure)) %>%
data.matrix()
Similarly, I will “cut out” the target variable from the training dataset.
y <- df_train$clm.count
I will also convert the test dataset into matrix format.
newx <- df_test %>%
select(-c(clm.count, exposure)) %>%
data.matrix()
Next, I will start building the model.
As I am trying to predict the claim count, hence I will specify Poisson to be the family in the GLM function.
glmnet_model <- glmnet(x,
y,
family = "poisson",
offset = log(df_train$exposure))
As in the data, we are not told whether the claim counts and claim amounts are being adjusted for the exposure. Hence, I will assume the claim counts and claim amounts are not adjusted for the exposure of the policies.
To adjust for the exposure, I will specify what should be the “offset” in the model.
Also, glmnet package allows us to fit regularized model.
The algorithm will fit a lasso model if we didn’t specify the value of
alpha.
To fit a ridge model, we just need to set the alpha to be 0 as shown below.
glmnet(x,
y,
family = "poisson",
offset = log(df_train$exposure),
alpha = 0)
tidy function from tidymodels package can
be used to convert the output into tidy data format so that its easier
to read and perform data wrangling.
glmnet_model %>%
tidy()
# A tibble: 1,714 x 5
term step estimate lambda dev.ratio
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1 -1.82 0.0263 2.53e-14
2 (Intercept) 2 -1.77 0.0240 3.54e- 3
3 (Intercept) 3 -1.73 0.0218 6.50e- 3
4 (Intercept) 4 -1.69 0.0199 8.96e- 3
5 (Intercept) 5 -1.65 0.0181 1.10e- 2
6 (Intercept) 6 -1.62 0.0165 1.27e- 2
7 (Intercept) 7 -1.59 0.0151 1.41e- 2
8 (Intercept) 8 -1.56 0.0137 1.53e- 2
9 (Intercept) 9 -1.54 0.0125 1.71e- 2
10 (Intercept) 10 -1.53 0.0114 1.88e- 2
# ... with 1,704 more rows
The results above show the different lambda values are being used to fit the model.
The model results also contains dev.ratio, which
represents the proportion of deviance explained.
Alternatively, glmnet package also has functions to help
us in performing cross validation.
As from the documentation, it doesn’t seem like the cross validation
function in glmnet package allows us to iterate over
different alpha values.
To overcome this, I will loop the cross validation over different alpha values.
alpha_list <- c(0, 0.25, 0.5, 0.75, 1)
for (i in alpha_list){
assign(glue("cv_glmnet_", i), cv.glmnet(x,
y,
family = "poisson",
intercept = FALSE,
type.measure = "deviance",
alpha = i,
nfolds = 20,
offset = log(df_train$exposure))
)
print(glue("cv_glmnet_", i))
print(get(glue("cv_glmnet_", i)))
}
cv_glmnet_0
Call: cv.glmnet(x = x, y = y, offset = log(df_train$exposure), type.measure = "deviance", nfolds = 20, family = "poisson", intercept = FALSE, alpha = i)
Measure: Poisson Deviance
Lambda Index Measure SE Nonzero
min 0.824 100 0.3988 0.008554 43
1se 13.429 70 0.4070 0.007821 43
cv_glmnet_0.25
Call: cv.glmnet(x = x, y = y, offset = log(df_train$exposure), type.measure = "deviance", nfolds = 20, family = "poisson", intercept = FALSE, alpha = i)
Measure: Poisson Deviance
Lambda Index Measure SE Nonzero
min 0.00330 100 0.3898 0.005572 29
1se 0.03703 74 0.3949 0.005402 6
cv_glmnet_0.5
Call: cv.glmnet(x = x, y = y, offset = log(df_train$exposure), type.measure = "deviance", nfolds = 20, family = "poisson", intercept = FALSE, alpha = i)
Measure: Poisson Deviance
Lambda Index Measure SE Nonzero
min 0.0681 60 0.4026 0.006298 1
1se 0.8395 33 0.4082 0.005454 2
cv_glmnet_0.75
Call: cv.glmnet(x = x, y = y, offset = log(df_train$exposure), type.measure = "deviance", nfolds = 20, family = "poisson", intercept = FALSE, alpha = i)
Measure: Poisson Deviance
Lambda Index Measure SE Nonzero
min 0.0498 59 0.4025 0.005125 1
1se 0.5100 34 0.4069 0.004504 1
cv_glmnet_1
Call: cv.glmnet(x = x, y = y, offset = log(df_train$exposure), type.measure = "deviance", nfolds = 20, family = "poisson", intercept = FALSE, alpha = i)
Measure: Poisson Deviance
Lambda Index Measure SE Nonzero
min 0.0374 59 0.4025 0.006287 1
1se 0.4607 32 0.4084 0.005440 1
we can pull out the coefficient by calling coef
function.
coef(cv_glmnet_0.25)
45 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) .
row.id .
driver.age -0.0006220311
yrs.licensed -0.0268724105
ncd.level -0.0720340462
region .
vehicle.age .
vehicle.value .
seats .
ccm .
hp .
weight .
length .
width .
height -0.8287184062
prior.claims 0.0342454915
nb.rb_NB .
nb.rb_RB -0.0175072183
driver.gender_Female .
driver.gender_Male .
marital.status_Divorced .
marital.status_Married .
marital.status_Single .
marital.status_Widow .
yrs.lic_X1 .
yrs.lic_X2 .
yrs.lic_X3 .
yrs.lic_X4 .
yrs.lic_X5 .
yrs.lic_X6 .
yrs.lic_X7 .
yrs.lic_X8 .
yrs.lic_X9 .
yrs.lic_X9. .
body.code_A .
body.code_B .
body.code_C .
body.code_D .
body.code_E .
body.code_F .
body.code_G .
body.code_H .
fuel.type_Diesel .
fuel.type_Gasoline .
fuel.type_LPG .
Note that over here, I am pulling the coefficients from the regularized model when cross validation error is within one standard error of the minimum.
Alternatively, we could extract the coefficient of the regularized model when cross validation error is the minimum by setting s to be “lambda.min” in the argument.
coef(cv_glmnet_0.25, s = "lambda.min")
45 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) .
row.id 3.077315e-06
driver.age -6.003862e-03
yrs.licensed -7.089763e-02
ncd.level -9.562086e-02
region -7.018851e-04
vehicle.age -1.777227e-02
vehicle.value -7.633382e-03
seats .
ccm .
hp 6.876185e-04
weight .
length 5.860735e-03
width 2.131378e-01
height -7.334423e-01
prior.claims 1.281979e-01
nb.rb_NB .
nb.rb_RB -1.769263e-01
driver.gender_Female .
driver.gender_Male -8.771084e-02
marital.status_Divorced -9.900298e-02
marital.status_Married -7.775612e-03
marital.status_Single .
marital.status_Widow 5.797761e-01
yrs.lic_X1 2.569229e-02
yrs.lic_X2 .
yrs.lic_X3 -4.162422e-02
yrs.lic_X4 7.175686e-02
yrs.lic_X5 .
yrs.lic_X6 -2.272347e-01
yrs.lic_X7 .
yrs.lic_X8 -1.465572e-01
yrs.lic_X9 3.793601e-01
yrs.lic_X9. .
body.code_A .
body.code_B -2.891222e-01
body.code_C .
body.code_D .
body.code_E 8.646049e-03
body.code_F -7.096296e-03
body.code_G 1.862983e-01
body.code_H 1.299067e-01
fuel.type_Diesel .
fuel.type_Gasoline -9.764875e-02
fuel.type_LPG .
Alternatively, we could use tidy function to tidy up the
objects and extract the output from the steps we are interested in.
tidy(cv_glmnet_0.25$glmnet.fit) %>%
filter(step == 75)
# A tibble: 6 x 5
term step estimate lambda dev.ratio
<chr> <dbl> <dbl> <dbl> <dbl>
1 driver.age 75 -0.000946 0.0337 0.589
2 yrs.licensed 75 -0.0309 0.0337 0.589
3 ncd.level 75 -0.0748 0.0337 0.589
4 height 75 -0.812 0.0337 0.589
5 prior.claims 75 0.0424 0.0337 0.589
6 nb.rb_RB 75 -0.0291 0.0337 0.589
To obtain the predictions, I will pass “response” to the type augment.
Since I have used offset to fit the model, offset needs to be specified in the predict function again based on the documentation (kjytay 2019).
Note that according to this stack overflow ticket, without performing cross validation, the glmnet will generate the predictions under all lambda values.
After performing cross validation, the algorithm will choose the largest lambda which MSE is witin one standard error of the smallest MSE by default.
cv_predict_glmnet <- predict(cv_glmnet_0.25,
newx = newx,
newoffset = log(df_test$exposure),
type = "response")
cv_predict_glmnet %>%
as_tibble()
# A tibble: 16,304 x 1
lambda.1se
<dbl>
1 0.121
2 0.156
3 0.0130
4 0.158
5 0.158
6 0.159
7 0.194
8 0.153
9 0.197
10 0.193
# ... with 16,294 more rows
I have passed the predictions to as_tibble function to
convert the object into a tibble table. This is to avoid R to print out
all the predictions when we call the predictions.
That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by Markus Spiske on Unsplash