A fish regression? Google translate from French to English to check what “poisson” means

Photo by Anne Nygård on Unsplash
In this post, I will be exploring a special type of regression model, which is Poisson regression.
Before jumping into Poisson regression, let’s look at what is count regression.
As the name suggested, the target variable for this type of regression is count.
This type of regression are intrinsically heteroskedastic, right-skewed and have a variance that increases with the mean of the distribution (Hilbe 2011).
Hence, this type of model will not fulfill the usual linear regression assumptions.
In this post, I will be focusing on Poisson regression.
In this model, below are the assumptions of using this model:
Assume the dependent variable follows Poisson distribution.
Assume the mean and variance are equal under this model.
I will be using Poisson log loss formula from yardstick package to check how well the fitted model captures the underlying pattern.
In this post, I will be using both base R glm function and tidymodels package to build Poisson regression.
pacman::p_load(tidyverse, CASdatasets, tidymodels, poissonreg, janitor, DHARMa)
First, I will import the claim frequency data from CASdatasets package.
data(freMTPLfreq)
df <-
freMTPLfreq %>%
clean_names() %>%
mutate(policy_id = as.character(policy_id)
,clm_rate = claim_nb/exposure)
rm(freMTPLfreq)
Great! Now, we will start building the model.
In the model building, I will be building model to predict the claim rate of the dataset by two different methods:
1st method: Use offset function
2nd method: Use weight function
First, I will build a Poisson regression with offset function.
Note that we need to take log transformation on the offset variable since the link function for Poisson distribution is log.
We could extract the coefficient of the model by using tidy function.
I will also take the exponential of estimate so that its easier to interpret the meaning of the different estimates.
glm_fit %>%
tidy() %>%
mutate(exp_estimate = exp(estimate)) %>%
relocate(exp_estimate, .before = std.error)
# A tibble: 31 x 6
term estimate exp_estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.00 0.136 0.0584 -34.2 4.69e-256
2 powere 0.0643 1.07 0.0280 2.29 2.18e- 2
3 powerf 0.0810 1.08 0.0273 2.96 3.04e- 3
4 powerg 0.0489 1.05 0.0272 1.80 7.25e- 2
5 powerh 0.0928 1.10 0.0389 2.39 1.70e- 2
6 poweri 0.189 1.21 0.0430 4.40 1.07e- 5
7 powerj 0.168 1.18 0.0438 3.84 1.25e- 4
8 powerk 0.232 1.26 0.0561 4.14 3.40e- 5
9 powerl 0.0978 1.10 0.0834 1.17 2.41e- 1
10 powerm 0.149 1.16 0.119 1.25 2.11e- 1
# ... with 21 more rows
From the result above, we noted on the following:
If all else being equal, the claim risk will decrease by approximately 1% when the driver age increases
Not all regions have the same effect on the claim rate. Some regions tends to have higher influence on the claim rate.
I will be using DHARMa package to check the assumptions.
simulated_residuals <- simulateResiduals(glm_fit)
plot(simulated_residuals)

Although the KS test result is significant, it seems like the fitted model is not too bad.
In fitting Poisson regression, one of the concerns is whether there is any sign of excessive zeros within the data.
To test this, I will use testZeroInflation function.
testZeroInflation(glm_fit)

DHARMa zero-inflation test via comparison to expected zeros
with simulation under H0 = fitted model
data: simulationOutput
ratioObsSim = 1.0009, p-value < 2.2e-16
alternative hypothesis: two.sided
As the p-value is less than 0.05, there is statistical evidence that the number of zeros are more than expected.
However, as mentioned in the vignette by DHMARMa package, overdispersion could lead to excess zeros. Hence, I will perform dispersion test next.
Next, I will perform dispersion test by using testDispersion function.
testDispersion(glm_fit)

DHARMa nonparametric dispersion test via sd of residuals
fitted vs. simulated
data: simulationOutput
dispersion = 1.0568, p-value < 2.2e-16
alternative hypothesis: two.sided
As the p-value is less than 0.05, we reject null hypothesis and the statistics is higher than 1, suggesting that there is statistical evidence that there is over dispersion.
I will generate the predictions by using augment function.
I will also round the predicted claim counts to the nearest number.
glm_pred <-
glm_fit %>%
augment(df, type.predict = "response") %>%
mutate(pred_round = round(.fitted))
glm_pred
# A tibble: 413,169 x 18
policy_id claim_nb expos~1 power car_age drive~2 brand gas region
<chr> <int> <dbl> <fct> <int> <int> <fct> <fct> <fct>
1 1 0 0.09 g 0 46 Japa~ Dies~ Aquit~
2 2 0 0.84 g 0 46 Japa~ Dies~ Aquit~
3 3 0 0.52 f 2 38 Japa~ Regu~ Nord-~
4 4 0 0.45 f 2 38 Japa~ Regu~ Nord-~
5 5 0 0.15 g 0 41 Japa~ Dies~ Pays-~
6 6 0 0.75 g 0 41 Japa~ Dies~ Pays-~
7 7 0 0.81 d 1 27 Japa~ Regu~ Aquit~
8 8 0 0.05 d 0 27 Japa~ Regu~ Aquit~
9 9 0 0.76 d 9 23 Fiat Regu~ Nord-~
10 10 0 0.34 i 0 44 Japa~ Regu~ Ile-d~
# ... with 413,159 more rows, 9 more variables: density <int>,
# clm_rate <dbl>, .fitted <dbl>, .resid <dbl>, .std.resid <dbl>,
# .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, pred_round <dbl>, and
# abbreviated variable names 1: exposure, 2: driver_age
Alternatively, we could extract the fitted.values from the model object if we are interested in the fitted values from the training dataset.
glm_fit$fitted.values %>%
as_tibble()
# A tibble: 413,169 x 1
value
<dbl>
1 0.00592
2 0.0552
3 0.0374
4 0.0324
5 0.00994
6 0.0497
7 0.0555
8 0.00347
9 0.0790
10 0.0383
# ... with 413,159 more rows
I will run chi-square test by using the anova function.
anova(glm_fit, test = "Chisq")
Analysis of Deviance Table
Model: poisson, link: log
Response: claim_nb
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 413168 105613
power 11 66.25 413157 105546 6.262e-10 ***
car_age 1 77.87 413156 105469 < 2.2e-16 ***
driver_age 1 414.08 413155 105054 < 2.2e-16 ***
brand 6 76.06 413149 104978 2.317e-14 ***
gas 1 17.84 413148 104961 2.397e-05 ***
region 9 199.84 413139 104761 < 2.2e-16 ***
density 1 63.93 413138 104697 1.289e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the result, we note the following:
Although all the variables have p-value less than 0.05, some variables seem to be more important than the rest
Driver_age seems to be the most important variable as the reduction in deviance is the biggest when we include this variable in model building
As mentioned in the earlier section, I will use poisson_log_loss function to check the model fit.
poisson_log_loss(glm_pred
,truth = claim_nb
,estimate = .fitted)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 poisson_log_loss standard 0.165
Next, I will plot the bar charts by showing the actual claim count and predicted claim count side by side.
graph_pred_act_df <-
tibble("claim_count" = c(0, 1, 2, 3, 4)) %>%
# frequency count based on actual claim count
left_join(
glm_pred %>%
group_by(claim_nb) %>%
summarise(`actual claim count` = n())
,by = c("claim_count" = "claim_nb")
) %>%
# frequency count based on predicted claim count
left_join(
glm_pred %>%
group_by(pred_round) %>%
summarise(`predicted claim count` = n())
,by = c("claim_count" = "pred_round")
) %>%
# replace any NA with 0
mutate_at(vars(everything())
,function(x) replace_na(x, 0)) %>%
pivot_longer(!claim_count
,names_to = "claim_type"
,values_to = "count")
# visualize the bar chart
graph_pred_act_df %>%
ggplot(aes(claim_count, count, fill = claim_type)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
labs(title = "Actual vs Predicted Claim Count"
,subtitle = "Poisson Regression with Log(Exposure) Offset")

It seems like the model did poorly in capturing the claim policies based on the result shown.
If we were to plot the actual claim rate versus the predicted claim rate by using scatter points, we can see that some of the actual claim rates are much higher than the rest.
ggplot(glm_pred, aes(clm_rate, .fitted/exposure)) +
geom_point()

If we were to extract policy, it seems like these policies have made at least a claim within 3 days after the policy inception.
glm_pred %>%
filter(clm_rate > 100) %>%
arrange(exposure)
# A tibble: 28 x 18
policy_id claim_nb expos~1 power car_age drive~2 brand gas region
<chr> <int> <dbl> <fct> <int> <int> <fct> <fct> <fct>
1 171344 1 0.00274 d 16 20 Fiat Regu~ Centre
2 188359 1 0.00274 d 1 43 Volk~ Regu~ Centre
3 249311 1 0.00274 d 8 28 Rena~ Regu~ Centre
4 24492 1 0.00546 f 7 28 Rena~ Dies~ Ile-d~
5 26305 1 0.00546 d 15 33 Rena~ Regu~ Centre
6 360082 1 0.00546 d 20 19 Rena~ Regu~ Centre
7 72149 1 0.00548 e 10 55 Rena~ Regu~ Pays-~
8 96106 1 0.00548 f 8 51 Rena~ Dies~ Centre
9 102385 1 0.00548 f 11 34 Opel~ Regu~ Centre
10 155312 1 0.00548 i 10 43 Rena~ Regu~ Centre
# ... with 18 more rows, 9 more variables: density <int>,
# clm_rate <dbl>, .fitted <dbl>, .resid <dbl>, .std.resid <dbl>,
# .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, pred_round <dbl>, and
# abbreviated variable names 1: exposure, 2: driver_age
Nevertheless, as we are not trying to build the most accurate model over here, hence I will leave the model as it is in this post.
When I was reading the different materials about building Poisson regression, I happened to come across this post on how the model with offset and model with weight would give us the same result if we pass in all the relevant parameters to the model.
Hence, to satisfy my own curiosity, I have built another Poisson regression by using weight.
I will use ‘clm_rate’ as the target variable and also indicate exposure as the weight.
glm_fit_rate <-
glm(clm_rate ~
power
+ car_age
+ driver_age
+ brand
+ gas
+ region
+ density
,weight = exposure
,data = df
,family = poisson)
If we were to extract the model info, we will realize that the estimate and all other model info are same as the Poisson regression with offset function we have built earlier.
glm_fit_rate %>%
tidy() %>%
mutate(exp_coef = exp(estimate)) %>%
relocate(exp_coef, .before = std.error)
# A tibble: 31 x 6
term estimate exp_coef std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.00 0.136 0.0584 -34.2 4.79e-256
2 powere 0.0643 1.07 0.0280 2.29 2.18e- 2
3 powerf 0.0810 1.08 0.0273 2.96 3.04e- 3
4 powerg 0.0489 1.05 0.0272 1.80 7.26e- 2
5 powerh 0.0928 1.10 0.0389 2.39 1.70e- 2
6 poweri 0.189 1.21 0.0430 4.40 1.07e- 5
7 powerj 0.168 1.18 0.0438 3.84 1.25e- 4
8 powerk 0.232 1.26 0.0561 4.14 3.40e- 5
9 powerl 0.0978 1.10 0.0834 1.17 2.41e- 1
10 powerm 0.149 1.16 0.119 1.25 2.11e- 1
# ... with 21 more rows
This shows that we could model the claim rate directly if the claim rate is the interest of the study. However, the weight needs to be specified when we are modeling on the claim rate directly.
graph_pred_act_df <-
tibble("claim_count" = c(0, 1, 2, 3, 4)) %>%
left_join(
glm_pred_weight %>%
group_by(claim_nb) %>%
summarise(`actual claim count` = n())
,by = c("claim_count" = "claim_nb")
) %>%
left_join(
glm_pred_weight %>%
group_by(pred_round) %>%
summarise(`predicted claim count` = n())
,by = c("claim_count" = "pred_round")
) %>%
mutate_at(vars(everything())
,function(x) replace_na(x, 0)) %>%
pivot_longer(!claim_count
,names_to = "claim_type"
,values_to = "count")
graph_pred_act_df %>%
ggplot(aes(claim_count, count, fill = claim_type)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
labs(title = "Actual vs Predicted Claim Count")

case_weights function in TidymodelsTidymodels also offers a weighting function to allow us in building a Poisson model with weights.
First, I will need to mutate the exposure column to indicate this column should be used as weights in the modeling.
df_rate <-
df %>%
mutate(exposure = importance_weights(exposure))
Note that as the point of writing this post, tidymodels has two types of weights functions:
frequency_weight, which only allows integers
importance_weight, which allows non-negative doubles
As the exposure is not an integer, hence I have used importance_weights function over here.
Once the column is mutated, we can see the column type has changed when we call the tibble table.
df_rate %>%
slice_head(n = 10)
policy_id claim_nb exposure power car_age driver_age
1 1 0 0.09 g 0 46
2 2 0 0.84 g 0 46
3 3 0 0.52 f 2 38
4 4 0 0.45 f 2 38
5 5 0 0.15 g 0 41
6 6 0 0.75 g 0 41
7 7 0 0.81 d 1 27
8 8 0 0.05 d 0 27
9 9 0 0.76 d 9 23
10 10 0 0.34 i 0 44
brand gas region
1 Japanese (except Nissan) or Korean Diesel Aquitaine
2 Japanese (except Nissan) or Korean Diesel Aquitaine
3 Japanese (except Nissan) or Korean Regular Nord-Pas-de-Calais
4 Japanese (except Nissan) or Korean Regular Nord-Pas-de-Calais
5 Japanese (except Nissan) or Korean Diesel Pays-de-la-Loire
6 Japanese (except Nissan) or Korean Diesel Pays-de-la-Loire
7 Japanese (except Nissan) or Korean Regular Aquitaine
8 Japanese (except Nissan) or Korean Regular Aquitaine
9 Fiat Regular Nord-Pas-de-Calais
10 Japanese (except Nissan) or Korean Regular Ile-de-France
density clm_rate
1 76 0
2 76 0
3 3003 0
4 3003 0
5 60 0
6 60 0
7 695 0
8 695 0
9 7887 0
10 27000 0
To build the Poisson model with weights, we will follow the usual steps in building machine learning models under tidymodels framework:
Define the recipe
Define the model specification
Define the workflow
Fit the model
The only difference is we will need to specify the “weight” by using add_case_weights function and add to the workflow.
gen_recipe <-
recipe(clm_rate ~ power + car_age + driver_age + brand + gas + region + density + exposure
,data = df_rate)
pois_specs <-
poisson_reg() %>%
set_engine("glm")
pois_wf <-
workflow() %>%
add_case_weights(exposure) %>%
add_recipe(gen_recipe) %>%
add_model(pois_specs)
If we were to call the workflow, we will see that exposure is being captured as “weight” in the workflow.
pois_wf
== Workflow ==========================================================
Preprocessor: Recipe
Model: poisson_reg()
-- Preprocessor ------------------------------------------------------
0 Recipe Steps
-- Case Weights ------------------------------------------------------
exposure
-- Model -------------------------------------------------------------
Poisson Regression Model Specification (regression)
Computational engine: glm
Finally, we will fit the model.
pois_fit <-
pois_wf %>%
fit(data = df_rate)
We can see that the estimate and relevant results are same as the weighted poisson model built by using glm function directly.
tidy(pois_fit)
# A tibble: 31 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.00 0.0584 -34.2 4.79e-256
2 powere 0.0643 0.0280 2.29 2.18e- 2
3 powerf 0.0810 0.0273 2.96 3.04e- 3
4 powerg 0.0489 0.0272 1.80 7.26e- 2
5 powerh 0.0928 0.0389 2.39 1.70e- 2
6 poweri 0.189 0.0430 4.40 1.07e- 5
7 powerj 0.168 0.0438 3.84 1.25e- 4
8 powerk 0.232 0.0561 4.14 3.40e- 5
9 powerl 0.0978 0.0834 1.17 2.41e- 1
10 powerm 0.149 0.119 1.25 2.11e- 1
# ... with 21 more rows
That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by Sixteen Miles Out on Unsplash