Poisson Regression

Machine Learning Supervised Learning

A fish regression? Google translate from French to English to check what “poisson” means

Jasper Lok https://jasperlok.netlify.app/
11-01-2023

Photo by Anne Nygård on Unsplash

In this post, I will be exploring a special type of regression model, which is Poisson regression.

Before jumping into Poisson regression, let’s look at what is count regression.

What is Count Regression?

As the name suggested, the target variable for this type of regression is count.

This type of regression are intrinsically heteroskedastic, right-skewed and have a variance that increases with the mean of the distribution (Hilbe 2011).

Hence, this type of model will not fulfill the usual linear regression assumptions.

In this post, I will be focusing on Poisson regression.

Poisson Regression

In this model, below are the assumptions of using this model:

I will be using Poisson log loss formula from yardstick package to check how well the fitted model captures the underlying pattern.

Demonstration

In this post, I will be using both base R glm function and tidymodels package to build Poisson regression.

pacman::p_load(tidyverse, CASdatasets, tidymodels, poissonreg, janitor, DHARMa)

Import Data

First, I will import the claim frequency data from CASdatasets package.

data(freMTPLfreq)

df <-
  freMTPLfreq %>% 
  clean_names() %>% 
  mutate(policy_id = as.character(policy_id)
         ,clm_rate = claim_nb/exposure)

rm(freMTPLfreq)

Model Building

Great! Now, we will start building the model.

In the model building, I will be building model to predict the claim rate of the dataset by two different methods:

Poisson Regression - Offset

First, I will build a Poisson regression with offset function.

glm_fit <-
  glm(claim_nb ~ 
      power
    + car_age
    + driver_age
    + brand
    + gas
    + region
    + density
    ,offset = log(exposure)
    ,data = df
    ,family = poisson)

Note that we need to take log transformation on the offset variable since the link function for Poisson distribution is log.

We could extract the coefficient of the model by using tidy function.

I will also take the exponential of estimate so that its easier to interpret the meaning of the different estimates.

glm_fit %>% 
  tidy() %>% 
  mutate(exp_estimate = exp(estimate)) %>% 
  relocate(exp_estimate, .before = std.error)
# A tibble: 31 x 6
   term        estimate exp_estimate std.error statistic   p.value
   <chr>          <dbl>        <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)  -2.00          0.136    0.0584    -34.2  4.69e-256
 2 powere        0.0643        1.07     0.0280      2.29 2.18e-  2
 3 powerf        0.0810        1.08     0.0273      2.96 3.04e-  3
 4 powerg        0.0489        1.05     0.0272      1.80 7.25e-  2
 5 powerh        0.0928        1.10     0.0389      2.39 1.70e-  2
 6 poweri        0.189         1.21     0.0430      4.40 1.07e-  5
 7 powerj        0.168         1.18     0.0438      3.84 1.25e-  4
 8 powerk        0.232         1.26     0.0561      4.14 3.40e-  5
 9 powerl        0.0978        1.10     0.0834      1.17 2.41e-  1
10 powerm        0.149         1.16     0.119       1.25 2.11e-  1
# ... with 21 more rows

From the result above, we noted on the following:

Assumptions

I will be using DHARMa package to check the assumptions.

Check - QQ plot & residuals
simulated_residuals <- simulateResiduals(glm_fit)

plot(simulated_residuals)

Although the KS test result is significant, it seems like the fitted model is not too bad.

Check - Is the number of zero inflated in the dataset?

In fitting Poisson regression, one of the concerns is whether there is any sign of excessive zeros within the data.

To test this, I will use testZeroInflation function.

testZeroInflation(glm_fit)


    DHARMa zero-inflation test via comparison to expected zeros
    with simulation under H0 = fitted model

data:  simulationOutput
ratioObsSim = 1.0009, p-value < 2.2e-16
alternative hypothesis: two.sided

As the p-value is less than 0.05, there is statistical evidence that the number of zeros are more than expected.

However, as mentioned in the vignette by DHMARMa package, overdispersion could lead to excess zeros. Hence, I will perform dispersion test next.

Check - Is there any overdispersion?

Next, I will perform dispersion test by using testDispersion function.

testDispersion(glm_fit)


    DHARMa nonparametric dispersion test via sd of residuals
    fitted vs. simulated

data:  simulationOutput
dispersion = 1.0568, p-value < 2.2e-16
alternative hypothesis: two.sided

As the p-value is less than 0.05, we reject null hypothesis and the statistics is higher than 1, suggesting that there is statistical evidence that there is over dispersion.

Model Predictions

I will generate the predictions by using augment function.

I will also round the predicted claim counts to the nearest number.

glm_pred <-
  glm_fit %>% 
  augment(df, type.predict = "response") %>% 
  mutate(pred_round = round(.fitted))

glm_pred
# A tibble: 413,169 x 18
   policy_id claim_nb expos~1 power car_age drive~2 brand gas   region
   <chr>        <int>   <dbl> <fct>   <int>   <int> <fct> <fct> <fct> 
 1 1                0    0.09 g           0      46 Japa~ Dies~ Aquit~
 2 2                0    0.84 g           0      46 Japa~ Dies~ Aquit~
 3 3                0    0.52 f           2      38 Japa~ Regu~ Nord-~
 4 4                0    0.45 f           2      38 Japa~ Regu~ Nord-~
 5 5                0    0.15 g           0      41 Japa~ Dies~ Pays-~
 6 6                0    0.75 g           0      41 Japa~ Dies~ Pays-~
 7 7                0    0.81 d           1      27 Japa~ Regu~ Aquit~
 8 8                0    0.05 d           0      27 Japa~ Regu~ Aquit~
 9 9                0    0.76 d           9      23 Fiat  Regu~ Nord-~
10 10               0    0.34 i           0      44 Japa~ Regu~ Ile-d~
# ... with 413,159 more rows, 9 more variables: density <int>,
#   clm_rate <dbl>, .fitted <dbl>, .resid <dbl>, .std.resid <dbl>,
#   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, pred_round <dbl>, and
#   abbreviated variable names 1: exposure, 2: driver_age

Alternatively, we could extract the fitted.values from the model object if we are interested in the fitted values from the training dataset.

glm_fit$fitted.values %>% 
  as_tibble()
# A tibble: 413,169 x 1
     value
     <dbl>
 1 0.00592
 2 0.0552 
 3 0.0374 
 4 0.0324 
 5 0.00994
 6 0.0497 
 7 0.0555 
 8 0.00347
 9 0.0790 
10 0.0383 
# ... with 413,159 more rows

Model Performance

Anova Test

I will run chi-square test by using the anova function.

anova(glm_fit, test = "Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: claim_nb

Terms added sequentially (first to last)

           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                      413168     105613              
power      11    66.25    413157     105546 6.262e-10 ***
car_age     1    77.87    413156     105469 < 2.2e-16 ***
driver_age  1   414.08    413155     105054 < 2.2e-16 ***
brand       6    76.06    413149     104978 2.317e-14 ***
gas         1    17.84    413148     104961 2.397e-05 ***
region      9   199.84    413139     104761 < 2.2e-16 ***
density     1    63.93    413138     104697 1.289e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the result, we note the following:

Actual vs Predicted

As mentioned in the earlier section, I will use poisson_log_loss function to check the model fit.

poisson_log_loss(glm_pred
                 ,truth = claim_nb
                 ,estimate = .fitted)
# A tibble: 1 x 3
  .metric          .estimator .estimate
  <chr>            <chr>          <dbl>
1 poisson_log_loss standard       0.165

Next, I will plot the bar charts by showing the actual claim count and predicted claim count side by side.

graph_pred_act_df <-
  tibble("claim_count" = c(0, 1, 2, 3, 4)) %>% 
  # frequency count based on actual claim count
  left_join(
    glm_pred %>% 
      group_by(claim_nb) %>% 
      summarise(`actual claim count` = n())
    ,by = c("claim_count" = "claim_nb")
  ) %>% 
  # frequency count based on predicted claim count
  left_join(
    glm_pred %>% 
      group_by(pred_round) %>% 
      summarise(`predicted claim count` = n())
    ,by = c("claim_count" = "pred_round")
  ) %>% 
  # replace any NA with 0
  mutate_at(vars(everything())
            ,function(x) replace_na(x, 0)) %>% 
  pivot_longer(!claim_count
               ,names_to = "claim_type"
               ,values_to = "count")

# visualize the bar chart
graph_pred_act_df %>% 
  ggplot(aes(claim_count, count, fill = claim_type)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  labs(title = "Actual vs Predicted Claim Count"
       ,subtitle = "Poisson Regression with Log(Exposure) Offset")  

It seems like the model did poorly in capturing the claim policies based on the result shown.

If we were to plot the actual claim rate versus the predicted claim rate by using scatter points, we can see that some of the actual claim rates are much higher than the rest.

ggplot(glm_pred, aes(clm_rate, .fitted/exposure)) +
  geom_point()

If we were to extract policy, it seems like these policies have made at least a claim within 3 days after the policy inception.

glm_pred %>% 
  filter(clm_rate > 100) %>% 
  arrange(exposure)
# A tibble: 28 x 18
   policy_id claim_nb expos~1 power car_age drive~2 brand gas   region
   <chr>        <int>   <dbl> <fct>   <int>   <int> <fct> <fct> <fct> 
 1 171344           1 0.00274 d          16      20 Fiat  Regu~ Centre
 2 188359           1 0.00274 d           1      43 Volk~ Regu~ Centre
 3 249311           1 0.00274 d           8      28 Rena~ Regu~ Centre
 4 24492            1 0.00546 f           7      28 Rena~ Dies~ Ile-d~
 5 26305            1 0.00546 d          15      33 Rena~ Regu~ Centre
 6 360082           1 0.00546 d          20      19 Rena~ Regu~ Centre
 7 72149            1 0.00548 e          10      55 Rena~ Regu~ Pays-~
 8 96106            1 0.00548 f           8      51 Rena~ Dies~ Centre
 9 102385           1 0.00548 f          11      34 Opel~ Regu~ Centre
10 155312           1 0.00548 i          10      43 Rena~ Regu~ Centre
# ... with 18 more rows, 9 more variables: density <int>,
#   clm_rate <dbl>, .fitted <dbl>, .resid <dbl>, .std.resid <dbl>,
#   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, pred_round <dbl>, and
#   abbreviated variable names 1: exposure, 2: driver_age

Nevertheless, as we are not trying to build the most accurate model over here, hence I will leave the model as it is in this post.

Poisson with weight

When I was reading the different materials about building Poisson regression, I happened to come across this post on how the model with offset and model with weight would give us the same result if we pass in all the relevant parameters to the model.

Hence, to satisfy my own curiosity, I have built another Poisson regression by using weight.

I will use ‘clm_rate’ as the target variable and also indicate exposure as the weight.

glm_fit_rate <-
  glm(clm_rate ~ 
     power
    + car_age
    + driver_age
    + brand
    + gas
    + region
    + density
    ,weight = exposure
    ,data = df
    ,family = poisson)

If we were to extract the model info, we will realize that the estimate and all other model info are same as the Poisson regression with offset function we have built earlier.

glm_fit_rate %>% 
  tidy() %>% 
  mutate(exp_coef = exp(estimate)) %>%
  relocate(exp_coef, .before = std.error)
# A tibble: 31 x 6
   term        estimate exp_coef std.error statistic   p.value
   <chr>          <dbl>    <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)  -2.00      0.136    0.0584    -34.2  4.79e-256
 2 powere        0.0643    1.07     0.0280      2.29 2.18e-  2
 3 powerf        0.0810    1.08     0.0273      2.96 3.04e-  3
 4 powerg        0.0489    1.05     0.0272      1.80 7.26e-  2
 5 powerh        0.0928    1.10     0.0389      2.39 1.70e-  2
 6 poweri        0.189     1.21     0.0430      4.40 1.07e-  5
 7 powerj        0.168     1.18     0.0438      3.84 1.25e-  4
 8 powerk        0.232     1.26     0.0561      4.14 3.40e-  5
 9 powerl        0.0978    1.10     0.0834      1.17 2.41e-  1
10 powerm        0.149     1.16     0.119       1.25 2.11e-  1
# ... with 21 more rows

This shows that we could model the claim rate directly if the claim rate is the interest of the study. However, the weight needs to be specified when we are modeling on the claim rate directly.

Model Performance

glm_pred_weight <-
  predict(glm_fit_rate, type = "response") %>% 
  as_tibble() %>% 
  bind_cols(df) %>%
  mutate(pred = value * exposure
         ,pred_round = round(pred, 0))
graph_pred_act_df <-
  tibble("claim_count" = c(0, 1, 2, 3, 4)) %>% 
  left_join(
    glm_pred_weight %>% 
      group_by(claim_nb) %>% 
      summarise(`actual claim count` = n())
    ,by = c("claim_count" = "claim_nb")
  ) %>% 
  left_join(
    glm_pred_weight %>% 
      group_by(pred_round) %>% 
      summarise(`predicted claim count` = n())
    ,by = c("claim_count" = "pred_round")
  ) %>% 
  mutate_at(vars(everything())
            ,function(x) replace_na(x, 0)) %>% 
  pivot_longer(!claim_count
               ,names_to = "claim_type"
               ,values_to = "count")

graph_pred_act_df %>% 
  ggplot(aes(claim_count, count, fill = claim_type)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  labs(title = "Actual vs Predicted Claim Count")  

Use case_weights function in Tidymodels

Tidymodels also offers a weighting function to allow us in building a Poisson model with weights.

First, I will need to mutate the exposure column to indicate this column should be used as weights in the modeling.

df_rate <-
  df %>% 
  mutate(exposure = importance_weights(exposure))

Note that as the point of writing this post, tidymodels has two types of weights functions:

As the exposure is not an integer, hence I have used importance_weights function over here.

Once the column is mutated, we can see the column type has changed when we call the tibble table.

df_rate %>% 
  slice_head(n = 10)
   policy_id claim_nb exposure power car_age driver_age
1          1        0     0.09     g       0         46
2          2        0     0.84     g       0         46
3          3        0     0.52     f       2         38
4          4        0     0.45     f       2         38
5          5        0     0.15     g       0         41
6          6        0     0.75     g       0         41
7          7        0     0.81     d       1         27
8          8        0     0.05     d       0         27
9          9        0     0.76     d       9         23
10        10        0     0.34     i       0         44
                                brand     gas             region
1  Japanese (except Nissan) or Korean  Diesel          Aquitaine
2  Japanese (except Nissan) or Korean  Diesel          Aquitaine
3  Japanese (except Nissan) or Korean Regular Nord-Pas-de-Calais
4  Japanese (except Nissan) or Korean Regular Nord-Pas-de-Calais
5  Japanese (except Nissan) or Korean  Diesel   Pays-de-la-Loire
6  Japanese (except Nissan) or Korean  Diesel   Pays-de-la-Loire
7  Japanese (except Nissan) or Korean Regular          Aquitaine
8  Japanese (except Nissan) or Korean Regular          Aquitaine
9                                Fiat Regular Nord-Pas-de-Calais
10 Japanese (except Nissan) or Korean Regular      Ile-de-France
   density clm_rate
1       76        0
2       76        0
3     3003        0
4     3003        0
5       60        0
6       60        0
7      695        0
8      695        0
9     7887        0
10   27000        0

To build the Poisson model with weights, we will follow the usual steps in building machine learning models under tidymodels framework:

The only difference is we will need to specify the “weight” by using add_case_weights function and add to the workflow.

gen_recipe <- 
  recipe(clm_rate ~ power + car_age + driver_age + brand + gas + region + density + exposure
         ,data = df_rate)

pois_specs <-
  poisson_reg() %>% 
  set_engine("glm")
  
pois_wf <-
  workflow() %>% 
  add_case_weights(exposure) %>% 
  add_recipe(gen_recipe) %>% 
  add_model(pois_specs)

If we were to call the workflow, we will see that exposure is being captured as “weight” in the workflow.

pois_wf
== Workflow ==========================================================
Preprocessor: Recipe
Model: poisson_reg()

-- Preprocessor ------------------------------------------------------
0 Recipe Steps

-- Case Weights ------------------------------------------------------
exposure

-- Model -------------------------------------------------------------
Poisson Regression Model Specification (regression)

Computational engine: glm 

Finally, we will fit the model.

pois_fit <-
  pois_wf %>% 
  fit(data = df_rate)

We can see that the estimate and relevant results are same as the weighted poisson model built by using glm function directly.

tidy(pois_fit)
# A tibble: 31 x 5
   term        estimate std.error statistic   p.value
   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)  -2.00      0.0584    -34.2  4.79e-256
 2 powere        0.0643    0.0280      2.29 2.18e-  2
 3 powerf        0.0810    0.0273      2.96 3.04e-  3
 4 powerg        0.0489    0.0272      1.80 7.26e-  2
 5 powerh        0.0928    0.0389      2.39 1.70e-  2
 6 poweri        0.189     0.0430      4.40 1.07e-  5
 7 powerj        0.168     0.0438      3.84 1.25e-  4
 8 powerk        0.232     0.0561      4.14 3.40e-  5
 9 powerl        0.0978    0.0834      1.17 2.41e-  1
10 powerm        0.149     0.119       1.25 2.11e-  1
# ... with 21 more rows

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Sixteen Miles Out on Unsplash

Hilbe, Joseph M. 2011. Negative Binomial Regression. Cambridge University Press.

References