Zero-Inflated Poisson

Machine Learning Supervised Learning

Too many zeros!

Jasper Lok https://jasperlok.netlify.app/
09-14-2024

In this post, I will be exploring how to check the dispersion and how to build a zero-inflated model.

Photo by Pickled Stardust on Unsplash

What is overdispersion?

Overdispersion describes the observation that variation is higher than would be expected (Dormann).

Often overdispersion can be a result of missing predictors or a misspecified model structure.

This overdispersion scenario is common in the insurance dataset as the policies made a claim is likely to be just a small set of the total portfolio.

Why do we care about overdispersion?

If overdispersion is present in a dataset, the estimated standard errors and test statistics of the overall goodness-of-fit will be distorted and adjustments must be made (PennState Eberly College of Science).

In this post, I will be exploring how to use a zero-inflated model to fix the overdispersion issue in the model.

Zero-inflated model

Zero-inflated model is a type of multi-level model. It comprises two components.

Other important notes

One of the posts I came across is the misuse of Vuong test for non-nested models to test for zero-inflation.

(Wilson) suggested that it’s incorrect to use Vuong test to check whether a zero-inflated model is a better fit for count distribution (e.g., Poisson distribution).

Demonstration

First, I will import the necessary packages into the environment.

pacman::p_load(tidyverse, tidymodels, poissonreg, pscl, DHARMa, AER)

In this demonstration, I will be using this car insurance dataset from CASdatasets package.

data(pg17trainpol)
data(pg17trainclaim)

pol_df <- pg17trainpol
rm(pg17trainpol)

clm_df <- pg17trainclaim
rm(pg17trainclaim)

I will perform some data wrangling before model building.

# claim count
clm_cnt <-
  clm_df %>% 
  group_by(id_client, id_vehicle) %>% 
  summarise(tot_clm_cnt = sum(claim_nb),
            tot_clm_amt = sum(claim_amount))

# join policy and claim data
pol_clm_df <-
  pol_df %>% 
  left_join(clm_cnt
            ,by = c("id_client", "id_vehicle")) %>% 
  mutate_at(c("tot_clm_cnt", "tot_clm_amt")
            ,function(x) replace_na(x, 0))

Below is the claim count distribution:

pol_clm_df %>% 
  ggplot(aes(as.character(tot_clm_cnt))) +
  geom_histogram(stat = "count") +
  scale_y_continuous(labels = scales::comma) +
  xlab("Claim Count") +
  ylab("Frequency") +
  labs(title = "Claim Distribution")

Based on the graph, it looks like the claim count is zero-inflated.

Nevertheless, let’s look at how to check over dispersion.

How to check overdispersion

First, I will build a Poisson model.

glm_model <-
  glm(tot_clm_cnt ~ 
             drv_age1 
           + vh_age 
           + pol_bonus 
           + pol_usage 
           + drv_drv2 
           + pol_duration
           + vh_type
           + vh_speed
      ,offset = log(pol_duration)
      ,family = poisson
      ,data = pol_clm_df)

Method 1: Use dispersiontest function from AER package

Next, I will use dispersiontest function to check there is any sign of over dispersion.

dispersiontest(glm_model)

    Overdispersion test

data:  glm_model
z = 13.711, p-value < 2.2e-16
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  1.125756 

Method 2: Use the testOverdispersion function from DHARMa package

Another way to check for over dispersion is to use testOverdispersion function.

sim_fmp <- simulateResiduals(glm_model) 
testOverdispersion(sim_fmp)


    DHARMa nonparametric dispersion test via sd of residuals
    fitted vs. simulated

data:  simulationOutput
dispersion = 1.1083, p-value < 2.2e-16
alternative hypothesis: two.sided

If we generate the QQ plot and residuals vs predicted plot, we can see from the results that poisson distribution is not appropriate. There is statistical evidence that the overdispersion exists in the dataset.

plotSimulatedResiduals(sim_fmp)

Note that the testOverdispersion function does not work on quasi distributions at the point of writing.

Zero-inflated model

Good! We have set the context and now we will look at how to fit zero inflated model.

Method 1: Use pscl package

Next, I will be building the zero-inflated model by using zeroinfl function from pscl package.

clm_cnt_pscl <-
  zeroinfl(tot_clm_cnt ~ 
             drv_age1 
           + vh_age 
           + pol_bonus 
           + pol_usage 
           + drv_drv2 
           + pol_duration
           + vh_type
           + vh_speed
           | pol_bonus
      ,data = pol_clm_df)

We can pass the fitted object into summary function to extract the coefficients.

summary(clm_cnt_pscl)

Call:
zeroinfl(formula = tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus + 
    pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed | 
    pol_bonus, data = pol_clm_df)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.6346 -0.4001 -0.3470 -0.2745 14.5165 

Count model coefficients (poisson with log link):
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -1.9372599  0.2766965  -7.001 2.53e-12 ***
drv_age1               0.0019583  0.0009365   2.091  0.03652 *  
vh_age                -0.0572089  0.0016941 -33.770  < 2e-16 ***
pol_bonus              0.8457973  0.2799440   3.021  0.00252 ** 
pol_usageProfessional -0.3046810  0.2072691  -1.470  0.14157    
pol_usageRetired      -0.5818882  0.2066632  -2.816  0.00487 ** 
pol_usageWorkPrivate  -0.4979166  0.2054887  -2.423  0.01539 *  
drv_drv2Yes            0.0454272  0.0184933   2.456  0.01403 *  
pol_duration          -0.0027838  0.0011289  -2.466  0.01367 *  
vh_typeTourism        -0.1085948  0.0366440  -2.964  0.00304 ** 
vh_speed               0.0052439  0.0004437  11.819  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -1.0383     0.4084  -2.542    0.011 *
pol_bonus     0.5855     0.7046   0.831    0.406  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 43 
Log-likelihood: -4.193e+04 on 13 Df

Method 2: Use tidymodels package

I will be exploring how to build zero inflated model by using functions from tidymodels package.

clm_cnt_spec_poisson <- 
  poisson_reg() %>% 
  set_engine("zeroinfl", dist = "poisson")

clm_cnt_fit_possion <-
  clm_cnt_spec_poisson %>% 
  fit(tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus + pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed | pol_bonus
      ,data = pol_clm_df)

clm_cnt_fit_possion
parsnip model object


Call:
pscl::zeroinfl(formula = tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus + 
    pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed | 
    pol_bonus, data = data, dist = ~"poisson")

Count model coefficients (poisson with log link):
          (Intercept)               drv_age1                 vh_age  
            -1.937260               0.001958              -0.057209  
            pol_bonus  pol_usageProfessional       pol_usageRetired  
             0.845797              -0.304681              -0.581888  
 pol_usageWorkPrivate            drv_drv2Yes           pol_duration  
            -0.497917               0.045427              -0.002784  
       vh_typeTourism               vh_speed  
            -0.108595               0.005244  

Zero-inflation model coefficients (binomial with logit link):
(Intercept)    pol_bonus  
    -1.0383       0.5855  
summary(clm_cnt_fit_possion$fit)

Call:
pscl::zeroinfl(formula = tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus + 
    pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed | 
    pol_bonus, data = data, dist = ~"poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.6346 -0.4001 -0.3470 -0.2745 14.5165 

Count model coefficients (poisson with log link):
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -1.9372599  0.2766965  -7.001 2.53e-12 ***
drv_age1               0.0019583  0.0009365   2.091  0.03652 *  
vh_age                -0.0572089  0.0016941 -33.770  < 2e-16 ***
pol_bonus              0.8457973  0.2799440   3.021  0.00252 ** 
pol_usageProfessional -0.3046810  0.2072691  -1.470  0.14157    
pol_usageRetired      -0.5818882  0.2066632  -2.816  0.00487 ** 
pol_usageWorkPrivate  -0.4979166  0.2054887  -2.423  0.01539 *  
drv_drv2Yes            0.0454272  0.0184933   2.456  0.01403 *  
pol_duration          -0.0027838  0.0011289  -2.466  0.01367 *  
vh_typeTourism        -0.1085948  0.0366440  -2.964  0.00304 ** 
vh_speed               0.0052439  0.0004437  11.819  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -1.0383     0.4084  -2.542    0.011 *
pol_bonus     0.5855     0.7046   0.831    0.406  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 43 
Log-likelihood: -4.193e+04 on 13 Df

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Tim Foster on Unsplash

Dormann, Carsten F. “Introduction: What Is Overdispersion?” https://biometry.github.io/APES//LectureNotes/2016-JAGS/Overdispersion/OverdispersionJAGS.html.
PennState Eberly College of Science. “7.3 - Overdispersion.” https://online.stat.psu.edu/stat504/lesson/7/7.3#:~:text=If%20overdispersion%20is%20present%20in,and%20adjustments%20must%20be%20made.
Wilson, Paul. “The Misuse of the Vuong Test for Non-Nested Models to Test for Zero-Inflation.” https://wlv.openrepository.com/bitstream/handle/2436/622610/Vuong1.pdf;jsessionid=174FF3A866D1D31B0C0129BB1606B68A?sequence=2.

References