In this post, I will be exploring how to check the dispersion and how to build a zero-inflated model.

What is overdispersion?

Overdispersion describes the observation that variation is higher than would be expected (Dormann).

Often overdispersion can be a result of missing predictors or a misspecified model structure.

This overdispersion scenario is common in the insurance dataset as the policies made a claim is likely to be just a small set of the total portfolio.

Why do we care about overdispersion?

If overdispersion is present in a dataset, the estimated standard errors and test statistics of the overall goodness-of-fit will be distorted and adjustments must be made (PennState Eberly College of Science).

In this post, I will be exploring how to use a zero-inflated model to fix the overdispersion issue in the model.

Zero-inflated model

Zero-inflated model is a type of multi-level model. It comprises two components.

One determines whether the outcome is zero or a positive value
Another part is the count process (e.g., Poisson distribution)

Other important notes

One of the posts I came across is the misuse of Vuong test for non-nested models to test for zero-inflation.

(Wilson) suggested that it’s incorrect to use Vuong test to check whether a zero-inflated model is a better fit for count distribution (e.g., Poisson distribution).

Demonstration

First, I will import the necessary packages into the environment.

pacman::p_load(tidyverse, tidymodels, poissonreg, pscl, DHARMa, AER)

In this demonstration, I will be using this car insurance dataset from CASdatasets package.

data(pg17trainpol)
data(pg17trainclaim)

pol_df <- pg17trainpol
rm(pg17trainpol)

clm_df <- pg17trainclaim
rm(pg17trainclaim)

I will perform some data wrangling before model building.

# claim count
clm_cnt <-
  clm_df %>% 
  group_by(id_client, id_vehicle) %>% 
  summarise(tot_clm_cnt = sum(claim_nb),
            tot_clm_amt = sum(claim_amount))

# join policy and claim data
pol_clm_df <-
  pol_df %>% 
  left_join(clm_cnt
            ,by = c("id_client", "id_vehicle")) %>% 
  mutate_at(c("tot_clm_cnt", "tot_clm_amt")
            ,function(x) replace_na(x, 0))

Below is the claim count distribution:

pol_clm_df %>% 
  ggplot(aes(as.character(tot_clm_cnt))) +
  geom_histogram(stat = "count") +
  scale_y_continuous(labels = scales::comma) +
  xlab("Claim Count") +
  ylab("Frequency") +
  labs(title = "Claim Distribution")

Based on the graph, it looks like the claim count is zero-inflated.

Nevertheless, let’s look at how to check over dispersion.

How to check overdispersion

First, I will build a Poisson model.

glm_model <-
  glm(tot_clm_cnt ~ 
             drv_age1 
           + vh_age 
           + pol_bonus 
           + pol_usage 
           + drv_drv2 
           + pol_duration
           + vh_type
           + vh_speed
      ,offset = log(pol_duration)
      ,family = poisson
      ,data = pol_clm_df)

Method 1: Use `dispersiontest` function from `AER` package

Next, I will use dispersiontest function to check there is any sign of over dispersion.

dispersiontest(glm_model)


    Overdispersion test

data:  glm_model
z = 13.711, p-value < 2.2e-16
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  1.125756

Method 2: Use the `testOverdispersion` function from `DHARMa` package

Another way to check for over dispersion is to use testOverdispersion function.

sim_fmp <- simulateResiduals(glm_model) 
testOverdispersion(sim_fmp)


    DHARMa nonparametric dispersion test via sd of residuals
    fitted vs. simulated

data:  simulationOutput
dispersion = 1.1083, p-value < 2.2e-16
alternative hypothesis: two.sided

If we generate the QQ plot and residuals vs predicted plot, we can see from the results that poisson distribution is not appropriate. There is statistical evidence that the overdispersion exists in the dataset.

plotSimulatedResiduals(sim_fmp)

Note that the testOverdispersion function does not work on quasi distributions at the point of writing.

Zero-inflated model

Good! We have set the context and now we will look at how to fit zero inflated model.

Method 1: Use `pscl` package

Next, I will be building the zero-inflated model by using zeroinfl function from pscl package.

clm_cnt_pscl <-
  zeroinfl(tot_clm_cnt ~ 
             drv_age1 
           + vh_age 
           + pol_bonus 
           + pol_usage 
           + drv_drv2 
           + pol_duration
           + vh_type
           + vh_speed
           | pol_bonus
      ,data = pol_clm_df)

We can pass the fitted object into summary function to extract the coefficients.

summary(clm_cnt_pscl)


Call:
zeroinfl(formula = tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus + 
    pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed | 
    pol_bonus, data = pol_clm_df)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.6346 -0.4001 -0.3470 -0.2745 14.5165 

Count model coefficients (poisson with log link):
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -1.9372599  0.2766965  -7.001 2.53e-12 ***
drv_age1               0.0019583  0.0009365   2.091  0.03652 *  
vh_age                -0.0572089  0.0016941 -33.770  < 2e-16 ***
pol_bonus              0.8457973  0.2799440   3.021  0.00252 ** 
pol_usageProfessional -0.3046810  0.2072691  -1.470  0.14157    
pol_usageRetired      -0.5818882  0.2066632  -2.816  0.00487 ** 
pol_usageWorkPrivate  -0.4979166  0.2054887  -2.423  0.01539 *  
drv_drv2Yes            0.0454272  0.0184933   2.456  0.01403 *  
pol_duration          -0.0027838  0.0011289  -2.466  0.01367 *  
vh_typeTourism        -0.1085948  0.0366440  -2.964  0.00304 ** 
vh_speed               0.0052439  0.0004437  11.819  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -1.0383     0.4084  -2.542    0.011 *
pol_bonus     0.5855     0.7046   0.831    0.406  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 43 
Log-likelihood: -4.193e+04 on 13 Df

Method 2: Use `tidymodels` package

I will be exploring how to build zero inflated model by using functions from tidymodels package.

clm_cnt_spec_poisson <- 
  poisson_reg() %>% 
  set_engine("zeroinfl", dist = "poisson")

clm_cnt_fit_possion <-
  clm_cnt_spec_poisson %>% 
  fit(tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus + pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed | pol_bonus
      ,data = pol_clm_df)

clm_cnt_fit_possion

parsnip model object


Call:
pscl::zeroinfl(formula = tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus + 
    pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed | 
    pol_bonus, data = data, dist = ~"poisson")

Count model coefficients (poisson with log link):
          (Intercept)               drv_age1                 vh_age  
            -1.937260               0.001958              -0.057209  
            pol_bonus  pol_usageProfessional       pol_usageRetired  
             0.845797              -0.304681              -0.581888  
 pol_usageWorkPrivate            drv_drv2Yes           pol_duration  
            -0.497917               0.045427              -0.002784  
       vh_typeTourism               vh_speed  
            -0.108595               0.005244  

Zero-inflation model coefficients (binomial with logit link):
(Intercept)    pol_bonus  
    -1.0383       0.5855

summary(clm_cnt_fit_possion$fit)


Call:
pscl::zeroinfl(formula = tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus + 
    pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed | 
    pol_bonus, data = data, dist = ~"poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.6346 -0.4001 -0.3470 -0.2745 14.5165 

Count model coefficients (poisson with log link):
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -1.9372599  0.2766965  -7.001 2.53e-12 ***
drv_age1               0.0019583  0.0009365   2.091  0.03652 *  
vh_age                -0.0572089  0.0016941 -33.770  < 2e-16 ***
pol_bonus              0.8457973  0.2799440   3.021  0.00252 ** 
pol_usageProfessional -0.3046810  0.2072691  -1.470  0.14157    
pol_usageRetired      -0.5818882  0.2066632  -2.816  0.00487 ** 
pol_usageWorkPrivate  -0.4979166  0.2054887  -2.423  0.01539 *  
drv_drv2Yes            0.0454272  0.0184933   2.456  0.01403 *  
pol_duration          -0.0027838  0.0011289  -2.466  0.01367 *  
vh_typeTourism        -0.1085948  0.0366440  -2.964  0.00304 ** 
vh_speed               0.0052439  0.0004437  11.819  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -1.0383     0.4084  -2.542    0.011 *
pol_bonus     0.5855     0.7046   0.831    0.406  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 43 
Log-likelihood: -4.193e+04 on 13 Df

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Tim Foster on Unsplash

Dormann, Carsten F. “Introduction: What Is Overdispersion?” https://biometry.github.io/APES//LectureNotes/2016-JAGS/Overdispersion/OverdispersionJAGS.html.

PennState Eberly College of Science. “7.3 - Overdispersion.” https://online.stat.psu.edu/stat504/lesson/7/7.3#:~:text=If%20overdispersion%20is%20present%20in,and%20adjustments%20must%20be%20made.

Wilson, Paul. “The Misuse of the Vuong Test for Non-Nested Models to Test for Zero-Inflation.” https://wlv.openrepository.com/bitstream/handle/2436/622610/Vuong1.pdf;jsessionid=174FF3A866D1D31B0C0129BB1606B68A?sequence=2.

Zero-Inflated Poisson

What is overdispersion?

Why do we care about overdispersion?

Zero-inflated model

Other important notes

Demonstration

How to check overdispersion

Method 1: Use `dispersiontest` function from `AER` package

Method 2: Use the `testOverdispersion` function from `DHARMa` package

Zero-inflated model

Method 1: Use `pscl` package

Method 2: Use `tidymodels` package

Conclusion

References

Zero-Inflated Poisson

What is overdispersion?

Why do we care about overdispersion?

Zero-inflated model

Other important notes

Demonstration

How to check overdispersion

Method 1: Use dispersiontest function from AER package

Method 2: Use the testOverdispersion function from DHARMa package

Zero-inflated model

Method 1: Use pscl package

Method 2: Use tidymodels package

Conclusion

References

Method 1: Use `dispersiontest` function from `AER` package

Method 2: Use the `testOverdispersion` function from `DHARMa` package

Method 1: Use `pscl` package

Method 2: Use `tidymodels` package