Too many zeros!
In this post, I will be exploring how to check the dispersion and how to build a zero-inflated model.

Photo by Pickled Stardust on Unsplash
Overdispersion describes the observation that variation is higher than would be expected (Dormann).
Often overdispersion can be a result of missing predictors or a misspecified model structure.
This overdispersion scenario is common in the insurance dataset as the policies made a claim is likely to be just a small set of the total portfolio.
If overdispersion is present in a dataset, the estimated standard errors and test statistics of the overall goodness-of-fit will be distorted and adjustments must be made (PennState Eberly College of Science).
In this post, I will be exploring how to use a zero-inflated model to fix the overdispersion issue in the model.
Zero-inflated model is a type of multi-level model. It comprises two components.
One determines whether the outcome is zero or a positive value
Another part is the count process (e.g., Poisson distribution)
One of the posts I came across is the misuse of Vuong test for non-nested models to test for zero-inflation.
(Wilson) suggested that it’s incorrect to use Vuong test to check whether a zero-inflated model is a better fit for count distribution (e.g., Poisson distribution).
First, I will import the necessary packages into the environment.
pacman::p_load(tidyverse, tidymodels, poissonreg, pscl, DHARMa, AER)
In this demonstration, I will be using this car insurance dataset from CASdatasets package.
I will perform some data wrangling before model building.
# claim count
clm_cnt <-
clm_df %>%
group_by(id_client, id_vehicle) %>%
summarise(tot_clm_cnt = sum(claim_nb),
tot_clm_amt = sum(claim_amount))
# join policy and claim data
pol_clm_df <-
pol_df %>%
left_join(clm_cnt
,by = c("id_client", "id_vehicle")) %>%
mutate_at(c("tot_clm_cnt", "tot_clm_amt")
,function(x) replace_na(x, 0))
Below is the claim count distribution:
pol_clm_df %>%
ggplot(aes(as.character(tot_clm_cnt))) +
geom_histogram(stat = "count") +
scale_y_continuous(labels = scales::comma) +
xlab("Claim Count") +
ylab("Frequency") +
labs(title = "Claim Distribution")

Based on the graph, it looks like the claim count is zero-inflated.
Nevertheless, let’s look at how to check over dispersion.
First, I will build a Poisson model.
dispersiontest function from AER packageNext, I will use dispersiontest function to check there is any sign of over dispersion.
dispersiontest(glm_model)
Overdispersion test
data: glm_model
z = 13.711, p-value < 2.2e-16
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
1.125756
testOverdispersion function from DHARMa packageAnother way to check for over dispersion is to use testOverdispersion function.
sim_fmp <- simulateResiduals(glm_model)
testOverdispersion(sim_fmp)

DHARMa nonparametric dispersion test via sd of residuals
fitted vs. simulated
data: simulationOutput
dispersion = 1.1083, p-value < 2.2e-16
alternative hypothesis: two.sided
If we generate the QQ plot and residuals vs predicted plot, we can see from the results that poisson distribution is not appropriate. There is statistical evidence that the overdispersion exists in the dataset.
plotSimulatedResiduals(sim_fmp)

Note that the testOverdispersion function does not work on quasi distributions at the point of writing.
Good! We have set the context and now we will look at how to fit zero inflated model.
pscl packageNext, I will be building the zero-inflated model by using zeroinfl function from pscl package.
clm_cnt_pscl <-
zeroinfl(tot_clm_cnt ~
drv_age1
+ vh_age
+ pol_bonus
+ pol_usage
+ drv_drv2
+ pol_duration
+ vh_type
+ vh_speed
| pol_bonus
,data = pol_clm_df)
We can pass the fitted object into summary function to extract the coefficients.
summary(clm_cnt_pscl)
Call:
zeroinfl(formula = tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus +
pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed |
pol_bonus, data = pol_clm_df)
Pearson residuals:
Min 1Q Median 3Q Max
-0.6346 -0.4001 -0.3470 -0.2745 14.5165
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.9372599 0.2766965 -7.001 2.53e-12 ***
drv_age1 0.0019583 0.0009365 2.091 0.03652 *
vh_age -0.0572089 0.0016941 -33.770 < 2e-16 ***
pol_bonus 0.8457973 0.2799440 3.021 0.00252 **
pol_usageProfessional -0.3046810 0.2072691 -1.470 0.14157
pol_usageRetired -0.5818882 0.2066632 -2.816 0.00487 **
pol_usageWorkPrivate -0.4979166 0.2054887 -2.423 0.01539 *
drv_drv2Yes 0.0454272 0.0184933 2.456 0.01403 *
pol_duration -0.0027838 0.0011289 -2.466 0.01367 *
vh_typeTourism -0.1085948 0.0366440 -2.964 0.00304 **
vh_speed 0.0052439 0.0004437 11.819 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0383 0.4084 -2.542 0.011 *
pol_bonus 0.5855 0.7046 0.831 0.406
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 43
Log-likelihood: -4.193e+04 on 13 Df
tidymodels packageI will be exploring how to build zero inflated model by using functions from tidymodels package.
clm_cnt_spec_poisson <-
poisson_reg() %>%
set_engine("zeroinfl", dist = "poisson")
clm_cnt_fit_possion <-
clm_cnt_spec_poisson %>%
fit(tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus + pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed | pol_bonus
,data = pol_clm_df)
clm_cnt_fit_possion
parsnip model object
Call:
pscl::zeroinfl(formula = tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus +
pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed |
pol_bonus, data = data, dist = ~"poisson")
Count model coefficients (poisson with log link):
(Intercept) drv_age1 vh_age
-1.937260 0.001958 -0.057209
pol_bonus pol_usageProfessional pol_usageRetired
0.845797 -0.304681 -0.581888
pol_usageWorkPrivate drv_drv2Yes pol_duration
-0.497917 0.045427 -0.002784
vh_typeTourism vh_speed
-0.108595 0.005244
Zero-inflation model coefficients (binomial with logit link):
(Intercept) pol_bonus
-1.0383 0.5855
summary(clm_cnt_fit_possion$fit)
Call:
pscl::zeroinfl(formula = tot_clm_cnt ~ drv_age1 + vh_age + pol_bonus +
pol_usage + drv_drv2 + pol_duration + vh_type + vh_speed |
pol_bonus, data = data, dist = ~"poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-0.6346 -0.4001 -0.3470 -0.2745 14.5165
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.9372599 0.2766965 -7.001 2.53e-12 ***
drv_age1 0.0019583 0.0009365 2.091 0.03652 *
vh_age -0.0572089 0.0016941 -33.770 < 2e-16 ***
pol_bonus 0.8457973 0.2799440 3.021 0.00252 **
pol_usageProfessional -0.3046810 0.2072691 -1.470 0.14157
pol_usageRetired -0.5818882 0.2066632 -2.816 0.00487 **
pol_usageWorkPrivate -0.4979166 0.2054887 -2.423 0.01539 *
drv_drv2Yes 0.0454272 0.0184933 2.456 0.01403 *
pol_duration -0.0027838 0.0011289 -2.466 0.01367 *
vh_typeTourism -0.1085948 0.0366440 -2.964 0.00304 **
vh_speed 0.0052439 0.0004437 11.819 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0383 0.4084 -2.542 0.011 *
pol_bonus 0.5855 0.7046 0.831 0.406
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 43
Log-likelihood: -4.193e+04 on 13 Df
That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by Tim Foster on Unsplash