Different methods but give the same results
In this post, I will be exploring how to build logistic regression on aggregated dataset.

Photo by Pawel Czerwinski on Unsplash
In this demonstration, I will be using base GLM function to build models by using following three methods:
1st method: Logistic regression on non-aggregated data
2nd method: Logistic regression on aggregated data + use cbind argument
3rd method: Logistic regression on aggregated data + weights
Once the models are built, I will be comparing their coefficients and predictions from the three models.
First, I will call the necessary packages.
pacman::p_load(tidyverse, tidymodels, janitor)
I will also reuse the employee attrition data from my previous analysis in this demonstration.
df <- read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2022-03-12-marketbasket/data/general_data.csv") %>%
# clean up the column naming
clean_names() %>%
# convert the attrition column to the correct column types
mutate(attrition = as.factor(attrition))
The original dataset can be found under this link.
For simplicity, I will use following variables to build the model.
df_1 <-
df %>%
select(c(age
,attrition
,business_travel
,department
,job_role
,marital_status))
For the first method, I will be building a logistic regression on the non-aggregated data.
logit_fit <-
glm(attrition ~ .
,family = "binomial"
,data = df_1
)
Next, I will pass the fit model to anova function.
anova(logit_fit)
Analysis of Deviance Table
Model: binomial, link: logit
Response: attrition
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 4409 3895.7
age 1 118.523 4408 3777.2
business_travel 2 71.259 4406 3706.0
department 2 26.372 4404 3679.6
job_role 8 29.558 4396 3650.0
marital_status 2 93.401 4394 3556.6
cbindNext, I will build a logistic regression on the aggregated data.
With that, I will first derive the aggregated data.
Once the data is derived, I will start building the model.
In the cbind argument, we just need to pass in
Similar to 1st method, I will pass the fit model into the anova function.
anova(logit_agg)
Analysis of Deviance Table
Model: binomial, link: logit
Response: cbind(attrition_count, nonattrition_count)
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 1098 3091.3
age 1 118.523 1097 2972.8
business_travel 2 71.259 1095 2901.6
department 2 26.372 1093 2875.2
job_role 8 29.558 1085 2845.6
marital_status 2 93.401 1083 2752.2
From the result above, we can see that the deviance results from the model is same as the deviance results from 1st model.
We could model on the proportion, instead of the absolute count.
This would give us the same result so long we specify the weights correctly in the modeling.
First, I will derive the proportion from the data.
df_agg_rate <-
df_agg %>%
mutate(attrition_rate = attrition_count/count) %>%
select(-c(attrition_count
,nonattrition_count))
Once that is done, I will fit the model by specifying the proportion as the target.
logit_agg_rate <-
glm(attrition_rate ~ . - count
,weights = count
,family = "binomial"
,data = df_agg_rate)
If we were to pass the fit object into anova function, we could see that the deviance results are same as previous two fit objects.
anova(logit_agg_rate)
Analysis of Deviance Table
Model: binomial, link: logit
Response: attrition_rate
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 1098 3091.3
age 1 118.523 1097 2972.8
business_travel 2 71.259 1095 2901.6
department 2 26.372 1093 2875.2
job_role 8 29.558 1085 2845.6
marital_status 2 93.401 1083 2752.2
If we were to put the coefficients of the three fit objects into the same table, we could see that all the coefficients from the three models are the same.
# first method
logit_fit %>%
tidy() %>%
select(term, estimate) %>%
rename(logit_fit = estimate) %>%
# second method
left_join(
logit_agg %>%
tidy() %>%
select(term, estimate) %>%
rename(logit_agg = estimate)
) %>%
# third method
left_join(
logit_agg_rate %>%
tidy() %>%
select(term, estimate) %>%
rename(logit_agg_rate = estimate)
)
# A tibble: 16 x 4
term logit_fit logit_agg logit_agg_rate
<chr> <dbl> <dbl> <dbl>
1 (Intercept) -0.479 -0.479 -0.479
2 age -0.0475 -0.0475 -0.0475
3 business_travelTravel_Frequently 1.38 1.38 1.38
4 business_travelTravel_Rarely 0.715 0.715 0.715
5 departmentResearch & Development -0.854 -0.854 -0.854
6 departmentSales -0.964 -0.964 -0.964
7 job_roleHuman Resources -0.140 -0.140 -0.140
8 job_roleLaboratory Technician 0.138 0.138 0.138
9 job_roleManager -0.158 -0.158 -0.158
10 job_roleManufacturing Director -0.388 -0.388 -0.388
11 job_roleResearch Director 0.510 0.510 0.510
12 job_roleResearch Scientist 0.211 0.211 0.211
13 job_roleSales Executive 0.169 0.169 0.169
14 job_roleSales Representative -0.0313 -0.0313 -0.0313
15 marital_statusMarried 0.193 0.193 0.193
16 marital_statusSingle 0.967 0.967 0.967
Lastly, I will extract the predictions from the last two fit models.
predict(logit_agg, type = "response") %>%
as_tibble()
# A tibble: 1,099 x 1
value
<dbl>
1 0.253
2 0.259
3 0.222
4 0.184
5 0.661
6 0.581
7 0.416
8 0.400
9 0.201
10 0.527
# ... with 1,089 more rows
predict(logit_agg_rate, type = "response") %>%
as_tibble()
# A tibble: 1,099 x 1
value
<dbl>
1 0.253
2 0.259
3 0.222
4 0.184
5 0.661
6 0.581
7 0.416
8 0.400
9 0.201
10 0.527
# ... with 1,089 more rows
From the results, we could see that regardless whether we model on the count or proportion, both methods would give us the same results.
If we were to extract the predictions for the same profile from the first model, we would see that the predicted values are the same as the predictions from the two models above.
logit_fit %>%
augment(type.predict = "response") %>%
select(1:8) %>%
rename(logit_fitted = .fitted
,logit_fit_resid = .resid) %>%
filter(age == 18
,business_travel == "Non-Travel"
,department == "Research & Development"
,job_role == "Laboratory Technician")
# A tibble: 3 x 8
attrition age business_t~1 depar~2 job_r~3 marit~4 logit~5 logit~6
<fct> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Yes 18 Non-Travel Resear~ Labora~ Single 0.253 1.66
2 Yes 18 Non-Travel Resear~ Labora~ Single 0.253 1.66
3 Yes 18 Non-Travel Resear~ Labora~ Single 0.253 1.66
# ... with abbreviated variable names 1: business_travel,
# 2: department, 3: job_role, 4: marital_status, 5: logit_fitted,
# 6: logit_fit_resid
That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by UX Indonesia on Unsplash