In this post, I will be exploring how to build logistic regression on aggregated dataset.

Demonstration

In this demonstration, I will be using base GLM function to build models by using following three methods:

1st method: Logistic regression on non-aggregated data
2nd method: Logistic regression on aggregated data + use cbind argument
3rd method: Logistic regression on aggregated data + weights

Once the models are built, I will be comparing their coefficients and predictions from the three models.

First, I will call the necessary packages.

pacman::p_load(tidyverse, tidymodels, janitor)

Import Data

I will also reuse the employee attrition data from my previous analysis in this demonstration.

df <- read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2022-03-12-marketbasket/data/general_data.csv") %>%
  # clean up the column naming
  clean_names() %>% 
  # convert the attrition column to the correct column types
  mutate(attrition = as.factor(attrition))

The original dataset can be found under this link.

For simplicity, I will use following variables to build the model.

df_1 <-
  df %>% 
  select(c(age
           ,attrition
           ,business_travel
           ,department
           ,job_role
           ,marital_status))

Model Building

1st method: Logistic regression on non-aggregated data

For the first method, I will be building a logistic regression on the non-aggregated data.

logit_fit <-
  glm(attrition ~ .
      ,family = "binomial" 
      ,data = df_1
      )

Next, I will pass the fit model to anova function.

anova(logit_fit)

Analysis of Deviance Table

Model: binomial, link: logit

Response: attrition

Terms added sequentially (first to last)

                Df Deviance Resid. Df Resid. Dev
NULL                             4409     3895.7
age              1  118.523      4408     3777.2
business_travel  2   71.259      4406     3706.0
department       2   26.372      4404     3679.6
job_role         8   29.558      4396     3650.0
marital_status   2   93.401      4394     3556.6

2nd method: Logistic regression on aggregated data + use `cbind`

Next, I will build a logistic regression on the aggregated data.

With that, I will first derive the aggregated data.

df_agg <-
  df_1 %>% 
  group_by(across(!attrition)) %>% 
  summarise(count = n()
            ,nonattrition_count = sum(attrition == "No")
            ,attrition_count = sum(attrition == "Yes")) %>% 
  ungroup()

Once the data is derived, I will start building the model.

In the cbind argument, we just need to pass in

logit_agg <-
  glm(cbind(attrition_count, nonattrition_count) ~ . - count
      ,family = "binomial"
      ,data = df_agg)

Similar to 1st method, I will pass the fit model into the anova function.

anova(logit_agg)

Analysis of Deviance Table

Model: binomial, link: logit

Response: cbind(attrition_count, nonattrition_count)

Terms added sequentially (first to last)

                Df Deviance Resid. Df Resid. Dev
NULL                             1098     3091.3
age              1  118.523      1097     2972.8
business_travel  2   71.259      1095     2901.6
department       2   26.372      1093     2875.2
job_role         8   29.558      1085     2845.6
marital_status   2   93.401      1083     2752.2

From the result above, we can see that the deviance results from the model is same as the deviance results from 1st model.

3rd method: Logistic regression on aggregated data + weights

We could model on the proportion, instead of the absolute count.

This would give us the same result so long we specify the weights correctly in the modeling.

First, I will derive the proportion from the data.

df_agg_rate <-
  df_agg %>% 
  mutate(attrition_rate = attrition_count/count) %>% 
  select(-c(attrition_count
            ,nonattrition_count))

Once that is done, I will fit the model by specifying the proportion as the target.

logit_agg_rate <-
  glm(attrition_rate ~ . - count
      ,weights = count
      ,family = "binomial"
      ,data = df_agg_rate)

If we were to pass the fit object into anova function, we could see that the deviance results are same as previous two fit objects.

anova(logit_agg_rate)

Analysis of Deviance Table

Model: binomial, link: logit

Response: attrition_rate

Terms added sequentially (first to last)

                Df Deviance Resid. Df Resid. Dev
NULL                             1098     3091.3
age              1  118.523      1097     2972.8
business_travel  2   71.259      1095     2901.6
department       2   26.372      1093     2875.2
job_role         8   29.558      1085     2845.6
marital_status   2   93.401      1083     2752.2

Compare the coefficients from different models

If we were to put the coefficients of the three fit objects into the same table, we could see that all the coefficients from the three models are the same.

# first method
logit_fit %>% 
  tidy() %>% 
  select(term, estimate) %>% 
  rename(logit_fit = estimate) %>% 
  # second method
  left_join(
    logit_agg %>% 
      tidy() %>% 
      select(term, estimate) %>% 
      rename(logit_agg = estimate)
  ) %>% 
  # third method
  left_join(
    logit_agg_rate %>% 
      tidy() %>% 
      select(term, estimate) %>% 
      rename(logit_agg_rate = estimate)
  )

# A tibble: 16 x 4
   term                             logit_fit logit_agg logit_agg_rate
   <chr>                                <dbl>     <dbl>          <dbl>
 1 (Intercept)                        -0.479    -0.479         -0.479 
 2 age                                -0.0475   -0.0475        -0.0475
 3 business_travelTravel_Frequently    1.38      1.38           1.38  
 4 business_travelTravel_Rarely        0.715     0.715          0.715 
 5 departmentResearch & Development   -0.854    -0.854         -0.854 
 6 departmentSales                    -0.964    -0.964         -0.964 
 7 job_roleHuman Resources            -0.140    -0.140         -0.140 
 8 job_roleLaboratory Technician       0.138     0.138          0.138 
 9 job_roleManager                    -0.158    -0.158         -0.158 
10 job_roleManufacturing Director     -0.388    -0.388         -0.388 
11 job_roleResearch Director           0.510     0.510          0.510 
12 job_roleResearch Scientist          0.211     0.211          0.211 
13 job_roleSales Executive             0.169     0.169          0.169 
14 job_roleSales Representative       -0.0313   -0.0313        -0.0313
15 marital_statusMarried               0.193     0.193          0.193 
16 marital_statusSingle                0.967     0.967          0.967

Compare the predictions

Lastly, I will extract the predictions from the last two fit models.

predict(logit_agg, type = "response") %>% 
  as_tibble()

# A tibble: 1,099 x 1
   value
   <dbl>
 1 0.253
 2 0.259
 3 0.222
 4 0.184
 5 0.661
 6 0.581
 7 0.416
 8 0.400
 9 0.201
10 0.527
# ... with 1,089 more rows

predict(logit_agg_rate, type = "response") %>% 
  as_tibble()

# A tibble: 1,099 x 1
   value
   <dbl>
 1 0.253
 2 0.259
 3 0.222
 4 0.184
 5 0.661
 6 0.581
 7 0.416
 8 0.400
 9 0.201
10 0.527
# ... with 1,089 more rows

From the results, we could see that regardless whether we model on the count or proportion, both methods would give us the same results.

If we were to extract the predictions for the same profile from the first model, we would see that the predicted values are the same as the predictions from the two models above.

logit_fit %>% 
  augment(type.predict = "response") %>% 
  select(1:8) %>% 
  rename(logit_fitted = .fitted
         ,logit_fit_resid = .resid) %>% 
  filter(age == 18
         ,business_travel == "Non-Travel"
         ,department == "Research & Development"
         ,job_role == "Laboratory Technician")

# A tibble: 3 x 8
  attrition   age business_t~1 depar~2 job_r~3 marit~4 logit~5 logit~6
  <fct>     <dbl> <chr>        <chr>   <chr>   <chr>     <dbl>   <dbl>
1 Yes          18 Non-Travel   Resear~ Labora~ Single    0.253    1.66
2 Yes          18 Non-Travel   Resear~ Labora~ Single    0.253    1.66
3 Yes          18 Non-Travel   Resear~ Labora~ Single    0.253    1.66
# ... with abbreviated variable names 1: business_travel,
#   2: department, 3: job_role, 4: marital_status, 5: logit_fitted,
#   6: logit_fit_resid

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by UX Indonesia on Unsplash

Logistic Regression on Aggregated Data