Are we similar?

Photo by Annie Spratt on Unsplash
This is to continue my journey to learn more about causal inference.
In this post, I will be looking at how to perform matching.
The objective of matching is to simulate the randomized controlled trials by using existing info, so we could remove the confounding effect.
To do so, the algorithm will “find” the best closest match for the treated from the untreated group.
There are many methods to find the “matching” group.
In general, the steps of matching are:
Preprocessing
Estimation
The downside to matching is we will have to throw away a sizable chunk of the data - anything that’s unmatched doesn’t get included in the final matched data (Heiss 2023).
In this post, I will be using both base R glm function and tidymodels package to build Poisson regression.
pacman::p_load(tidyverse, tidymodels, janitor, MatchIt, halfmoon)
First, I will import the data into the environment.
df <- read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2022-03-12-marketbasket/data/general_data.csv") %>%
# drop the columns we don't need
dplyr::select(-c(EmployeeCount, StandardHours, EmployeeID)) %>%
clean_names() %>%
# impute the missing values with the mean values
mutate(
num_companies_worked = case_when(
is.na(num_companies_worked) ~ mean(num_companies_worked, na.rm = TRUE),
TRUE ~ num_companies_worked),
total_working_years = case_when(
is.na(total_working_years) ~ mean(total_working_years, na.rm = TRUE),
TRUE ~ total_working_years),
ind_promoted_in_last1Yr = if_else(years_since_last_promotion <= 1, "yes", "no"),
ind_promoted_in_last1Yr = as.factor(ind_promoted_in_last1Yr),
attrition = as.factor(attrition),
job_level = as.factor(job_level)
) %>%
droplevels()
Next, we will perform covariate balancing.
m_outcome <-
matchit(ind_promoted_in_last1Yr ~ age + gender + department
,data = df
,replace = TRUE)
The default matching method is nearest neighbor matching.
Refer to the documentation for the different matching methods.
Note that I also allow the data points to be repeatably matched.
m_outcome
A matchit object
- method: 1:1 nearest neighbor matching with replacement
- distance: Propensity score
- estimated with logistic regression
- number of obs.: 4410 (original), 3789 (matched)
- target estimand: ATT
- covariates: age, gender, department
Based on the result, we could see that the default distance calculation is glm with logit function. We could change the distance calculation.
For the estimand, it depends on the questions we would like to answer.
Following are some materials that explain how to choose estimands:
https://www.r-causal.org/chapters/11-estimands.html#choosing-estimands
https://www.andrewheiss.com/blog/2024/03/21/demystifying-ate-att-atu/
summary(m_outcome)
Call:
matchit(formula = ind_promoted_in_last1Yr ~ age + gender + department,
data = df, replace = TRUE)
Summary of Balance for All Data:
Means Treated Means Control
distance 0.6472 0.6221
age 35.8507 38.8158
genderFemale 0.4009 0.3985
genderMale 0.5991 0.6015
departmentHuman Resources 0.0480 0.0338
departmentResearch & Development 0.6471 0.6654
departmentSales 0.3049 0.3008
Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.3311 0.9989 0.0747
age -0.3258 1.0497 0.0690
genderFemale 0.0048 . 0.0024
genderMale -0.0048 . 0.0024
departmentHuman Resources 0.0662 . 0.0141
departmentResearch & Development -0.0383 . 0.0183
departmentSales 0.0090 . 0.0042
eCDF Max
distance 0.1733
age 0.1493
genderFemale 0.0024
genderMale 0.0024
departmentHuman Resources 0.0141
departmentResearch & Development 0.0183
departmentSales 0.0042
Summary of Balance for Matched Data:
Means Treated Means Control
distance 0.6472 0.6472
age 35.8507 35.8156
genderFemale 0.4009 0.4350
genderMale 0.5991 0.5650
departmentHuman Resources 0.0480 0.0416
departmentResearch & Development 0.6471 0.6620
departmentSales 0.3049 0.2964
Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.0002 0.9991 0.0006
age 0.0039 1.0354 0.0052
genderFemale -0.0696 . 0.0341
genderMale 0.0696 . 0.0341
departmentHuman Resources 0.0299 . 0.0064
departmentResearch & Development -0.0312 . 0.0149
departmentSales 0.0185 . 0.0085
eCDF Max Std. Pair Dist.
distance 0.0139 0.0024
age 0.0181 0.0430
genderFemale 0.0341 0.1479
genderMale 0.0341 0.1479
departmentHuman Resources 0.0064 0.1497
departmentResearch & Development 0.0149 0.0625
departmentSales 0.0085 0.0371
Sample Sizes:
Control Treated
All 1596. 2814
Matched (ESS) 466.07 2814
Matched 975. 2814
Unmatched 621. 0
Discarded 0. 0
Below are some of the important definitions of the terms in the result above (Greifer 2023):
Standardized mean difference: Difference in the means of each covariate between treatment groups standardized by a standardization factor so that it is on the same scale for all covariates
Variance ratio: Ratio of the variance of a covariate in one group to that in the other
Empirical CDF statistics: Difference in the empirical cumulative distribution functions (eCDFs) of each covariate between groups allows assessment of imbalance across the entire covariate distribution of that covariate rather than just it is mean or variance.
We can also pass the matchit object into plot function to create the Love plot of the covariates.
The solid line and dotted line represent acceptable and good balance respectively.
After matching, the absolute standardized mean difference for age has reduced below both thresholds.
The absolute standardized mean difference for gender increases after matching, but they are still below the 0.1 threshold. Hence for simplicity, I will leave it as it is.
Alternatively, below is how to visualize the result by using ggplot function.
as.data.frame(summary(m_outcome)$sum.all) %>%
rownames_to_column("variable") %>%
clean_names() %>%
rename_at(vars(everything(), -c(variable)), ~paste0(., "_all")) %>%
left_join(
as.data.frame(summary(m_outcome)$sum.matched) %>%
rownames_to_column("variable") %>%
clean_names() %>%
rename_at(vars(everything(), -c(variable)), ~paste0(., "_matched"))
) %>%
ggplot(aes(abs(std_mean_diff_all), variable)) +
geom_point(aes(color = "red")) +
geom_point(aes(abs(std_mean_diff_matched), variable)) +
geom_vline(xintercept = 0.1) +
geom_vline(xintercept = 0.05, linetype = "dotted")

That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by Jess Bailey on Unsplash