What is matching?

The objective of matching is to simulate the randomized controlled trials by using existing info, so we could remove the confounding effect.

To do so, the algorithm will “find” the best closest match for the treated from the untreated group.

There are many methods to find the “matching” group.

In general, the steps of matching are:

Preprocessing
Estimation

Issues with matching

The downside to matching is we will have to throw away a sizable chunk of the data - anything that’s unmatched doesn’t get included in the final matched data (Heiss 2023).

Demonstration

In this post, I will be using both base R glm function and tidymodels package to build Poisson regression.

pacman::p_load(tidyverse, tidymodels, janitor, MatchIt, halfmoon)

Import Data

First, I will import the data into the environment.

df <- read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2022-03-12-marketbasket/data/general_data.csv") %>%
  # drop the columns we don't need
  dplyr::select(-c(EmployeeCount, StandardHours, EmployeeID)) %>%
  clean_names() %>% 
  # impute the missing values with the mean values
  mutate(
    num_companies_worked = case_when(
      is.na(num_companies_worked) ~ mean(num_companies_worked, na.rm = TRUE),
      TRUE ~ num_companies_worked),
    total_working_years = case_when(
      is.na(total_working_years) ~ mean(total_working_years, na.rm = TRUE),
      TRUE ~ total_working_years),
    ind_promoted_in_last1Yr = if_else(years_since_last_promotion <= 1, "yes", "no"),
    ind_promoted_in_last1Yr = as.factor(ind_promoted_in_last1Yr),
    attrition = as.factor(attrition),
    job_level = as.factor(job_level)
    ) %>%
  droplevels()

Matching

Next, we will perform covariate balancing.

m_outcome <-
  matchit(ind_promoted_in_last1Yr ~ age + gender + department
          ,data = df
          ,replace = TRUE)

The default matching method is nearest neighbor matching.

Refer to the documentation for the different matching methods.

Note that I also allow the data points to be repeatably matched.

m_outcome

A matchit object
 - method: 1:1 nearest neighbor matching with replacement
 - distance: Propensity score
             - estimated with logistic regression
 - number of obs.: 4410 (original), 3789 (matched)
 - target estimand: ATT
 - covariates: age, gender, department

Based on the result, we could see that the default distance calculation is glm with logit function. We could change the distance calculation.

For the estimand, it depends on the questions we would like to answer.

Following are some materials that explain how to choose estimands:

summary(m_outcome)


Call:
matchit(formula = ind_promoted_in_last1Yr ~ age + gender + department, 
    data = df, replace = TRUE)

Summary of Balance for All Data:
                                 Means Treated Means Control
distance                                0.6472        0.6221
age                                    35.8507       38.8158
genderFemale                            0.4009        0.3985
genderMale                              0.5991        0.6015
departmentHuman Resources               0.0480        0.0338
departmentResearch & Development        0.6471        0.6654
departmentSales                         0.3049        0.3008
                                 Std. Mean Diff. Var. Ratio eCDF Mean
distance                                  0.3311     0.9989    0.0747
age                                      -0.3258     1.0497    0.0690
genderFemale                              0.0048          .    0.0024
genderMale                               -0.0048          .    0.0024
departmentHuman Resources                 0.0662          .    0.0141
departmentResearch & Development         -0.0383          .    0.0183
departmentSales                           0.0090          .    0.0042
                                 eCDF Max
distance                           0.1733
age                                0.1493
genderFemale                       0.0024
genderMale                         0.0024
departmentHuman Resources          0.0141
departmentResearch & Development   0.0183
departmentSales                    0.0042

Summary of Balance for Matched Data:
                                 Means Treated Means Control
distance                                0.6472        0.6472
age                                    35.8507       35.8156
genderFemale                            0.4009        0.4350
genderMale                              0.5991        0.5650
departmentHuman Resources               0.0480        0.0416
departmentResearch & Development        0.6471        0.6620
departmentSales                         0.3049        0.2964
                                 Std. Mean Diff. Var. Ratio eCDF Mean
distance                                  0.0002     0.9991    0.0006
age                                       0.0039     1.0354    0.0052
genderFemale                             -0.0696          .    0.0341
genderMale                                0.0696          .    0.0341
departmentHuman Resources                 0.0299          .    0.0064
departmentResearch & Development         -0.0312          .    0.0149
departmentSales                           0.0185          .    0.0085
                                 eCDF Max Std. Pair Dist.
distance                           0.0139          0.0024
age                                0.0181          0.0430
genderFemale                       0.0341          0.1479
genderMale                         0.0341          0.1479
departmentHuman Resources          0.0064          0.1497
departmentResearch & Development   0.0149          0.0625
departmentSales                    0.0085          0.0371

Sample Sizes:
              Control Treated
All           1596.      2814
Matched (ESS)  466.07    2814
Matched        975.      2814
Unmatched      621.         0
Discarded        0.         0

Below are some of the important definitions of the terms in the result above (Greifer 2023):

Standardized mean difference: Difference in the means of each covariate between treatment groups standardized by a standardization factor so that it is on the same scale for all covariates
- Recommended thresholds: 0.1 & 0.05
Variance ratio: Ratio of the variance of a covariate in one group to that in the other
- Close to 1 means good balance, implying the variance of the samples is similar
Empirical CDF statistics: Difference in the empirical cumulative distribution functions (eCDFs) of each covariate between groups allows assessment of imbalance across the entire covariate distribution of that covariate rather than just it is mean or variance.
- Maximum eCDF difference is also known as the “Kolmogorov-Smirnov statistic”

Plotting

Love Plot

We can also pass the matchit object into plot function to create the Love plot of the covariates.

plot(summary(m_outcome))

The solid line and dotted line represent acceptable and good balance respectively.

After matching, the absolute standardized mean difference for age has reduced below both thresholds.

The absolute standardized mean difference for gender increases after matching, but they are still below the 0.1 threshold. Hence for simplicity, I will leave it as it is.

Alternatively, below is how to visualize the result by using ggplot function.

as.data.frame(summary(m_outcome)$sum.all) %>% 
  rownames_to_column("variable") %>% 
  clean_names() %>% 
  rename_at(vars(everything(), -c(variable)), ~paste0(., "_all")) %>% 
  left_join(
    as.data.frame(summary(m_outcome)$sum.matched) %>%
      rownames_to_column("variable") %>%
      clean_names() %>%
      rename_at(vars(everything(), -c(variable)), ~paste0(., "_matched"))
  ) %>% 
  ggplot(aes(abs(std_mean_diff_all), variable)) +
  geom_point(aes(color = "red")) + 
  geom_point(aes(abs(std_mean_diff_matched), variable)) +
  geom_vline(xintercept = 0.1) +
  geom_vline(xintercept = 0.05, linetype = "dotted")