In previous post, I have explored on how to build a Cox proportional hazard model.

In this post, I will be exploring how to check the proportional hazard assumption in the model.

Test Cox proportional hazard

One of the main assumptions is the hazard functions are proportional to one another over the period.

To check whether the proportional hazard is appropriate, we will perform statistical test based on scaled Schoenfeld residuals.

If the p-value is significant, the proportional hazard assumption doesn’t hold (StackExchange 2019).

When proportional hazard assumption is inappropriate

When the proportional hazard assumption does not hold, there are several approaches to overcome this (Sestelo 2017).

Stratify the variable
Partition of the time axis
Check on the nonlinear effect

In this post, I will focus on how to stratify the variable.

Strata

(D 2017) explained that in a Cox model, stratification allows for as many different hazard functions as there are strata.

Beta coefficients (hazard ratios) optimized for all strata are then fitted.

Demonstration

In this demonstration, I will be using this bank dataset from Kaggle.

Setup the environment

First, I will load the necessary packages into the environment.

pacman::p_load(tidyverse, lubridate, janitor, survival, survminer)

Import Data

First I will import the dataset into the environment.

I will also clean the column names, drop the columns I don’t need and transform the columns to be the right format.

df <- read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2022-09-10-kaplan-meier/data/Churn_Modelling.csv") %>%
  clean_names() %>%
  select(-c(row_number, customer_id, surname)) %>%
  mutate(has_cr_card = factor(has_cr_card),
         is_active_member = factor(is_active_member),
         credit_score = credit_score/100,
         balance = balance/10000,
         estimated_salary = estimated_salary/10000) %>%
  filter(tenure > 0)

Test Cox Proportional assumption

Before we could check the proportional hazard assumption, I will build a simple model.

cox_model <- 
  coxph(Surv(tenure, exited) ~ estimated_salary, 
        data = df, 
        x = TRUE, 
        y = TRUE)

To check the assumption, I will pass the fitted model to cox.zph function as follows:

cox.zph(cox_model)

                 chisq df    p
estimated_salary  1.01  1 0.31
GLOBAL            1.01  1 0.31

From the result, the p-value is greater than 0.05, suggesting the proportional hazard assumption is appropriate for this simple model.

To visualize the result, we could pass the result from cox.zph function to plot function.

plot(cox.zph(cox_model))

Alternatively, we can pass the result from cox.zph function to ggcoxzph function to visualize the result.

According to R documentation, ggcoxzph function is a wrapper around plot.cox.zph function.

ggcoxzph(
  cox.zph(cox_model)
)

Below are some of the interpretations of the result (Dessai and Patil 2019):

Solid line represents the smoothing spline fit for the plot
Dashed lines represent +/- 2 standard error bands around the fit
If the p-value is less than 0.05, then the assumption is violated
- Assumption: the hazard ratio is not constant over time
- Assume a non-significant relationship betweeen residuals and time

Next, I will check the proportional hazard assumption under a multivariate model.

Similarly, I will build a multivariate model first.

cox_model_all <- 
  coxph(Surv(tenure, exited) ~ credit_score + gender + balance + age + num_of_products + has_cr_card + is_active_member + estimated_salary, 
        data = df, 
        x = TRUE, 
        y = TRUE)

Then, I will pass the result to cox.zph function.

cox.zph(cox_model_all)

                    chisq df     p
credit_score      0.20167  1 0.653
gender            0.32393  1 0.569
balance           4.25168  1 0.039
age               1.71430  1 0.190
num_of_products   0.78555  1 0.375
has_cr_card       0.00105  1 0.974
is_active_member  1.89779  1 0.168
estimated_salary  1.30608  1 0.253
GLOBAL           11.01238  8 0.201

From the result, we note that the overall p-value is still greater than 0.05, although the p-value for balance is less than 0.05.

Based on this discussion post, the author suggested since the overall test has a near-perfect multiplicity adjustment, individual tests probably is not so crucial if global p-value is greater than 0.2 unless the individual \(\rho\) value (i.e. correlation between Schoenfeld residual and time) is very high.

Next, I will visualize the result by using ggcoxzph function.

Given the number of variables used to build the model, the graph will be hard to read.

Hence, I will make some modifications to the graphs produced.

ggcoxzph(
    cox.zph(cox_model_all)
    ,font.x = 8
    ,font.y = 7
    ,font.tickslab = 8
    ,font.main = 8)

Stratify the variable

In this demonstration, I will explore using stratification.

To do so, I will re-build the model by stratifying the balance variable.

cox_model_all_balance_strata <- 
  coxph(Surv(tenure, exited) ~ credit_score + gender + age + num_of_products + has_cr_card + is_active_member + estimated_salary + strata(balance), 
        data = df, 
        x = TRUE, 
        y = TRUE)

cox_model_all_balance_strata

Call:
coxph(formula = Surv(tenure, exited) ~ credit_score + gender + 
    age + num_of_products + has_cr_card + is_active_member + 
    estimated_salary + strata(balance), data = df, x = TRUE, 
    y = TRUE)

                       coef exp(coef)  se(coef)       z        p
credit_score      -0.043836  0.957111  0.045962  -0.954    0.340
genderMale        -0.403308  0.668106  0.092768  -4.347 1.38e-05
age                0.054010  1.055495  0.003697  14.609  < 2e-16
num_of_products   -0.963456  0.381572  0.089148 -10.807  < 2e-16
has_cr_card1      -0.157441  0.854327  0.099791  -1.578    0.115
is_active_member1 -0.910944  0.402145  0.100939  -9.025  < 2e-16
estimated_salary   0.010022  1.010073  0.008082   1.240    0.215

Likelihood ratio test=432.2  on 7 df, p=< 2.2e-16
n= 9587, number of events= 1942

After stratifying the variable, the overall p-value increases.

cox.zph(cox_model_all_balance_strata)

                  chisq df    p
credit_score     0.0149  1 0.90
gender           0.0841  1 0.77
age              0.4012  1 0.53
num_of_products  0.5101  1 0.48
has_cr_card      0.0205  1 0.89
is_active_member 0.6870  1 0.41
estimated_salary 0.0767  1 0.78
GLOBAL           1.4938  7 0.98

Similarly, I will visualize the result by using ggcoxzph function.

ggcoxzph(
    cox.zph(cox_model_all_balance_strata)
    ,font.x = 8
    ,font.y = 7
    ,font.tickslab = 8
    ,font.main = 8)

Lastly, I will extract the concordance results from the original & stratified models for comparison.

cox_model_all$concordance["concordance"]

concordance 
  0.7110915

cox_model_all_balance_strata$concordance["concordance"]

concordance 
  0.7478856

The concordance result improves slightly after we stratify balance.

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Abbilyn Rurenko on Unsplash

D, Todd. 2017. “Stratification in Cox Model.” https://stats.stackexchange.com/questions/256148/stratification-in-cox-model.

Dessai, Sampada, and Vijay Patil. 2019. “Testing and Interpreting Assumptions of COX Regression Analysis.” https://journals.lww.com/crst/Fulltext/2019/02010/Testing_and_interpreting_assumptions_of_COX.26.aspx.

Sestelo, Marta. 2017. A Short Course on Survival Analysis. https://bookdown.org/sestelo/sa_financial/non-proportional-hazards-and-now-what.html.

StackExchange. 2019. “How to Correctly Interpret Schoenfeld Residuals p-Value.” https://stats.stackexchange.com/questions/433971/how-to-correctly-interpret-schoenfeld-residuals-p-value#:~:text=Schoenfeld%20is%20like%20a%20Shapiro,the%20individual%20feature%20p%2Dvalues.

Proportion Hazard Assumption under Cox Model