Have you checked the assumption?
In previous post, I have explored on how to build a Cox proportional hazard model.

Photo by Robert Horvick on Unsplash
In this post, I will be exploring how to check the proportional hazard assumption in the model.
One of the main assumptions is the hazard functions are proportional to one another over the period.
To check whether the proportional hazard is appropriate, we will perform statistical test based on scaled Schoenfeld residuals.
If the p-value is significant, the proportional hazard assumption doesn’t hold (StackExchange 2019).
When the proportional hazard assumption does not hold, there are several approaches to overcome this (Sestelo 2017).
Stratify the variable
Partition of the time axis
Check on the nonlinear effect
In this post, I will focus on how to stratify the variable.
(D 2017) explained that in a Cox model, stratification allows for as many different hazard functions as there are strata.
Beta coefficients (hazard ratios) optimized for all strata are then fitted.
In this demonstration, I will be using this bank dataset from Kaggle.
First, I will load the necessary packages into the environment.
pacman::p_load(tidyverse, lubridate, janitor, survival, survminer)
First I will import the dataset into the environment.
I will also clean the column names, drop the columns I don’t need and transform the columns to be the right format.
df <- read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2022-09-10-kaplan-meier/data/Churn_Modelling.csv") %>%
clean_names() %>%
select(-c(row_number, customer_id, surname)) %>%
mutate(has_cr_card = factor(has_cr_card),
is_active_member = factor(is_active_member),
credit_score = credit_score/100,
balance = balance/10000,
estimated_salary = estimated_salary/10000) %>%
filter(tenure > 0)
Before we could check the proportional hazard assumption, I will build a simple model.
cox_model <-
coxph(Surv(tenure, exited) ~ estimated_salary,
data = df,
x = TRUE,
y = TRUE)
To check the assumption, I will pass the fitted model to cox.zph function as follows:
cox.zph(cox_model)
chisq df p
estimated_salary 1.01 1 0.31
GLOBAL 1.01 1 0.31
From the result, the p-value is greater than 0.05, suggesting the proportional hazard assumption is appropriate for this simple model.
To visualize the result, we could pass the result from cox.zph function to plot function.
plot(cox.zph(cox_model))

Alternatively, we can pass the result from cox.zph function to ggcoxzph function to visualize the result.
According to R documentation, ggcoxzph function is a wrapper around plot.cox.zph function.
ggcoxzph(
cox.zph(cox_model)
)

Below are some of the interpretations of the result (Dessai and Patil 2019):
Solid line represents the smoothing spline fit for the plot
Dashed lines represent +/- 2 standard error bands around the fit
If the p-value is less than 0.05, then the assumption is violated
Assumption: the hazard ratio is not constant over time
Assume a non-significant relationship betweeen residuals and time
Next, I will check the proportional hazard assumption under a multivariate model.
Similarly, I will build a multivariate model first.
cox_model_all <-
coxph(Surv(tenure, exited) ~ credit_score + gender + balance + age + num_of_products + has_cr_card + is_active_member + estimated_salary,
data = df,
x = TRUE,
y = TRUE)
Then, I will pass the result to cox.zph function.
cox.zph(cox_model_all)
chisq df p
credit_score 0.20167 1 0.653
gender 0.32393 1 0.569
balance 4.25168 1 0.039
age 1.71430 1 0.190
num_of_products 0.78555 1 0.375
has_cr_card 0.00105 1 0.974
is_active_member 1.89779 1 0.168
estimated_salary 1.30608 1 0.253
GLOBAL 11.01238 8 0.201
From the result, we note that the overall p-value is still greater than 0.05, although the p-value for balance is less than 0.05.
Based on this discussion post, the author suggested since the overall test has a near-perfect multiplicity adjustment, individual tests probably is not so crucial if global p-value is greater than 0.2 unless the individual \(\rho\) value (i.e. correlation between Schoenfeld residual and time) is very high.
Next, I will visualize the result by using ggcoxzph function.
Given the number of variables used to build the model, the graph will be hard to read.
Hence, I will make some modifications to the graphs produced.
ggcoxzph(
cox.zph(cox_model_all)
,font.x = 8
,font.y = 7
,font.tickslab = 8
,font.main = 8)

Similar results can be observed.
If I were to convert the balance into categorical variable and visualize the survival curve, we will note that one of the survival curve interacts other curves.
df_grp <-
df %>%
mutate(balance_grp = cut(balance, breaks = c(-Inf, 5, 10, 15, 20, 25, 30)))
ggsurvplot(survfit(Surv(tenure, exited) ~ balance_grp, data = df_grp),
data = df_grp)

In this demonstration, I will explore using stratification.
To do so, I will re-build the model by stratifying the balance variable.
cox_model_all_balance_strata <-
coxph(Surv(tenure, exited) ~ credit_score + gender + age + num_of_products + has_cr_card + is_active_member + estimated_salary + strata(balance),
data = df,
x = TRUE,
y = TRUE)
cox_model_all_balance_strata
Call:
coxph(formula = Surv(tenure, exited) ~ credit_score + gender +
age + num_of_products + has_cr_card + is_active_member +
estimated_salary + strata(balance), data = df, x = TRUE,
y = TRUE)
coef exp(coef) se(coef) z p
credit_score -0.043836 0.957111 0.045962 -0.954 0.340
genderMale -0.403308 0.668106 0.092768 -4.347 1.38e-05
age 0.054010 1.055495 0.003697 14.609 < 2e-16
num_of_products -0.963456 0.381572 0.089148 -10.807 < 2e-16
has_cr_card1 -0.157441 0.854327 0.099791 -1.578 0.115
is_active_member1 -0.910944 0.402145 0.100939 -9.025 < 2e-16
estimated_salary 0.010022 1.010073 0.008082 1.240 0.215
Likelihood ratio test=432.2 on 7 df, p=< 2.2e-16
n= 9587, number of events= 1942
After stratifying the variable, the overall p-value increases.
cox.zph(cox_model_all_balance_strata)
chisq df p
credit_score 0.0149 1 0.90
gender 0.0841 1 0.77
age 0.4012 1 0.53
num_of_products 0.5101 1 0.48
has_cr_card 0.0205 1 0.89
is_active_member 0.6870 1 0.41
estimated_salary 0.0767 1 0.78
GLOBAL 1.4938 7 0.98
Similarly, I will visualize the result by using ggcoxzph function.
ggcoxzph(
cox.zph(cox_model_all_balance_strata)
,font.x = 8
,font.y = 7
,font.tickslab = 8
,font.main = 8)

Lastly, I will extract the concordance results from the original & stratified models for comparison.
cox_model_all$concordance["concordance"]
concordance
0.7110915
cox_model_all_balance_strata$concordance["concordance"]
concordance
0.7478856
The concordance result improves slightly after we stratify balance.
That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by Abbilyn Rurenko on Unsplash