One of the popular methods in survival analysis
In this post, I will be exploring one of the popular methods for survival analysis, Cox proportional hazard.

Photo by Alexander Andrews on Unsplash
Before jumping into cox proportional hazard, let’s take a look what are the differences between parametric, semi-parametric, and non-parametric models.
According to the post by Ravi Charan (Charan 2020), below are the summary of the differences:
Non-parametric model: Make no assumptions about the functional form of \(\lambda\) (i.e. hazard function)
Parametric model: Assume the precise functional form of \(\lambda\) in modeling
Semi-parametric model: Cox proportional hazard does not make any functional assumptions on the shape of the hazard function
So, what is cox proportional hazard?
Cox proportional hazard is a semi-parametric survival model.
The Cox proportional hazard formula can be written as follows (STHDA, n.d.):
\[h(t) = h_0(t) \times exp(b_1x_1 + b_2x_2 + ... + b_px_p) \]
where
t is the survival time
h(t) is the hazard function determined by a set of covariates (\(x_1, x_2, ..., x_p\))
the coefficients (\(b_1, b_2, ...., b_p\)) measure the impact of covariates
\(h_0\) is the baseline hazard
As shown in the formula, Cox proportional hazard model allows the users to assess the effects of multiple risk factors on survival time at the same time.
One of the key outputs from Cox proportional hazard model is the hazard ratio.
Below is how one could interpret the hazard ratio:
Hazard ratio = 1 —> no effect
Hazard ratio > 1 increase relative risk compared to the reference group
Hazard ratio < 1 decrease in relative risk compared to the reference group
The hazard ratio represents the increase/decrease in risk for every unit change
Examples will be provided in the demonstration so that it’s easier to understand the concepts.

One of the disadvantages of the Kaplan-Meier survival curve is unable to model numeric variables. Each unique numeric figure will be treated as a separate group when fitting the survival curve under the Kaplan-Meier model.
Another disadvantage of the Kaplan-Meier model is that model is unable to include many explanatory variables.
These issues could be overcome if Cox proportional hazard model is used.
Before we move on, it’s important to discuss one of the important performance metrics in the survival model.
Concordance index (a.k.a. C-index) is one of the commonly used performance measures for survival models.
This index can be interpreted as the fraction of all pairs of subjects whose predicted survival times are correctly ordered among all subjects that can actually be ordered (Raykar et al., n.d.).
Below is a chart on how the concordance index is being calculated:

Taken from “How to Evaluate Survival Analysis Models” post by Nicolo Cosimo Albanese
In the post, the author also summarised the interpretation of the concordance index as follows:
C = 1: perfect concordance between risks and event times.
C = 0: perfect anti-concordance between risks and event times.
C = 0.5: random assignment. The model predicts the relationship between risk and survival time as well as a coin toss.
In this demonstration, I will be using this bank dataset from Kaggle.
In this demonstration, I will be using Cox proportional hazard function from survival package.
I will be using functions from survminer package to visualize the results.
First, I will load the necessary packages into the environment.
pacman::p_load(tidyverse, lubridate, janitor, survival, survminer, scales)
Next, I will import the dataset into the environment.
I will also clean the column names, drop the columns I don’t need, and transform the columns to the right format.
df <- read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2022-09-10-kaplan-meier/data/Churn_Modelling.csv") %>%
clean_names() %>%
select(-c(row_number, customer_id, surname)) %>%
mutate(has_cr_card = factor(has_cr_card),
is_active_member = factor(is_active_member),
credit_score = credit_score/100,
balance = balance/10000,
estimated_salary = estimated_salary/10000) %>%
filter(tenure > 0)
Note that I have also scaled credit_score, balance, and estimated_salary in respective units.
To build a cox model, I will use coxph function from survival package to build a cox proportional hazard model.
cox_model <- coxph(Surv(tenure, exited) ~ .,
data = df,
x = TRUE,
y = TRUE)
We could call the fitted model to see the model result.
cox_model
Call:
coxph(formula = Surv(tenure, exited) ~ ., data = df, x = TRUE,
y = TRUE)
coef exp(coef) se(coef) z p
credit_score -0.048759 0.952410 0.023060 -2.114 0.0345
geographyGermany 0.488400 1.629706 0.055680 8.772 < 2e-16
geographySpain 0.034473 1.035074 0.062178 0.554 0.5793
genderMale -0.382557 0.682115 0.045836 -8.346 < 2e-16
age 0.048125 1.049302 0.001816 26.506 < 2e-16
balance 0.020393 1.020603 0.004495 4.537 5.7e-06
num_of_products -0.051172 0.950116 0.039461 -1.297 0.1947
has_cr_card1 -0.059315 0.942410 0.049585 -1.196 0.2316
is_active_member1 -0.755391 0.469827 0.048751 -15.495 < 2e-16
estimated_salary 0.001221 1.001222 0.003926 0.311 0.7558
Likelihood ratio test=1124 on 10 df, p=< 2.2e-16
n= 9587, number of events= 1942
It seems like the summary function would provide more info about the fitted model if we pass the fitted model into the function.
summary(cox_model)
Call:
coxph(formula = Surv(tenure, exited) ~ ., data = df, x = TRUE,
y = TRUE)
n= 9587, number of events= 1942
coef exp(coef) se(coef) z Pr(>|z|)
credit_score -0.048759 0.952410 0.023060 -2.114 0.0345 *
geographyGermany 0.488400 1.629706 0.055680 8.772 < 2e-16 ***
geographySpain 0.034473 1.035074 0.062178 0.554 0.5793
genderMale -0.382557 0.682115 0.045836 -8.346 < 2e-16 ***
age 0.048125 1.049302 0.001816 26.506 < 2e-16 ***
balance 0.020393 1.020603 0.004495 4.537 5.7e-06 ***
num_of_products -0.051172 0.950116 0.039461 -1.297 0.1947
has_cr_card1 -0.059315 0.942410 0.049585 -1.196 0.2316
is_active_member1 -0.755391 0.469827 0.048751 -15.495 < 2e-16 ***
estimated_salary 0.001221 1.001222 0.003926 0.311 0.7558
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
credit_score 0.9524 1.0500 0.9103 0.9964
geographyGermany 1.6297 0.6136 1.4612 1.8176
geographySpain 1.0351 0.9661 0.9163 1.1692
genderMale 0.6821 1.4660 0.6235 0.7462
age 1.0493 0.9530 1.0456 1.0530
balance 1.0206 0.9798 1.0117 1.0296
num_of_products 0.9501 1.0525 0.8794 1.0265
has_cr_card1 0.9424 1.0611 0.8551 1.0386
is_active_member1 0.4698 2.1284 0.4270 0.5169
estimated_salary 1.0012 0.9988 0.9935 1.0090
Concordance= 0.72 (se = 0.006 )
Likelihood ratio test= 1124 on 10 df, p=<2e-16
Wald test = 1169 on 10 df, p=<2e-16
Score (logrank) test = 1211 on 10 df, p=<2e-16
Note that the hazard ratio would be exp(coef) in the above.
The three tests at the bottom of the results are the statistical tests for the overall significance of the model, where the null hypothesis of the statistical tests is as follows:
\(H_0:\beta = 0\)
As the p-values for all the statistical tests are smaller than 0.05, we reject the null hypothesis.
For more info on how we could interpret the result, I find this post very helpful.
Alternatively, chapter 5 of the book Applied Survival Analysis Using R by Dirk F. Moore has a quite clear explanation of cox proportional hazards as well.
The concordance results can be extracted by using the following method:
cox_model$concordance["concordance"]
concordance
0.7204698
Alternatively, we could extract the concordance index by passing the fitted model into the concordance function from survival package.
concordance(cox_model)
Call:
concordance.coxph(object = cox_model)
n= 9587
Concordance= 0.7205 se= 0.006461
concordant discordant tied.x tied.y tied.xy
7192433 2790544 0 193333 0
Having fit a Cox model to the data, it’s possible to visualize the predicted survival proportion at any given point in time for a particular risk group. The function survfit() estimates the survival proportion, by default at the mean values of covariates.
ggsurvplot(survfit(cox_model),
data = df)

To plot the graph, it seems like we need to pass the fit object into survfit function
ggsurvplot also supports multiple datasets.
To do so, we need to combine the different survfit objects into a list.
For demonstration, I will first split the dataset by gender.
Then, I will fit two cox proportional hazard models based on each dataset.
survfit_male <-
coxph(Surv(tenure, exited) ~ .,
data = df_male)
survfit_female <-
coxph(Surv(tenure, exited) ~ .,
data = df_female)
Then, I will create the “list” before passing the list into ggsurvplot function.
surv_fit_list <-
list("male" = survfit(survfit_male),
"female" = survfit(survfit_female))
ggsurvplot(surv_fit_list,
combine = TRUE,
conf.int = TRUE,
ylim = c(0.35, 1))

Next, I will visualize the fitted model results by using ggforest function.
ggforest(cox_model)

This graph contains the different variables used in fitting Cox model and the different categories within the categorical variables.
On the right side of the graph, we have the hazard ratio, the confidence interval for the different hazard ratios, and p-value for each variable.
As discussed earlier, the higher the hazard ratio, the riskier the customer is, compared to the reference group.
For example, on average a male customer is 32% less risky than a female customer.
Meanwhile, for the numerical variable, we could interpret the hazard ratio as the change in relative risk for every unit change.
For example, for every increase in 1 for age, the risk would increase approximately by 5%.
Nevertheless, below are what we could observe from the cox model result:
ggplot(df, aes(x = age, fill = as.factor(exited), alpha = 0.5)) +
geom_histogram()

The likelihood of churning also depends on whether the customer is being tagged as an active member
The customers from Germany are more likely to churn, compared to France and Spain
This is consistent with what we observe in the dataset
It is probably worthwhile to understand further whether there are any reasons why customers from Germany are more likely to churn.
For example, it could be the customers from Germany are mostly from a certain industry, resulting in a higher churn rate.
This is to help us to spot whether any customer characteristics that are not captured in the dataset.
ggplot(df, aes(x = geography, fill = as.factor(exited))) +
geom_bar(position = "fill")

I find that visualizing the results as follows makes it easier to understand.
We could take one step further to perform feature selection on the variables to be used in cox proportional hazard modeling.
Over here, I will be using step function from stats package in performing stepwise feature selection.
Note that this method uses AIC in selecting the model.
I will also indicate that the direction should be “both”, i.e. the model could include or exclude variables at each step.
Start: AIC=31267.66
Surv(tenure, exited) ~ credit_score + geography + gender + age +
balance + num_of_products + has_cr_card + is_active_member +
estimated_salary
Df AIC
- estimated_salary 1 31266
- has_cr_card 1 31267
- num_of_products 1 31267
<none> 31268
- credit_score 1 31270
- balance 1 31286
- gender 1 31336
- geography 2 31348
- age 1 31881
Step: AIC=31265.75
Surv(tenure, exited) ~ credit_score + geography + gender + age +
balance + num_of_products + has_cr_card + is_active_member
Df AIC
- has_cr_card 1 31265
- num_of_products 1 31265
<none> 31266
+ estimated_salary 1 31268
- credit_score 1 31268
- balance 1 31284
- gender 1 31334
- geography 2 31347
- age 1 31879
Step: AIC=31265.16
Surv(tenure, exited) ~ credit_score + geography + gender + age +
balance + num_of_products + is_active_member
Df AIC
- num_of_products 1 31265
<none> 31265
+ has_cr_card 1 31266
+ estimated_salary 1 31267
- credit_score 1 31268
- balance 1 31284
- gender 1 31334
- geography 2 31345
- age 1 31880
Step: AIC=31264.82
Surv(tenure, exited) ~ credit_score + geography + gender + age +
balance + is_active_member
Df AIC
<none> 31265
+ num_of_products 1 31265
+ has_cr_card 1 31265
+ estimated_salary 1 31267
- credit_score 1 31267
- balance 1 31289
- gender 1 31333
- geography 2 31344
- age 1 31883
Call:
coxph(formula = Surv(tenure, exited) ~ credit_score + geography +
gender + age + balance + is_active_member, data = df, x = TRUE,
y = TRUE)
coef exp(coef) se(coef) z p
credit_score -0.048060 0.953076 0.023066 -2.084 0.0372
geographyGermany 0.479384 1.615078 0.055375 8.657 < 2e-16
geographySpain 0.034141 1.034731 0.062170 0.549 0.5829
genderMale -0.383472 0.681491 0.045820 -8.369 < 2e-16
age 0.048282 1.049466 0.001813 26.631 < 2e-16
balance 0.022047 1.022292 0.004329 5.093 3.53e-07
is_active_member1 -0.757448 0.468862 0.048702 -15.553 < 2e-16
Likelihood ratio test=1120 on 7 df, p=< 2.2e-16
n= 9587, number of events= 1942
From the result, we see that the following variables are being dropped in the feature selection:
num_of_products
has_cr_card
estimated_salary
With this, I will rebuild the cox proportional hazard model with lesser variables.
cox_model_sub <-
coxph(Surv(tenure, exited) ~ credit_score + geography + gender + age + balance + is_active_member,
data = df,
x = TRUE,
y = TRUE)
ggforest(cox_model_sub)

Next, I will extract the concordance index of the new model.
cox_model_sub$concordance["concordance"]
concordance
0.7200558
The new model requires fewer variables in predicting the outcome although there is a slight drop in the concordance index in the new model.
The concordance index dropped by 0.06%.
That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by Helena Lopes