In this post, I will be exploring one of the popular methods for survival analysis, Cox proportional hazard.

Photo by Alexander Andrews on Unsplash

Before jumping into cox proportional hazard, let’s take a look what are the differences between parametric, semi-parametric, and non-parametric models.

Differences Between parametric, semi-parametric, and non-parametric models

According to the post by Ravi Charan (Charan 2020), below are the summary of the differences:

Non-parametric model: Make no assumptions about the functional form of \(\lambda\) (i.e. hazard function)
Parametric model: Assume the precise functional form of \(\lambda\) in modeling
Semi-parametric model: Cox proportional hazard does not make any functional assumptions on the shape of the hazard function
- Functional form assumptions are made about the effects of the covariates alone

Cox Proportional Hazard

What is Cox proportional hazard?

So, what is cox proportional hazard?

Cox proportional hazard is a semi-parametric survival model.

The Cox proportional hazard formula can be written as follows (STHDA, n.d.):

\[h(t) = h_0(t) \times exp(b_1x_1 + b_2x_2 + ... + b_px_p) \]

where

t is the survival time
h(t) is the hazard function determined by a set of covariates (\(x_1, x_2, ..., x_p\))
the coefficients (\(b_1, b_2, ...., b_p\)) measure the impact of covariates
\(h_0\) is the baseline hazard

As shown in the formula, Cox proportional hazard model allows the users to assess the effects of multiple risk factors on survival time at the same time.

How to interpret Cox Proportional Hazard?

One of the key outputs from Cox proportional hazard model is the hazard ratio.

Below is how one could interpret the hazard ratio:

Categorical variable

Hazard ratio = 1 —> no effect
Hazard ratio > 1 increase relative risk compared to the reference group
Hazard ratio < 1 decrease in relative risk compared to the reference group

Numerical variable

The hazard ratio represents the increase/decrease in risk for every unit change

Examples will be provided in the demonstration so that it’s easier to understand the concepts.

Difference between Cox Proportional Hazard and Kaplan-Meier Survival Curve

One of the disadvantages of the Kaplan-Meier survival curve is unable to model numeric variables. Each unique numeric figure will be treated as a separate group when fitting the survival curve under the Kaplan-Meier model.

Another disadvantage of the Kaplan-Meier model is that model is unable to include many explanatory variables.

These issues could be overcome if Cox proportional hazard model is used.

Concordance Index

Before we move on, it’s important to discuss one of the important performance metrics in the survival model.

Concordance index (a.k.a. C-index) is one of the commonly used performance measures for survival models.

This index can be interpreted as the fraction of all pairs of subjects whose predicted survival times are correctly ordered among all subjects that can actually be ordered (Raykar et al., n.d.).

Below is a chart on how the concordance index is being calculated:

Taken from “How to Evaluate Survival Analysis Models” post by Nicolo Cosimo Albanese

In the post, the author also summarised the interpretation of the concordance index as follows:

C = 1: perfect concordance between risks and event times.
C = 0: perfect anti-concordance between risks and event times.
C = 0.5: random assignment. The model predicts the relationship between risk and survival time as well as a coin toss.

Demonstration

In this demonstration, I will be using this bank dataset from Kaggle.

Setup the environment

In this demonstration, I will be using Cox proportional hazard function from survival package.

I will be using functions from survminer package to visualize the results.

First, I will load the necessary packages into the environment.

pacman::p_load(tidyverse, lubridate, janitor, survival, survminer, scales)

Import Data

Next, I will import the dataset into the environment.

I will also clean the column names, drop the columns I don’t need, and transform the columns to the right format.

df <- read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2022-09-10-kaplan-meier/data/Churn_Modelling.csv") %>%
  clean_names() %>%
  select(-c(row_number, customer_id, surname)) %>%
  mutate(has_cr_card = factor(has_cr_card),
         is_active_member = factor(is_active_member),
         credit_score = credit_score/100,
         balance = balance/10000,
         estimated_salary = estimated_salary/10000) %>%
  filter(tenure > 0)

Note that I have also scaled credit_score, balance, and estimated_salary in respective units.

Building Cox model

To build a cox model, I will use coxph function from survival package to build a cox proportional hazard model.

cox_model <- coxph(Surv(tenure, exited) ~ ., 
                   data = df,
                   x = TRUE,
                   y = TRUE)

We could call the fitted model to see the model result.

cox_model

Call:
coxph(formula = Surv(tenure, exited) ~ ., data = df, x = TRUE, 
    y = TRUE)

                       coef exp(coef)  se(coef)       z       p
credit_score      -0.048759  0.952410  0.023060  -2.114  0.0345
geographyGermany   0.488400  1.629706  0.055680   8.772 < 2e-16
geographySpain     0.034473  1.035074  0.062178   0.554  0.5793
genderMale        -0.382557  0.682115  0.045836  -8.346 < 2e-16
age                0.048125  1.049302  0.001816  26.506 < 2e-16
balance            0.020393  1.020603  0.004495   4.537 5.7e-06
num_of_products   -0.051172  0.950116  0.039461  -1.297  0.1947
has_cr_card1      -0.059315  0.942410  0.049585  -1.196  0.2316
is_active_member1 -0.755391  0.469827  0.048751 -15.495 < 2e-16
estimated_salary   0.001221  1.001222  0.003926   0.311  0.7558

Likelihood ratio test=1124  on 10 df, p=< 2.2e-16
n= 9587, number of events= 1942

It seems like the summary function would provide more info about the fitted model if we pass the fitted model into the function.

summary(cox_model)

Call:
coxph(formula = Surv(tenure, exited) ~ ., data = df, x = TRUE, 
    y = TRUE)

  n= 9587, number of events= 1942 

                       coef exp(coef)  se(coef)       z Pr(>|z|)    
credit_score      -0.048759  0.952410  0.023060  -2.114   0.0345 *  
geographyGermany   0.488400  1.629706  0.055680   8.772  < 2e-16 ***
geographySpain     0.034473  1.035074  0.062178   0.554   0.5793    
genderMale        -0.382557  0.682115  0.045836  -8.346  < 2e-16 ***
age                0.048125  1.049302  0.001816  26.506  < 2e-16 ***
balance            0.020393  1.020603  0.004495   4.537  5.7e-06 ***
num_of_products   -0.051172  0.950116  0.039461  -1.297   0.1947    
has_cr_card1      -0.059315  0.942410  0.049585  -1.196   0.2316    
is_active_member1 -0.755391  0.469827  0.048751 -15.495  < 2e-16 ***
estimated_salary   0.001221  1.001222  0.003926   0.311   0.7558    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                  exp(coef) exp(-coef) lower .95 upper .95
credit_score         0.9524     1.0500    0.9103    0.9964
geographyGermany     1.6297     0.6136    1.4612    1.8176
geographySpain       1.0351     0.9661    0.9163    1.1692
genderMale           0.6821     1.4660    0.6235    0.7462
age                  1.0493     0.9530    1.0456    1.0530
balance              1.0206     0.9798    1.0117    1.0296
num_of_products      0.9501     1.0525    0.8794    1.0265
has_cr_card1         0.9424     1.0611    0.8551    1.0386
is_active_member1    0.4698     2.1284    0.4270    0.5169
estimated_salary     1.0012     0.9988    0.9935    1.0090

Concordance= 0.72  (se = 0.006 )
Likelihood ratio test= 1124  on 10 df,   p=<2e-16
Wald test            = 1169  on 10 df,   p=<2e-16
Score (logrank) test = 1211  on 10 df,   p=<2e-16

Note that the hazard ratio would be exp(coef) in the above.

The three tests at the bottom of the results are the statistical tests for the overall significance of the model, where the null hypothesis of the statistical tests is as follows:

\(H_0:\beta = 0\)

As the p-values for all the statistical tests are smaller than 0.05, we reject the null hypothesis.

For more info on how we could interpret the result, I find this post very helpful.

Alternatively, chapter 5 of the book Applied Survival Analysis Using R by Dirk F. Moore has a quite clear explanation of cox proportional hazards as well.

Extract the results

The concordance results can be extracted by using the following method:

cox_model$concordance["concordance"]

concordance 
  0.7204698

Alternatively, we could extract the concordance index by passing the fitted model into the concordance function from survival package.

concordance(cox_model)

Call:
concordance.coxph(object = cox_model)

n= 9587 
Concordance= 0.7205 se= 0.006461
concordant discordant     tied.x     tied.y    tied.xy 
   7192433    2790544          0     193333          0

Visualize the survival curve

Having fit a Cox model to the data, it’s possible to visualize the predicted survival proportion at any given point in time for a particular risk group. The function survfit() estimates the survival proportion, by default at the mean values of covariates.

ggsurvplot(survfit(cox_model), 
           data = df)

To plot the graph, it seems like we need to pass the fit object into survfit function

ggsurvplot also supports multiple datasets.

To do so, we need to combine the different survfit objects into a list.

For demonstration, I will first split the dataset by gender.

df_male <-
  df %>% 
  filter(gender == "Male") %>%
  select(-gender)

df_female <-
  df %>% 
  filter(gender == "Female") %>%
  select(-gender)

Then, I will fit two cox proportional hazard models based on each dataset.

survfit_male <-
  coxph(Surv(tenure, exited) ~ .,
        data = df_male)

survfit_female <-
  coxph(Surv(tenure, exited) ~ .,
        data = df_female)

Then, I will create the “list” before passing the list into ggsurvplot function.

surv_fit_list <- 
  list("male" = survfit(survfit_male),
       "female" = survfit(survfit_female))

ggsurvplot(surv_fit_list, 
           combine = TRUE,
           conf.int = TRUE,
           ylim = c(0.35, 1))

Visualize the cox proportional hazard results

Next, I will visualize the fitted model results by using ggforest function.

ggforest(cox_model)

This graph contains the different variables used in fitting Cox model and the different categories within the categorical variables.

On the right side of the graph, we have the hazard ratio, the confidence interval for the different hazard ratios, and p-value for each variable.

As discussed earlier, the higher the hazard ratio, the riskier the customer is, compared to the reference group.

For example, on average a male customer is 32% less risky than a female customer.

Meanwhile, for the numerical variable, we could interpret the hazard ratio as the change in relative risk for every unit change.

For example, for every increase in 1 for age, the risk would increase approximately by 5%.

Nevertheless, below are what we could observe from the cox model result:

Interestingly enough that in general, the churn increases when the age increases

ggplot(df, aes(x = age, fill = as.factor(exited), alpha = 0.5)) +
  geom_histogram()

The likelihood of churning also depends on whether the customer is being tagged as an active member
- The company could run some campaigns to engage the non-active customers to reduce the churn rate
The customers from Germany are more likely to churn, compared to France and Spain
- This is consistent with what we observe in the dataset
- It is probably worthwhile to understand further whether there are any reasons why customers from Germany are more likely to churn.
For example, it could be the customers from Germany are mostly from a certain industry, resulting in a higher churn rate.

This is to help us to spot whether any customer characteristics that are not captured in the dataset.

ggplot(df, aes(x = geography, fill = as.factor(exited))) +
  geom_bar(position = "fill")

I find that visualizing the results as follows makes it easier to understand.

Feature Selection

We could take one step further to perform feature selection on the variables to be used in cox proportional hazard modeling.

Over here, I will be using step function from stats package in performing stepwise feature selection.

Note that this method uses AIC in selecting the model.

I will also indicate that the direction should be “both”, i.e. the model could include or exclude variables at each step.

stats::step(
  cox_model,
  scope = list(upper = ~.,
               lower = ~is_active_member),
  direction = "both"
)

Start:  AIC=31267.66
Surv(tenure, exited) ~ credit_score + geography + gender + age + 
    balance + num_of_products + has_cr_card + is_active_member + 
    estimated_salary

                   Df   AIC
- estimated_salary  1 31266
- has_cr_card       1 31267
- num_of_products   1 31267
<none>                31268
- credit_score      1 31270
- balance           1 31286
- gender            1 31336
- geography         2 31348
- age               1 31881

Step:  AIC=31265.75
Surv(tenure, exited) ~ credit_score + geography + gender + age + 
    balance + num_of_products + has_cr_card + is_active_member

                   Df   AIC
- has_cr_card       1 31265
- num_of_products   1 31265
<none>                31266
+ estimated_salary  1 31268
- credit_score      1 31268
- balance           1 31284
- gender            1 31334
- geography         2 31347
- age               1 31879

Step:  AIC=31265.16
Surv(tenure, exited) ~ credit_score + geography + gender + age + 
    balance + num_of_products + is_active_member

                   Df   AIC
- num_of_products   1 31265
<none>                31265
+ has_cr_card       1 31266
+ estimated_salary  1 31267
- credit_score      1 31268
- balance           1 31284
- gender            1 31334
- geography         2 31345
- age               1 31880

Step:  AIC=31264.82
Surv(tenure, exited) ~ credit_score + geography + gender + age + 
    balance + is_active_member

                   Df   AIC
<none>                31265
+ num_of_products   1 31265
+ has_cr_card       1 31265
+ estimated_salary  1 31267
- credit_score      1 31267
- balance           1 31289
- gender            1 31333
- geography         2 31344
- age               1 31883

Call:
coxph(formula = Surv(tenure, exited) ~ credit_score + geography + 
    gender + age + balance + is_active_member, data = df, x = TRUE, 
    y = TRUE)

                       coef exp(coef)  se(coef)       z        p
credit_score      -0.048060  0.953076  0.023066  -2.084   0.0372
geographyGermany   0.479384  1.615078  0.055375   8.657  < 2e-16
geographySpain     0.034141  1.034731  0.062170   0.549   0.5829
genderMale        -0.383472  0.681491  0.045820  -8.369  < 2e-16
age                0.048282  1.049466  0.001813  26.631  < 2e-16
balance            0.022047  1.022292  0.004329   5.093 3.53e-07
is_active_member1 -0.757448  0.468862  0.048702 -15.553  < 2e-16

Likelihood ratio test=1120  on 7 df, p=< 2.2e-16
n= 9587, number of events= 1942

From the result, we see that the following variables are being dropped in the feature selection:

num_of_products
has_cr_card
estimated_salary

With this, I will rebuild the cox proportional hazard model with lesser variables.

cox_model_sub <- 
  coxph(Surv(tenure, exited) ~ credit_score + geography + gender + age + balance + is_active_member, 
        data = df, 
        x = TRUE, 
        y = TRUE)

ggforest(cox_model_sub)

Next, I will extract the concordance index of the new model.

cox_model_sub$concordance["concordance"]

concordance 
  0.7200558

The new model requires fewer variables in predicting the outcome although there is a slight drop in the concordance index in the new model.

The concordance index dropped by 0.06%.

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Helena Lopes

Charan, Ravi. 2020. “The Cox Proportional Hazards Model.” https://towardsdatascience.com/the-cox-proportional-hazards-model-35e60e554d8f.

Raykar, Vikas C., Harald Steck, Balaji Krishnapuram, Cary Dehing-Oberije, and Philippe Lambin. n.d. “On Ranking in Survival Analysis: Bounds on the Concordance Index.” https://proceedings.neurips.cc/paper/2007/file/33e8075e9970de0cfea955afd4644bb2-Paper.pdf.

STHDA. n.d. “Cox Proportional-Hazards Model.” http://www.sthda.com/english/wiki/cox-proportional-hazards-model.

Cox Proportional Hazard

Differences Between parametric, semi-parametric, and non-parametric models

Cox Proportional Hazard

What is Cox proportional hazard?

How to interpret Cox Proportional Hazard?

Categorical variable

Numerical variable

Difference between Cox Proportional Hazard and Kaplan-Meier Survival Curve

Concordance Index

Demonstration

Setup the environment

Import Data

Building Cox model

Extract the results

Visualize the survival curve

Visualize the cox proportional hazard results

Feature Selection

Conclusion

References