Photo by Ian Keefe on Unsplash

Recently, I was reading my notes on survival analysis.

Although this was one of the topics I learned during my undergraduate, I forgot most of the details so I thought it would be beneficial to read more about survival analysis.

In this post, I will be summarizing what is survival analysis and some important concepts of survival analysis.

I will be using one of the popular methods, Kaplan-Meier to estimate the survival curve.

What is survival analysis?

Survival analysis corresponds to a set of statistical approaches used to investigate the time it takes for an event of interest to occur (STHDA).

Survival analysis is also commonly known as ‘time to event’ analysis.

Let’s look at some important concepts of survival models.

Survival function

Survival function is defined as the probability of surviving up to a point t.

It can be written as follows:

\[S(t) = Pr(T > t)\]

Hazard function

The hazard function is defined as the instantaneous rate at which subject’s experience the event of interest, given that they have survived up to time t (Bartlett 2014).

The function is also known as intensity function, instantaneous failure rate or force of mortality.

\[h(t)=\lim_{\delta\to0}\frac{Pr(t<T<t+\delta|T>t)}{\delta}\]

Censoring

Censoring occurs when we have some information about individual survival time, but we don’t know the time exactly (Sestelo 2017).

The author also summarized the reasons on why censoring may occur as follows:

a person does not experience the event before the study ends
a person is lost to follow-up during the study period
a person withdraws from the study because of death (if death is not the event of interest) or some other reason

In general, there are three types of censoring:

Type	Description
Right	Occur when the true unobserved event is on the right of the censoring time
Left	Event has already occurred before the observed time
Interval	The event of interest happen within a specific interval of time

I find the illustration by Professor Pesta making it easier to understand the different types of censoring.

Illustration by Michal Pesta

Survival analysis vs logistic regression

At the first glance, it looks like both survival analysis and logistic regression are performing the same analysis.

However, there is a difference between survival analysis and logistic regression.

Survival analysis accounts of censored data in the analysis, where logistic regression treats all the individuals didn’t experience the event as ‘survived’ (Zablotski 2021).

This would overestimate the survival probability.

In this post, I will be exploring Kaplan-Meier (KM) method to estimate the survival curve.

Kaplan-Meier Survival Curve

KM is a non-parametric method in estimating the survival curve.

KM survival curve is defined as the probability of surviving in a given length of time while considering time in many small intervals (Manish Kumar Goel 2010).

The formula can be written as following:

\[\hat{S}(t) = \prod_{t_i \leq j}(1- \frac{d_i}{n_i})\]

where

\(\hat{S}(t)\) is the conditional survival probability at time t
\(d_i\) is the number of failure events at time \(t_i\)
\(n_i\) is the number of people at risk at time \(t_i\)

In short, the function estimates the survival probability of a particular individual, provided this individual has survived in all the previous period.

Advantages and disadvantages of using Kaplan-Meier

Below are some of the advantages of using KM model (Zablotski 2021; Sestelo 2017):

Commonly used to describe survival
Commonly used to compare two study populations
Intuitive graphical presentation
Does not require many assumptions for using such method

Following are the disadvantages of using KM model:

Unable to model numeric variables
Unable to include many explanatory variables
Does not control for covariates
Cannot accommodate time-dependent variables

Demonstration

In this demonstration, I will be using this bank dataset from Kaggle.

Photo by Alex Motoc on Unsplash

Setup the environment

First, I will load the necessary packages into the environment.

pacman::p_load(tidyverse, lubridate, tidymodels, rsample, survival, 
               censored, janitor, survminer)

With this, I will be using survival package to perform the survival analysis.

Taken from survival Github

Import Data

First I will import the dataset into the environment.

I will also clean the column names, drop the columns I don’t need and transform the columns to be the right format.

df <- read_csv("data/Churn_Modelling.csv") %>%
  clean_names() %>%
  select(-c(row_number, customer_id)) %>%
  mutate(has_cr_card = factor(has_cr_card),
         is_active_member = factor(is_active_member),
         age_group = case_when(age < 30 ~ "20s",
                               age < 40 ~ "30s",
                               age < 50 ~ "40s",
                               age < 60 ~ "50s",
                               TRUE ~ "60s+"))

Survival curve

According to the documentation page, if the event variable (i.e. exited column in this dataset) is a factor, the survival type will be assumed to multi-state.

The Surv object would look slightly different from the usual single state survival analysis as shown below.

df_1_factor <- df %>%
  mutate(exited = as.factor(exited))

# only show the first 10 records
Surv(df_1_factor$tenure, df_1_factor$exited)[1:10]

 [1] 2:1 1+  8:1 1+  2+  8:1 7+  4:1 4+  2+

However, for this dataset, there are only two possible outcomes under exited variable, i.e. whether the customer has left the bank or remains as a customer.

So, I will convert the exited variable to numeric variable instead.

df_1 <- df %>%
  mutate(exited = as.numeric(exited))

Now, if I were to call the Surv object, we will note that it looks slightly different from the multi-state Surv object.

Surv(df_1$tenure, df_1$exited)[1:10]

 [1] 2  1+ 8  1+ 2+ 8  7+ 4  4+ 2+

Note that the tenure with a plus sign indicates the data point is being right censored.

Also note that event = 1 indicates that the event has been observed and event = 0 indicates that the event is being censored.

Overall survival curve

Now, I will start building the survival curve.

With that, I will be using the survfit function to fit the survival curve.

Note that the argument is very similar to how we usually fit a machine learning model, except the dependent variable is now a surv object.

surv_fit <- survfit(Surv(tenure, exited) ~ 1, data = df_1)

In order to show the overall survival curve, I will pass “1” as the independent variable into the survfit function.

surv_fit

Call: survfit(formula = Surv(tenure, exited) ~ 1, data = df_1)

         n events median 0.95LCL 0.95UCL
[1,] 10000   2037     10      10      NA

We can also extract the fitted survival curve by passing the created survfit object into summary function.

summary(surv_fit)

Call: survfit(formula = Surv(tenure, exited) ~ 1, data = df_1)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0  10000      95    0.991 0.00097        0.989        0.992
    1   9587     232    0.967 0.00182        0.963        0.970
    2   8552     201    0.944 0.00238        0.939        0.948
    3   7504     213    0.917 0.00294        0.911        0.923
    4   6495     203    0.888 0.00347        0.882        0.895
    5   5506     209    0.855 0.00404        0.847        0.863
    6   4494     196    0.817 0.00466        0.808        0.827
    7   3527     177    0.776 0.00535        0.766        0.787
    8   2499     197    0.715 0.00647        0.703        0.728
    9   1474     213    0.612 0.00857        0.595        0.629
   10    490     101    0.486 0.01309        0.461        0.512

Apart of the cumulative survival curve, the result contains the following:

how many people at risk at the beginning of each time point (i.e. n.risk)
how many event has occurred during the time point (i.e. n.event)
standard error and confidence interval of the survival value

Visualize Survival Curve

There are a few way to visualise the survival curve.

In this sub-section, I will be exploring the different approaches to visualize the survival curve.

Method 1: Use ggsurplot function from survminer pacakage

To visualize the survival curve, we could use the ggsurvplot function from survminer package.

ggsurvplot(surv_fit)

Meanwhile, the package also offers the users a list of additional arguments to modify the graph.

For example, we can draw the median survival time in the graph.

ggsurvplot(surv_fit, surv.median.line = "hv")

One of the interesting argument is that the ggsurvplot function enables the users to risk tables by “turning on” risk.table argument.

graph <- ggsurvplot(surv_fit,
                    risk.table = TRUE)

# remove the formating on the risk tables
graph$table <- graph$table + theme_cleantable()

graph

To add the title, we could pass the title name into the title argument as shown below.

ggsurvplot(surv_fit,
           title = "Overall Survival Curve")

Alternatively, we could use the usual ggplot approach to include the modifications we would like to make on the graph.

ggsurvplot(surv_fit) +
  labs(title = "Overall Survival Curve")

Note that the graph object from ggsurvplot function is not ggplot object. Hence, some of the usual way of modifying the graph (eg. layering the different layers to modify the graph) may not work.

As such, we will need to use the list of allowed arguments to modify the graph.

For more details, please refer to documentation page.

Method 2: Use autoplot function from ggfortify package

Next, I will use autoplot function from ggfortify function to create the graph.

autoplot(surv_fit)

The advantage of this method is the created graph is a ggplot object.

This would allow us to make changes to the graph through different ggplot layers.

autoplot(surv_fit) +
  scale_x_continuous(breaks = c(0, 2, 4, 6, 8, 10)) +
  scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1),
                     labels = c("0%", "25%", "50%", "75%", "100%"),
                     limits = c(0, 1)) +
  xlab("Time") +
  ylab("Survival Probability") +
  theme_minimal()

Note that the censor objects appeared at the each step of survival curve. This is because in the dataset, the tenure captured for each customer is an integer, i.e. doesn’t have any decimals.

Method 3: Use functions from ggplot2 package

Lastly, I will use the good old ggplot2 package to create the survival curve from scratch.

To do so, I will first pass the survfit object into tidy function to make the fitted object into a tibble data.

surv_fit_tb <- surv_fit %>%
  tidy()

surv_fit_tb

# A tibble: 11 x 8
    time n.risk n.event n.censor estimate std.error conf.high conf.low
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
 1     0  10000      95      318    0.990  0.000979     0.992    0.989
 2     1   9587     232      803    0.967  0.00188      0.970    0.963
 3     2   8552     201      847    0.944  0.00252      0.948    0.939
 4     3   7504     213      796    0.917  0.00320      0.923    0.911
 5     4   6495     203      786    0.888  0.00390      0.895    0.882
 6     5   5506     209      803    0.855  0.00473      0.863    0.847
 7     6   4494     196      771    0.817  0.00570      0.827    0.808
 8     7   3527     177      851    0.776  0.00689      0.787    0.766
 9     8   2499     197      828    0.715  0.00904      0.728    0.703
10     9   1474     213      771    0.612  0.0140       0.629    0.595
11    10    490     101      389    0.486  0.0269       0.512    0.461

Alternatively, the surv_summary function from survminer package could be used to “tidy” the results into a data frame.

surv_summary(surv_fit)

   time n.risk n.event n.censor      surv      std.err     upper
1     0  10000      95      318 0.9905000 0.0009793424 0.9924031
2     1   9587     232      803 0.9665305 0.0018830569 0.9701042
3     2   8552     201      847 0.9438138 0.0025219694 0.9484906
4     3   7504     213      796 0.9170238 0.0032021035 0.9227971
5     4   6495     203      786 0.8883624 0.0039013915 0.8951814
6     5   5506     209      803 0.8546414 0.0047314807 0.8626038
7     6   4494     196      771 0.8173673 0.0057038892 0.8265563
8     7   3527     177      851 0.7763483 0.0068930935 0.7869081
9     8   2499     197      828 0.7151476 0.0090420982 0.7279346
10    9   1474     213      771 0.6118054 0.0140126693 0.6288411
11   10    490     101      389 0.4856986 0.0269487042 0.5120420
       lower
1  0.9886006
2  0.9629698
3  0.9391601
4  0.9112866
5  0.8815954
6  0.8467525
7  0.8082805
8  0.7659302
9  0.7025853
10 0.5952312
11 0.4607104

The output from surv_summary function is same as the output from tidy function, with some differences on the column naming.

Then, I will pass the tibble into ggplot function.

Voilà!

surv_fit_tb %>%
  ggplot() +
  # survival curve
  geom_step(aes(x = time, y = estimate)) +
  # shaded area of confidence interval
  geom_rect(aes(xmin = time, 
                xmax = lead(time),
                ymin = conf.low, 
                ymax = conf.high),
            fill = "blue",
            alpha = 0.2) +
  scale_y_continuous(labels = scales::percent, 
                     limits = c(0, 1)) +
  labs(title = "Overall Survival Curve") +
  xlab("Time") +
  ylab("Survival Probability") +
  theme_minimal()

Survival curve by groups

Next, I will fit the survival curve by different groups.

surv_fit_age_group <- survfit(Surv(tenure, exited) ~ age_group, data = df_1)

ggsurvplot(surv_fit_age_group)

From the graph above, somehow the younger customers are more “sticky”.

They have higher median survival time than the older customers.

If we input more than 1 variable, the function will create the different survival curves through permutating over the different categories of selected variables.

For example, if we were to pass both age_group and gender into the function, below is how the graph would look like:

surv_fit_age_gender <- survfit(Surv(tenure, exited) ~ age_group + gender, data = df_1)

ggsurvplot(surv_fit_age_gender)

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by takahiro taguchi on Unsplash

Bartlett, Jopnathan. 2014. “Interpreting Changes in Hazard and Hazard Ratios.” https://thestatsgeek.com/2014/03/28/interpreting-changes-in-hazard-and-hazard-ratios/.

Manish Kumar Goel, Jugal Kishore, Pardeep Khanna. 2010. “Understanding Survival Analysis: Kaplan-Meier Estimate.” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3059453/.

Sestelo, Marta. 2017. A Short Course on Survival Analysis Applied to the Financial Industry. Bookdown.

STHDA. “Survival Analysis Basics.” http://www.sthda.com/english/wiki/survival-analysis-basics.

Zablotski, Yury. 2021. “yuzaR-Blog: Survival Analysis 1: A Gentle Introduction into Kaplan-Meier Curves.” https://yuzar-blog.netlify.app/posts/2021-01-03-survival-analysis-1-a-gentle-introduction-into-kaplan-meier-curves/.

Survival Modelling - Kaplan-Meier