In the long run we are all dead - John Maynard Keynes

Photo by Ian Keefe on Unsplash
Recently, I was reading my notes on survival analysis.
Although this was one of the topics I learned during my undergraduate, I forgot most of the details so I thought it would be beneficial to read more about survival analysis.
In this post, I will be summarizing what is survival analysis and some important concepts of survival analysis.
I will be using one of the popular methods, Kaplan-Meier to estimate the survival curve.
Survival analysis corresponds to a set of statistical approaches used to investigate the time it takes for an event of interest to occur (STHDA).
Survival analysis is also commonly known as ‘time to event’ analysis.
Let’s look at some important concepts of survival models.
Survival function is defined as the probability of surviving up to a point t.
It can be written as follows:
\[S(t) = Pr(T > t)\]
The hazard function is defined as the instantaneous rate at which subject’s experience the event of interest, given that they have survived up to time t (Bartlett 2014).
The function is also known as intensity function, instantaneous failure rate or force of mortality.
\[h(t)=\lim_{\delta\to0}\frac{Pr(t<T<t+\delta|T>t)}{\delta}\]
Censoring occurs when we have some information about individual survival time, but we don’t know the time exactly (Sestelo 2017).
The author also summarized the reasons on why censoring may occur as follows:
a person does not experience the event before the study ends
a person is lost to follow-up during the study period
a person withdraws from the study because of death (if death is not the event of interest) or some other reason
In general, there are three types of censoring:
| Type | Description |
|---|---|
| Right | Occur when the true unobserved event is on the right of the censoring time |
| Left | Event has already occurred before the observed time |
| Interval | The event of interest happen within a specific interval of time |
I find the illustration by Professor Pesta making it easier to understand the different types of censoring.

Illustration by Michal Pesta
At the first glance, it looks like both survival analysis and logistic regression are performing the same analysis.
However, there is a difference between survival analysis and logistic regression.
Survival analysis accounts of censored data in the analysis, where logistic regression treats all the individuals didn’t experience the event as ‘survived’ (Zablotski 2021).
This would overestimate the survival probability.
In this post, I will be exploring Kaplan-Meier (KM) method to estimate the survival curve.
KM is a non-parametric method in estimating the survival curve.
KM survival curve is defined as the probability of surviving in a given length of time while considering time in many small intervals (Manish Kumar Goel 2010).
The formula can be written as following:
\[\hat{S}(t) = \prod_{t_i \leq j}(1- \frac{d_i}{n_i})\]
where
\(\hat{S}(t)\) is the conditional survival probability at time t
\(d_i\) is the number of failure events at time \(t_i\)
\(n_i\) is the number of people at risk at time \(t_i\)
In short, the function estimates the survival probability of a particular individual, provided this individual has survived in all the previous period.
Below are some of the advantages of using KM model (Zablotski 2021; Sestelo 2017):
Commonly used to describe survival
Commonly used to compare two study populations
Intuitive graphical presentation
Does not require many assumptions for using such method
Following are the disadvantages of using KM model:
Unable to model numeric variables
Unable to include many explanatory variables
Does not control for covariates
Cannot accommodate time-dependent variables
In this demonstration, I will be using this bank dataset from Kaggle.

Photo by Alex Motoc on Unsplash
First, I will load the necessary packages into the environment.
pacman::p_load(tidyverse, lubridate, tidymodels, rsample, survival,
censored, janitor, survminer)
With this, I will be using survival package to perform
the survival analysis.

Taken from survival Github
First I will import the dataset into the environment.
I will also clean the column names, drop the columns I don’t need and transform the columns to be the right format.
df <- read_csv("data/Churn_Modelling.csv") %>%
clean_names() %>%
select(-c(row_number, customer_id)) %>%
mutate(has_cr_card = factor(has_cr_card),
is_active_member = factor(is_active_member),
age_group = case_when(age < 30 ~ "20s",
age < 40 ~ "30s",
age < 50 ~ "40s",
age < 60 ~ "50s",
TRUE ~ "60s+"))
According to the documentation
page, if the event variable (i.e. exited
column in this dataset) is a factor, the survival type will be assumed
to multi-state.
The Surv object would look slightly different from the
usual single state survival analysis as shown below.
df_1_factor <- df %>%
mutate(exited = as.factor(exited))
# only show the first 10 records
Surv(df_1_factor$tenure, df_1_factor$exited)[1:10]
[1] 2:1 1+ 8:1 1+ 2+ 8:1 7+ 4:1 4+ 2+
However, for this dataset, there are only two possible outcomes under
exited variable, i.e. whether the customer has left the
bank or remains as a customer.
So, I will convert the exited variable to numeric
variable instead.
df_1 <- df %>%
mutate(exited = as.numeric(exited))
Now, if I were to call the Surv object, we will note
that it looks slightly different from the multi-state Surv
object.
Surv(df_1$tenure, df_1$exited)[1:10]
[1] 2 1+ 8 1+ 2+ 8 7+ 4 4+ 2+
Note that the tenure with a plus sign indicates the data point is being right censored.
Also note that event = 1 indicates that the event has been observed and event = 0 indicates that the event is being censored.
Now, I will start building the survival curve.
With that, I will be using the survfit function to fit
the survival curve.
Note that the argument is very similar to how we usually fit a
machine learning model, except the dependent variable is now a
surv object.
surv_fit <- survfit(Surv(tenure, exited) ~ 1, data = df_1)
In order to show the overall survival curve, I will pass “1” as the
independent variable into the survfit function.
surv_fit
Call: survfit(formula = Surv(tenure, exited) ~ 1, data = df_1)
n events median 0.95LCL 0.95UCL
[1,] 10000 2037 10 10 NA
We can also extract the fitted survival curve by passing the created
survfit object into summary function.
summary(surv_fit)
Call: survfit(formula = Surv(tenure, exited) ~ 1, data = df_1)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 10000 95 0.991 0.00097 0.989 0.992
1 9587 232 0.967 0.00182 0.963 0.970
2 8552 201 0.944 0.00238 0.939 0.948
3 7504 213 0.917 0.00294 0.911 0.923
4 6495 203 0.888 0.00347 0.882 0.895
5 5506 209 0.855 0.00404 0.847 0.863
6 4494 196 0.817 0.00466 0.808 0.827
7 3527 177 0.776 0.00535 0.766 0.787
8 2499 197 0.715 0.00647 0.703 0.728
9 1474 213 0.612 0.00857 0.595 0.629
10 490 101 0.486 0.01309 0.461 0.512
Apart of the cumulative survival curve, the result contains the following:
how many people at risk at the beginning of each time point (i.e. n.risk)
how many event has occurred during the time point (i.e. n.event)
standard error and confidence interval of the survival value
There are a few way to visualise the survival curve.
In this sub-section, I will be exploring the different approaches to visualize the survival curve.
Method 1: Use ggsurplot function from
survminer pacakage
To visualize the survival curve, we could use the
ggsurvplot function from survminer
package.
ggsurvplot(surv_fit)

Meanwhile, the package also offers the users a list of additional arguments to modify the graph.
For example, we can draw the median survival time in the graph.
ggsurvplot(surv_fit, surv.median.line = "hv")

One of the interesting argument is that the ggsurvplot
function enables the users to risk tables by “turning on”
risk.table argument.
graph <- ggsurvplot(surv_fit,
risk.table = TRUE)
# remove the formating on the risk tables
graph$table <- graph$table + theme_cleantable()
graph

To add the title, we could pass the title name into the title argument as shown below.
ggsurvplot(surv_fit,
title = "Overall Survival Curve")

Alternatively, we could use the usual ggplot approach to include the modifications we would like to make on the graph.
ggsurvplot(surv_fit) +
labs(title = "Overall Survival Curve")

Note that the graph object from ggsurvplot function is
not ggplot object. Hence, some of the usual way of modifying the graph
(eg. layering the different layers to modify the graph) may not
work.
As such, we will need to use the list of allowed arguments to modify the graph.
For more details, please refer to documentation page.
Method 2: Use autoplot function from
ggfortify package
Next, I will use autoplot function from
ggfortify function to create the graph.
autoplot(surv_fit)

The advantage of this method is the created graph is a ggplot object.
This would allow us to make changes to the graph through different ggplot layers.
autoplot(surv_fit) +
scale_x_continuous(breaks = c(0, 2, 4, 6, 8, 10)) +
scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1),
labels = c("0%", "25%", "50%", "75%", "100%"),
limits = c(0, 1)) +
xlab("Time") +
ylab("Survival Probability") +
theme_minimal()

Note that the censor objects appeared at the each step of survival curve. This is because in the dataset, the tenure captured for each customer is an integer, i.e. doesn’t have any decimals.
Method 3: Use functions from ggplot2
package
Lastly, I will use the good old ggplot2 package to
create the survival curve from scratch.
To do so, I will first pass the survfit object into
tidy function to make the fitted object into a tibble
data.
surv_fit_tb <- surv_fit %>%
tidy()
surv_fit_tb
# A tibble: 11 x 8
time n.risk n.event n.censor estimate std.error conf.high conf.low
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 10000 95 318 0.990 0.000979 0.992 0.989
2 1 9587 232 803 0.967 0.00188 0.970 0.963
3 2 8552 201 847 0.944 0.00252 0.948 0.939
4 3 7504 213 796 0.917 0.00320 0.923 0.911
5 4 6495 203 786 0.888 0.00390 0.895 0.882
6 5 5506 209 803 0.855 0.00473 0.863 0.847
7 6 4494 196 771 0.817 0.00570 0.827 0.808
8 7 3527 177 851 0.776 0.00689 0.787 0.766
9 8 2499 197 828 0.715 0.00904 0.728 0.703
10 9 1474 213 771 0.612 0.0140 0.629 0.595
11 10 490 101 389 0.486 0.0269 0.512 0.461
Alternatively, the surv_summary function from
survminer package could be used to “tidy” the results into
a data frame.
surv_summary(surv_fit)
time n.risk n.event n.censor surv std.err upper
1 0 10000 95 318 0.9905000 0.0009793424 0.9924031
2 1 9587 232 803 0.9665305 0.0018830569 0.9701042
3 2 8552 201 847 0.9438138 0.0025219694 0.9484906
4 3 7504 213 796 0.9170238 0.0032021035 0.9227971
5 4 6495 203 786 0.8883624 0.0039013915 0.8951814
6 5 5506 209 803 0.8546414 0.0047314807 0.8626038
7 6 4494 196 771 0.8173673 0.0057038892 0.8265563
8 7 3527 177 851 0.7763483 0.0068930935 0.7869081
9 8 2499 197 828 0.7151476 0.0090420982 0.7279346
10 9 1474 213 771 0.6118054 0.0140126693 0.6288411
11 10 490 101 389 0.4856986 0.0269487042 0.5120420
lower
1 0.9886006
2 0.9629698
3 0.9391601
4 0.9112866
5 0.8815954
6 0.8467525
7 0.8082805
8 0.7659302
9 0.7025853
10 0.5952312
11 0.4607104
The output from surv_summary function is same as the
output from tidy function, with some differences on the
column naming.
Then, I will pass the tibble into ggplot function.
Voilà!
surv_fit_tb %>%
ggplot() +
# survival curve
geom_step(aes(x = time, y = estimate)) +
# shaded area of confidence interval
geom_rect(aes(xmin = time,
xmax = lead(time),
ymin = conf.low,
ymax = conf.high),
fill = "blue",
alpha = 0.2) +
scale_y_continuous(labels = scales::percent,
limits = c(0, 1)) +
labs(title = "Overall Survival Curve") +
xlab("Time") +
ylab("Survival Probability") +
theme_minimal()

Next, I will fit the survival curve by different groups.
surv_fit_age_group <- survfit(Surv(tenure, exited) ~ age_group, data = df_1)
ggsurvplot(surv_fit_age_group)

From the graph above, somehow the younger customers are more “sticky”.
They have higher median survival time than the older customers.
If we input more than 1 variable, the function will create the different survival curves through permutating over the different categories of selected variables.
For example, if we were to pass both age_group and
gender into the function, below is how the graph would look
like:
surv_fit_age_gender <- survfit(Surv(tenure, exited) ~ age_group + gender, data = df_1)
ggsurvplot(surv_fit_age_gender)

That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by takahiro taguchi on Unsplash