Survival Modelling - Log Rank Test

Machine Learning Survival Model

Are the survival curves same? Yes or no?

Jasper Lok https://jasperlok.netlify.app/
01-09-2023

Photo by Daniel Reche

In my previous post, I shared about how to build the survival curve.

Often, one of the questions raised while building the survival curve is whether the survival curves observed under the different groups are statistically different from one another.

Log-rank test

Log-rank test is a chi-square test.

It compares the observed and expected counts to see whether the survival curves are statistically different.

Below is the null and alternative hypothesis of the log-rank test:

Hypothesis Remarks
Null All survival curves are the same
Alternative At least one of the survival curves is different from the rest

Taken from DATAtab

Different variations of log-rank tests

There are different variations to the log-rank test (David G. Kleinbaum 2012).

They allow the users to apply different weights at the f-th failure time.

Taken from Survival Analysis - A Self Learning Text book

One of the arguments in the test is rho. Below is the difference when different rho is assumed in the test (Sestelo 2017):

According to the author, the weighting method to be used in the log-rank test should be an priori decision, instead of trial and error to get the desirable results.

This is to avoid bias in the results (David G. Kleinbaum 2012).

Nevertheless, let’s start the demonstration!

Demonstration

In this demonstration, I will be using this bank dataset from Kaggle.

Setup the environment

First, I will load the necessary packages into the environment.

pacman::p_load(tidyverse, survival, janitor, survminer)

With this, I will be using survival package to perform the log-rank test.

Import Data

First I will import the data into the environment.

df <- read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2022-09-10-kaplan-meier/data/Churn_Modelling.csv")

Next, I will perform similar data wrangling.

Refer to my previous post for the details.

Log-rank Test

Comparing two survival curve

In this demonstration, I will compare the survival curve under different genders.

Recall that to visualize the survival curve, I will first create the survfit object and the created object into ggsurvplot function to visualize the survival curves.

surv_fit <- survfit(Surv(tenure, exited) ~ gender, data = df)

ggsurvplot(surv_fit)

From the graph, it looks like the survival curves are visually different under different genders.

To confirm this, I will perform a chi-square test on this to check whether the survival curves are indeed different.

As such, I will use survdiff function to perform the relevant task.

survdiff(Surv(tenure, exited) ~ gender, data = df)
Call:
survdiff(formula = Surv(tenure, exited) ~ gender, data = df)

                 N Observed Expected (O-E)^2/E (O-E)^2/V
gender=Female 4543     1139      920      52.2       101
gender=Male   5457      898     1117      43.0       101

 Chisq= 101  on 1 degrees of freedom, p= <2e-16 

As the p-value is greater than 0.05, we will reject the null hypothesis. There is statistical evidence that the two survival curves are different from one another.

Comparing more than two survival curves

Similarly, survdiff function also can be used when there are more than two survival curves.

For example, I would like to find out that the survival curves are indeed different when the number of products held by the customers differs.

survdiff(Surv(tenure, exited) ~ num_of_products, data = df)
Call:
survdiff(formula = Surv(tenure, exited) ~ num_of_products, data = df)

                     N Observed Expected (O-E)^2/E (O-E)^2/V
num_of_products=1 5084     1409   1027.2       142       304
num_of_products=2 4590      348    941.4       374       737
num_of_products=3  266      220     54.7       499       545
num_of_products=4   60       60     13.7       157       169

 Chisq= 1246  on 3 degrees of freedom, p= <2e-16 

As shown in the result above, we reject the null hypothesis and conclude that all the survival curves are not the same.

However, this test does not tell us whether the survival curves are similar for some of the groups.

For example, if we plot out the survival curve for customers that held a different number of products, it seems like the survival curves for customers who held 3 products and customers who held 4 products are rather similar.

surv_fit <- survfit(Surv(tenure, exited) ~ num_of_products, data = df)

ggsurvplot(surv_fit)

Pairwise survival curves

To verify the hypothesis above, I will use pairwise_survdiff function to generate the pairwise results.

pairwise_survdiff(Surv(tenure, exited) ~ num_of_products, data = df)

    Pairwise comparisons using Log-Rank test 

data:  df and num_of_products 

  1      2      3   
2 <2e-16 -      -   
3 <2e-16 <2e-16 -   
4 <2e-16 <2e-16 0.43

P value adjustment method: BH 

From the results above, we fail to reject the null hypothesis when comparing the survival curves for customers who held 3 and 4 products. There is no statistical evidence that the survival curves for customers with 3 and 4 products are different.

By default, the function will perform log-rank test (i.e. rho = 0).

To perform peto-peto test, we will just need to set the rho to 1 as shown below.

pairwise_survdiff(Surv(tenure, exited) ~ num_of_products, data = df, rho = 1)

    Pairwise comparisons using Peto & Peto test 

data:  df and num_of_products 

  1      2      3   
2 <2e-16 -      -   
3 <2e-16 <2e-16 -   
4 <2e-16 <2e-16 0.52

P value adjustment method: BH 

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by George Milton

David G. Kleinbaum, Mitchel Klein. 2012. Survival Analysis: A Self-Learning Text. 3rd ed. Springer.
Sestelo, Marta. 2017. A Short Course on Survival Analysis Applied to the Financial Industry. https://bookdown.org/sestelo/sa_financial/comparing-survival-curves.html.

References