Discriminant analysis

Machine Learning Supervised Learning

Differentiating the groups!

Jasper Lok https://jasperlok.netlify.app/
06-17-2024

Photo by Anne Nygård on Unsplash

In this post, I will be exploring discriminant analysis.

What is discriminant analysis?

Discriminant analysis is a statistical technique used in research that aims to classify or predict a categorical dependent variable based on one or more continuous or binary independent variables. It is often used when the dependent variable is non-metric (categorical) and the independent variables are metric (continuous or binary) (Hassan 2024).

Pros and cons of using discriminant analysis

(Hassan 2024) listed the pros and cons of discriminant analysis.

Pros

Cons

Why use discriminant analysis when we have logistic regression?

When I found out this technique, I was rather curious why we need another method when we have logistic regression. Below are the explanations from one of the textbooks I was reading (Gareth et al. 2021):

Different discriminant analysis

Apart from the linear discriminant analysis, below are the different types of discriminant models (Brownlee 2020):

Model Remarks
Quadratic Discriminant Analysis Each class uses its own estimate of variance (or covariance when there are multiple input variables)
Flexible Discriminant Analysis Where non-linear combinations of inputs is used such as splines
Regularized Discriminant Analysis Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA

Important things to take note of before performing discriminant analysis

Below are some of the important considerations before performing discriminant analysis (Bobbitt 2020):

Demonstration

In this demonstration, I will be using several methods to fit a discriminant analysis.

pacman::p_load(tidyverse, tidymodels, janitor, MASS, discrim)
select <- dplyr::select

Import Data

I will be using travel insurance claim dataset in this demonstration.

df <- 
  read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2021-08-31-naive-bayes/data/travel%20insurance.csv") %>% 
  clean_names() %>% 
  select(-c(agency, product_name, gender, destination))

Model Building

Method 1: Use lda function from MASS package

First, I will build a discriminant model by using lda function.

lda_fit <- 
  lda(claim ~ .
      ,data = df)

lda_fit
Call:
lda(claim ~ ., data = df)

Prior probabilities of groups:
        No        Yes 
0.98536146 0.01463854 

Group means:
    agency_typeTravel Agency distribution_channelOnline  duration
No                 0.7297072                  0.9825318  48.40385
Yes                0.3624595                  0.9816613 110.78857
    net_sales commision_in_value      age
No   39.90466           9.571755 39.98982
Yes  94.37442          25.846419 38.63430

Coefficients of linear discriminants:
                                     LD1
agency_typeTravel Agency   -1.2411359711
distribution_channelOnline -0.7002360259
duration                    0.0007684786
net_sales                   0.0139448575
commision_in_value          0.0041677590
age                        -0.0175789923

According to this article, this is how to interpret the model output:

If we pass the data into the fitted model to generate the predictions, below is the output:

prediction_lda <-
  predict(lda_fit) %>% 
  as.data.frame()

head(prediction_lda)
  class posterior.No posterior.Yes       LD1
1    No    0.9988726  0.0011273705 -1.243361
2    No    0.9985697  0.0014302982 -1.067571
3    No    0.9990595  0.0009405083 -1.377185
4    No    0.9989092  0.0010907774 -1.267729
5    No    0.9987795  0.0012204927 -1.184744
6    No    0.9989424  0.0010575667 -1.290563

Note:

To evaluate the model, we just need to merge the predictions with the dataset and pass them to the necessary model evaluation function as shown below.

prediction_lda %>% 
  bind_cols(df) %>% 
  mutate(claim = as.factor(claim)) %>% 
  roc_auc(truth = claim
          ,posterior.No)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.781

Method 2: Fit by using tidymodels approach

Nest, I will be exploring how to use tidymodels packages to build discriminant analysis models.

First, we will split the dataset into training and testing datasets.

df_splits <- initial_split(df, prop = 0.6, strata = claim)
df_train <- training(df_splits)
df_test <- testing(df_splits)
df_fold <- vfold_cv(df_train)

I will define the recipe for the model building later.

gen_recipe <-
  recipe(claim ~ .
         ,data = df_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_corr(all_numeric_predictors(), threshold = .5) %>% 
  step_zv(all_predictors())

In the recipe, I will use the pre-processing steps suggested in tidymodeling book.

I will also define all the model specifications.

# linear discriminant analysis
lda_specs <-
  discrim_linear() %>% 
  set_engine("MASS")

# quadratic discriminant analysis
qda_specs <-
  discrim_quad() %>% 
  set_engine("MASS")

# flexible discriminant analysis
fda_specs <- 
  discrim_flexible() %>% 
  set_engine("earth")

# regularized discriminant analysis
rda_specs <-
  discrim_regularized() %>% 
  set_engine("klaR")

After that, I will combine the recipe and model specifications into a workflow.

Instead of building different workflows for different models, I will use the functions from workflowsets package.

all_wf <-
  workflow_set(
    preproc = list(gen_recipe)
    ,models = list(linear = lda_specs
                   ,quad = qda_specs
                   ,flexible = fda_specs
                   ,regularized = rda_specs)
  )

all_wf
# A workflow set/tibble: 4 × 4
  wflow_id           info             option    result    
  <chr>              <list>           <list>    <list>    
1 recipe_linear      <tibble [1 × 4]> <opts[0]> <list [0]>
2 recipe_quad        <tibble [1 × 4]> <opts[0]> <list [0]>
3 recipe_flexible    <tibble [1 × 4]> <opts[0]> <list [0]>
4 recipe_regularized <tibble [1 × 4]> <opts[0]> <list [0]>

I will perform cross validation.

all_fold <- 
  all_wf %>% 
  option_add(control = control_grid(save_workflow = TRUE)) %>% 
  workflow_map(seed = 1234
               ,resamples = df_fold
               ,grid = 5)

To find which fitted model has the best model performance, we can do so by looking at ROC AUC curve.

all_fold %>% 
  rank_results() %>% 
  filter(.metric == "roc_auc")
# A tibble: 4 × 9
  wflow_id      .config .metric  mean std_err     n preprocessor model
  <chr>         <chr>   <chr>   <dbl>   <dbl> <int> <chr>        <chr>
1 recipe_linear Prepro… roc_auc 0.783 0.00833    10 recipe       disc…
2 recipe_flexi… Prepro… roc_auc 0.772 0.00849    10 recipe       disc…
3 recipe_quad   Prepro… roc_auc 0.753 0.0111     10 recipe       disc…
4 recipe_regul… Prepro… roc_auc 0.733 0.00693    10 recipe       disc…
# ℹ 1 more variable: rank <int>

As shown in the results above, linear discriminant model has the best model result among the fitted model.

all_fold %>% 
  extract_workflow_set_result("recipe_linear")
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits               id     .metrics         .notes          
   <list>               <chr>  <list>           <list>          
 1 <split [34195/3800]> Fold01 <tibble [3 × 4]> <tibble [0 × 3]>
 2 <split [34195/3800]> Fold02 <tibble [3 × 4]> <tibble [0 × 3]>
 3 <split [34195/3800]> Fold03 <tibble [3 × 4]> <tibble [0 × 3]>
 4 <split [34195/3800]> Fold04 <tibble [3 × 4]> <tibble [0 × 3]>
 5 <split [34195/3800]> Fold05 <tibble [3 × 4]> <tibble [0 × 3]>
 6 <split [34196/3799]> Fold06 <tibble [3 × 4]> <tibble [0 × 3]>
 7 <split [34196/3799]> Fold07 <tibble [3 × 4]> <tibble [0 × 3]>
 8 <split [34196/3799]> Fold08 <tibble [3 × 4]> <tibble [0 × 3]>
 9 <split [34196/3799]> Fold09 <tibble [3 × 4]> <tibble [0 × 3]>
10 <split [34196/3799]> Fold10 <tibble [3 × 4]> <tibble [0 × 3]>

After that, I will select the fitted model with the best model result and measure the model performance by using the testing dataset.

all_test <-
  all_fold %>% 
  extract_workflow("recipe_linear") %>%
  finalize_workflow(
    all_fold %>% 
      extract_workflow_set_result("recipe_linear") %>% 
      select_best(metric = "roc_auc")
  ) %>%
  last_fit(split = df_splits)

all_test %>% 
  collect_predictions() %>% 
  roc_auc(`.pred_No`, truth = claim)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.775

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Randy Fath on Unsplash

Bobbitt, Zach. 2020. “Introduction to Linear Discriminant Analysis.” https://www.statology.org/linear-discriminant-analysis/.
Brownlee, Jason. 2020. “Linear Discriminant Analysis for Machine Learning.” https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/.
Gareth, James, Witten Daniela, Hastie Trevor, and Tibshirani Robert. 2021. An Introduction to Statistical Learning. Springer.
Hassan, Muhammad. 2024. “Discriminant Analysis – Methods, Types and Examples.” https://researchmethod.net/discriminant-analysis/#google_vignette.

References