In this post, I will be exploring discriminant analysis.

What is discriminant analysis?

Discriminant analysis is a statistical technique used in research that aims to classify or predict a categorical dependent variable based on one or more continuous or binary independent variables. It is often used when the dependent variable is non-metric (categorical) and the independent variables are metric (continuous or binary) (Hassan 2024).

Pros and cons of using discriminant analysis

(Hassan 2024) listed the pros and cons of discriminant analysis.

Pros

Multiclass Classification: Discriminant analysis can handle situations where there are more than two classes in the dependent variable, which is a limitation for some other methods such as logistic regression.
Understanding Group Differences: Discriminant analysis does not just predict group membership; it also provides information on which variables are important discriminators between groups. This makes it a useful tool for exploratory research to understand the differences between groups.
Efficient with Large Variables: Discriminant analysis can handle a large number of predictor variables efficiently. It becomes useful when the number of variables is very large, potentially exceeding the number of observations.
Dimensionality Reduction: Linear Discriminant Analysis (LDA) can be used for dimensionality reduction – it can reduce the number of variables in a dataset while preserving as much information as possible.
Prior Probabilities: Discriminant analysis allows for the inclusion of prior probabilities, meaning that researchers can incorporate prior knowledge about the proportions of observations in each group.
Model Interpretability: The model produced by discriminant analysis is relatively interpretable compared to some other machine learning models, such as neural networks. The weights of the features in the model can provide an indication of their relative importance.

Cons

Assumption of Normality: Discriminant analysis assumes that the predictors are normally distributed. If this assumption is violated, the performance of the model may be affected.
Assumption of Equal Covariance Matrices: Discriminant analysis, particularly Linear Discriminant Analysis (LDA), assumes that the groups being compared have equal covariance matrices. If this assumption is not met, it may lead to inaccuracies in classification.
Multicollinearity: Discriminant analysis may not work well if there is high multicollinearity among the predictor variables. This situation can lead to unstable estimates of the coefficients and difficulties in interpreting the results.
Outliers: Discriminant analysis is sensitive to outliers, which can have a large influence on the classification function.
Overfitting: Like many statistical techniques, discriminant analysis can result in overfitting if the model is too complex. Overfitting happens when the model fits the training data very well but performs poorly on new, unseen data.
Limited to Linear Relationships: Linear Discriminant Analysis (LDA) assumes a linear relationship between predictor variables and the log-odds of the dependent variable. This limits its utility in scenarios where relationships are complex or nonlinear. In such cases, Quadratic Discriminant Analysis (QDA) or other non-linear methods might be more appropriate.

Why use discriminant analysis when we have logistic regression?

When I found out this technique, I was rather curious why we need another method when we have logistic regression. Below are the explanations from one of the textbooks I was reading (Gareth et al. 2021):

When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are unstable
This method may be more accurate than logistic regression if the distribution of the predictors X is approximately normal in each of the classes and the sample size is small
The methods in this section can be naturally extended to the case of more than two response classes

Different discriminant analysis

Apart from the linear discriminant analysis, below are the different types of discriminant models (Brownlee 2020):

Model	Remarks
Quadratic Discriminant Analysis	Each class uses its own estimate of variance (or covariance when there are multiple input variables)
Flexible Discriminant Analysis	Where non-linear combinations of inputs is used such as splines
Regularized Discriminant Analysis	Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA

Important things to take note of before performing discriminant analysis

Below are some of the important considerations before performing discriminant analysis (Bobbitt 2020):

The response variable is categorical
The predictor variables follow a normal distribution
Each predictor variable has the same variance
Account for extreme outliers

Demonstration

In this demonstration, I will be using several methods to fit a discriminant analysis.

pacman::p_load(tidyverse, tidymodels, janitor, MASS, discrim)
select <- dplyr::select

Import Data

I will be using travel insurance claim dataset in this demonstration.

df <- 
  read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2021-08-31-naive-bayes/data/travel%20insurance.csv") %>% 
  clean_names() %>% 
  select(-c(agency, product_name, gender, destination))

Model Building

Method 1: Use `lda` function from `MASS` package

First, I will build a discriminant model by using lda function.

lda_fit <- 
  lda(claim ~ .
      ,data = df)

lda_fit

Call:
lda(claim ~ ., data = df)

Prior probabilities of groups:
        No        Yes 
0.98536146 0.01463854 

Group means:
    agency_typeTravel Agency distribution_channelOnline  duration
No                 0.7297072                  0.9825318  48.40385
Yes                0.3624595                  0.9816613 110.78857
    net_sales commision_in_value      age
No   39.90466           9.571755 39.98982
Yes  94.37442          25.846419 38.63430

Coefficients of linear discriminants:
                                     LD1
agency_typeTravel Agency   -1.2411359711
distribution_channelOnline -0.7002360259
duration                    0.0007684786
net_sales                   0.0139448575
commision_in_value          0.0041677590
age                        -0.0175789923

According to this article, this is how to interpret the model output:

Prior probabilities of group: These represent the proportions of each Species in the training set
Group means: These display the mean values for each predictor variable for each species
Coefficients of linear discriminants: These display the linear combination of predictor variables that are used to form the decision rule of the LDA model

If we pass the data into the fitted model to generate the predictions, below is the output:

prediction_lda <-
  predict(lda_fit) %>% 
  as.data.frame()

head(prediction_lda)

  class posterior.No posterior.Yes       LD1
1    No    0.9988726  0.0011273705 -1.243361
2    No    0.9985697  0.0014302982 -1.067571
3    No    0.9990595  0.0009405083 -1.377185
4    No    0.9989092  0.0010907774 -1.267729
5    No    0.9987795  0.0012204927 -1.184744
6    No    0.9989424  0.0010575667 -1.290563

Note:

Class is the predicted class
The output will contain the posterior probability that the observation belongs to the selected class

To evaluate the model, we just need to merge the predictions with the dataset and pass them to the necessary model evaluation function as shown below.

prediction_lda %>% 
  bind_cols(df) %>% 
  mutate(claim = as.factor(claim)) %>% 
  roc_auc(truth = claim
          ,posterior.No)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.781

Method 2: Fit by using `tidymodels` approach

Nest, I will be exploring how to use tidymodels packages to build discriminant analysis models.

First, we will split the dataset into training and testing datasets.

df_splits <- initial_split(df, prop = 0.6, strata = claim)
df_train <- training(df_splits)
df_test <- testing(df_splits)
df_fold <- vfold_cv(df_train)

I will define the recipe for the model building later.

gen_recipe <-
  recipe(claim ~ .
         ,data = df_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_corr(all_numeric_predictors(), threshold = .5) %>% 
  step_zv(all_predictors())

In the recipe, I will use the pre-processing steps suggested in tidymodeling book.

I will also define all the model specifications.

# linear discriminant analysis
lda_specs <-
  discrim_linear() %>% 
  set_engine("MASS")

# quadratic discriminant analysis
qda_specs <-
  discrim_quad() %>% 
  set_engine("MASS")

# flexible discriminant analysis
fda_specs <- 
  discrim_flexible() %>% 
  set_engine("earth")

# regularized discriminant analysis
rda_specs <-
  discrim_regularized() %>% 
  set_engine("klaR")

After that, I will combine the recipe and model specifications into a workflow.

Instead of building different workflows for different models, I will use the functions from workflowsets package.

all_wf <-
  workflow_set(
    preproc = list(gen_recipe)
    ,models = list(linear = lda_specs
                   ,quad = qda_specs
                   ,flexible = fda_specs
                   ,regularized = rda_specs)
  )

all_wf

# A workflow set/tibble: 4 × 4
  wflow_id           info             option    result    
  <chr>              <list>           <list>    <list>    
1 recipe_linear      <tibble [1 × 4]> <opts[0]> <list [0]>
2 recipe_quad        <tibble [1 × 4]> <opts[0]> <list [0]>
3 recipe_flexible    <tibble [1 × 4]> <opts[0]> <list [0]>
4 recipe_regularized <tibble [1 × 4]> <opts[0]> <list [0]>

I will perform cross validation.

all_fold <- 
  all_wf %>% 
  option_add(control = control_grid(save_workflow = TRUE)) %>% 
  workflow_map(seed = 1234
               ,resamples = df_fold
               ,grid = 5)

To find which fitted model has the best model performance, we can do so by looking at ROC AUC curve.

all_fold %>% 
  rank_results() %>% 
  filter(.metric == "roc_auc")

# A tibble: 4 × 9
  wflow_id      .config .metric  mean std_err     n preprocessor model
  <chr>         <chr>   <chr>   <dbl>   <dbl> <int> <chr>        <chr>
1 recipe_linear Prepro… roc_auc 0.783 0.00833    10 recipe       disc…
2 recipe_flexi… Prepro… roc_auc 0.772 0.00849    10 recipe       disc…
3 recipe_quad   Prepro… roc_auc 0.753 0.0111     10 recipe       disc…
4 recipe_regul… Prepro… roc_auc 0.733 0.00693    10 recipe       disc…
# ℹ 1 more variable: rank <int>

As shown in the results above, linear discriminant model has the best model result among the fitted model.

all_fold %>% 
  extract_workflow_set_result("recipe_linear")

# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits               id     .metrics         .notes          
   <list>               <chr>  <list>           <list>          
 1 <split [34195/3800]> Fold01 <tibble [3 × 4]> <tibble [0 × 3]>
 2 <split [34195/3800]> Fold02 <tibble [3 × 4]> <tibble [0 × 3]>
 3 <split [34195/3800]> Fold03 <tibble [3 × 4]> <tibble [0 × 3]>
 4 <split [34195/3800]> Fold04 <tibble [3 × 4]> <tibble [0 × 3]>
 5 <split [34195/3800]> Fold05 <tibble [3 × 4]> <tibble [0 × 3]>
 6 <split [34196/3799]> Fold06 <tibble [3 × 4]> <tibble [0 × 3]>
 7 <split [34196/3799]> Fold07 <tibble [3 × 4]> <tibble [0 × 3]>
 8 <split [34196/3799]> Fold08 <tibble [3 × 4]> <tibble [0 × 3]>
 9 <split [34196/3799]> Fold09 <tibble [3 × 4]> <tibble [0 × 3]>
10 <split [34196/3799]> Fold10 <tibble [3 × 4]> <tibble [0 × 3]>

After that, I will select the fitted model with the best model result and measure the model performance by using the testing dataset.

all_test <-
  all_fold %>% 
  extract_workflow("recipe_linear") %>%
  finalize_workflow(
    all_fold %>% 
      extract_workflow_set_result("recipe_linear") %>% 
      select_best(metric = "roc_auc")
  ) %>%
  last_fit(split = df_splits)

all_test %>% 
  collect_predictions() %>% 
  roc_auc(`.pred_No`, truth = claim)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.775

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Randy Fath on Unsplash

Bobbitt, Zach. 2020. “Introduction to Linear Discriminant Analysis.” https://www.statology.org/linear-discriminant-analysis/.

Brownlee, Jason. 2020. “Linear Discriminant Analysis for Machine Learning.” https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/.

Gareth, James, Witten Daniela, Hastie Trevor, and Tibshirani Robert. 2021. An Introduction to Statistical Learning. Springer.

Hassan, Muhammad. 2024. “Discriminant Analysis – Methods, Types and Examples.” https://researchmethod.net/discriminant-analysis/#google_vignette.

Discriminant analysis