Differentiating the groups!

Photo by Anne Nygård on Unsplash
In this post, I will be exploring discriminant analysis.
Discriminant analysis is a statistical technique used in research that aims to classify or predict a categorical dependent variable based on one or more continuous or binary independent variables. It is often used when the dependent variable is non-metric (categorical) and the independent variables are metric (continuous or binary) (Hassan 2024).
(Hassan 2024) listed the pros and cons of discriminant analysis.
Multiclass Classification: Discriminant analysis can handle situations where there are more than two classes in the dependent variable, which is a limitation for some other methods such as logistic regression.
Understanding Group Differences: Discriminant analysis does not just predict group membership; it also provides information on which variables are important discriminators between groups. This makes it a useful tool for exploratory research to understand the differences between groups.
Efficient with Large Variables: Discriminant analysis can handle a large number of predictor variables efficiently. It becomes useful when the number of variables is very large, potentially exceeding the number of observations.
Dimensionality Reduction: Linear Discriminant Analysis (LDA) can be used for dimensionality reduction – it can reduce the number of variables in a dataset while preserving as much information as possible.
Prior Probabilities: Discriminant analysis allows for the inclusion of prior probabilities, meaning that researchers can incorporate prior knowledge about the proportions of observations in each group.
Model Interpretability: The model produced by discriminant analysis is relatively interpretable compared to some other machine learning models, such as neural networks. The weights of the features in the model can provide an indication of their relative importance.
Assumption of Normality: Discriminant analysis assumes that the predictors are normally distributed. If this assumption is violated, the performance of the model may be affected.
Assumption of Equal Covariance Matrices: Discriminant analysis, particularly Linear Discriminant Analysis (LDA), assumes that the groups being compared have equal covariance matrices. If this assumption is not met, it may lead to inaccuracies in classification.
Multicollinearity: Discriminant analysis may not work well if there is high multicollinearity among the predictor variables. This situation can lead to unstable estimates of the coefficients and difficulties in interpreting the results.
Outliers: Discriminant analysis is sensitive to outliers, which can have a large influence on the classification function.
Overfitting: Like many statistical techniques, discriminant analysis can result in overfitting if the model is too complex. Overfitting happens when the model fits the training data very well but performs poorly on new, unseen data.
Limited to Linear Relationships: Linear Discriminant Analysis (LDA) assumes a linear relationship between predictor variables and the log-odds of the dependent variable. This limits its utility in scenarios where relationships are complex or nonlinear. In such cases, Quadratic Discriminant Analysis (QDA) or other non-linear methods might be more appropriate.
When I found out this technique, I was rather curious why we need another method when we have logistic regression. Below are the explanations from one of the textbooks I was reading (Gareth et al. 2021):
When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are unstable
This method may be more accurate than logistic regression if the distribution of the predictors X is approximately normal in each of the classes and the sample size is small
The methods in this section can be naturally extended to the case of more than two response classes
Apart from the linear discriminant analysis, below are the different types of discriminant models (Brownlee 2020):
| Model | Remarks |
|---|---|
| Quadratic Discriminant Analysis | Each class uses its own estimate of variance (or covariance when there are multiple input variables) |
| Flexible Discriminant Analysis | Where non-linear combinations of inputs is used such as splines |
| Regularized Discriminant Analysis | Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA |
Below are some of the important considerations before performing discriminant analysis (Bobbitt 2020):
The response variable is categorical
The predictor variables follow a normal distribution
Each predictor variable has the same variance
Account for extreme outliers
In this demonstration, I will be using several methods to fit a discriminant analysis.
I will be using travel insurance claim dataset in this demonstration.
df <-
read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2021-08-31-naive-bayes/data/travel%20insurance.csv") %>%
clean_names() %>%
select(-c(agency, product_name, gender, destination))
lda function from MASS packageFirst, I will build a discriminant model by using lda function.
lda_fit <-
lda(claim ~ .
,data = df)
lda_fit
Call:
lda(claim ~ ., data = df)
Prior probabilities of groups:
No Yes
0.98536146 0.01463854
Group means:
agency_typeTravel Agency distribution_channelOnline duration
No 0.7297072 0.9825318 48.40385
Yes 0.3624595 0.9816613 110.78857
net_sales commision_in_value age
No 39.90466 9.571755 39.98982
Yes 94.37442 25.846419 38.63430
Coefficients of linear discriminants:
LD1
agency_typeTravel Agency -1.2411359711
distribution_channelOnline -0.7002360259
duration 0.0007684786
net_sales 0.0139448575
commision_in_value 0.0041677590
age -0.0175789923
According to this article, this is how to interpret the model output:
Prior probabilities of group: These represent the proportions of each Species in the training set
Group means: These display the mean values for each predictor variable for each species
Coefficients of linear discriminants: These display the linear combination of predictor variables that are used to form the decision rule of the LDA model
If we pass the data into the fitted model to generate the predictions, below is the output:
prediction_lda <-
predict(lda_fit) %>%
as.data.frame()
head(prediction_lda)
class posterior.No posterior.Yes LD1
1 No 0.9988726 0.0011273705 -1.243361
2 No 0.9985697 0.0014302982 -1.067571
3 No 0.9990595 0.0009405083 -1.377185
4 No 0.9989092 0.0010907774 -1.267729
5 No 0.9987795 0.0012204927 -1.184744
6 No 0.9989424 0.0010575667 -1.290563
Note:
Class is the predicted class
The output will contain the posterior probability that the observation belongs to the selected class
To evaluate the model, we just need to merge the predictions with the dataset and pass them to the necessary model evaluation function as shown below.
prediction_lda %>%
bind_cols(df) %>%
mutate(claim = as.factor(claim)) %>%
roc_auc(truth = claim
,posterior.No)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.781
tidymodels approachNest, I will be exploring how to use tidymodels packages to build discriminant analysis models.
First, we will split the dataset into training and testing datasets.
df_splits <- initial_split(df, prop = 0.6, strata = claim)
df_train <- training(df_splits)
df_test <- testing(df_splits)
df_fold <- vfold_cv(df_train)
I will define the recipe for the model building later.
gen_recipe <-
recipe(claim ~ .
,data = df_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = .5) %>%
step_zv(all_predictors())
In the recipe, I will use the pre-processing steps suggested in tidymodeling book.
I will also define all the model specifications.
# linear discriminant analysis
lda_specs <-
discrim_linear() %>%
set_engine("MASS")
# quadratic discriminant analysis
qda_specs <-
discrim_quad() %>%
set_engine("MASS")
# flexible discriminant analysis
fda_specs <-
discrim_flexible() %>%
set_engine("earth")
# regularized discriminant analysis
rda_specs <-
discrim_regularized() %>%
set_engine("klaR")
After that, I will combine the recipe and model specifications into a workflow.
Instead of building different workflows for different models, I will use the functions from workflowsets package.
all_wf <-
workflow_set(
preproc = list(gen_recipe)
,models = list(linear = lda_specs
,quad = qda_specs
,flexible = fda_specs
,regularized = rda_specs)
)
all_wf
# A workflow set/tibble: 4 × 4
wflow_id info option result
<chr> <list> <list> <list>
1 recipe_linear <tibble [1 × 4]> <opts[0]> <list [0]>
2 recipe_quad <tibble [1 × 4]> <opts[0]> <list [0]>
3 recipe_flexible <tibble [1 × 4]> <opts[0]> <list [0]>
4 recipe_regularized <tibble [1 × 4]> <opts[0]> <list [0]>
I will perform cross validation.
all_fold <-
all_wf %>%
option_add(control = control_grid(save_workflow = TRUE)) %>%
workflow_map(seed = 1234
,resamples = df_fold
,grid = 5)
To find which fitted model has the best model performance, we can do so by looking at ROC AUC curve.
all_fold %>%
rank_results() %>%
filter(.metric == "roc_auc")
# A tibble: 4 × 9
wflow_id .config .metric mean std_err n preprocessor model
<chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr>
1 recipe_linear Prepro… roc_auc 0.783 0.00833 10 recipe disc…
2 recipe_flexi… Prepro… roc_auc 0.772 0.00849 10 recipe disc…
3 recipe_quad Prepro… roc_auc 0.753 0.0111 10 recipe disc…
4 recipe_regul… Prepro… roc_auc 0.733 0.00693 10 recipe disc…
# ℹ 1 more variable: rank <int>
As shown in the results above, linear discriminant model has the best model result among the fitted model.
all_fold %>%
extract_workflow_set_result("recipe_linear")
# Resampling results
# 10-fold cross-validation
# A tibble: 10 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [34195/3800]> Fold01 <tibble [3 × 4]> <tibble [0 × 3]>
2 <split [34195/3800]> Fold02 <tibble [3 × 4]> <tibble [0 × 3]>
3 <split [34195/3800]> Fold03 <tibble [3 × 4]> <tibble [0 × 3]>
4 <split [34195/3800]> Fold04 <tibble [3 × 4]> <tibble [0 × 3]>
5 <split [34195/3800]> Fold05 <tibble [3 × 4]> <tibble [0 × 3]>
6 <split [34196/3799]> Fold06 <tibble [3 × 4]> <tibble [0 × 3]>
7 <split [34196/3799]> Fold07 <tibble [3 × 4]> <tibble [0 × 3]>
8 <split [34196/3799]> Fold08 <tibble [3 × 4]> <tibble [0 × 3]>
9 <split [34196/3799]> Fold09 <tibble [3 × 4]> <tibble [0 × 3]>
10 <split [34196/3799]> Fold10 <tibble [3 × 4]> <tibble [0 × 3]>
After that, I will select the fitted model with the best model result and measure the model performance by using the testing dataset.
all_test <-
all_fold %>%
extract_workflow("recipe_linear") %>%
finalize_workflow(
all_fold %>%
extract_workflow_set_result("recipe_linear") %>%
select_best(metric = "roc_auc")
) %>%
last_fit(split = df_splits)
all_test %>%
collect_predictions() %>%
roc_auc(`.pred_No`, truth = claim)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.775
That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by Randy Fath on Unsplash