Photo by Brandi Redd on Unsplash

What is latent class analysis?

LCA models work on the assumption that the observed distribution of the variables is the result of a finite latent (unobserved) mixture of underlying distributions (Pratik Sinha 2021).

Difference between latent class analysis and cluster analysis

Although this method sounds similar to the traditional clustering analysis, there are some differences between the two methods (StackExchange 2014):

Latent class analysis is a finite mixture model
It uses probabilistic models to describe the distribution of the data
Latent class analysis starts with describing the distribution of the data, whereas the clustering analysis attempts to find similarities between cases
As we are using a statistical model to select the model, assessing goodness of fit is possible
This method also allows us to extend the analysis to others (e.g., include prior distribution into the analysis)

Considerations when using latent class analysis

The authors have listed a whole list of considerations when building a LCA model. Refer to this website for the considerations.

Demonstration

In this demonstration, I will be using poLCA package in performing principal component analysis.

pacman::p_load(tidyverse, tidymodels, janitor, poLCA, foreach, doParallel)

Import Data

I will be using this breast cancer dataset for the demonstration.

df <- 
  read_csv("data/data.csv") %>% 
  dplyr::select(-c(`...33`, id)) %>% 
  clean_names()

Data wrangling

Next, I will scale the numerical figures to have a mean of 0 and a standard deviation of 1 so that it does not affect the model results.

gen_norm <-
  recipe(diagnosis ~ ., data = df) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  prep()

df_norm <-
  bake(gen_norm, new_data = df)

As LCA can only accept categorical variables, I will bin all the numeric variables into different groups.

df_norm_grp <-
  df_norm %>% 
  mutate(across(!diagnosis, function(x) cut_interval(x, n = 5)))

Build LCA model

Okay, let’s start our modeling!

I will first define the formula for the latent class model.

formula_lca <- cbind(radius_mean
                    ,texture_mean
                    ,perimeter_mean
                    ,area_mean
                    ,smoothness_mean
                    ,compactness_mean
                    ,concavity_mean
                    ,concave_points_mean
                    ,symmetry_mean
                    ,fractal_dimension_mean
                    ,radius_se
                    ,texture_se
                    ,perimeter_se
                    ,area_se
                    ,smoothness_se
                    ,compactness_se
                    ,concavity_se
                    ,concave_points_se
                    ,symmetry_se
                    ,fractal_dimension_se
                    ,radius_worst
                    ,texture_worst
                    ,perimeter_worst
                    ,area_worst
                    ,smoothness_worst
                    ,compactness_worst
                    ,concavity_worst
                    ,concave_points_worst
                    ,symmetry_worst
                    ,fractal_dimension_worst) ~ 1

Then, I will build a latent class model.

In this example, I will specify the nclass to be 2.

lca_fit <- poLCA(formula_lca,
               maxiter = 1000,
               nclass = 2,
               nrep = 10,
               data = df_norm_grp)

Model 1: llik = -13389.02 ... best llik = -13389.02
Model 2: llik = -13268.56 ... best llik = -13268.56
Model 3: llik = -13274.55 ... best llik = -13268.56
Model 4: llik = -13295.33 ... best llik = -13268.56
Model 5: llik = -13352.13 ... best llik = -13268.56
Model 6: llik = -13274.55 ... best llik = -13268.56
Model 7: llik = -13367.22 ... best llik = -13268.56
Model 8: llik = -13283.52 ... best llik = -13268.56
Model 9: llik = -13290.03 ... best llik = -13268.56
Model 10: llik = -13345.64 ... best llik = -13268.56
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$radius_mean
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.2535 0.7148 0.0317 0.0000 0.0000
class 2:  0.0330 0.2829 0.4400 0.2112 0.0329

$texture_mean
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.2756 0.5457 0.1336 0.0423 0.0028
class 2:  0.0713 0.4942 0.3782 0.0469 0.0094

$perimeter_mean
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.2845 0.6867 0.0289 0.0000 0.0000
class 2:  0.0283 0.2688 0.4729 0.1971 0.0329

$area_mean
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.8519 0.1481 0.0000 0.0000 0.0000
class 2:  0.1906 0.4480 0.3051 0.0422 0.0141

$smoothness_mean
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.2423 0.5752 0.1670 0.0155 0.0000
class 2:  0.0188 0.4546 0.4634 0.0539 0.0094

$compactness_mean
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.6815 0.3114 0.0071 0.0000 0.0000
class 2:  0.0333 0.5138 0.3355 0.0939 0.0235

$concavity_mean
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.9295 0.0705 0.0000 0.0000 0.0000
class 2:  0.0378 0.5820 0.2769 0.0704 0.0329

$concave_points_mean
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.8676 0.1324 0.0000 0.0000 0.0000
class 2:  0.0189 0.4038 0.4178 0.1267 0.0329

$symmetry_mean
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.1042 0.6241 0.2492 0.0197 0.0028
class 2:  0.0141 0.3496 0.5049 0.1127 0.0188

$fractal_dimension_mean
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.3836 0.5129 0.0922 0.0084 0.0028
class 2:  0.2996 0.4409 0.2031 0.0376 0.0188

$radius_se
           Pr(1)  Pr(2)  Pr(3) Pr(4)  Pr(5)
class 1:  0.9915 0.0085 0.0000     0 0.0000
class 2:  0.6620 0.3051 0.0235     0 0.0094

$texture_se
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.6141 0.3207 0.0568 0.0056 0.0028
class 2:  0.6150 0.3434 0.0322 0.0094 0.0000

$perimeter_se
           Pr(1)  Pr(2)  Pr(3) Pr(4)  Pr(5)
class 1:  0.9944 0.0056 0.0000     0 0.0000
class 2:  0.7278 0.2347 0.0282     0 0.0094

$area_se
           Pr(1) Pr(2)  Pr(3) Pr(4)  Pr(5)
class 1:  1.0000 0.000 0.0000     0 0.0000
class 2:  0.8592 0.122 0.0094     0 0.0094

$smoothness_se
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.6988 0.2583 0.0401 0.0028 0.0000
class 2:  0.6851 0.2925 0.0083 0.0094 0.0047

$compactness_se
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.8379 0.1424 0.0113 0.0084 0.0000
class 2:  0.4158 0.4386 0.1033 0.0376 0.0047

$concavity_se
           Pr(1)  Pr(2) Pr(3)  Pr(4)  Pr(5)
class 1:  0.9831 0.0169     0 0.0000 0.0000
class 2:  0.8920 0.0986     0 0.0047 0.0047

$concave_points_se
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.6871 0.2989 0.0112 0.0028 0.0000
class 2:  0.1179 0.7224 0.1362 0.0188 0.0047

$symmetry_se
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.7066 0.2680 0.0225 0.0028 0.0000
class 2:  0.7050 0.2105 0.0516 0.0282 0.0047

$fractal_dimension_se
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.9409 0.0535 0.0028 0.0028 0.0000
class 2:  0.8638 0.1127 0.0141 0.0047 0.0047

$radius_worst
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.5175 0.4686 0.0139 0.0000 0.0000
class 2:  0.0436 0.3270 0.4463 0.1455 0.0376

$texture_worst
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.2227 0.5159 0.2052 0.0563 0.0000
class 2:  0.0796 0.3327 0.4749 0.0893 0.0235

$perimeter_worst
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.5964 0.4036 0.0000 0.0000 0.0000
class 2:  0.0389 0.3978 0.4178 0.1173 0.0282

$area_worst
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.9749 0.0251 0.0000 0.0000 0.0000
class 2:  0.3236 0.4698 0.1643 0.0376 0.0047

$smoothness_worst
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.1240 0.5328 0.3102 0.0330 0.0000
class 2:  0.0141 0.2529 0.5204 0.1938 0.0188

$compactness_worst
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.8053 0.1947 0.0000 0.0000 0.0000
class 2:  0.1415 0.5769 0.2065 0.0516 0.0235

$concavity_worst
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.8248 0.1724 0.0028 0.0000 0.0000
class 2:  0.0668 0.5623 0.3051 0.0516 0.0141

$concave_points_worst
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.3212 0.5804 0.0984 0.0000 0.0000
class 2:  0.0000 0.0516 0.3852 0.4178 0.1455

$symmetry_worst
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.4110 0.5695 0.0166 0.0029 0.0000
class 2:  0.1414 0.6095 0.1882 0.0516 0.0094

$fractal_dimension_worst
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)
class 1:  0.7948 0.1968 0.0056 0.0028 0.0000
class 2:  0.3984 0.4419 0.1503 0.0047 0.0047

Estimated class population shares 
 0.6249 0.3751 
 
Predicted class memberships (by modal posterior prob.) 
 0.625 0.375 
 
========================================================= 
Fit for 2 latent classes: 
========================================================= 
number of observations: 568 
number of estimated parameters: 241 
residual degrees of freedom: 327 
maximum log-likelihood: -13268.56 
 
AIC(2): 27019.11
BIC(2): 28065.56
G^2(2): 19349.1 (Likelihood ratio/deviance statistic) 
X^2(2): 3.849065e+30 (Chi-square goodness of fit)

From the results, we could see the following:

The proportion of each category within each class
There are also different performance metrics (e.g., AIC, BIC) and the proportion of predicted classes

Predictions

We could generate the predicted class by using augment function.

lca_pred <- 
  augment(lca_fit, df_norm_grp) %>% 
  mutate(.class = factor(.class, levels = c("2", "1")))

lca_pred

# A tibble: 568 × 33
   radius_mean   texture_mean perimeter_mean area_mean smoothness_mean
   <fct>         <fct>        <fct>          <fct>     <fct>          
 1 (0.368,1.57]  [-2.23,-0.8… (0.397,1.59]   (-0.116,… (0.461,1.91]   
 2 (1.57,2.77]   (-0.85,0.52… (1.59,2.78]    (1.22,2.… (-0.985,0.461] 
 3 (1.57,2.77]   (-0.85,0.52… (0.397,1.59]   (1.22,2.… (0.461,1.91]   
 4 (-0.834,0.36… (-0.85,0.52… (-0.796,0.397] [-1.46,-… (1.91,3.35]    
 5 (1.57,2.77]   [-2.23,-0.8… (1.59,2.78]    (1.22,2.… (-0.985,0.461] 
 6 (-0.834,0.36… (-0.85,0.52… (-0.796,0.397] [-1.46,-… (1.91,3.35]    
 7 (0.368,1.57]  (-0.85,0.52… (0.397,1.59]   (-0.116,… (-0.985,0.461] 
 8 (-0.834,0.36… (-0.85,0.52… (-0.796,0.397] [-1.46,-… (0.461,1.91]   
 9 (-0.834,0.36… (0.525,1.9]  (-0.796,0.397] [-1.46,-… (1.91,3.35]    
10 (-0.834,0.36… (0.525,1.9]  (-0.796,0.397] [-1.46,-… (0.461,1.91]   
# ℹ 558 more rows
# ℹ 28 more variables: compactness_mean <fct>, concavity_mean <fct>,
#   concave_points_mean <fct>, symmetry_mean <fct>,
#   fractal_dimension_mean <fct>, radius_se <fct>, texture_se <fct>,
#   perimeter_se <fct>, area_se <fct>, smoothness_se <fct>,
#   compactness_se <fct>, concavity_se <fct>,
#   concave_points_se <fct>, symmetry_se <fct>, …

Graph - Proportion of categories within each class

We could also visualize the proportion of categories within each class to understand the characteristics of each class.

To do so, I will first extract the column names.

variable_name_list <- 
  lca_pred %>% 
  dplyr::select(-c(diagnosis, .class, .probability)) %>% 
  names()

After that, I will loop through the variables to visualize the results.

For simplicity, I will plot out the first 3 graphs, otherwise, the post will be way too long.

Graph for radius_mean



Graph for texture_mean



Graph for perimeter_mean

Graph - Proportion of diagnosis within each class

I will use ggplot to plot the graph of class versus diagnosis.

lca_pred %>% 
  ggplot(aes(.class, fill = diagnosis)) +
  geom_bar(position = "fill") +
  coord_flip()

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Nik on Unsplash

Pratik Sinha, Kevin L Delucchi, Carolyn S Calfee. 2021. “Practitioner’s Guide to Latent Class Analysis: Methodological Considerations and Common Pitfalls.” https://pmc.ncbi.nlm.nih.gov/articles/PMC7746621/.

StackExchange. 2014. “Latent Class Analysis Vs. Cluster Analysis - Differences in Inferences?” https://stats.stackexchange.com/questions/122213/latent-class-analysis-vs-cluster-analysis-differences-in-inferences.

Latent Class Analysis