Finding the unseen

Photo by Brandi Redd on Unsplash
LCA models work on the assumption that the observed distribution of the variables is the result of a finite latent (unobserved) mixture of underlying distributions (Pratik Sinha 2021).
Although this method sounds similar to the traditional clustering analysis, there are some differences between the two methods (StackExchange 2014):
Latent class analysis is a finite mixture model
It uses probabilistic models to describe the distribution of the data
Latent class analysis starts with describing the distribution of the data, whereas the clustering analysis attempts to find similarities between cases
As we are using a statistical model to select the model, assessing goodness of fit is possible
This method also allows us to extend the analysis to others (e.g., include prior distribution into the analysis)
The authors have listed a whole list of considerations when building a LCA model. Refer to this website for the considerations.
In this demonstration, I will be using poLCA package in performing principal component analysis.
pacman::p_load(tidyverse, tidymodels, janitor, poLCA, foreach, doParallel)
I will be using this breast cancer dataset for the demonstration.
Next, I will scale the numerical figures to have a mean of 0 and a standard deviation of 1 so that it does not affect the model results.
gen_norm <-
recipe(diagnosis ~ ., data = df) %>%
step_normalize(all_numeric_predictors()) %>%
prep()
df_norm <-
bake(gen_norm, new_data = df)
As LCA can only accept categorical variables, I will bin all the numeric variables into different groups.
df_norm_grp <-
df_norm %>%
mutate(across(!diagnosis, function(x) cut_interval(x, n = 5)))
Okay, let’s start our modeling!
I will first define the formula for the latent class model.
formula_lca <- cbind(radius_mean
,texture_mean
,perimeter_mean
,area_mean
,smoothness_mean
,compactness_mean
,concavity_mean
,concave_points_mean
,symmetry_mean
,fractal_dimension_mean
,radius_se
,texture_se
,perimeter_se
,area_se
,smoothness_se
,compactness_se
,concavity_se
,concave_points_se
,symmetry_se
,fractal_dimension_se
,radius_worst
,texture_worst
,perimeter_worst
,area_worst
,smoothness_worst
,compactness_worst
,concavity_worst
,concave_points_worst
,symmetry_worst
,fractal_dimension_worst) ~ 1
Then, I will build a latent class model.
In this example, I will specify the nclass to be 2.
lca_fit <- poLCA(formula_lca,
maxiter = 1000,
nclass = 2,
nrep = 10,
data = df_norm_grp)
Model 1: llik = -13389.02 ... best llik = -13389.02
Model 2: llik = -13268.56 ... best llik = -13268.56
Model 3: llik = -13274.55 ... best llik = -13268.56
Model 4: llik = -13295.33 ... best llik = -13268.56
Model 5: llik = -13352.13 ... best llik = -13268.56
Model 6: llik = -13274.55 ... best llik = -13268.56
Model 7: llik = -13367.22 ... best llik = -13268.56
Model 8: llik = -13283.52 ... best llik = -13268.56
Model 9: llik = -13290.03 ... best llik = -13268.56
Model 10: llik = -13345.64 ... best llik = -13268.56
Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$radius_mean
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.2535 0.7148 0.0317 0.0000 0.0000
class 2: 0.0330 0.2829 0.4400 0.2112 0.0329
$texture_mean
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.2756 0.5457 0.1336 0.0423 0.0028
class 2: 0.0713 0.4942 0.3782 0.0469 0.0094
$perimeter_mean
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.2845 0.6867 0.0289 0.0000 0.0000
class 2: 0.0283 0.2688 0.4729 0.1971 0.0329
$area_mean
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.8519 0.1481 0.0000 0.0000 0.0000
class 2: 0.1906 0.4480 0.3051 0.0422 0.0141
$smoothness_mean
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.2423 0.5752 0.1670 0.0155 0.0000
class 2: 0.0188 0.4546 0.4634 0.0539 0.0094
$compactness_mean
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.6815 0.3114 0.0071 0.0000 0.0000
class 2: 0.0333 0.5138 0.3355 0.0939 0.0235
$concavity_mean
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.9295 0.0705 0.0000 0.0000 0.0000
class 2: 0.0378 0.5820 0.2769 0.0704 0.0329
$concave_points_mean
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.8676 0.1324 0.0000 0.0000 0.0000
class 2: 0.0189 0.4038 0.4178 0.1267 0.0329
$symmetry_mean
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.1042 0.6241 0.2492 0.0197 0.0028
class 2: 0.0141 0.3496 0.5049 0.1127 0.0188
$fractal_dimension_mean
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.3836 0.5129 0.0922 0.0084 0.0028
class 2: 0.2996 0.4409 0.2031 0.0376 0.0188
$radius_se
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.9915 0.0085 0.0000 0 0.0000
class 2: 0.6620 0.3051 0.0235 0 0.0094
$texture_se
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.6141 0.3207 0.0568 0.0056 0.0028
class 2: 0.6150 0.3434 0.0322 0.0094 0.0000
$perimeter_se
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.9944 0.0056 0.0000 0 0.0000
class 2: 0.7278 0.2347 0.0282 0 0.0094
$area_se
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 1.0000 0.000 0.0000 0 0.0000
class 2: 0.8592 0.122 0.0094 0 0.0094
$smoothness_se
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.6988 0.2583 0.0401 0.0028 0.0000
class 2: 0.6851 0.2925 0.0083 0.0094 0.0047
$compactness_se
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.8379 0.1424 0.0113 0.0084 0.0000
class 2: 0.4158 0.4386 0.1033 0.0376 0.0047
$concavity_se
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.9831 0.0169 0 0.0000 0.0000
class 2: 0.8920 0.0986 0 0.0047 0.0047
$concave_points_se
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.6871 0.2989 0.0112 0.0028 0.0000
class 2: 0.1179 0.7224 0.1362 0.0188 0.0047
$symmetry_se
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.7066 0.2680 0.0225 0.0028 0.0000
class 2: 0.7050 0.2105 0.0516 0.0282 0.0047
$fractal_dimension_se
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.9409 0.0535 0.0028 0.0028 0.0000
class 2: 0.8638 0.1127 0.0141 0.0047 0.0047
$radius_worst
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.5175 0.4686 0.0139 0.0000 0.0000
class 2: 0.0436 0.3270 0.4463 0.1455 0.0376
$texture_worst
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.2227 0.5159 0.2052 0.0563 0.0000
class 2: 0.0796 0.3327 0.4749 0.0893 0.0235
$perimeter_worst
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.5964 0.4036 0.0000 0.0000 0.0000
class 2: 0.0389 0.3978 0.4178 0.1173 0.0282
$area_worst
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.9749 0.0251 0.0000 0.0000 0.0000
class 2: 0.3236 0.4698 0.1643 0.0376 0.0047
$smoothness_worst
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.1240 0.5328 0.3102 0.0330 0.0000
class 2: 0.0141 0.2529 0.5204 0.1938 0.0188
$compactness_worst
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.8053 0.1947 0.0000 0.0000 0.0000
class 2: 0.1415 0.5769 0.2065 0.0516 0.0235
$concavity_worst
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.8248 0.1724 0.0028 0.0000 0.0000
class 2: 0.0668 0.5623 0.3051 0.0516 0.0141
$concave_points_worst
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.3212 0.5804 0.0984 0.0000 0.0000
class 2: 0.0000 0.0516 0.3852 0.4178 0.1455
$symmetry_worst
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.4110 0.5695 0.0166 0.0029 0.0000
class 2: 0.1414 0.6095 0.1882 0.0516 0.0094
$fractal_dimension_worst
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
class 1: 0.7948 0.1968 0.0056 0.0028 0.0000
class 2: 0.3984 0.4419 0.1503 0.0047 0.0047
Estimated class population shares
0.6249 0.3751
Predicted class memberships (by modal posterior prob.)
0.625 0.375
=========================================================
Fit for 2 latent classes:
=========================================================
number of observations: 568
number of estimated parameters: 241
residual degrees of freedom: 327
maximum log-likelihood: -13268.56
AIC(2): 27019.11
BIC(2): 28065.56
G^2(2): 19349.1 (Likelihood ratio/deviance statistic)
X^2(2): 3.849065e+30 (Chi-square goodness of fit)
From the results, we could see the following:
The proportion of each category within each class
There are also different performance metrics (e.g., AIC, BIC) and the proportion of predicted classes
We could generate the predicted class by using augment function.
lca_pred <-
augment(lca_fit, df_norm_grp) %>%
mutate(.class = factor(.class, levels = c("2", "1")))
lca_pred
# A tibble: 568 × 33
radius_mean texture_mean perimeter_mean area_mean smoothness_mean
<fct> <fct> <fct> <fct> <fct>
1 (0.368,1.57] [-2.23,-0.8… (0.397,1.59] (-0.116,… (0.461,1.91]
2 (1.57,2.77] (-0.85,0.52… (1.59,2.78] (1.22,2.… (-0.985,0.461]
3 (1.57,2.77] (-0.85,0.52… (0.397,1.59] (1.22,2.… (0.461,1.91]
4 (-0.834,0.36… (-0.85,0.52… (-0.796,0.397] [-1.46,-… (1.91,3.35]
5 (1.57,2.77] [-2.23,-0.8… (1.59,2.78] (1.22,2.… (-0.985,0.461]
6 (-0.834,0.36… (-0.85,0.52… (-0.796,0.397] [-1.46,-… (1.91,3.35]
7 (0.368,1.57] (-0.85,0.52… (0.397,1.59] (-0.116,… (-0.985,0.461]
8 (-0.834,0.36… (-0.85,0.52… (-0.796,0.397] [-1.46,-… (0.461,1.91]
9 (-0.834,0.36… (0.525,1.9] (-0.796,0.397] [-1.46,-… (1.91,3.35]
10 (-0.834,0.36… (0.525,1.9] (-0.796,0.397] [-1.46,-… (0.461,1.91]
# ℹ 558 more rows
# ℹ 28 more variables: compactness_mean <fct>, concavity_mean <fct>,
# concave_points_mean <fct>, symmetry_mean <fct>,
# fractal_dimension_mean <fct>, radius_se <fct>, texture_se <fct>,
# perimeter_se <fct>, area_se <fct>, smoothness_se <fct>,
# compactness_se <fct>, concavity_se <fct>,
# concave_points_se <fct>, symmetry_se <fct>, …
We could also visualize the proportion of categories within each class to understand the characteristics of each class.
To do so, I will first extract the column names.
After that, I will loop through the variables to visualize the results.
For simplicity, I will plot out the first 3 graphs, otherwise, the post will be way too long.
Graph for radius_mean

Graph for texture_mean

Graph for perimeter_mean

I will use ggplot to plot the graph of class versus diagnosis.
lca_pred %>%
ggplot(aes(.class, fill = diagnosis)) +
geom_bar(position = "fill") +
coord_flip()

That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!
