Photo by Sebastian Palomino

In my previous post, I have explored how to perform latent class analysis.

However, latent class analysis can be only be used when the variables are categorical. What if the variables are numerical?

In this post, I will be looking at how to perform latent profile analysis (LPA).

What is latent profile analysis (LPA)?

Latent Profile Analysis is a statistical modeling approach for estimating distinct profiles of variables (Rosenberg 2021).

There are different model types under LPA.

Model Type	Variance	Covariance	Also known as
Model 1	Equal	0	EEI
Model 2	Varying	0	VVI
Model 3	Equal	Equal	EEE
Model 4	Varying	Equal
Model 5	Equal	Varying
Model 6	Varying	Varying	VVV

Important notes when using latent profile analysis

Below are the points to note when using LPA:

LPA only works on numeric variables
- This post mentioned that it is not appropriate to use LPA when we converted categorical variables to dummy variables
LPA is sensitive to the scale of the variables
- Therefore, we will need to normalize the variables

Demonstration

In this demonstration, I will be using tidylpa package in performing latent profile analysis.

pacman::p_load(tidyverse, tidymodels, tidyLPA, GGally)

Import Data

df <- 
  read_csv("data/data.csv") %>% 
  select(-c(`...33`, id))

For simplicity, I will skip the check on the correlations of the variables.

First, I will normalize the numeric variables so that standard deviation of one and mean of zero, otherwise the model results will be affected by the scales of columns.

gen_norm <-
  recipe(diagnosis ~ ., data = df) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  prep()

df_norm <-
  bake(gen_norm, new_data = df)

By default, the function will fit model 1 if any of the arguments on variance, covariance and models are not provided.

lpa_fit_model1 <-
  df_norm %>%
  select(-diagnosis) %>% 
  estimate_profiles(n_profiles = 2:6)

lpa_fit_model1

tidyLPA analysis using mclust: 

 Model Classes AIC      BIC      Entropy prob_min prob_max n_min
 1     2       41655.40 42050.53 0.98    0.99     1.00     0.32 
 1     3       39182.54 39712.28 0.98    0.97     1.00     0.16 
 1     4       37785.01 38449.35 0.98    0.96     1.00     0.11 
 1     5       36923.06 37722.02 0.96    0.95     1.00     0.04 
 1     6       36005.39 36938.95 0.96    0.95     1.00     0.03 
 n_max BLRT_p
 0.68  0.01  
 0.64  0.01  
 0.54  0.01  
 0.40  0.01  
 0.31  0.01

From the results, we can note the following:

When we add additional component, the p-value is still less than 0.05, indicating that the additional component brings additional information.
Among all the tested profiles, it seems like 6 classes are the best based on the AIC and BIC results.

We can extract the model performance by using get_fit function.

get_fit(lpa_fit_model1)

# A tibble: 5 × 18
  Model Classes  LogLik    AIC    AWE    BIC   CAIC    CLC    KIC
  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1       2 -20737. 41655. 42899. 42051. 42142. 41475. 41749.
2     1       3 -19469. 39183. 40850. 39712. 39834. 38941. 39308.
3     1       4 -18740. 37785. 39877. 38449. 38602. 37481. 37941.
4     1       5 -18278. 36923. 39439. 37722. 37906. 36557. 37110.
5     1       6 -17788. 36005. 38946. 36939. 37154. 35577. 36223.
# ℹ 9 more variables: SABIC <dbl>, ICL <dbl>, Entropy <dbl>,
#   prob_min <dbl>, prob_max <dbl>, n_min <dbl>, n_max <dbl>,
#   BLRT_val <dbl>, BLRT_p <dbl>

The descriptions of the different measurements can be found under this link.

We can specify the number of classes to run as follows:

lpa_fit_model1_skip <-
  df_norm %>%
  select(-diagnosis) %>% 
  estimate_profiles(n_profiles = c(2, 4, 6))

lpa_fit_model1_skip

tidyLPA analysis using mclust: 

 Model Classes AIC      BIC      Entropy prob_min prob_max n_min
 1     2       41655.40 42050.53 0.98    0.99     1.00     0.32 
 1     4       37785.01 38449.35 0.98    0.96     1.00     0.11 
 1     6       36005.39 36938.95 0.96    0.95     1.00     0.03 
 n_max BLRT_p
 0.68  0.01  
 0.54  0.01  
 0.31  0.01

We could extract the selected model by using pluck function.

lpa_fit_model1_skip %>% 
  pluck("model_1_class_4")

tidyProfile estimated using Mclust: 

    Model   Classes    LogLik       AIC       AWE       BIC      CAIC 
        1         4 -18739.51  37785.01  39876.74  38449.35  38602.35 
      CLC       KIC     SABIC       ICL   Entropy  prob_min  prob_max 
 37480.97  37941.01  37963.65 -38466.51      0.98      0.96      1.00 
    n_min     n_max  BLRT_val    BLRT_p 
     0.11      0.54   1459.53      0.01

We can also change the model we would like to build by either passing relevant info into variance and covariance arguments or indicate under model argument.

lpa_fit_model2 <-
  df_norm %>%
  select(-diagnosis) %>% 
  estimate_profiles(n_profiles = 2:6, models = 2)

lpa_fit_model2

tidyLPA analysis using mclust: 

 Model Classes AIC      BIC      Entropy prob_min prob_max n_min
 2     2       37243.45 37768.85 0.99    1.00     1.00     0.39 
 2     3       34887.19 35677.46 0.98    0.98     1.00     0.15 
 2     4       32506.72 33561.85 0.97    0.98     0.99     0.15 
 2     5       30625.20 31945.20 0.98    0.99     1.00     0.11 
 2     6       29393.02 30977.89 0.98    0.98     1.00     0.11 
 n_max BLRT_p
 0.61  0.01  
 0.55  0.01  
 0.29  0.01  
 0.26  0.01  
 0.22  0.01

To visualize the results, we could use plot_profiles function to do so.

Note on the following:

In order to use plot_profiles function to visualize, we will need to convert the necessary data into matrix format.
We can use ggplot functions to further modify the charts.

 df_norm %>%
  select(-diagnosis) %>%
  as.matrix() %>% 
  estimate_profiles(3) %>% 
  tidyLPA::plot_profiles() +
  coord_flip()

It seems like this chart will be quite hard to read when we have too many variables.

Alternatively, we could use the parallel coordinate plot to illustrate the estimated values under each group. This is probably easier to compare the different under each group.

This can be done by using ggparcoord function from GGally package.

 df_norm %>%
  select(-diagnosis) %>%
  as.matrix() %>% 
  estimate_profiles(3) %>% 
  get_estimates() %>%
  filter(Category == "Means") %>% 
  select(c(Estimate, Parameter, Class)) %>% 
  pivot_wider(names_from = Parameter
              ,values_from = Estimate) %>%
  mutate(Class = as.factor(Class)) %>% 
  ggparcoord(columns = 2:14, groupColumn = 1) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

It seems like there are some differences in characteristics among the different profiles.

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Pixabay

Rosenberg, Joshua M. 2021. “Introduction to tidyLPA.” https://cran.r-project.org/web/packages/tidyLPA/vignettes/Introduction_to_tidyLPA.html.

Latent Profile Analysis

What is latent profile analysis (LPA)?

Important notes when using latent profile analysis

Demonstration

Import Data

Conclusion

References