Why don’t we call this algorithm as k-average algorithm?
Because it’s mean.
In the previous post, I have discussed the basic concept of clustering and what are considerations when performing any clustering exercise.
Therefore, I will be using K-means to demonstrate how a clustering algorithm works and how to interpret the clustering results in this post.

Illustration by Allison Horst
K-means clustering is one of the most common basic algorithms one would learn when they embark on their machine learning journey.
In short, K-means clustering attempts to partition a dataset into K distinct, non-overlapping clusters (James et al. 2021).
Allison Horst has nice illustrations on how K-means clustering work in graphic format, making it super easy to understand.
Following are the illustrations on how K-means clustering work:











Illustration by Allison Horst
(Google Developers 2020) has listed a few pros and cons of the K-means clustering algorithm.
Below are the pros and cons extracted from the website:
Pros
Relatively simple to implement
Scales to large data sets
Guarantees convergence
Can warm-start the positions of centroids
Easily adapts to new examples
Generalizes to clusters of different shapes and sizes, such as elliptical clusters
Cons
Choosing k manually
Being dependent on initial values
Clustering data of varying sizes and density
Clustering outliers
Scaling with number of dimensions
For this demonstration, I will be using bank marketing data from UCI Machine Learning Repository.
This dataset contains data points from past direct marketing campaigns of a Portuguese banking institution.

Photo by Monstera from Pexels
First, I will set up the environment by calling all the packages I need for the analysis later.
packages <- c('tidyverse', 'readr', 'skimr', 'tidymodels', 'plotly',
'corrplot', 'GGally')
for(p in packages){
if(!require (p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Next, I will import the data into the environment and perform some data wrangling for the analysis later.
Note that I will perform clustering on a sample of the data points. This approach is commonly used when the dataset is too huge for us to perform data exploratory.
Once we have finished the clustering analysis, we could re-run on the full dataset although the results are unlikely to change significantly when we change from subset of dataset to full dataset. If the results have changed significantly, this could indicate that the subset of data used for clustering is not representative of the full dataset.
As part of the usual analysis, I will start by checking the data quality.
skim(df)
| Name | df |
| Number of rows | 22606 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| job | 0 | 1 | FALSE | 12 | blu: 4871, man: 4665, tec: 3747, adm: 2626 |
| marital | 0 | 1 | FALSE | 3 | mar: 13633, sin: 6433, div: 2540 |
| education | 0 | 1 | FALSE | 4 | sec: 11611, ter: 6566, pri: 3522, unk: 907 |
| default | 0 | 1 | FALSE | 2 | no: 22193, yes: 413 |
| housing | 0 | 1 | FALSE | 2 | yes: 12550, no: 10056 |
| loan | 0 | 1 | FALSE | 2 | no: 18955, yes: 3651 |
| contact | 0 | 1 | FALSE | 3 | cel: 14648, unk: 6495, tel: 1463 |
| month | 0 | 1 | TRUE | 12 | may: 6893, jul: 3454, aug: 3130, jun: 2648 |
| poutcome | 0 | 1 | FALSE | 4 | unk: 18496, fai: 2500, oth: 870, suc: 740 |
| y | 0 | 1 | FALSE | 2 | no: 19968, yes: 2638 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 41.01 | 10.64 | 18 | 33.00 | 39 | 48 | 94 | ▅▇▃▁▁ |
| balance | 0 | 1 | 1347.74 | 2897.13 | -3372 | 71.00 | 443 | 1430 | 71188 | ▇▁▁▁▁ |
| day | 0 | 1 | 15.83 | 8.32 | 1 | 8.00 | 16 | 21 | 31 | ▇▆▇▆▆ |
| duration | 0 | 1 | 259.57 | 258.82 | 0 | 103.25 | 180 | 321 | 3785 | ▇▁▁▁▁ |
| campaign | 0 | 1 | 2.78 | 3.16 | 1 | 1.00 | 2 | 3 | 58 | ▇▁▁▁▁ |
| pdays | 0 | 1 | 39.79 | 99.67 | -1 | -1.00 | -1 | -1 | 871 | ▇▁▁▁▁ |
| previous | 0 | 1 | 0.57 | 2.57 | 0 | 0.00 | 0 | 0 | 275 | ▇▁▁▁▁ |
In the data dictionary, we are told that if the customer is not being contacted, the pdays will be recorded as -1.
In Google Machine Learning Course, the author mentioned it is not ideal to mix “magic” values with actual data.
Therefore, I will separate the “magic” values from the actual data by including an indicator to indicate whether the pdays were captured in the dataset and changing the magic value to 0.
As K-means clustering is finding the centroid within each cluster, hence the algorithm can sensitive to outliers. Hence, to overcome this, we can either take the following actions:
Remove outliers before running the clustering algorithm
Use other algorithms that are less sensitive to outliers (eg. k-medians clustering)
In this post, I will remove the outliers from the dataset before performing clustering analysis.
df_1 <- df_1 %>%
filter(previous < 100)
In the data quality checking, we can see that the values are very skewed.
As K-means clustering is very sensitive to the range of the variable, hence standardization should be performed to avoid variables with larger variance to dominate the effect of clustering.
df_1_scale <- df_1 %>%
mutate(across(where(is.numeric), scale))
There are non-numeric variables within the dataset. However, K-means clustering algorithm is unable to handle non-numeric variables.
The common approach to get around with this issue is to convert the non-numeric variables into dummy variables through a one-hot encoding method. (IBM Support 2020) mentioned that the clustering results are unlikely to be satisfactory by using binary data. As the naming of the algorithm suggested, the algorithm attempts to find the clusters through finding the “mean” (a.k.a. the centroid), where finding the average is not meaningful for binary data.
As such, we can either perform clustering only on the numeric variables or use other algorithms (e.g. K-proto) that allow us to perform clustering on the mixed data type. I will explore this alternative clustering algorithm in my future post.
In this post, I will use the first method, which is to filter out all the non-numeric variables for clustering.
df_1_scale_num <- df_1_scale %>%
select_if(is.numeric)
I will also check the correlation before I proceed and run K-means clustering. This is because highly correlated variables may affect the clustering results, causing clustering results to overlap with one another.
corrplot(cor(df_1_scale_num, use="pairwise.complete.obs"),
method = "number",
type = "upper",
tl.cex = 0.65,
number.cex = 0.65,
diag = FALSE)

From the correlation results, it does not seem like we have any highly correlated variables. This suggests that we could use all the variables for clustering purposes.
Let’s start our analysis in K-means clustering.
As mentioned earlier, we would need to upfront pre-define how many clusters we need for the K-means clustering exercise. However, we are unable to know what is the optimal number of clusters with the given dataset. Hence, the common approach is to run the algorithms over a pre-defined range number of clusters.
Therefore, to do so, I have referenced the code from Exploratory clustering shared on Tidymodels documentation page.
kclusts_1_scale <-
tibble(k = 3:10) %>%
mutate(
kclust = map(k, ~kmeans(df_1_scale_num, algorithm="Lloyd", .x)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, df_1_scale_num)
)
Next, I would use unnest function to flatten the
selected columns into regular columns.
The beauty of such approach above is the output from the codes follows tidy data concept, which allows us to join different functions together without needing many transformations.
Once the columns are flattened, I will plot the elbow curve to find
the optimal number of clusters by using ggplot
function.
In general, the total within-cluster variation (a.k.a. tot.withinss in the algorithm/graph) decreases when the number of clusters increases. This is expected since when the number of clusters increases, there would be more centroids within the dataset, which decreases the distance between each data point and the centroids.
Alternatively, we can understand that when all the data points are individual clusters by themselves, the total within-cluster variation would be zero. Hence, the total within-cluster variation would decrease when we increase the number of clusters.
As such, below is the elbow curve for the given dataset:
ggplot(clusterings_1_scale, aes(k, tot.withinss)) +
geom_line() +
geom_point()

Based on the elbow curve above, it seems like 8 clusters is a good cutoff.
Note that it is not ideal to choose the cluster that gives the lowest total within-cluster variation. Merely choosing the number of clusters that give us the lowest total within-cluster variation would not work as the data point is a cluster on its own would give us the lowest total within-cluster variation. This would defeat the purpose of the clustering exercise.
Therefore, it would be beneficial to consider how should we select the clustering results.
Do refer to my previous post where I have discussed the different considerations while performing clustering analysis.
With that, I will pull out the clustering results by using
pull function and pluck function.
kclusts_result_1_scale <- kclusts_1_scale %>%
pull(kclust) %>%
pluck(6)
As discussed in my previous post, it is also quite crucial to check the number of data points within each cluster to ensure there is a sizable count within each cluster. The usual rule of thumb used in practice is to have at least 5 data points within each cluster.
Let’s pull out the number of data points in each cluster.
kclusts_result_1_scale$size
[1] 3419 1428 653 3370 5490 2868 4842 535
From the results shown above, there are at least 5 data points within each cluster.
Observing the average values of each cluster could give us a glimpse of the different characteristics of each cluster, allowing us to formulate different strategies for different groups of customers.
This is because “not all the customers are the same”.
Note that we have previously performed standardization on the dataset before using it to run the clustering algorithm. Hence, instead of using the average values in the clustering results, I will compute the average values of each cluster by using the original dataset (i.e. the dataset before performing any standardization).
To do so, I will first insert back which cluster each data point belongs to back to the dataset.
df_1_num <- df_1 %>%
select_if(is.numeric)
df_1_num$cluster <- kclusts_result_1_scale$cluster
Then, I will extract out all the column names.
var_list <- df_1_num %>%
names()
Once that is done, I will use loop function to calculate
the average values for each cluster.
cluster <- c(1,2,3,4,5,6,7,8)
result <- as.data.frame(cluster)
for(i in var_list){
temp_result <- df_1_num %>%
group_by(cluster) %>%
summarise({{i}} := mean(!!sym(i))) %>%
dplyr::select(-cluster)
result <- cbind(result, temp_result)
}
result
cluster age balance day duration campaign
1 1 52.16993 1114.4109 8.808424 212.3963 2.255338
2 2 40.58403 1287.3011 16.013305 991.6078 2.435574
3 3 40.58806 923.0766 22.781011 136.0888 16.464012
4 4 52.70475 1247.9864 22.788724 192.7899 2.700000
5 5 34.07341 904.1894 22.950638 203.8022 2.498361
6 6 39.55788 1148.5764 13.542887 239.3720 2.120990
7 7 33.57951 798.6838 8.224081 217.6338 2.214168
8 8 43.73271 14736.7645 15.986916 238.4766 2.542056
previous pdays_recoded
1 0.109388710 7.1275227
2 0.213585434 16.7436975
3 0.009188361 0.6324655
4 0.153115727 10.9842730
5 0.079963570 6.1910747
6 3.628312413 265.2771967
7 0.080132177 4.4258571
8 0.474766355 29.5140187
Following are some observations from the results above and possible actions we could take:
Alternatively, a good way to illustrate the clustering results is to use a parallel coordinate plot. This approach allows us to use the visualization effectively to compare the average value of the clusters.
To make the graph less clustered, I will plot out 1% of the clustering results so that it is easier for us to make comparisons across different clusters.
For example, if we filter the interactive graph by only focusing on Cluster 5 & 8, we can see the average age under Cluster 8 is younger than Cluster 5.
Note that we can select and un-select the cluster in the graph by clicking the cluster number in the legend.
That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by Nareeta Martin on Unsplash