Not Boruto, also not burrito
In this post, I will be exploring how to use Boruta to perform feature selection.

Photo by Kamila Bairam
(Brownlee 2020) explained that feature selection is the process of reducing the the number of input variables when developing a predictive model.
The author further explained the benefits of feature selections, which includes:
Reduce the computational cost of modeling
This could improve the performance of the model
Before jumping into how Boruta works, let’s look at the different all relevant method and minimal optimal class.
All relevant method (eg. Boruta) aims to find all features connected with the decision, where minimal optimal class (eg. XGBoost) aims to provide a possibly compact set of features which carry enough information for a possibly optiomal classification on the reduced set (Kursa 2022).
The author also mentioned minimal optimal methods are generally cherry-picking features usable for classification, regardless if this usability is significant or not, which is an easy way to overfitting. Boruta is much more robust in this manner.
Below is my simple explanation on how it works after referencing to different materials:
Create permutated features from existing features (a.k.a. shadow features)
Calculate the Z-score and the max Z score from the shadow features will be taken as the cutoff
Any existing features with Z score lower than the cutoff will be deemed as “redundant” and vice versa
Repeat this process X times
Then, a two sided test is performed to classify the features into “Confirmed”, “Tentative” and “Rejected”
For more information, you refer to the page 3 of original paper.
Alternatively, I happened to come across this Youtube video on the explanation on how Boruta works and I find it helpful.
In this demonstration, I will be using the Seoul bike sharing dataset from UCL website for the exploration.
First, I will load the necessary packages into the environment.
pacman::p_load(tidyverse, janitor, lubridate, Boruta, ranger)
Next, I will import the data into the environment.
I will also perform some basic data wrangling before using Boruta.
I have also included a “random noise” in the dataset to see whether the algorithm will pick up this column is unimportant.
To perform Boruta, I will be using Boruta function from Boruta package.
For reproducibility, I will specify the random seed before running Boruta.
boruta_randForest <- Boruta(rented_bike_count ~ . -date
,data = df
,doTrace = 2)
boruta_randForest
Boruta performed 11 iterations in 13.15153 secs.
15 attributes confirmed important: dew_point_temperature_c,
functioning_day, holiday, hour, humidity_percent and 10 more;
1 attributes confirmed unimportant: random_noise;
By default, Boruta function uses Random Forest to perform feature selection. At the point of writing, the function is using random forest from ranger package.
We could plot out the importance of the features over the different runs.
plotImpHistory(boruta_randForest)

According to the documentation, below are the meaning for the different colors in Boruta plots:
| color | Reference |
|---|---|
| Green | Confirmed |
| Yellow | Tentative |
| Red | Rejected |
| Blue | Shadow |
plot functionTo visualise the importance of the features, we could pass the object to plot function.
plot(boruta_randForest)

ggplot functionPersonally, I prefer to visualize the results in ggplot function so that I have more flexibility to control the graph parameters.
attStats(boruta_randForest)
meanImp medianImp minImp maxImp
hour 98.0717172 97.9130486 93.9123303 100.83349
temperature_c 36.8423986 36.9185862 34.9602356 37.80734
humidity_percent 45.3167087 45.1593932 42.1440836 48.57834
wind_speed_m_s 23.6467304 23.6905421 21.9460428 25.25994
visibility_10m 29.5007216 29.2028510 27.8115254 32.50643
dew_point_temperature_c 28.6002636 28.6356678 27.5923997 29.28198
solar_radiation_mj_m2 41.2940363 41.2602276 38.2420791 43.24073
rainfall_mm 39.3291499 39.5970634 36.2640944 40.56487
snowfall_cm 9.9727469 9.9544285 9.5916183 10.84924
seasons 25.6104768 25.4314202 25.0328769 26.81662
holiday 14.5492481 14.6347796 12.9727474 15.72155
functioning_day 77.4296628 77.6273986 73.8235309 80.21149
year 11.7275931 11.6575018 11.1935009 12.69956
month 27.9184309 28.1075324 26.3403225 30.51047
weekday 34.3243762 34.4392173 32.3535620 36.53955
random_noise 0.4028639 0.2574984 -0.5417463 2.12460
normHits decision
hour 1 Confirmed
temperature_c 1 Confirmed
humidity_percent 1 Confirmed
wind_speed_m_s 1 Confirmed
visibility_10m 1 Confirmed
dew_point_temperature_c 1 Confirmed
solar_radiation_mj_m2 1 Confirmed
rainfall_mm 1 Confirmed
snowfall_cm 1 Confirmed
seasons 1 Confirmed
holiday 1 Confirmed
functioning_day 1 Confirmed
year 1 Confirmed
month 1 Confirmed
weekday 1 Confirmed
random_noise 0 Rejected
While the result is in data frame format, however the feature names are in the row index.
class(attStats(boruta_randForest))
[1] "data.frame"
To extract out the feature names, I will use rownames_to_column function as shown below.
rownames_to_column(attStats(boruta_randForest), var = "variables")
variables meanImp medianImp minImp maxImp
1 hour 98.0717172 97.9130486 93.9123303 100.83349
2 temperature_c 36.8423986 36.9185862 34.9602356 37.80734
3 humidity_percent 45.3167087 45.1593932 42.1440836 48.57834
4 wind_speed_m_s 23.6467304 23.6905421 21.9460428 25.25994
5 visibility_10m 29.5007216 29.2028510 27.8115254 32.50643
6 dew_point_temperature_c 28.6002636 28.6356678 27.5923997 29.28198
7 solar_radiation_mj_m2 41.2940363 41.2602276 38.2420791 43.24073
8 rainfall_mm 39.3291499 39.5970634 36.2640944 40.56487
9 snowfall_cm 9.9727469 9.9544285 9.5916183 10.84924
10 seasons 25.6104768 25.4314202 25.0328769 26.81662
11 holiday 14.5492481 14.6347796 12.9727474 15.72155
12 functioning_day 77.4296628 77.6273986 73.8235309 80.21149
13 year 11.7275931 11.6575018 11.1935009 12.69956
14 month 27.9184309 28.1075324 26.3403225 30.51047
15 weekday 34.3243762 34.4392173 32.3535620 36.53955
16 random_noise 0.4028639 0.2574984 -0.5417463 2.12460
normHits decision
1 1 Confirmed
2 1 Confirmed
3 1 Confirmed
4 1 Confirmed
5 1 Confirmed
6 1 Confirmed
7 1 Confirmed
8 1 Confirmed
9 1 Confirmed
10 1 Confirmed
11 1 Confirmed
12 1 Confirmed
13 1 Confirmed
14 1 Confirmed
15 1 Confirmed
16 0 Rejected
Once the features names are extracted, we could pass the info into ggplot function to visualize the results.
rownames_to_column(attStats(boruta_randForest), var = "variables") %>%
ggplot(aes(fct_reorder(variables, medianImp), medianImp, color = decision)) +
geom_point() +
geom_errorbar(aes(ymin = minImp, ymax = maxImp, variables)) +
coord_flip()

Next, I will visual the relationship between target variable and two variables.
First, I will look at how does the hour affect the rental bike count. It seems there are more bikes rental in the evening.
ggplot(df, aes(hour, rented_bike_count)) +
geom_boxplot() +
labs(title = "Rental Bike Count vs Hour") +
theme_minimal()

If we were plot the rental bike count by the weather temperature, we noted that the rental bike count increases when the temperature increases. The count peaks at around 23 degree and starts to drop when the temperature continues to increase.
I guess people wouldn’t want to cycle when the weather is too cold or too hot.
ggplot(df, aes(temperature_c, rented_bike_count)) +
geom_point(alpha = 0.2) +
geom_smooth() +
labs(title = "Rental Bike Count vs Temperature")

We could extract the selected features as the model formula.
formula_final <- getConfirmedFormula(boruta_randForest)
Another function is getNonRejectedFormula, which includes the variables fall under “Tentative” category.
The extracted formula can be passed to model for fitting the model.
ranger_fit <-
ranger(formula_final
,data = df
,importance = "permutation")
Next, I will extract the variable importance from the fitted model.
rownames_to_column(as.data.frame(importance(ranger_fit)), var = "variables") %>%
rename(importance = `importance(ranger_fit)`) %>%
ggplot(aes(importance, fct_reorder(variables, importance))) +
geom_col() +
ylab("Features") +
labs(title = "Variable Importance") +
theme_minimal()

There are some differences in the importance between Boruta and ranger. This is likely due to the differences in how the importance are being calculated.
Also, note that the rejected variables are not used in fitting the model.
That’s all for the day!
Thanks for reading the post until the end.
Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.
Refer to this link for the blog disclaimer.
Till next time, happy learning!

Photo by Gonzalo Mendiola