In this post, I will be exploring how to use Boruta to perform feature selection.

What is feature selection?

(Brownlee 2020) explained that feature selection is the process of reducing the the number of input variables when developing a predictive model.

The author further explained the benefits of feature selections, which includes:

Reduce the computational cost of modeling
This could improve the performance of the model

All Relevant Method vs Minimal Optimal Class

Before jumping into how Boruta works, let’s look at the different all relevant method and minimal optimal class.

All relevant method (eg. Boruta) aims to find all features connected with the decision, where minimal optimal class (eg. XGBoost) aims to provide a possibly compact set of features which carry enough information for a possibly optiomal classification on the reduced set (Kursa 2022).

The author also mentioned minimal optimal methods are generally cherry-picking features usable for classification, regardless if this usability is significant or not, which is an easy way to overfitting. Boruta is much more robust in this manner.

So, how does Boruta work?

Below is my simple explanation on how it works after referencing to different materials:

Create permutated features from existing features (a.k.a. shadow features)
Calculate the Z-score and the max Z score from the shadow features will be taken as the cutoff
Any existing features with Z score lower than the cutoff will be deemed as “redundant” and vice versa
Repeat this process X times
Then, a two sided test is performed to classify the features into “Confirmed”, “Tentative” and “Rejected”

For more information, you refer to the page 3 of original paper.

Alternatively, I happened to come across this Youtube video on the explanation on how Boruta works and I find it helpful.

Demonstration

In this demonstration, I will be using the Seoul bike sharing dataset from UCL website for the exploration.

Setup the environment

First, I will load the necessary packages into the environment.

pacman::p_load(tidyverse, janitor, lubridate, Boruta, ranger)

Import Data

Next, I will import the data into the environment.

I will also perform some basic data wrangling before using Boruta.

df <- 
  read_csv("data/SeoulBikeData.csv", locale = locale(encoding = "latin1")) %>% 
  clean_names() %>% 
  mutate(date = dmy(date)
         ,year = year(date)
         ,month = month(date),
         ,weekday = wday(date, label = TRUE)) %>% 
  mutate_at(c("hour", "year", "month")
            ,function(x) as.factor(x))

I have also included a “random noise” in the dataset to see whether the algorithm will pick up this column is unimportant.

df <-
  df %>% 
  mutate(random_noise = rnorm(nrow(df), 0, 1))

Boruta

To perform Boruta, I will be using Boruta function from Boruta package.

For reproducibility, I will specify the random seed before running Boruta.

boruta_randForest <- Boruta(rented_bike_count ~ . -date
                            ,data = df
                            ,doTrace = 2)

boruta_randForest

Boruta performed 11 iterations in 13.15153 secs.
 15 attributes confirmed important: dew_point_temperature_c,
functioning_day, holiday, hour, humidity_percent and 10 more;
 1 attributes confirmed unimportant: random_noise;

By default, Boruta function uses Random Forest to perform feature selection. At the point of writing, the function is using random forest from ranger package.

Importance under Different Runs

We could plot out the importance of the features over the different runs.

plotImpHistory(boruta_randForest)

According to the documentation, below are the meaning for the different colors in Boruta plots:

color	Reference
Green	Confirmed
Yellow	Tentative
Red	Rejected
Blue	Shadow

Visualizing the Importance

`plot` function

To visualise the importance of the features, we could pass the object to plot function.

plot(boruta_randForest)

`ggplot` function

Personally, I prefer to visualize the results in ggplot function so that I have more flexibility to control the graph parameters.

attStats(boruta_randForest)

                           meanImp  medianImp     minImp    maxImp
hour                    98.0717172 97.9130486 93.9123303 100.83349
temperature_c           36.8423986 36.9185862 34.9602356  37.80734
humidity_percent        45.3167087 45.1593932 42.1440836  48.57834
wind_speed_m_s          23.6467304 23.6905421 21.9460428  25.25994
visibility_10m          29.5007216 29.2028510 27.8115254  32.50643
dew_point_temperature_c 28.6002636 28.6356678 27.5923997  29.28198
solar_radiation_mj_m2   41.2940363 41.2602276 38.2420791  43.24073
rainfall_mm             39.3291499 39.5970634 36.2640944  40.56487
snowfall_cm              9.9727469  9.9544285  9.5916183  10.84924
seasons                 25.6104768 25.4314202 25.0328769  26.81662
holiday                 14.5492481 14.6347796 12.9727474  15.72155
functioning_day         77.4296628 77.6273986 73.8235309  80.21149
year                    11.7275931 11.6575018 11.1935009  12.69956
month                   27.9184309 28.1075324 26.3403225  30.51047
weekday                 34.3243762 34.4392173 32.3535620  36.53955
random_noise             0.4028639  0.2574984 -0.5417463   2.12460
                        normHits  decision
hour                           1 Confirmed
temperature_c                  1 Confirmed
humidity_percent               1 Confirmed
wind_speed_m_s                 1 Confirmed
visibility_10m                 1 Confirmed
dew_point_temperature_c        1 Confirmed
solar_radiation_mj_m2          1 Confirmed
rainfall_mm                    1 Confirmed
snowfall_cm                    1 Confirmed
seasons                        1 Confirmed
holiday                        1 Confirmed
functioning_day                1 Confirmed
year                           1 Confirmed
month                          1 Confirmed
weekday                        1 Confirmed
random_noise                   0  Rejected

While the result is in data frame format, however the feature names are in the row index.

class(attStats(boruta_randForest))

[1] "data.frame"

To extract out the feature names, I will use rownames_to_column function as shown below.

rownames_to_column(attStats(boruta_randForest), var = "variables")

                 variables    meanImp  medianImp     minImp    maxImp
1                     hour 98.0717172 97.9130486 93.9123303 100.83349
2            temperature_c 36.8423986 36.9185862 34.9602356  37.80734
3         humidity_percent 45.3167087 45.1593932 42.1440836  48.57834
4           wind_speed_m_s 23.6467304 23.6905421 21.9460428  25.25994
5           visibility_10m 29.5007216 29.2028510 27.8115254  32.50643
6  dew_point_temperature_c 28.6002636 28.6356678 27.5923997  29.28198
7    solar_radiation_mj_m2 41.2940363 41.2602276 38.2420791  43.24073
8              rainfall_mm 39.3291499 39.5970634 36.2640944  40.56487
9              snowfall_cm  9.9727469  9.9544285  9.5916183  10.84924
10                 seasons 25.6104768 25.4314202 25.0328769  26.81662
11                 holiday 14.5492481 14.6347796 12.9727474  15.72155
12         functioning_day 77.4296628 77.6273986 73.8235309  80.21149
13                    year 11.7275931 11.6575018 11.1935009  12.69956
14                   month 27.9184309 28.1075324 26.3403225  30.51047
15                 weekday 34.3243762 34.4392173 32.3535620  36.53955
16            random_noise  0.4028639  0.2574984 -0.5417463   2.12460
   normHits  decision
1         1 Confirmed
2         1 Confirmed
3         1 Confirmed
4         1 Confirmed
5         1 Confirmed
6         1 Confirmed
7         1 Confirmed
8         1 Confirmed
9         1 Confirmed
10        1 Confirmed
11        1 Confirmed
12        1 Confirmed
13        1 Confirmed
14        1 Confirmed
15        1 Confirmed
16        0  Rejected

Once the features names are extracted, we could pass the info into ggplot function to visualize the results.

rownames_to_column(attStats(boruta_randForest), var = "variables") %>% 
  ggplot(aes(fct_reorder(variables, medianImp), medianImp, color = decision)) +
  geom_point() +
  geom_errorbar(aes(ymin = minImp, ymax = maxImp, variables)) +
  coord_flip()

Next, I will visual the relationship between target variable and two variables.

First, I will look at how does the hour affect the rental bike count. It seems there are more bikes rental in the evening.

ggplot(df, aes(hour, rented_bike_count)) +
  geom_boxplot() +
  labs(title = "Rental Bike Count vs Hour") +
  theme_minimal()

If we were plot the rental bike count by the weather temperature, we noted that the rental bike count increases when the temperature increases. The count peaks at around 23 degree and starts to drop when the temperature continues to increase.

I guess people wouldn’t want to cycle when the weather is too cold or too hot.

ggplot(df, aes(temperature_c, rented_bike_count)) +
  geom_point(alpha = 0.2) +
  geom_smooth() +
  labs(title = "Rental Bike Count vs Temperature")

Model Fitting

We could extract the selected features as the model formula.

formula_final <- getConfirmedFormula(boruta_randForest)

Another function is getNonRejectedFormula, which includes the variables fall under “Tentative” category.

The extracted formula can be passed to model for fitting the model.

ranger_fit <-
  ranger(formula_final
         ,data = df
         ,importance = "permutation")

Next, I will extract the variable importance from the fitted model.

rownames_to_column(as.data.frame(importance(ranger_fit)), var = "variables") %>%
  rename(importance = `importance(ranger_fit)`) %>% 
  ggplot(aes(importance, fct_reorder(variables, importance))) +
  geom_col() +
  ylab("Features") +
  labs(title = "Variable Importance") +
  theme_minimal()

There are some differences in the importance between Boruta and ranger. This is likely due to the differences in how the importance are being calculated.

Also, note that the rejected variables are not used in fitting the model.

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Gonzalo Mendiola

Brownlee, Jason. 2020. “How to Choose a Feature Selection Method for Machine Learning.” https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/#:~:text=Feature%20selection%20is%20the%20process,the%20performance%20of%20the%20model.

Kursa, Miron B. 2022. “Boruta for Those in a Hurry.” https://cran.r-project.org/web/packages/Boruta/vignettes/inahurry.pdf.

Boruta

What is feature selection?

All Relevant Method vs Minimal Optimal Class

So, how does Boruta work?

Demonstration

Setup the environment

Import Data

Boruta

Importance under Different Runs

Visualizing the Importance

plot function

ggplot function

Model Fitting

Conclusion

References

`plot` function

`ggplot` function