Machine Learning on Time Series

Machine Learning Time Series
Jasper Lok https://jasperlok.netlify.app/
07-02-2024

Previously, I happened to come across these time series packages and thought of trying them out to see how they work.

Hence, I will jump straight into the demonstration.

Photo by Nathan Dumlao on Unsplash

Demonstration

Setup the environment

First, I will set up the environment by calling all the relevant packages.

pacman::p_load(tidyverse, tidymodels, statmod, modeltime, janitor, tsibble, timetk, workflowsets)

Import the data

In this demonstration, I will be using the Seoul bike-sharing dataset.

The relevant data can be downloaded from UCL website.

Photo by Josh Nuttall on Unsplash

df <-
  read_csv("https://raw.githubusercontent.com/jasperlok/my-blog/master/_posts/2023-09-23-boruta/data/SeoulBikeData.csv"
           ,locale = locale(encoding = "latin1")) %>% 
  clean_names() %>% 
  mutate(date = dmy(date)) %>% 
  group_by(date) %>%
  summarise(rented_bike_count_tot = sum(rented_bike_count)
            ,temperature_c_avg = mean(temperature_c)
            ,humidity_percent_avg = mean(humidity_percent)
            ,wind_speed_m_s_avg = mean(wind_speed_m_s)
            ,visibility_10m_avg = mean(visibility_10m)
            ,dew_point_temperature_c_avg = mean(dew_point_temperature_c)
            ,solar_radiation_mj_m2_avg = mean(solar_radiation_mj_m2)
            ,rainfall_mm_avg = mean(rainfall_mm)
            ,snowfall_cm_avg = mean(snowfall_cm)
            ,holiday_first = first(holiday))

We could visualize the time series by using plot_time_series function. This function will produce an interactive chart.

df %>% 
  plot_time_series(date, rented_bike_count_tot)

Model building

First, I will split the dataset into training and testing datasets.

Note that as this is a time series dataset, hence we cannot use the usual random splitting. We split by the time instead.

df_splits <-
  initial_time_split(df, prop = 0.6)

Fitting single model

Next, I will build an ARIMA model by using the functions from modeltime package.

arima_fit <-
  arima_reg() %>% 
  set_engine(engine = "auto_arima") %>% 
  fit(rented_bike_count_tot ~ 
        date 
      + temperature_c_avg
      + temperature_c_avg
      + humidity_percent_avg
      + wind_speed_m_s_avg
      + visibility_10m_avg
      + dew_point_temperature_c_avg
      + solar_radiation_mj_m2_avg
      + rainfall_mm_avg
      + snowfall_cm_avg
      + holiday_first
      ,data = training(df_splits))

The cool thing about this package is its syntax follows the same pattern as tidymodels package.

Then, I will pass the testing data into the function to evaluate how well this model is performing.

calibration_df <-
  arima_fit %>% 
  modeltime_calibrate(new_data = testing(df_splits))
calibration_df %>%
    modeltime_accuracy()
# A tibble: 1 × 9
  .model_id .model_desc      .type   mae  mape  mase smape  rmse   rsq
      <int> <chr>            <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1         1 REGRESSION WITH… Test  6401.   Inf  1.02  35.3 9191. 0.232

Oof, the model doesn’t seem to perform very well.

Calibrate multiple models

Instead of fitting models one by one, we could fit several models at one time.

First, I will create the general formula for all the models I will be building later.

gen_formula <- 
   as.formula("rented_bike_count_tot ~ date + temperature_c_avg + temperature_c_avg + humidity_percent_avg + wind_speed_m_s_avg + visibility_10m_avg + dew_point_temperature_c_avg + solar_radiation_mj_m2_avg + rainfall_mm_avg + snowfall_cm_avg + holiday_first")

Next, I will define the models to be built.

# ARIMA
arima_fit <-
  arima_reg() %>% 
  set_engine(engine = "auto_arima") %>% 
  fit(gen_formula
      ,data = training(df_splits))

# prophet
prophet_fit <- 
  prophet_reg() %>%
    set_engine(engine = "prophet") %>%
    fit(gen_formula
        ,data = training(df_splits))

# prophet - boost
prophet_boost_fit <- 
  boost_tree("regression") %>%
    set_engine(engine = "xgboost") %>%
    fit(gen_formula
        ,data = training(df_splits))

# seasonal
seasonal_fit <- 
  seasonal_reg() %>%
    set_engine(engine = "stlm_ets") %>%
    fit(gen_formula
        ,data = training(df_splits))

Then, I will combine the models by using modeltime_table function. The function will provide an ID and description for each model.

model_tbl <-
  modeltime_table(
    arima_fit
    ,prophet_fit
    ,prophet_boost_fit
    ,seasonal_fit
  )

model_tbl
# Modeltime Table
# A tibble: 4 × 3
  .model_id .model   .model_desc                                  
      <int> <list>   <chr>                                        
1         1 <fit[+]> REGRESSION WITH ARIMA(1,1,2)(0,0,1)[7] ERRORS
2         2 <fit[+]> PROPHET W/ REGRESSORS                        
3         3 <fit[+]> XGBOOST                                      
4         4 <fit[+]> SEASONAL DECOMP: ETS(A,N,N)                  

I will measure the performance of each model.

model_calibrate <-
  model_tbl %>% 
  modeltime_calibrate(new_data = testing(df_splits))

model_calibrate %>% 
  modeltime_accuracy()
# A tibble: 4 × 9
  .model_id .model_desc  .type    mae  mape  mase smape   rmse     rsq
      <int> <chr>        <chr>  <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>
1         1 REGRESSION … Test   6401.   Inf  1.02  35.3  9191. 0.232  
2         2 PROPHET W/ … Test  12064.   Inf  1.92  52.1 14101. 0.235  
3         3 XGBOOST      Test   6595.   Inf  1.05  40.7  8910. 0.245  
4         4 SEASONAL DE… Test   7134.   Inf  1.13  38.5 10149. 0.00214

We can even pass the model performance to table_modeltime_accuracy function to show them in a table format.

model_calibrate %>% 
  modeltime_accuracy() %>% 
  table_modeltime_accuracy(
    .interactive = TRUE
    )

Next, I will extract the predictions from each model and compare them to the actual data.

model_calibrate %>%
    modeltime_forecast(
        new_data    = testing(df_splits),
        actual_data = df
    )
# A tibble: 949 × 7
   .model_id .model_desc .key   .index     .value .conf_lo .conf_hi
       <int> <chr>       <fct>  <date>      <dbl>    <dbl>    <dbl>
 1        NA ACTUAL      actual 2017-12-01   9539       NA       NA
 2        NA ACTUAL      actual 2017-12-02   8523       NA       NA
 3        NA ACTUAL      actual 2017-12-03   7222       NA       NA
 4        NA ACTUAL      actual 2017-12-04   8729       NA       NA
 5        NA ACTUAL      actual 2017-12-05   8307       NA       NA
 6        NA ACTUAL      actual 2017-12-06   6669       NA       NA
 7        NA ACTUAL      actual 2017-12-07   8549       NA       NA
 8        NA ACTUAL      actual 2017-12-08   8032       NA       NA
 9        NA ACTUAL      actual 2017-12-09   7233       NA       NA
10        NA ACTUAL      actual 2017-12-10   3453       NA       NA
# ℹ 939 more rows

I will visualize the results as well.

model_calibrate %>%
    modeltime_forecast(
        new_data    = testing(df_splits),
        actual_data = df
    )  %>%
    plot_modeltime_forecast()

As shown in the chart, we can see that the seasonal decomposition model performs worse than other models.

Another method of fitting multiple models - workflowsets package

First, I will define the recipe to be used in fitting the model.

gen_recipe <-
  recipe(rented_bike_count_tot ~ date + temperature_c_avg + temperature_c_avg + humidity_percent_avg + wind_speed_m_s_avg + visibility_10m_avg + dew_point_temperature_c_avg + solar_radiation_mj_m2_avg + rainfall_mm_avg + snowfall_cm_avg + holiday_first
         ,data = training(df_splits))

Then, I will define the models to be fitted.

wf_arima <-
  arima_reg() %>%
  set_engine(engine = "auto_arima")

wf_prophet <-
  prophet_reg() %>%
  set_engine(engine = "prophet")

I will pass all this info into the workflow_set function.

wfsets <-
  workflow_set(
  preproc = list(gen_recipe = gen_recipe)
  ,models = list(arima_fit = wf_arima
                 ,prophet_fit = wf_prophet)
  ,cross = TRUE
  )

Lastly, I will use modeltime_fit_workflowset function to fit all the models within the workflow.

wfsets_fit <-
  wfsets %>% 
  modeltime_fit_workflowset(training(df_splits))

wfsets_fit
# Modeltime Table
# A tibble: 2 × 3
  .model_id .model     .model_desc           
      <int> <list>     <chr>                 
1         1 <workflow> GEN_RECIPE_ARIMA_FIT  
2         2 <workflow> GEN_RECIPE_PROPHET_FIT

The steps to evaluate the fitted models are the same as how we did it in the previous sub-section.

wfsets_calibrate <-
  wfsets_fit %>% 
  modeltime_calibrate(new_data = testing(df_splits))

wfsets_calibrate %>% 
  modeltime_accuracy()
# A tibble: 2 × 9
  .model_id .model_desc    .type    mae  mape  mase smape   rmse   rsq
      <int> <chr>          <chr>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
1         1 GEN_RECIPE_AR… Test   6401.   Inf  1.02  35.3  9191. 0.232
2         2 GEN_RECIPE_PR… Test  12064.   Inf  1.92  52.1 14101. 0.235

Conclusion

That’s all for the day!

Thanks for reading the post until the end.

Feel free to contact me through email or LinkedIn if you have any suggestions on future topics to share.

Refer to this link for the blog disclaimer.

Till next time, happy learning!

Photo by Murray Campbell on Unsplash

References