Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nthread in prophet_xgboost and select_best Computation Speed #258

Open
Zhouyi-Joey opened this issue Dec 2, 2024 · 1 comment
Open

Nthread in prophet_xgboost and select_best Computation Speed #258

Zhouyi-Joey opened this issue Dec 2, 2024 · 1 comment

Comments

@Zhouyi-Joey
Copy link

Hi Matt,

I’ve been using your fantastic tools for time-series forecasting and encountered a couple of issues that might need clarification or improvement:
1.When setting set_engine("prophet_xgboost", nthread = 12), it seems that the specified number of threads (12) isn’t being utilized effectively during computation. Could you confirm if nthread in the XGBoost backend is fully activated in this setup?
2.Alternatively, using parallel_start(12, .method = "parallel") allows parallelism for most of the computation steps. However, during the final step where I run

best_result <- select_best(tune_results, metric = "rmse"), 

only a single core is used, leading to significant delays for large datasets.

Could you provide guidance on:
1.Ensuring that nthread is fully activated for XGBoost within the prophet_xgboost engine?
2.or, Improving parallelization for the select_best step to speed up computation?

Below is a minimal reproducible example showcasing the issue:

# Load necessary libraries
library(modeltime)      # For time-series forecasting models
library(tidymodels)     # For tidymodels framework
library(workflowsets)   # For creating multiple workflows
library(tidyverse)      # For data manipulation
library(timetk)         # For time series toolkits
library(prophet)        # For Prophet time-series forecasting

# Filter US holidays from a pre-generated holiday dataset
us_holidays <- generated_holidays[generated_holidays$country == "US", ]

# If nthread = 12 is removed and replaced with parallel_start(), it leads to issue 2 described above.
# parallel_start(12, .method = "parallel")

# Load and preprocess data
# Filter specific time-series data
m750 <- m4_monthly %>% filter(id == "M750")

# Ensure the date column is in Date format
m750 <- m750 %>% mutate(date = as.Date(date))

# Add annual statistics: mean and variance
m750 <- m750 %>%
    group_by(year = year(date)) %>%
    mutate(
        annual_mean = mean(value, na.rm = TRUE),
        annual_variance = var(value, na.rm = TRUE)
    ) %>%
    ungroup()

# Add quarterly statistics: mean
m750 <- m750 %>%
    group_by(year = year(date), quarter = quarter(date)) %>%
    mutate(quarter_mean = mean(value, na.rm = TRUE)) %>%
    ungroup()

# Split data into training and testing sets (90% training data)
splits <- initial_time_split(m750, prop = 0.9)

# Define the preprocessing recipe
recipe_spec <- recipe(value ~ ., data = training(splits))

# Prepare and inspect the preprocessed data
recipe_spec %>% prep() %>% juice()

# Define the Prophet + XGBoost model with tunable parameters
prophet_boost_tune <- prophet_boost(
    mode = "regression"
) %>%
    set_engine("prophet_xgboost", holidays = us_holidays, nthread = 12) %>%
    set_args(
        changepoint_range = 0.85,
        trees = 100000,
        tree_depth = tune(),
        learn_rate = 0.01,
        stop_iter = 10,
        min_n = tune(),
        season = tune(),
        changepoint_num = tune(),
        prior_scale_changepoints = tune(),
        prior_scale_seasonality = tune(),
        prior_scale_holidays = tune()
    )

# Set Bayesian optimization control parameters
bayes_control <- control_bayes(
    verbose = TRUE,
    uncertain = 10,
    no_improve = 10,
    parallel_over = "everything", # Allow parallel processing
    save_pred = TRUE
)

# Define the workflow by combining the recipe and model
workflow_prophet_boost <- workflow() %>%
    add_model(prophet_boost_tune) %>%
    add_recipe(recipe_spec)

# Set seed for reproducibility
set.seed(200)

# Define cross-validation folds for time-series
cv_folds <- time_series_cv(
    data = m750,
    date_var = date,
    initial = "20 years",
    assess = "6 months"
)

# Perform Bayesian tuning
tune_results <- tune_bayes(
    workflow_prophet_boost,
    resamples = cv_folds,
    initial = 10,      # Number of initial points in the Bayesian search
    iter = 10,         # Number of iterations for optimization
    control = bayes_control,
    metrics = metric_set(rmse) # Root Mean Square Error as the evaluation metric
)

# Select the best result based on RMSE
best_result <- select_best(tune_results, metric = "rmse")

Thank you for your time and incredible contributions to the R and time-series forecasting communities. Your tools have been a game-changer in streamlining complex workflows and making advanced modeling accessible. I deeply appreciate your hard work and dedication in maintaining and improving these packages.

Best regards,
Yi Zhou

@Zhouyi-Joey
Copy link
Author

Zhouyi-Joey commented Dec 15, 2024

I found some code that needs to be fixed

# Load necessary libraries
library(modeltime)      # For time-series forecasting models
library(tidymodels)     # For tidymodels framework
library(workflowsets)   # For creating multiple workflows
library(tidyverse)      # For data manipulation
library(timetk)         # For time series toolkits
library(prophet)        # For Prophet time-series forecasting

# Filter US holidays from a pre-generated holiday dataset
us_holidays <- generated_holidays[generated_holidays$country == "US", ]

# If nthread = 12 is removed and replaced with parallel_start(), it leads to issue 2 described above.
# parallel_start(12, .method = "parallel", .export_vars = "us_holidays")

# Load and preprocess data
# Filter specific time-series data
m750 <- m4_monthly %>% filter(id == "M750")

# Ensure the date column is in Date format
m750 <- m750 %>% mutate(date = as.Date(date))

# Add annual statistics: mean and variance
m750 <- m750 %>%
    group_by(year = year(date)) %>%
    mutate(
        annual_mean = mean(value, na.rm = TRUE),
        annual_variance = var(value, na.rm = TRUE)
    ) %>%
    ungroup()

# Add quarterly statistics: mean
m750 <- m750 %>%
    group_by(year = year(date), quarter = quarter(date)) %>%
    mutate(quarter_mean = mean(value, na.rm = TRUE)) %>%
    ungroup()

# Split data into training and testing sets (90% training data)
splits <- initial_time_split(m750, prop = 0.9)

# Define the preprocessing recipe
recipe_spec <- recipe(value ~ ., data = training(splits))

# Prepare and inspect the preprocessed data
recipe_spec %>% prep() %>% juice()

# Define the Prophet + XGBoost model with tunable parameters
prophet_boost_tune <- prophet_boost(
    mode = "regression"
) %>%
    set_engine("prophet_xgboost", holidays = us_holidays, nthread = 12) %>%
    set_args(
        changepoint_range = 0.85,
        trees = 100000,
        tree_depth = tune(),
        learn_rate = 0.01,
        stop_iter = 10,
        min_n = tune(),
        season = tune(),
        changepoint_num = tune(),
        prior_scale_changepoints = tune(),
        prior_scale_seasonality = tune(),
        prior_scale_holidays = tune()
    )

# Set Bayesian optimization control parameters
bayes_control <- control_bayes(
    verbose = TRUE,
    uncertain = 10,
    no_improve = 10,
    parallel_over = "everything", # Allow parallel processing
    save_pred = TRUE
)

# Define the workflow by combining the recipe and model
workflow_prophet_boost <- workflow() %>%
    add_model(prophet_boost_tune) %>%
    add_recipe(recipe_spec)

# Set seed for reproducibility
set.seed(200)

# Define cross-validation folds for time-series
cv_folds <- time_series_cv(
    data = m750,
    date_var = date,
    initial = "20 years",
    assess = "6 months"
)

# Perform Bayesian tuning
tune_results <- tune_bayes(
    workflow_prophet_boost,
    resamples = cv_folds,
    initial = 10,      # Number of initial points in the Bayesian search
    iter = 10,         # Number of iterations for optimization
    control = bayes_control,
    metrics = metric_set(rmse) # Root Mean Square Error as the evaluation metric
)

# Select the best result based on RMSE
best_result <- select_best(tune_results, metric = "rmse")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant