Modeling with tidymodels in R

Published

December 10, 2021

Creating training and test datasets

The rsample package is designed to create training and test datasets. Creating a test dataset is important for estimating how a trained model will likely perform on new data. It also guards against overfitting, where a model memorizes patterns that exist only in the training data and performs poorly on new data.

In this exercise, you will create training and test datasets from the home_sales data. This data contains information on homes sold in the Seattle, Washington area between 2015 and 2016.

The outcome variable in this data is selling_price.

The tidymodels package will be pre-loaded in every exercise in the course. The home_sales tibble has also been loaded for you.

home_sales <- readRDS("home_sales.rds")

# Create a data split object
home_split <- initial_split(home_sales, 
                            prop = 0.7, 
                            strata = selling_price)

# Create the training data
home_training <- home_split %>%
  training()

# Create the test data
home_test <- home_split %>% 
  testing()

# Check number of rows in each dataset
nrow(home_training)

[1] 1042

nrow(home_test)

[1] 450

Distribution of outcome variable values Stratifying by the outcome variable when generating training and test datasets ensures that the outcome variable values have a similar range in both datasets.

Since the original data is split at random, stratification avoids placing all the expensive homes in home_sales into the test dataset, for example. In this case, your model would most likely perform poorly because it was trained on less expensive homes.

In this exercise, you will calculate summary statistics for the selling_price variable in the training and test datasets. The home_training and home_test tibbles have been loaded from the previous exercise.

# Distribution of selling_price in training data
library(knitr)
summary_func <- function(df){
      df %>% summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price)) %>%
    kable()

}
home_training %>% 
    summary_func()

min_sell_price	max_sell_price	mean_sell_price	sd_sell_price
350000	650000	479358.7	81534.16

home_test %>% 
  summary_func()

min_sell_price	max_sell_price	mean_sell_price	sd_sell_price
350000	650000	478449.5	79764.72

Fitting a linear regression model

The parsnip package provides a unified syntax for the model fitting process in R.

With parsnip, it is easy to define models using the various packages, or engines, that exist in the R ecosystem.

In this exercise, you will define a parsnip linear regression object and train your model to predict selling_price using home_age and sqft_living as predictor variables from the home_sales data.

The home_training and home_test tibbles that you created in the previous lesson have been loaded into this session.

# Specify a linear regression model, linear_model
linear_model <- linear_reg() %>% 
  # Set the model engine
  set_engine('lm') %>% 
  # Set the model mode
  set_mode('regression')

# Train the model with the training data
lm_fit <- linear_model %>% 
  fit(selling_price~ home_age + sqft_living,
      data = home_training)

# Print lm_fit to view model information
tidy(lm_fit) %>%
    kable()

term	estimate	std.error	statistic
(Intercept)	292350.3425	7625.56910	38.338167
home_age	-1616.2020	176.93455	-9.134463
sqft_living	103.2897	2.77595	37.208781

Predicting home selling prices

After fitting a model using the training data, the next step is to use it to make predictions on the test dataset. The test dataset acts as a new source of data for the model and will allow you to evaluate how well it performs.

Before you can evaluate model performance, you must add your predictions to the test dataset.

In this exercise, you will use your trained model, lm_fit, to predict selling_price in the home_test dataset.

Your trained model, lm_fit, as well as the test dataset, home_test have been loaded into your session.

# Predict selling_price
home_predictions <- predict(lm_fit,
                        new_data = home_test)

# View predicted selling prices
#home_predictions

# Combine test data with predictions
home_test_results <- home_test %>% 
  select(selling_price, home_age, sqft_living) %>% 
  bind_cols(home_predictions)

# View results
home_test_results %>% 
    head()

# A tibble: 6 × 4
  selling_price home_age sqft_living   .pred
          <dbl>    <dbl>       <dbl>   <dbl>
1        487000       10        2540 538544.
2        465000       10        1530 434222.
3        411000       18        1130 379976.
4        635000        4        3350 631906.
5        464950       19        2190 487847.
6        425000       11        1920 472888.

Model performance metrics

Evaluating model results is an important step in the modeling process. Model evaluation should be done on the test dataset in order to see how well a model will generalize to new datasets.

In the previous exercise, you trained a linear regression model to predict selling_price using home_age and sqft_living as predictor variables. You then created the home_test_results tibble using your trained model on the home_test data.

In this exercise, you will calculate the RMSE and R squared metrics using your results in home_test_results.

The home_test_results tibble has been loaded into your session.

# Print home_test_results
#home_test_results

# Caculate the RMSE metric
home_test_results %>% 
  rmse(truth = selling_price, estimate =.pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      45930.

# Calculate the R squared metric
home_test_results %>% 
  rsq(truth = selling_price, estimate =.pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.670

R squared plot

In the previous exercise, you got an R squared value of 0.651. The R squared metric ranges from 0 to 1, 0 being the worst and 1 the best.

Calculating the R squared value is only the first step in studying your model’s predictions.

Making an R squared plot is extremely important because it will uncover potential problems with your model, such as non-linear patterns or regions where your model is either over or under-predicting the outcome variable.

In this exercise, you will create an R squared plot of your model’s performance.

The home_test_results tibble has been loaded into your session.

# Create an R squared plot of model performance
ggplot(home_test_results, aes(x = selling_price, y = .pred)) +
  geom_point(alpha = 0.5) + 
  geom_abline(color = 'blue', linetype = 2) +
  coord_obs_pred()  +
  labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')

Complete model fitting process with last_fit()

In this exercise, you will train and evaluate the performance of a linear regression model that predicts selling_price using all the predictors available in the home_sales tibble.

This exercise will give you a chance to perform the entire model fitting process with tidymodels, from defining your model object to evaluating its performance on the test data.

Earlier in the chapter, you created an rsample object called home_split by passing the home_sales tibble into initial_split(). The home_split object contains the instructions for randomly splitting home_sales into training and test sets.

The home_sales tibble, and home_split object have been loaded into this session.

# Define a linear regression model
linear_model <- linear_reg() %>% 
  set_engine('lm') %>% 
  set_mode('regression')

# Train linear_model with last_fit()
linear_fit <- linear_model %>% 
  last_fit(selling_price ~ ., split = home_split)

# Collect predictions and view results
predictions_df <- linear_fit %>% collect_predictions()
predictions_df %>% head()

# A tibble: 6 × 5
  id                 .pred  .row selling_price .config             
  <chr>              <dbl> <int>         <dbl> <chr>               
1 train/test split 527338.     1        487000 Preprocessor1_Model1
2 train/test split 421637.     2        465000 Preprocessor1_Model1
3 train/test split 397998.     3        411000 Preprocessor1_Model1
4 train/test split 694181.     4        635000 Preprocessor1_Model1
5 train/test split 475255.     8        464950 Preprocessor1_Model1
6 train/test split 436889.     9        425000 Preprocessor1_Model1

# Make an R squared plot using predictions_df
ggplot(predictions_df, aes(x = selling_price, y = .pred)) + 
  geom_point(alpha = 0.5) + 
  geom_abline(color = 'blue', linetype = 2) +
  coord_obs_pred() +
  labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')

Data resampling

The first step in a machine learning project is to create training and test datasets for model fitting and evaluation. The test dataset provides an estimate of how your model will perform on new data and helps to guard against overfitting.

You will be working with the telecom_df dataset which contains information on customers of a telecommunications company. The outcome variable is canceled_service and it records whether a customer canceled their contract with the company. The predictor variables contain information about customers’ cell phone and internet usage as well as their contract type and monthly charges.

The telecom_df tibble has been loaded into your session.

telecom_df <- readRDS("telecom_df.rds")
# Create data split object
telecom_split <- initial_split(telecom_df, prop = 0.75,
                     strata = canceled_service)

# Create the training data
telecom_training <- telecom_split %>% 
  training()

# Create the test data
telecom_test <- telecom_split %>% 
  testing()

# Check the number of rows
nrow(telecom_training)

[1] 731

nrow(telecom_test)

[1] 244

Fitting a logistic regression model

In addition to regression models, the parsnip package also provides a general interface to classification models in R.

In this exercise, you will define a parsnip logistic regression object and train your model to predict canceled_service using avg_call_mins, avg_intl_mins, and monthly_charges as predictor variables from the telecom_df data.

The telecom_training and telecom_test tibbles that you created in the previous lesson have been loaded into this session.

# Specify a logistic regression model
logistic_model <- logistic_reg() %>% 
  # Set the engine
  set_engine('glm') %>% 
  # Set the mode
  set_mode('classification')

# Fit to training data
logistic_fit <- logistic_model %>% 
  fit(canceled_service ~ avg_call_mins +avg_intl_mins+monthly_charges,
      data = telecom_training)

# Print model fit object
logistic_fit %>% tidy()

# A tibble: 4 × 5
  term             estimate std.error statistic  p.value
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      1.99       0.584      3.41   6.43e- 4
2 avg_call_mins   -0.0107     0.00128   -8.40   4.50e-17
3 avg_intl_mins    0.0236     0.00311    7.59   3.16e-14
4 monthly_charges  0.000293   0.00477    0.0615 9.51e- 1

Combining test dataset results

Evaluating your model’s performance on the test dataset gives insights into how well your model predicts on new data sources. These insights will help you communicate your model’s value in solving problems or improving decision making.

Before you can calculate classification metrics such as sensitivity or specificity, you must create a results tibble with the required columns for yardstick metric functions.

In this exercise, you will use your trained model to predict the outcome variable in the telecom_test dataset and combine it with the true outcome values in the canceled_service column.

Your trained model, logistic_fit, and test dataset, telecom_test, have been loaded from the previous exercise.

# Predict outcome categories
class_preds <- predict(logistic_fit, new_data = telecom_test,
                       type = 'class')

# Obtain estimated probabilities for each outcome value
prob_preds <- predict(logistic_fit, new_data = telecom_test, 
                      type = 'prob')

# Combine test set results
telecom_results <- telecom_test %>% 
  select(canceled_service) %>% 
  bind_cols(class_preds, prob_preds)

# View results tibble
telecom_results %>%
    head()

# A tibble: 6 × 4
  canceled_service .pred_class .pred_yes .pred_no
  <fct>            <fct>           <dbl>    <dbl>
1 yes              no              0.136    0.864
2 yes              yes             0.792    0.208
3 no               no              0.171    0.829
4 no               yes             0.508    0.492
5 yes              no              0.371    0.629
6 no               no              0.139    0.861

Evaluating performance with yardstick

In the previous exercise, you calculated classification metrics from a sample confusion matrix. The yardstick package was designed to automate this process.

For classification models, yardstick functions require a tibble of model results as the first argument. This should include the actual outcome values, predicted outcome values, and estimated probabilities for each value of the outcome variable.

In this exercise, you will use the results from your logistic regression model, telecom_results, to calculate performance metrics.

The telecom_results tibble has been loaded into your session.

# Calculate the confusion matrix
conf_mat(telecom_results, truth = canceled_service,
    estimate = .pred_class)

          Truth
Prediction yes  no
       yes  31  22
       no   51 140

# Calculate the accuracy
accuracy(telecom_results, truth = canceled_service,
    estimate = .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.701

# Calculate the sensitivity

sens(telecom_results, truth = canceled_service,
    estimate = .pred_class)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 sens    binary         0.378

# Calculate the specificity

spec(telecom_results, truth = canceled_service,
    estimate = .pred_class)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 spec    binary         0.864

Creating custom metric sets

The yardstick package also provides the ability to create custom sets of model metrics. In cases where the cost of obtaining false negative errors is different from the cost of false positive errors, it may be important to examine a specific set of performance metrics.

Instead of calculating accuracy, sensitivity, and specificity separately, you can create your own metric function that calculates all three at the same time.

In this exercise, you will use the results from your logistic regression model, telecom_results, to calculate a custom set of performance metrics. You will also use a confusion matrix to calculate all available binary classification metrics in tidymodelsall at once.

The telecom_results tibble has been loaded into your session.

# Create a custom metric function
telecom_metrics <- metric_set(accuracy, sens, spec)

# Calculate metrics using model results tibble
telecom_metrics(telecom_results, 
                truth = canceled_service,
                estimate = .pred_class)

# A tibble: 3 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.701
2 sens     binary         0.378
3 spec     binary         0.864

# Create a confusion matrix
conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class) %>% 
  # Pass to the summary() function
  summary()

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.701
 2 kap                  binary         0.265
 3 sens                 binary         0.378
 4 spec                 binary         0.864
 5 ppv                  binary         0.585
 6 npv                  binary         0.733
 7 mcc                  binary         0.278
 8 j_index              binary         0.242
 9 bal_accuracy         binary         0.621
10 detection_prevalence binary         0.217
11 precision            binary         0.585
12 recall               binary         0.378
13 f_meas               binary         0.459

Plotting the confusion matrix

Calculating performance metrics with the yardstick package provides insight into how well a classification model is performing on the test dataset. Most yardstick functions return a single number that summarizes classification performance.

Many times, it is helpful to create visualizations of the confusion matrix to more easily communicate your results.

In this exercise, you will make a heat map and mosaic plot of the confusion matrix from your logistic regression model on the telecom_df dataset.

Your model results tibble, telecom_results, has been loaded into your session.

# Create a confusion matrix
conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class)  %>% 
  # Create a heat map
  autoplot(type = "heatmap")

conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class)  %>% 
  # Create a mosaic plot
  autoplot(type = "mosaic")

ROC curves and area under the ROC curve

ROC curves are used to visualize the performance of a classification model across a range of probability thresholds. An ROC curve with the majority of points near the upper left corner of the plot indicates that a classification model is able to correctly predict both the positive and negative outcomes correctly across a wide range of probability thresholds.

The area under this curve provides a letter grade summary of model performance.

In this exercise, you will create an ROC curve from your logistic regression model results and calculate the area under the ROC curve with yardstick.

Your model results tibble, telecom_results has been loaded into your session.

# Calculate metrics across thresholds
threshold_df <- telecom_results %>% 
  roc_curve(truth = canceled_service, .pred_yes)

# View results
threshold_df %>%
  head()

# A tibble: 6 × 3
  .threshold specificity sensitivity
       <dbl>       <dbl>       <dbl>
1  -Inf          0             1    
2     0.0110     0             1    
3     0.0350     0             0.988
4     0.0416     0.00617       0.988
5     0.0435     0.0123        0.988
6     0.0487     0.0185        0.988

# Plot ROC curve
threshold_df %>% 
  autoplot()

# Calculate ROC AUC
roc_auc(telecom_results, truth = canceled_service, .pred_yes)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.725

Streamlining the modeling process The last_fit() function is designed to streamline the modeling workflow in tidymodels. Instead of training your model on the training data and building a results tibble using the test data, last_fit() accomplishes this with one function.

In this exercise, you will train the same logistic regression model as you fit in the previous exercises, except with the last_fit() function.

Your data split object, telecom_split, and model specification, logistic_model, have been loaded into your session.

# Train model with last_fit()
telecom_last_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins +avg_intl_mins+monthly_charges,
           split = telecom_split)

# View test set metrics
telecom_last_fit %>% 
  collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.701 Preprocessor1_Model1
2 roc_auc  binary         0.725 Preprocessor1_Model1

Collecting predictions and creating custom metrics

Using the last_fit() modeling workflow also saves time in collecting model predictions. Instead of manually creating a tibble of model results, there are helper functions that extract this information automatically.

In this exercise, you will use your trained model, telecom_last_fit, to create a tibble of model results on the test dataset as well as calculate custom performance metrics.

You trained model, telecom_last_fit, has been loaded into this session.

# Collect predictions
last_fit_results <- telecom_last_fit %>% 
  collect_predictions()

# View results
last_fit_results %>%
  head()

# A tibble: 6 × 7
  id               .pred_yes .pred_no  .row .pred_class canceled_service .config
  <chr>                <dbl>    <dbl> <int> <fct>       <fct>            <chr>  
1 train/test split     0.136    0.864     4 no          yes              Prepro…
2 train/test split     0.792    0.208     7 yes         yes              Prepro…
3 train/test split     0.171    0.829    10 no          no               Prepro…
4 train/test split     0.508    0.492    16 yes         no               Prepro…
5 train/test split     0.371    0.629    17 no          yes              Prepro…
6 train/test split     0.139    0.861    21 no          no               Prepro…

# Custom metrics function
last_fit_metrics <- metric_set(accuracy, sens,
                               spec, roc_auc)

# Calculate metrics
last_fit_metrics(last_fit_results,
                 truth = canceled_service,
                 estimate = .pred_class,
                 .pred_yes)

# A tibble: 4 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.701
2 sens     binary         0.378
3 spec     binary         0.864
4 roc_auc  binary         0.725

Complete modeling workflow

In this exercise, you will use the last_fit() function to train a logistic regression model and evaluate its performance on the test data by assessing the ROC curve and the area under the ROC curve.

Similar to previous exercises, you will predict canceled_service in the telecom_df data, but with an additional predictor variable to see if you can improve model performance.

The telecom_df tibble, telecom_split, and logistic_model objects from the previous exercises have been loaded into your workspace. The telecom_split object contains the instructions for randomly splitting the telecom_df tibble into training and test sets. The logistic_model object is a parsnip specification of a logistic regression model.

# Train a logistic regression model
logistic_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges + months_with_company, 
           split = telecom_split)

# Collect metrics
logistic_fit %>% 
  collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.742 Preprocessor1_Model1
2 roc_auc  binary         0.797 Preprocessor1_Model1

# Collect model predictions
logistic_fit %>% 
  collect_predictions() %>% 
  # Plot ROC curve
  roc_curve(truth = canceled_service, .pred_yes) %>% 
  autoplot()