Analyze the Arena Summary

The Arena page is an exploratory page that does not require any specific action. It provides valuable insight by analyzing evaluation metrics and features across models and targets gathered during the model arena.

The Arena page consists of different views (Accuracy, Impact, Explanation, and CV Strategy). To select a view, click on its button at the top of the page.

For all views available for the Arena page, use the Top Models Visible next to the different view buttons to filter the models available in the Leaderboard pane. This displays the training results for the number that is selected. Ranked models outside the selected number of models visible include (Not Visible) in the Rank.

NOTE: For Impact and Explanation views, the message "No impact data exists for this selection" when there is no impact data available for the project.

Arena Accuracy View

The Accuracy view shows the model metrics, predictions, and prediction intervals (if configured) in different model stages. Any target that is a part of the model build can have its models examined in this view.

This helps to answer questions such as:

Which type of model has the best accuracy for a given target?
By how much did the model win?

It also helps you understand how closely the forecasted values overlay the actuals in the line chart, which can provide answers to questions such as:

Are there spikes that aren’t being caught by the forecasts?
If so, could adding any events help catch these spikes?

NOTE: Error metric scores do not dynamically adjust based on the time period specified by the range slider at the bottom of the page.DELETE THIS??

The table on this page displays:

The model algorithms run for a given target (such as XGBoost, CatBoost, or Shift).
The type of model algorithm (ML, Statistical, or Baseline).

The evaluation metric (such as Mean Squared Error, Mean Absolute Error, and Mean Absolute Percentage Error) and the associated score.
The train time (How long did it take to train the given model during Pipeline?).

With all the configurations and the newly found important features, the engine runs multiple models per configuration to find the best. This process involves hyperparameter tuning each model on multiple splits of the data and then saving the accuracy metrics of each model.

To analyze the training results, click a target in the Targets pane to see the accuracy metrics of each it its deployed models if implemented over the course of its past data. Each model listed shows its name and category, along with the type of evaluation metric used and the evaluation metric score. Select a model name in the models list to view a line chart that shows how close the forecasted values are to the actuals.

The line chart corresponds to the highlighted model in the table. It visualizes both the predictions made for the historical actual test period (blue) and the historical actuals (orange). The time frame in this chart is only a subset of the total time frame for the historical data, as this time frame is for a specific portion of a split.

At a high level, this page provides the view of how the best version of each model (such as XGBoost, CatBoost, ExponentialSmoothing, Shift, and Mean) has performed against unseen historical data for each target and more specifically, on the test set of the historical data.

The optimal error metric score can be the lowest score, the highest score, or the closest to zero. It is dependent on the type of metric. See Appendix 3: Error Metrics for more information about the error metrics Sensible Machine Learning uses.

Arena Impact View

The Impact view shows the same information as the Accuracy view but also includes the feature impact scores for different models. Any target that is a part of the model build can have its models examined in this view. The feature impact score shows how much influence the feature had for a given model.

NOTE: Feature impact data is dependent on the type of model. Not all models have feature impact data.

Arena Explanation View

The Explanation view shows the model metrics, predictions, and prediction intervals (if configured) in different model stages similar to the Accuracy view. The Explanation view also includes the prediction explanations of the models. Select a model from the Leaderboard grid to see its prediction explanations in a Tug of War plot on the right side. For each data point, this plot shows the features that had the largest magnitude effect (negative or positive) on the prediction for the displayed date. Any target that is a part of the model build can have its models examined in this view.

TIP: To drill down into a specific date, double-click the date in the tug-of-war plot to see a feature-by-feature view of prediction explanations for that date.

NOTE: Only models that use features have feature explanation data.

Arena CV Strategy View

The CV Strategy view shows how the splits were used in each model stage and the portion of the splits. Select the Model stage using the Cross Validation Usage drop-down. Split usage can then display in the Cross Validation Splits Chart.

The number of splits and size of each portion of the splits can be configured when you Set Modeling Options. A description of how each portion of the split is used follows.

Train Set: The split of historical data on which the models initially train and learn patterns, seasonality, and trends.

Validation Set: The split of historical data on which the optimal hyperparameter set is selected for each model, if applicable. A model makes predictions on the validation set time period for each hyperparameter iteration. The hyperparameter set with the best error metric when comparing predictions to the actuals in the validation set time period is selected. This split does not occur when the historical data set does not have enough data points.

Test Set: The split of historical data used to select the best model algorithm compared to the others. For example, an XGBoost model gets ranked higher than a baseline model based on evaluation metric score. This split does not occur when the historical data set does not have enough data points.

Holdout Set: The split of historical data used to simulate live performance for the model algorithms. This is the truest test of model accuracy. This set can also serve as a check for overfit models. This split does not occur when the historical data set does not have enough data points.