Impact of Data Points and Data Granularity

The number of data points determines the functionality that can be used during the model build process. For data sets of coarser data granularity, Sensible Machine Learning adjusts to not waste resources on functionality that does not contribute to model performance. For example, Sensible Machine Learning does not run machine learning models on a monthly data set because there are too few seasonal data patterns to learn from monthly level data. Sensible Machine Learning is optimized to produce the best performance out of any level of data set provided.

Functional Overview by Data Granularity

XperiFlow Engine Functionality by Data Point Range

XperiFlow engine functionality based on datapoint range.

16 to 36 Data Points (Quarterly to Monthly)

This is the most restrictive model build pipeline offered. The data is most likely monthly data with only one to three seasonal cycles. All models offered for this data point range are univariate, simple models, meaning they are trend-based models with no features to enhance predictive capability.

Data: Most likely monthly data with little to no seasonality. Typically, only one to three seasonal cycles exist within the number of historical data points. Limited seasonality implies that models trained against this data rely exclusively on learning the underlying trend.

Functionality: The functionality for this data point range is the most restrictive offered. Functionality is restricted due to low data volumes. Models that can be used with low data volumes do not accept features. Therefore, events, locations, external data, external features, feature engineering, and feature selection cannot be provided or run in Sensible Machine Learning for these data volumes.

With the 16 to 36 data points range, there may be enough data for a validation set for training to compare model accuracy. This is not a perfect situation, since this validation set is being used for model hyperparameter tuning and selecting the best model, during training.

Since there is not enough data that can be held out during training to compare model accuracy, backtest accuracy comparisons are deactivated on the Deploy page. Grouping functionality is also not available since there are no grouping models that can be run.

Features: Features cannot be used for this data point range. This is because there is not enough data to learn any meaningful relationships between any feature and target variable.

Models: Models in this data point range are basic statistical models that extract an underlying trend or basic seasonality. Given the limited amount of data to train on, these models typically forecast conservatively and do not project aggressive increases or decreases in the underlying trend. With only one to three yearly seasonal cycles typically existing with this amount of data, the models struggle to find the seasonality with accuracy. The models typically catch only apparent seasonality patterns spotted over at least three seasonal cycles.

Models running against this amount of data can train much faster than larger data point ranges and more complex models.

Model Selection: Model selection can be difficult for models with a low number of data points. There is no back-test procedure or train-test procedure that can validate which models will perform the best in production outside of using the validation set once there are 20 data points.

The best performing model is chosen based on the best fit for the validation data if available. This implies that models could be overfit to the training data. To deal with this, Sensible Machine Learning selects models from the Model Arena of the model build phase based on some of the structural integrity of a particular target. For example, if the historical data offers little to no repeated seasonality, the seasonal statistical models do not run against that target in the Model Arena.

Selecting the Models to Train: Sensible Machine Learning inspects a target's data patterns to determine which models should be allowed to train. Inspecting the data patterns includes determining whether there is seasonality in the existing underlying data. Sensible Machine Learning only trains models that match the structural data patterns of the data.

Selecting the Best Trained Model: The best trained model is chosen by whatever model best overfits the training data or validation data, if applicable. The effectiveness of the Selecting the Model to Train process is very important.

Use Cases: This data point range is best associated with long term growth, high-level or long-range planning, or strategic growth use cases, since there are only underlying trends and slight seasonality.

It is recommended to leverage all the models for your targets. This gives downstream users from Sensible Machine Learning the flexibility to choose the model forecasts to use.

36 to 80 Data Points (Monthly to Weekly)

This is the second most restrictive model build pipeline offered. The data is typically monthly or weekly with three or more seasonal cycles. All models offered for this data point range are univariate simple trend and seasonal models. This means these models can detect a single seasonality with an underlying trend. However, there are still not enough underlying data patterns to warrant the use of features.

Data: Most likely monthly or weekly level data with a single seasonality and trend.

Functionality: The same limitations for the 16 to 36 data point range exist with a few exceptions.

With the 36 to 80 data points range, there is enough data for a validation set for training to compare model accuracy. This is not a perfect situation, since this validation set is being used for model hyperparameter tuning and selecting the best model during training. Backtest accuracy comparisons are deactivated on the Deploy page. Grouping functionality is not available since there are no grouping models for this data point range.

Features: Features are not allowed for this data point range. This is because there is not enough data to learn meaningful relationships between any feature and the target variable, since there are minimal data patterns to learn.

Models: Models for this data point range are statistical models that extract underlying trends and some seasonality. With limited data to train on, these models typically forecast conservatively and do not project aggressive increases or decreases in the underlying trend. These models typically only catch seasonal patterns spotted over at least three seasonal cycles.

Models running against this amount of data can train very fast, compared to larger data point ranges and more complex models.

Model Selection: As mentioned in the 16 to 36 Data Points (Quarterly to Monthly) section, the same difficulties apply to model selection and Selecting the Best Trained Model in most situations. However, as the data points approach 20 or more, Sensible Machine Learning uses a validation data set for hyperparameter tuning.

Selecting the Models to Train: Sensible Machine Learning inspects the target's data patterns to determine which models should be allowed to train. This includes determining whether there is seasonality or an underlying trend that exists in the data. Sensible Machine Learning only trains models that match the structural data patterns.

Selecting the Best Trained Model: The best trained model can be chosen by whatever model overfit the most to the training data for 20 or less data points. If there are more than 20 data points, a validation set is used for hyperparameter tuning and choosing the model that had the lowest error on this section.

Use Cases: This data point range is best associated with long term growth, or high-level or long-range planning use cases given that there exists an underlying trend and single seasonality. This provides some understanding of the months or time periods within a given year that are spiking or dipping.

80 to 300 Data Points (Weekly to Daily)

A data set with 80 to 300 data points can leverage almost all capabilities of Sensible Machine Learning.

Data: The data will most likely be daily level data with multiple data patterns that can be learned.

These data patterns consist of:

  • Seasonality: Multiple seasonalities may exist within the data. This may be overlaid seasonality of weekly, monthly, quarterly, or yearly.

  • Trend: An underlying change in the mean value over time.

  • Anomalies: Spikes or dips in the data that can be explained by re-occurring events and holidays.

These data patterns may be difficult to learn since there are likely not enough data pattern repetitions. For example, a daily level data set with only 300 data points does not have a complete picture of yearly seasonality. Therefore, a model running against this data set most likely cannot assume any yearly seasonality exists, even if it does exist.

Functionality: Almost all the functionality available in Sensible Machine Learning may be leveraged. The cross-validation strategy used improves the closer you get to 300 data points.

Features: Sensible Machine Learning uses all possible feature types to get the most highly performing models. However, weekly level data sets may not see an effective benefit from event-based features since events occur daily.

Models: All different model types can be leveraged with at least 80 data points in the train set of the largest split. Models that run against these 80-300 data points are typically a mix of machine learning and statistical models. This data point range blends the usage of models that leverage features and pit them against models that do not use features.

Model Selection: The models that perform best are typically ML models or more advanced statistical models, since there is a decent amount of data patterns to learn from.

Selecting the Models to Train: Sensible Machine Learning gets a list of candidate models to run in the Model Arena. It defaults to running a recommended set of machine learning models. Two of these are XGBoostTimeSeries and CatBoostTimeSeries. After the models have been selected, the Model Arena then trains these models and compares them against each other. Sensible Machine Learning also includes common baseline models (shift and mean models).

Selecting the Best Trained Model: In the Model Arena, the cross-validation strategy used improves the closer you get to 500 data points.

Use Case: This data point range is best associated with an annual demand plan. This is because Sensible Machine Learning can learn a fair amount of seasonal data patterns which provide accurate forecasts on any given day or weekly interval.

Daily granular data sets within this data point range may struggle to produce accurate forecasts longer than six months given that there may not have been enough daily history to learn yearly seasonal data patterns.

300+ Data Points (Daily)

All capabilities of Sensible Machine Learning are unlocked for data sets with more than 300 data points.

Data: The data is most likely daily level data with multiple data patterns that can be learned.

These data patterns consist of:

  • Seasonality: Multiple seasonality may exist within the data. This may be overlaid seasonality of weekly, monthly, quarterly, or yearly.

  • Trend: An underlying change in the mean value over time.

  • Anomalies: Spikes or dips in the data that can be explained by re-occurring events and holidays.

Functionality: All functionality of Sensible Machine Learning is available for data sets with more than 500 data points.

Features: Sensible Machine Learning leverages all possible feature types to get the most performing models.

Models: All different model types can be leveraged with at least 80 data points in the train set of the largest split. Models that run against data with more than 300 data points take the longest to train. This is because the models running against larger data sets have more parameters to tune, more data for models to consume, and more data patterns to learn. The train time duration per target is higher than other data point ranges.

The models that perform best here are typically machine learning models due to high volume of data patterns to learn from.

Model Selection

Selecting the Models to Train: Before running the Model Arena with 300+ data points, Sensible Machine Learning gets a list of candidate models to run in the Model Arena. It defaults to running a recommended set of machine learning models. Two of these models are XGBoostTimeSeries, and CatBoostTimeSeries. After the models have been selected, the Model Arena trains these models and compares them against each other. Sensible Machine Learning also includes common baseline models (shift and mean models).

Selecting the Best Trained Model: In the Model Arena, a comprehensive cross-validation strategy is leveraged for hyperparameter tuning and model validation to determine which models are the most likely to perform best in production. A nested time series cross-validation strategy is leveraged.

Use Cases: This data point range is best associated with annual demand planning or operational level demand planning use cases. This is because Sensible Machine Learning can learn from multiple seasonal data patterns which provide highly accurate forecasts on any given day. These granular forecasts can be used to drive operational level decisions.