Data Collection Best Practices

The most concise way to describe the best practices is to avoid all the data quality problems shown in Data Quality, Data Collection Process, and Data Set Frequency.

Additional best practices include:

Use the Same Target Collection Lags for all Targets

All targets should have close to the same target collection lags. If there are large differences in collection lags across targets, break up the data into two or more separate projects where you can separate the project based on similar target collection lags across the data.

Ensure the Target Collection Lag Remains Constant Over Time

The source data should be routinely updated at a consistent interval. This minimizes the need to interpolate the most recent dates. If the actual collection lag changes, then you must reconfigure the models by doing a full model rebuild.

Provide Complete Data

It’s okay to fill a few missing values or a few partial sections of target data if they can be reasonably interpolated. If greater than five percent of data is missing, performance can become unstable for targets. This is because models are learning from fake interpolated patterns found in the data that may not match. The more complete the data is, the better the forecast results.

Do Not Use Fake Data

No reasonable model accuracy can be assumed if source data is manufactured to represent real data.

For example, take a target that is only available at a yearly frequency and guess its allocation at a monthly frequency, then provide this to Sensible Machine Learning as a monthly frequency target. The model would learn from a fake monthly variation that most likely does not match the monthly reality. Therefore, you cannot reasonably assume that the model accuracy for that target at the monthly level produces accurate forecasts.

In general, if the data provided to Sensible Machine Learning (or any model) does not match the reality of the historical data patterns, then no reasonable forecasts should be expected from Sensible Machine Learning for those targets. Learning from fake data can lead to inaccurate results.

Ensure Uniform Data Collection Practices

Ensure that all data collection practices remain consistent across the entire history and future of the source data. This ensures that models have consistent data patterns to learn from. A model can mistake a change in the data collection pattern as a new trend or seasonal data pattern which can cause inaccurate results.

Ensure a Constant Frequency Across All Targets

Ensure that all targets included in the source data are of the same frequency. Any targets that are less granular than the data set frequency produce inaccurate results.

Do Not Change Source Data While Sensible Machine Learning is Running

Changing the data source targets while a Sensible Machine Learning job is running may cause Sensible Machine Learning to stop responding. This is because Sensible Machine Learning avoids making a copy of the data source used to power the solution.

Align the Target Units to the Business Problem

It is important to align the target units to the business problem that Sensible Machine Learning is being used to solve.

For example, if the downstream use case is supply chain demand planning, it is best to have the targets’ units be unit sales rather than dollar sales. This is because the raw units sold more closely align to the downstream use case when estimating how much product to move to certain retail locations.

Using dollar sales does not effectively align to this use case and creates these challenges:

  • Models are expected to learn price appreciation or changes in the price per unit. Price-per-unit changes are a form of non-uniform target collection which should be avoided.

  • The forecast dollar sales must be converted back into unit sales to align to the demand planning use case. This can lead to conversion errors, which complicates the inherent error that exists in any forecast.