We see you are located in China.
Do you want to switch to our Chinese website?

The 5 essentials of a good forecasting blind test

Lennert Smeets - July 8, 2021

Reading time: 4 min

The 5 essentials of a good forecasting blind test

How do you know if a forecasting solution will perform well? There are many available on the market using a wide range of technologies, but which would be best for a particular industry or company? Well, the proof of the pudding is in the eating. That’s why many companies get solution providers to submit to a blind test. Which is great, but blind tests need to be carefully organized if you want to learn anything from them.

There are five essential requirements for a good blind test:

1. Be sure to make it really blind

Be sure to make it really blindThe data used for blind testing consists of a collection of time series of a range of product-market combinations (for example SKUs per country). It will also include the training data to be used in making the forecast and the test data that need to be predicted.

Let’s assume that you have three participants and that you provide them sales history from January 2016 to December 2020 (the training data), and ask them to forecast January 2021 to June 2021 (the test data). It is essential that no participant be given access to the test data, because that would allow them to tweak the forecast using information that would be unavailable in real circumstances. Test data should not be revealed to anyone prior to a test, even trusted participants. Even the best data scientists could inadvertently introduce some kind of data leakage, leading to unrealistic results.

Blog post

2. Provide a sufficient number of items

Provide a sufficient number of items How do you know if someone is an expert poker player? The answer must be based on more than how they play just one or two hands. The same goes for forecasting tests, which would be too influenced by coincidence if they would be carried out for only a handful of items.

I always recommend providing a data set of at least several hundred or even thousands of time series. This has the advantage of leaving little or no room for any manual forecast manipulation. You want to test the automatic forecasting capability of a solution before any human looks at it.

 

3. Make the test period long enough

Make the test period long enoughThe period to be forecast should be long enough to avoid solutions underperforming or overperforming just by coincidence. Therefore, the test data, reusing the above example, should not only be for January 2021 but preferably from January 2021 to June 2021. If you asked competitors to just forecast the January 2021 actuals, you wouldn’t get a complete picture of any solution’s performance.

When assessing forecast accuracy on a specific lag, for example ‘month+1’, I recommend organizing multiple rounds. In the case outlined above, this would mean the following:

  • Provide the January 2016 to December 2020 data and ask to forecast January 2021;
  • Provide the January 2021 data, and ask to forecast February 2021;
  • Provide the February 2021 data, and ask to forecast March 2021, and so on.

This may look cumbersome, but multiple rounds are indeed essential. If you provided all the data immediately, you wouldn’t have a blind test anymore, leaving room for intentional or unintentional cheating.

Blog post

4. Share your evaluation metrics upfront

Share your evaluation metrics upfrontSome test organizers are reluctant to reveal the specific metrics they will be using to evaluate the different forecasts. This may seem fair if we assume that solution providers should just give their ‘best’ forecast, whatever the evaluation method. However, it is common practice in forecasting competitions for the metrics to be shared upfront.

Why? I like to make a comparison with car racing. Racing teams always finetune their car’s engine, transmission, suspension, tires, etc., based on each race track’s properties, allowing the car to perform at its best. Forecasting engines are similar to race cars in that they are always tuned to the challenge at hand.

The algorithms are adjusted to optimize a specific KPI such as the MSE, the MAPE, the MAD, or the Bias, or even a weighted combination of KPIs. They are also adjusted to the granularity (months, weeks, days) and to the calculated lag (month+1, month+2, etc.). The precise tuning depends on the stated expectations and thus the metrics used by the customer, so all competitors must be given equal insight into them.

Blog post

5. Reveal the actuals along with the test results

Reveal the actuals along with the test resultsWhen testing is over, best practice is to share the actuals and test results among all the participants, so that they may independently verify the results. This guarantees transparency and objectivity and does not involve any disclosure of intellectual property relating to how the forecasts were generated.

 

Would you like more information on how to organize, interpret and use forecasting competitions, feel free to reach out.

Lennert Smeets

Senior Product Manager at OMP USA

Biography

Lennert oversees the R&D of OMP for Demand Management. He is mostly driven by looking for innovations that make our customer’s demand planning journey more manageable and, at the same time, more effective.

Let's connect