Blog post

Online eats offline evaluation for breakfast

As part of my academic research with the professor Bart Goethals, we've built and evaluated different recommender systems. Unfortunately, we’ve found the real world of business to be quite different from what we’ve learned about evaluating recommender systems in academia.

I was also lucky to work together with researchers from Netflix and Apple on this technology. I saw that they were getting enormous amounts of advantage from correctly using and evaluating recommender systems.

When evaluating the effectiveness of your recommender systems, there are different approaches available. Online evaluation, when in production, is far better than testing the system on offline datasets. This is very specific for recommender systems.

Here’s why.

In the graph below, there are three people. In a typical experiment, you have the time of their purchase history, and every blue dot represents this person’s purchase. 

What is typically done to evaluate two approaches is to remove one of the purchases for each person randomly. The next step is to try to predict the purchase that you have left out. It's called leave-one-out cross-validation (LOOCV).

However, there’s a problem using this approach. This is because you’re using the ‘future’ to predict the past. The bulk of all the scientific work in this domain is precisely this. It looks reasonable at first sight, I did it myself, but it's simply wrong.

Nevertheless, there is a better method, shown in the graph below. I came across an alternative that worked better. But there is still a huge problem.

The better way to do this is to draw a line in time, as represented in the ‘time axis’ in the graph above. Predict only the next purchase for each person, rather than a random purchase in time, while using the data on the left side of the line, and leave out the data on the right side of the line.

Doing this you might see that the performance drops dramatically. Of course, you can't use the future anymore. Secondly, differences between the approaches also decrease dramatically, but there is still a huge problem. When you collected the dataset from your website, an algorithm was running even if this algorithm was heavily impacted by manual curation, meaning, a person handpicking specific promotions.

In scientific literature one says the dataset was collected under a certain logging policy. This logging policy biases the dataset and consequently the results of the experiment. We tested the importance of this bias by collecting a new dataset under a different logging policy. As you can see in the graph below, the effect is large and even changes which of both algorithms wins! This means that in an offline evaluation you are not measuring which approach is best, but rather which approach resembles the logging policy most.

My suggestion is that you do not compare algorithms offline on historical data, but focus on immediate online A/B-testing.

Now the question is, which approach resembles most of the methods you were using before? So, in our opinion, offline experiments using historical data to score and compare recommender systems is not the best approach to take. Instead, it is better if you put it in production, A/B test it, and see how people react to it.

Adjacent to this, there is a difference between prediction and recommendation. When you are testing with these offline methods is about how good you are at predicting the behavior somebody already had and predicting the same for people with similar behavior. If you're good at this, the best that you can do is remove the friction of something that would always happen. More valuable than prediction is recommendation: making people discover products they would never have found without the recommendation. This is however impossible to test with a historical dataset, because the dataset only includes product-interactions people already found.


In this lesson, we want to give the message that comparing different recommendation approaches on historical data is flawed. The only way to really compare two approaches is in an online a/b-test.

Stay tuned for more lessons about building recommender systems in our social media channels. 

NEWS Personalization
Case Study

Please accept marketing cookies to view this form.

Let’s get you started!

Ready to know more about how Froomle can boost your business in as little as 40 days? Our team of experts is here to help!