Blog post

Crafting the Perfect Recommendation System

Recommendation systems have become omnipresent in our daily lives. Where the technology used to be limited to tech giants like Facebook, Netflix and Spotify, we now find them used in more and more websites.

Retail companies want to show relevant products when you are browsing their store page, podcast apps suggest new shows to listen to and news websites find content related to your topics of interest.

The research community is very active and proposes many new algorithms, methods and metrics.  

At Froomle, the research team realised that the choice of algorithm is not as important as making sure that the model is up to date and that it is trained on the right amount of data.
Better controlling these factors has a larger impact on customers’ KPIs, rather than trying out many different algorithms.

This blog post will discuss why these two factors are so important, and how the Froomle solution controls them.

Timely Training of recommendation models

A trained recommendation model freezes user interests and item relationships. The world, on the other hand, keeps changing. Items that were related a while ago might no longer be related, old items lose relevance and user interests change. Eventually, as time passes, the model's frozen reality is so different from the environment's reality, that it hurts the quality of the recommendations.

Fig. 1: As models grow old, performance drops sharply for news use cases (Adressa[2] and Globo[3]), while it remains largely stable for the retail use case (Cosmeticsshop[4])

This is especially true for news use cases. In particular to the quick rotation of ‘relevant’ items, most items quickly lose relevance after a short period in the spotlight.

In order to avoid the difference between model reality and environment reality, we need to keep the models up to date. However, every model update costs money, and so in order to balance performance with model costs, companies need to find a way to schedule the right amount of model updates at the right moments to maximize performance.

Typically the starting point is a certain budget available for updates, which gets translated, based on past experience, into an average number of updates per day (or month).

A basic scheduling solution is to update the models on a fixed cadence, for example updating every 2 hours. However, the research shows that this is not the optimal usage of the available model updates. Activity fluctuates, for example, at night there usually is less traffic on a website. Any update scheduled during a period of "low traffic" is far less useful than an update during peak hours. During peak hours large amounts of information are collected, which must be captured by updating the model.

Fig. 2: A traditional solution would schedule updates at the black marks, which results in multiple updates during the night, when very few events arrive. A more optimised scheduler would schedule updates at the yellow marks, maximising the value from each update.

Fortunately, it is pretty easy to achieve this behaviour. Rather than specifying a fixed schedule based on time,  scheduling can be based on the number of events that occurred since the last update. Once enough events have been collected,  the model is retrained. The threshold defining "enough" can be computed based on the number of allowed updates, and the average number of events collected every day.

This is already a step towards better scheduling, but this method is still lacking, especially in settings where models do not show regular performance degradation, such as webshops and streaming services. In these settings, models do not grow stale at all over long periods of time, so they do not need to be updated frequently.  They also do not grow stale at a regular pace, as the news models did. Rather there are a few moments where the environment’s reality shifts drastically, rather than gradually.

In these settings, the staleness of a model is not influenced by the number of new events collected, but rather by the amount of new information those events carry. Events that confirm knowledge already present in the model are not as useful as those that the model had not expected.

For example, if the model thinks a user is interested in Squid Game, and the user watches an episode from that series, that contains very little additional information to the model.
But if the model has no idea that the user likes The Office, and they watch the first episode of that series, that has a lot more value to a future model, because the system can learn something new about the user’s preferences.

This realization led to exploring information-based schedules. 

  • The first method, Inverse Predicted Relevance  (IPR), assigns each event a weight (information value) based on the model's predicted score for that item, given the user.
    So if a model is sure a user will be interested in an item (high model score), that event will get a low information value, if the model thought it unlikely the user would be interested, the information value is high. Low model scores (high information value) can occur when the model does not have enough knowledge about the item yet, or the user has not expressed interest in similar items before. Thus this method addresses both new items inserted into the system and users changing their interests.
  • The second method is based on the assumption that two different models are going to react similarly to changes in the training data.When one model changes its recommendations (so its reality changes), we expect the other to do so as well.We can exploit this, by looking at how much a cheap model changes every time we consider training our production model. If the cheap model does not change, we can assume the production model will also not change.The decision to schedule an update of the production model is made when the cheap model changes enough (where enough is an estimated threshold).While it is easy to define situations where two models are not strongly correlated in their changes, researchers have found that changes in the popularity model are an indicator of changes in personalisation algorithms.

Froomle has implemented the second method already after promising offline results and is working on also implementing the IPR scheduler.

Applying the smarter schedulers in production has led to an increase in CTR of 2% (relative) on news and a 20% (relative) increase in CTR on retail.

Right Training data

The second aspect to optimize is the amount of historic data to use when training models.

 Through experimentation, Froomle noticed, that even if there are months or years of data available, training simple models with just the last few hours of data, is a very effective way to give better recommendations.

The changing environment again is at the basis of this solution. Old interactions contain different information from what is relevant now. So, using these interactions in models that need to perform now can contaminate the recommendations.

Some models are able to use more data effectively, by accounting for the order of events, or how old events are, but simpler models are easily drowned by older events, giving poor recommendations when they receive too much training data.
One could conclude that these simpler models are therefore not a good fit for those use cases, however,  researchers found that by using only recent interactions as training data for these algorithms, they can outperform the more complicated algorithms.

From this experimentation, the Froomle standard procedure for optimizing AB-tests now also includes finding the right window of data to use when training recommendation models. For example, for news use cases the best results were achieved with training windows between 12 hours and 36 hours for personalisation models, and 1 hour for popularity.

In a series of AB tests on Mediahuis brands, Froomle found an uplift of 8% for the popularity model, training it on 1 hour of data, rather than 3 hours of data.

An interesting side-effect of this data reduction approach is that training the model takes less time and resources than it would without paying attention to training data.
Thus more models can be trained on the same budget, and so the production model is more up-to-date as well.

For more information, the research team at Froomle has published a paper on the methodology and experiments in the Proceedings of the Perspectives on the Evaluation of Recommender Systems Workshop in 2022 [1].


In order to get high-quality recommendations, companies need to look beyond the choice of algorithm, and consider when to train their models, and on which data to  train them.

By getting these choices right, companies are able to make more effective recommendations. Ignoring these choices can have dramatic effects, as the model might be hopelessly out of data, or trained on data that is not representative of the environment in which it needs to make predictions.


[1] Robin Verachtert and Lien Michiels and Bart Goethals. “Are We Forgetting Something? Correctly Evaluate a Recommender System With an Optimal Training Window”.
In Proceedings of the Perspectives on the Evaluation of Recommender Systems Workshop 2022. URL: 

[2] Jon Atle Gulla, Lemei Zhang, Peng Liu, Özlem Özgöbek, and Xiaomeng Su. “The Adressa Dataset for News Recommendation”. In Proceedings of the International Conference on Web Intelligence, WI ’17, page 1042–1048, New York, NY, USA, 2017. Association for Computing Machinery. URL

[3] Gabriel de Souza Pereira Moreira, Felipe Ferreira, and Adilson Marques da Cunha. “News Session-based Recommendations Using Deep Neural Networks”. In Proceedings of the 3rd Workshop on Deep Learning for Recommender Systems, DLRS 2018, page 15–23, New York, NY, USA, 2018. ACM.URL:

[4] Michael Kechinov. Cosmeticsshop E-commerce Dataset.
URL: Accessed: 2022-07-26

NEWS Personalization
Case Study

Please accept marketing cookies to view this form.

Let’s get you started!

Ready to know more about how Froomle can boost your business in as little as 40 days? Our team of experts is here to help!