I am always very curious about how the big internet companies use AI: which choices do they make, what do they struggle with, which results do they obtain etc. Aren’t you?
Recently, I got excited when a colleague shared a scientific publication from Google’s Scientists (details at the bottom), describing how they do product recommendation (thanks for sharing Lien Michiels!).
Although I loved reading it and can highly recommend it, I can imagine that reading a 10-page scientific paper full of mathematical details will be a no-go for many of you. That’s why I took the effort to summarize my view of the paper. I hope you like it and I would love to interact with some of you on this.
So what does Google do?
They provide product recommendations for the more than 10,000 retailers that they collaborate with.
Product recommendations everywhere?
It looks like they limit themselves to showing two kinds of recommendations on ‘product detail’ pages:
Substitutes for the product on the detail page; and
Accessories and complementary products for the product on the detail page.
This means the product recommendations are not personalized.
Why would they limit themselves?
The reason for this limitation seems to be two-fold:
Efficiency: this way, they can precompute all recommendations for every product page, so they don’t need to do any real-time computation, saving cost and avoiding the risk of delays;
Complexity: with a team of only two people, they want to limit the complexity.
Are there downsides to putting these limits on themselves?
There are two downsides to this limitation. Firstly, they do not support homepage or overview page recommendations and, secondly, product page recommendations are the same for every customer.
Ok, So what data do they use to pull this off?
Google uses the following data:
Who views each product;
Who searches for each product;
Who adds each product to his/her cart;
Who purchases each product; and
Product attributes like categories, brand, and price.
They attach different levels of importance to different data. Specifically, the order from more to less important is purchase > cart > search > view.
Two choices surprise me a lot:
They only remember the users’ 25 most recent actions, which is not a lot of data; and
They update their model on a daily basis. This seems quite slow, as typically fast-changing product stocks are not taken into account, so the recommender might recommend out of stock items.
Most importantly: does it work?
Their experiments reveal two important insights:
If a method performs better in offline tests on historical data, it does not mean it will perform better in real life. This is a big problem that is known by many, but its impact is understood by few. Way too much research and decisions are based on offline tests on historical data, instead of A/B-tests in real life; and
Their approach has an advantage over simpler approaches when recommending substitutes or accessories for unpopular items. For popular items, it does not improve upon older techniques.
Finally, how do they want to improve in the future?
They identify at least two ways:
Taking image quality into account, because people (probably) don’t click on ugly pictures; and
Taking inventory levels into account, because it seems stupid to recommend something you cannot sell.
Want to know more?
Continue to the next section, which dives deeper into the more technical details.
The Technical Deep Dive
Alright, which approach did they implement?
At the heart of Google’s recommendation engine is the well known BPR, a matrix factorization approach by Rendle et al.
What is noticeable about the way they implement a well-known approach?
It is interesting to see that they incorporate different kinds of interactions into BPR by using the way BPR selects data to train itself. BPR is trained by comparing positive signals with negative signals. Google learns the algorithm that a search is worth more than a view by sampling a view as a negative for a search. Similarly, the algorithm learns that a purchase is worth more than a search by sampling a search as a negative for a purchase, etc.
Furthermore, for every item, they use two embeddings, depending on whether you want to recommend the item or use the item to produce a recommendation. Consequently, their model is asymmetric, meaning that if viewing product A leads to recommending product B, it is not necessarily true that viewing product B also leads to recommending product A.
Also, user embeddings are not explicitly modeled but computed as a linear combination of the item embeddings of the last 25 items he/she interacted with. So, although they do have a way to obtain user embeddings to train their model, it seems that they do not use them when serving recommendations, because that would require too much computation in real-time, increasing cost and response times.
Isn’t BPR only for collaborative filtering? How do they use the item attributes?
Google’s approach naturally combines collaborative filtering and content-based recommendations by taking into account item attributes when computing the item embeddings, based on the works of Kanagal et al. and Ahmed et al.
And did they encounter challenges?
It was interesting to read that BPR does not “just work”, but requires quite some engineering to get good results, making things a lot more complex. Some examples:
They cannot sample negatives randomly, but need to apply heuristics such as:
The negative sample needs to be far away from the positive sample in the taxonomy tree;
The negative sample cannot be frequently co-bought and -viewed with the positive sample;
The model is extremely sensitive to settings of hyperparameters such as the number of latent factors, learning rate, regularization parameters and the seed for the random number generator. They observed that the worst choice of hyperparameters can generate results that are 100(!) times worse than the best. Consequently,
Every retailer has different hyperparameter settings; and
Grid search on the hyperparameter space is an important part of their model training process. Model quality is assessed using leave-one-out cross validation in combination with an approximation of mean average precision at 10 (MAP@10).
Apparently, it is not possible to just feed as many features as possible to the model and let the model decide which one is important, because they mention feature selection is important for performance.
Computing for every product what other products need to be recommended next to it seems to be practically impossible with their approach. That’s why they first do a candidate selection based on the taxonomy, co-views and co-buys, also taking into account whether a product is re-purchasable or not.
Anything out of the ordinary?
What was very surprising to me is that a broader candidate selection could lower the accuracy of the approach. This means that the effect of the candidate selection is not just reducing computational complexity, but also filtering out false positives that BPR on its own would miss!
What infrastructure did they deploy
Google performs model training on preemptible VMs that have a high probability of being taken down. These VMs are up to 70% cheaper compared to regular VMs but required Google to include a fault-tolerant recovery mechanism with checkpoints.
Did they distribute computation?
All model computation for one retailer happens on a single VM with maximally 128GB of RAM, using multiple threads performing Hogwild-style SGD. Also, one VM only computes models for one retailer at a time. They use the Map-Reduce framework to distribute all retailers over the VMs, but not to distribute the work for one retailer since that happens on a single VM.
Do they need to compute their model only once?
Of course not. Google incrementally retrains the models daily, updating the previous model, which improves convergence speed compared to retraining from scratch. However, this gives a bias towards older data. Therefore, they need to retrain from scratch every once in a while.
The grid search is not repeated daily for all 100 hyperparameter sets, but only for the 5 most promising.
So what happens at serving time, when a website requests recommendations?
Serving recommendations appears to be no more than looking up the non-personalized precomputed recommendations in a cache, which maximizes serving speed.
What if I want to know more?
You should definitely read the original paper:
Title: Recommendations for All: Solving thousands of Recommendation Problems Daily
Authors: Bhargav Kanagal and Sandeep Tata, both from Google.
Currently, I am writing down my views on Pinterest’s recommendation approach, based on their WWW papers: “Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time” and “Related Pins at Pinterest: The Evolution of a Real-World Recommender System”