Questioning the Efficacy of Neural Recommendation Systems

The Data Exchange Podcast: Paolo Cremonesi and Maurizio Ferrari Dacrema on the reproducibility, complexity, and inefficiency of neural methods for recommenders.

Subscribe: Apple • Android • Spotify • Stitcher • Google • RSS.

This week’s guests are leading researchers in recommendation systems: Paolo Cremonesi is Professor of Computer Science and Maurizio Ferrari Dacrema is a Postdoc at Politecnico di Milano, where they are both part of the RecSys research group. Paolo is also the Reproducibility co-chair for the upcoming 2021 RecSys Conference.

Take the 2021 NLP Industry Survey and get a free pass to the 2021 NLP Summit.

Recommenders are everywhere and have become standard in most websites and mobile applications. The scale of recommendations served up by leading companies like Facebook and Netflix are breathtaking. In recent years, recsys researchers have been exploring neural network models. Paolo, Maurizio, and crew recently published two survey papers on the use of deep learning in recommendation systems:

You can tell from the titles of these survey papers that, at best, they found mixed results. They raise serious issues that researchers in recommenders and the broader machine learning community need to address. There is the ongoing reproducibility crisis which they highlight in the papers above. We also identified the need for a knowledge base that collects RecSys research findings, and most importantly a platform where research models can interact with real-world users and applications.

Maurizio

❛ We had trouble reproducing the results as reported in the original papers. We couldn’t find a way to use them effectively for the type of data that we have, considering that we are using very highly structured data. … We tried to do an analysis that was as detailed and as transparent as possible. So we took those papers one by one, we analyzed them, and we tried to see whether we are able to reproduce them and if not, we provided a motivation for why, for example, “the author’s provided the source code, but the source code didn’t work for some reason”. We contacted the authors for assistance, and we tried to engage in a conversation to see whether we were able to address the issue. Unfortunately, the majority of articles did not provide sufficient information. So we got stuck at some point. .. I think, in the first paper, five articles out of 12, I believe. So less than 50%, which is somewhat low, but is in line with what was observed in other studies in artificial intelligence.

… We also included some baselines, in some cases, 20 year old k-nearest neighbor baselines. And we tried to see how well they performed in the evaluation scenario designed by the original authors themselves. And we encounter cases, many cases where simple, very simple baselines were able to outperform the recommendation quality … of the newly complex proposed model. …. We were surprised that such cases could occur in the published literature. Unfortunately, it looks like simple but well-performing models have disappeared in favor of more complex ones, which are more difficult to tune, more difficult to optimize.

… (The deep learning models we tested were) between 10 and 100 times, considering that we were using a CPU for the baselines and Tesla V100 for the deep learning models, so that was substantially slower. But, in that case, you should consider that those are not product-grade implementations, so they are research, somewhat proof of concept. Aside from these perhaps extremely slow implementations, there is indeed a dimension in terms of–are we really getting much advantage for the huge computational cost we need to sustain. If I recall correctly, in one of the recent articles of Facebook, where they discussed their deep learning models, they said that at some point, they only took into account the interaction between the second-order interaction and going much higher didn’t provide much advantage, so there is a point where we could go in ever more complex models, but maybe partly because of the data and partly because of the type of behavior we want to model, we encounter a limit after which it doesn’t really make sense to go more complex.

This doesn’t mean that neural models aren’t used in modern RecSys systems. What happens inside large commercial companies – where access to real users, massive compute and big data sets is not an issue – is not something covered in their survey papers. As they noted in the course of our conversation, their studies are based on research submitted to academic journals and conferences. Based on what one can glean from media articles, neural networks (including graph neural networks) and even reinforcement learning are part of recommenders that power some of the most popular services in the world. Given the growing importance and impact of recommendation systems, we need to figure out how to bridge the gap between the types of RecSys models being used in industry and those studied by academics.

Download a complete transcript of this episode by filling out the form below:

Subscribe to our Newsletter:
We also publish a popular newsletter where we share highlights from recent episodes, trends in AI / machine learning / data, and a collection of recommendations.

Related content: