We all see ‘recommended’ content on almost all websites or content platforms including Amazon and YouTube and most of these platforms use a technique called “collaborative filtering” to recommend you content. While this recommendations technique is useful, there are issues which have been addressed through a new and better recommendations algorithm.
To determine what products a given customer might like, current recommendations technique look for other customers who have assigned similar ratings to a similar range of products, and extrapolate from there. However, there is an issue here. The success of this approach depends vitally on the notion of similarity. Most recommendation systems use a measure called cosine similarity, which seems to work well in practice.
Now researchers have published a new study wherein they are reporting that they developed a new algorithm that should work better than those in use today, particularly when ratings data is “sparse” — that is, when there is little overlap between the products reviewed and the ratings assigned by different customers.
The algorithm’s basic strategy is simple: When trying to predict a customer’s rating of a product, use not only the ratings from people with similar tastes but also the ratings from people who are similar to those people, and so on.
The idea is intuitive, but in practice, everything again hinges on the specific measure of similarity.
“If we’re really generous, everybody will effectively look like each other,” says Devavrat Shah, a professor of electrical engineering and computer science and senior author on the paper. “On the other hand, if we’re really stringent, we’re back to effectively just looking at nearest neighbors. Or putting it another way, when you move from a friend’s preferences to a friend of a friend’s, what is the noise introduced in the process, and is there a right way to quantify that noise so that we balance the signal we gain with the noise we introduce? Because of our model, we knew exactly what is the right thing to do.”
As it turns out, the right thing to do is to again use cosine similarity. Essentially, cosine similarity represents a customer’s preferences as a line in a very high-dimensional space and quantifies similarity as the angle between two lines.
Suppose, for instance, that you have two points in a Cartesian plane, the two-dimensional coordinate system familiar from high school algebra. If you connect the points to the origin — the point with coordinates (0, 0) — you define an angle, and its cosine can be calculated from the point coordinates themselves.
If a movie-streaming service has, say, 5,000 titles in its database, then the ratings that any given user has assigned some subset of them defines a single point in a 5,000-dimensional space. Cosine similarity measures the angle between any two sets of ratings in that space.
When data is sparse, however, there may be so little overlap between users’ ratings that cosine similarity is essentially meaningless. In that context, aggregating the data of many users becomes necessary.
The researchers’ analysis is theoretical, but here’s an example of how their algorithm might work in practice. For any given customer, it would select a small set — say, five — of those customers with the greatest cosine similarity and average their ratings. Then, for each of those customers, it would select five similar customers, average their ratings, and fold that average into the cumulative average. It would continue fanning out in this manner, building up an increasingly complete set of ratings, until it had enough data to make a reasonable estimate about the rating of the product of interest.