nicolai’s space.:space:

Recommending Products like Amazon 🎁

February 13th, 2020 · 5 min read

Recommendations are not an invention of the Internet - they existed long before. We gladly recommend products, films, TV series, […] of which we are convinced, to our friends and relatives. Therefore, it’s not surprising that we also come across recommendations on the Internet - be it Amazon, who wants to sell us similar products, or Netflix, who shows us how well a TV series matches our profile.

The difference, however, is that e.g. Amazon and Netflix try to determine in advance which product or TV series we might like. One can spot multiple types of suggestions at Amazons website:

  • Often bought together
  • Customers who bought this item also bought […]
  • You probably like […], too

But: How does Amazon manage to create individual recommendations? Although, statistics on sales are available, they are of little help here - since of course one wants to recommend products which are not part of the mainstream, too. It seems obvious, that a different approach is therefore required at this point.

Two approaches have been adopted in the area of recommendations:

  • Content-based Filtering
  • Collaborative Filtering

Content-based Filtering

As the name implies, Content-based Filtering (CBF) measures the similarity of two objects by comparing their properties. Taking a TV series as an example, the following properties might be useful:

  • Year
  • Genre
  • Country
  • Duration
  • Producer

A recommendation is then simple - use all TV series, that a customer bought, watched and rated as source and search for a TV series with the same properties (exact match) or similar properties (best match). If these information are not available, the currently viewed TV series may be used as source. In this context, tracking cookies are also useful, as they follow our route through the web and track, what websites we visited.

Let’s stick to the example: I filled the table below with properties, which I think are the most useful.

ArrowThe FlashBreaking BadThe Simpsons
GenreCrime, Action, DramaCrime, Action, DramaDramaComedy
Duration42 min.42 min.47 min.22 min.
ProducerDC ComicsDC ComicsVince Gilligan20th Century Fox

Imagine, I am a fan of Arrow - what would I probably be watching next? You guessed it: I’m more likely to watch The Flash than Breaking Bad, because the latter has fewer matches with the original title. The Simpsons wouldn’t probably be recommended to me, as the only similarity is the country of production. The exact match in this case would be The Flash, the best match is Breaking Bad. So far, so logical.

💡 The recommendation becomes more individual if a customer can rate which criteria are the most important.

However, one disadvantage of this approach is, that not every product can be divided into categories and properties and thus be compared to each other. Additionally, there’s a risk of getting stuck in the same product category. The consequence might be, that only TV series based on TV series are recommended.

Personally, I like to watch Arrow, but I’m not only interested in similar TV series - I would also like to buy arrows and a bow. The question is: “How can arrow and a bow be recommended to me?“. It’s obvious to simply look at what other customers bought. This process is called Collaborative Filtering.

Collaborative Filtering

Collaborative filtering is a method of making recommendations (filtering) based on the interests of a list of customers (collaborating). Hence, the customers work together involuntary, so to say. The underlying assumption of this approach is that if person A has the same opinion as person B on an issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.

Instead of comparing properties of products, Collaborative Filtering tries to find similarities between customers. One important thing to keep in mind is, that the similarity is entirely based on the given rating to a specific product. If two customers gave the same ratings to 5 TV series, one can consider them to be equal - despite the fact, that their age or gender might be different.

Memory-based Filtering

In this context, Memory-based Filtering can be used to determine the n most similar customers. For this purpose, statistical techniques are applied to the entire dataset. To better understand the concept, let’s create a simple dataset first.

The Flash5214
Breaking Bad2524
The Simpsons3431

The dataset of our Content-based Filtering example from the beginning is extended to include ratings of Jake, Lucy, Emily and me. Just like Amazon, a rating with value 5 indicates overall satisfaction, whereas a value of 1 means the opposite. For calculating the similarity (last row), I use the formula for centered cosine.

💡 The formula for centered cosine is the same as that for Pearson correlation coefficient (PCC).

Suppose, I wanted to calculate the measure for Jake and me - it would look like this:

According to the Cauchy–Schwarz inequality, PCC has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

The calculated score 0.93 is close to 1, which means that Jake and I share common interests. That’s not surprising, as we gave similar ratings to the TV series. Hence, if Jake decides to buy an arrow and a bow on Amazon, it’s likely that I get suggested the same products.

But: This approach also has its disadvantages. As the number of customers increases, so does the computing effort - since theoretically every customer has to be compared with every other one. Further, if the company has a large product range (like Amazon), customers may rarely have similarities. While Jake prefers to watch and rate the TV series in English, I’m a fan of the German version. The result is, that we both rated Arrow, but in a different version and therefore not the same product - although we have the same taste.

Model-based Filtering

It seems obvious, that an approach is needed to limit the number of comparisons. Instead of looking at customer-to-customer similarities, Model-based Filtering tries to identify superior properties in the dataset in order to create groups (clusters) out of them. As a result, I don’t have to be compared with Jake anymore, but with the group of e.g. ”Comic Fans“.

Note: It is not necessary, to know the properties of the groups in advance.

k-means Clustering

A simple approach of Model-based Filtering is k-means clustering. The goal of this algorithm is to find a number of k groups (clusters) in the data. It does so by iteratively assigning each data point to the cluster with the nearest centroid. For the first iteration, these centroids are randomly chosen. Next, the centroid of each cluster is recalculated, and each data point is assigned again […] until stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).

Let’s create a sample dataset and try to identify clusters using k-means clustering. I use the Python packages sklearn to randomly create data points and matplotlib as well as seaborn for a scatter plot. You can find the code snippet below.

2# -*- coding: utf-8 -*-
4""" Author: nicolai92 """
6from sklearn.datasets.samples_generator import make_blobs
7import numpy as np
8import matplotlib.pyplot as plt
9import seaborn as sns
11sns.set() # Use seaborn for plot
13X, Y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
14plt.scatter(X[:, 0], X[:, 1], s=50)

Running the code snippet will show a figure like the following. Since we can obviously make out about 4 clusters, we now attempt to create them using k-means. Next, we assign each data point to one out of those 4 clusters.

Sample data.
Random generated data points using sklearn

The package sklearn ships with an implementation of the k-means algorithm. Let’s extend the code snippet to finally create the clusters and assign the data points. Note, that in line 8 we have to explicitly define how many clusters to create — here it’s 4.

2# -*- coding: utf-8 -*-
4""" Author: nicolai92 """
6from sklearn.cluster import KMeans
8kmeans = KMeans(n_clusters=4)
10y_kmeans = kmeans.predict(X)
12plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='Set3')
14centers = kmeans.cluster_centers_
15plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)

The output of the extended script is a scatter plot, whereas each color (green, yellow, gray, blue) defines a cluster. Now, the data points are assigned to the cluster with the nearest centroid — highlighted with a black circle.

Assigned data points to cluster.
Colored clusters with highlighted centroids

But: The clusters can depend on the choice of the initially chosen centroids. It’s therefore recommended to randomly select centroids at the beginning and run the k-means algorithm multiple times.

Further, Collaborative Filtering in general also has it’s disadvantages. Problems occur, if a product was released shortly and has no ratings or just a few. Hence, it’s advisable to combine Content-based Filtering and Collaborative Filtering.

And that’s how Amazon works?

What exactly happens behind the scenes at Amazon is kept under lock and key. The last more detailed information came from a 2003 article and a patent, but it’s assumed that the fundamentals of the system have remained the same.

Join the mailing list

High-quality blog posts are like shiny Pokémon - they don't appear often. But when they do, be the first to receive the latest content with the ability to opt-out at anytime.

More articles from Nicolai

Alternative to Docker Desktop 🐳

This post explains, how Podman can be setup as an alternative to Docker Desktop, which became a paid solution for professionals.

February 4th, 2022 · 2 min read

Kubernetes Cheatsheet 📝

This post contains a list of commands and tips, that I use often when working with Kubernetes.

August 18th, 2021 · 3 min read
© 2017–2022 Nicolai
Link to $mailto:nicolai+blog@disroot.orgLink to $ to $ to $ to $ to $
Sometimes, the questions are complicated – and the answers are simple.