Levenshtein's Distance Equation 🔍

Recently, I was working with DGraph to build a knowledge graph. The idea was to create a search engine, that helps to identify which technologies are used across different customers. When reading further through the tutorials, I stumbled upon Levenshtein’s distance equation in the context of fuzzy search.

Fuzzy Search

Providing search capabilities on technologies requires searching for the closest match to a string, if a full match doesn’t exist. This feature helps to get relevant results even if there’s a typo or the user doesn’t search based on the exact name it is stored. This is exactly what fuzzy search does — it compares the string values and returns the nearest matches. Hence, it’s ideal for the use case of implementing a search engine on technologies.

Levenshtein’s Distance Equation

Here, Levenshtein’s distance is often used as a metric to measure the closeness of two strings. The higher the number, the more operations (insert, substitute, delete) it takes to transform one string into the other. This metric can be calculated with Levenshtein’s distance equation, which is defined as:

lev_{a, b} (i, j) = ⎩ ⎨ ⎧ max (i, j) min ⎩ ⎨ ⎧ lev_{a, b} (i - 1, j) + 1 lev_{a, b} (i, j - 1) + 1 lev_{a, b} (i - 1, j - 1) + 1_{(a_{i} \neq = b_{j})} if min (i, j) = 0

Introduction

For example, the Levenshtein distance between kitten and sitting is 3, because it needs at least 3 edits to change one string into the other.

kitten → sitten (substitution of s for k)
sitten → sittin (substitution of i for e)
sittin → sitting (insertion of g at the end).

Before going into detail and understand Levenshtein’s distance equation, let’s have a quick refresher on functions first.

Functions

Functions take a given input, follow a set of instructions, and yield an output. A simple linear function may be defined as

f (x) = 2 x + 8,

which in human words states: Given an input (represented by $x$ ), multiply the input by 2 and add 8 to obtain the output.

input	set of instructions	output
$0$	$2 (0) + 8$	$8$
$1$	$2 (1) + 8$	$10$
$2$	$2 (2) + 8$	$12$

Piecewise Functions

Piecewise functions are more complex functions, as there are multiple sets of instructions. One set is chosen over another based on a certain condition. Consider the following example

f (x) = ⎩ ⎨ ⎧ x^{2} 8 16 - x if x < 4 if x = 4 if 4 < x < 16,

which in human words states: Given an input (represented by $x$ ) …

If the input is less than 4, square the input
If the input is equal to 4, the output is 8
If the input is between 4 and 16, subtract the input from 16

input	set of instructions	output
$2$	$2^{2}$	$4$
$4$	$8$	$8$
$10$	$16 - 10$	$6$

With that in mind, Levenshtein’s distance equation should look a little more readable.

Levenshtein’s Distance Equation in Detail

Up to this point a basic understanding of functions has been gained and a detailed example can now be discussed. In order to do so, let’s have a look at the parameters first:

a — 1st string
b — 2nd string
i — Terminal character position of 1st string
j — Terminal character position of 2nd string

An important thing to note is, that the strings are 1-indexed. That being said, the first line of Levenshtein’s distance equation

max (i, j) if min (i, j) = 0

translates into: If both strings are empty, then the Levenshtein distance equals to 0.

If that’s not the case, then consider the strings pen and pin as example. The numbers in the table below refer to the terminal character position and a and b to the strings to compare.

	1	2	3
a	P	E	N
b	P	I	N

To proceed, let’s check each character individually by looping over the terminal positions. For the given example, this results in:

lev_{a, b} (1, 1) = 0 lev_{a, b} (2, 2) = 1 lev_{a, b} (3, 3) = 0 ⟶ P = P ⟶ PE \neq = PI ⟶ PIN = PIN ⟶ 0 edit required ⟶ 1 edit required ⟶ 0 edit required

Overall, 1 edit is required to change pen into pin or vice versa.

Calculation using a Matrix

Fortunately, the Levenshtein distance can be calculated by solving a matrix. The bottom right corner will then yield the metric. Let’s begin to fill out the matrix by following the piecewise functions.

lev_{a, b} (1, 1) = ⎩ ⎨ ⎧ min ⎩ ⎨ ⎧ lev_{a, b} (0, 1) + 1 lev_{a, b} (1, 0) + 1 lev_{a, b} (0, 0) + 1_{(if a_{1} \neq = b_{1})} = 1 + 1 = 1 + 1 = 0 + 0 = 2 = 2 = 0

Having a closer look at the last instruction reveals, that there’s a conditional statement, $(if a_{i} \neq = b_{j})$ , which translates into: If both characters are not equal, then add 1 — else, do not account as there’s no edit required.

From the resulting vector, the minimum value is taken as result. As a consequence, the Levenshtein distance between the first letters of string a and b is 0 as they are equal.

lev_{a, b} (1, 1) = min ⎩ ⎨ ⎧ 220 = 0

When applying the above logic to the matrix, it’s solved as shown in the table below. The bottom right corner yields the Levenshein distance, which in this case is 1, as exactly 1 edit is required.

	#	P	E	N
#	0	1	2	3
P	1	0	1	2
I	2	1	1	2
N	3	2	2	1

About Vladimir Levenshtein

Vladimir Levenshtein was a Russian scientist, who is best known for the distance equation named after him. He came up with the algorithm in 1965 and it’s worth mentioning, that according to MIT the algorithm hasn’t improved since then. MIT researchers report that, “in all likelihood, […] the algorithm is as good as it gets”.

Levenshtein's Distance Equation 🔍

Fuzzy Search

Levenshtein’s Distance Equation

Introduction

Functions

Piecewise Functions

Levenshtein’s Distance Equation in Detail

Calculation using a Matrix

About Vladimir Levenshtein

Join the mailing list

More articles from Nicolai

Alternative to Docker Desktop 🐳

Kubernetes Cheatsheet 📝