Calculating Edit Distance in Python (Levenshtein)

Edit distance, also known as Levenshtein distance, is a measure of the similarity between two strings. It calculates the minimum number of operations required to transform one string into another, where each operation can be an insertion, deletion, or substitution of a single character. In this article, we’ll explore how to calculate the edit distance between two strings using the Levenshtein algorithm in Python.


What is “Edit Distance”?

Edit distance, also known as Levenshtein distance, is a measure of the similarity between two strings. It quantifies the minimum number of operations required to transform one string into another. These operations include:

  1. insertion,
  2. deletion,
  3. or substitution of a single character.

The concept of edit distance finds applications in various fields such as spell checking, DNA sequence alignment, and natural language processing. By calculating the edit distance between two strings, we can determine how similar or different they are.

For example, consider two strings, “kitten” and “sitting.” The edit distance between these strings is 3. We can transform “kitten” into “sitting” by:

  1. substituting ‘k’ with ‘s’
  2. inserting ‘g’ at the end
  3. inserting ‘s’ at the beginning.

These three operations represent the minimum changes required to convert one string into the other.


Example# 2

Here’s another example involving the strings “intention” and “execution”.

Calculating Edit Distance in Python (Levenshtein)

You can see that a total of 5 operations were required here to convert the string “intention” to “execution”. Thus, the edit distance of these two strings is 5.


Understanding the Edit Distance Algorithm

The Levenshtein algorithm, named after the Russian scientist Vladimir Levenshtein, is a dynamic programming approach used to calculate the edit distance between two strings. It involves constructing a matrix and iteratively filling in its cells to compute the final edit distance.

Here’s a step-by-step explanation of the algorithm:

  1. Initialization: We start by initializing a matrix with dimensions (m+1) x (n+1), where m and n are the lengths of the two strings. The extra “+1” is for accommodating empty substrings.
  2. Setting Initial Values: We set the values in the first row and first column of the matrix to represent the edit distance for substrings of varying lengths. Each cell at position (i, 0) corresponds to the edit distance when the second string is empty, and its value is simply i. Similarly, each cell at position (0, j) represents the edit distance when the first string is empty, and its value is j.
# Initialize the first row and column
for i in range(m + 1):
    matrix[i][0] = i
for j in range(n + 1):
    matrix[0][j] = j
  1. Filling the Matrix: Moving row by row, column by column, we iterate through the remaining cells of the matrix. For each cell at position (i, j), we compare the characters at the corresponding positions in the two strings. If the characters are the same, we copy the value from the diagonal cell (i-1, j-1) because no operation is required. If the characters are different, we consider the three operations:
    • Insertion: We look at the cell to the left (i, j-1).
    • Deletion: We look at the cell above (i-1, j).
    • Substitution: We look at the diagonal cell (i-1, j-1).
    We choose the minimum value among these three operations and add 1 to it because we need to account for the current operation. Finally, we store this minimum value in the current cell (i, j).
  2. Final Result: Once we have filled all the cells in the matrix, the value in the bottom-right cell represents the edit distance between the two strings. This value is the minimum cost required to transform one string into another.

By following these steps, the Levenshtein algorithm efficiently calculates the edit distance between two strings. It considers all possible combinations of operations and chooses the path with the minimum cost, enabling us to measure the similarity between strings.

In the next section, we will implement the Levenshtein algorithm in Python and demonstrate its usage with example code.


Implementing the Levenshtein Algorithm in Python

We can start by defining a function, levenshtein_distance, that takes two strings as input and returns their edit distance using the Levenshtein algorithm:

def levenshtein_distance(str1, str2):
    m, n = len(str1), len(str2)
    matrix = [[0] * (n + 1) for _ in range(m + 1)]

    # Initialize the first row and column
    for i in range(m + 1):
        matrix[i][0] = i
    for j in range(n + 1):
        matrix[0][j] = j

    # Fill in the matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if str1[i - 1] == str2[j - 1]:
                matrix[i][j] = matrix[i - 1][j - 1]
            else:
                insertion = matrix[i][j - 1]
                deletion = matrix[i - 1][j]
                substitution = matrix[i - 1][j - 1]
                matrix[i][j] = min(insertion, deletion, substitution) + 1

    # Return the edit distance
    return matrix[m][n]

Let’s test our function with some sample inputs:

#Test the levenshtein_distance function
str1 = "kitten"
str2 = "sitting"
distance = levenshtein_distance(str1, str2)
print(f"The edit distance between '{str1}' and '{str2}' is {distance}")

When you run the above code, it should output:

The edit distance between 'kitten' and 'sitting' is 3

In this case, it takes three operations (substitution of ‘k’ with ‘s’, insertion of ‘g’ at the end, and insertion of ‘s’ at the beginning) to transform “kitten” into “sitting.”

Here is what the matrix looks like. The value in the bottom-right corner is our desired answer.

[[0, 1, 2, 3, 4, 5, 6, 7],
 [1, 1, 2, 3, 4, 5, 6, 7],
 [2, 2, 1, 2, 3, 4, 5, 6],
 [3, 3, 2, 1, 2, 3, 4, 5],
 [4, 4, 3, 2, 1, 2, 3, 4],
 [5, 5, 4, 3, 2, 2, 3, 4],
 [6, 6, 5, 4, 3, 3, 2, 3]]

This marks the end of the Calculating Edit Distance in Python (Levenshtein) Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments