Edit Distance (Big Data Analytics, Machine Learning)
Edit distance is a way of quantifying how dissimilar two strings (e.g. words) are to one another by counting the minimum number of operations
required to transform one string into the other.
Edit distances are useful in:
1) Natural language processing, where automatic spelling correction can determine candidate
corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question.
2) In bioinformatics, it can be used to quantify the similarity of DNA sequences,
which can be viewed as strings of the letters A, C, G and T.
Different definitions of an edit distance use different sets of string operations.
The operations are the removal, insertion, or substitution of a character in the
string.
Types of Edit distance:
Different types of edit distance allow different sets of string operations. For instance:
a) The Levenshtein distance allows deletion, insertion and substitution.
b) The Longest common subsequence (LCS) distance allows only insertion and deletion, not substitution.
Some edit distances are defined as a parameterizable metric calculated with a
specific set of allowed edit operations, and each operation is assigned a cost
(possibly infinite). This is further generalized by DNA sequence alignment
algorithms such as the Smith–Waterman algorithm, which make an operation's
cost depend on where it is applied.
Formal definition and properties:
Given two strings a and b on an alphabet Σ (e.g. the set of ASCII characters, the
set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series
of edit operations that transforms a into b. One of the simplest sets of edit
operations is that defined by Levenshtein:
Insertion of a single symbol. If a = uv, then inserting the symbol x produces
uxv. This can also be denoted ε→x, using ε to denote the empty string.
Deletion of a single symbol changes uxv to uv (x→ε).
Substitution of a single symbol x for a symbol y ≠ x changes uxv to uyv (x→y).
Edit distance can be calculated by 2 methods:
A) Longest Common Sequence (LCS) Method
B) Classical Method
Suppose ‘x’ and ‘y’ are two points represented as strings. Then the edit distance d(x , y) can be calculated as follows:
d(x , y)= length of string ‘x’ + length of string ‘y’ -2*LCS
Where LCS=number of common characters
For Example 1] Consider the following
x=A B C D E and
y=A C F D E G
Find the edit distance.
Solution:
Length of string ‘x’ = 5
Length of string ‘y’ = 6
LCS = 4
d(x,y)= 5 + 6 - (2*4) =11-8 =3
Hence, the edit distance is 3.
Example 2] What would be the edit distance between x and y?
x= MAHARASHTRA and
y= MAHABALESHWAR
Answer:
x= MAHARASHTRA and
y= MAHABALESHWAR
LEN(X)=11
LEN(Y)=13
LCS=9
Edit Distance=11+13-2*9=24-18=6
B) Classical Method:
By using the classical method, we can calculate the edit distance by counting the
minimum number of operations required to transform one string into the other.
The distance between two points is the number of insertions and deletions.
For Example: 1] Find the edit distance using classical method.
x= A B C D E and
y= A C F D E G
Solution:
To find the edit distance between x and y we need to transform x into y or y into x
and calculate the total number of insertions and deletions required.
Here we are transforming y into x:
Step 1:Delete ‘B’ from position 2 of string ‘x’
x= A C D E
1 2 3 4
y= A C F D E G
1 2 3 4 5 6
Step 2: Insert ‘F’ at the position 3 of string ‘x’
x= A C F D E
1 2 3 4 5
y= A C F D E G
1 2 3 4 5 6
Step 3: Insert ‘G’ at position 6 of string ‘x’
x= A C F D E G
1 2 3 4 5 6
y= A C F D E G
1 2 3 4 5 6
Now both strings look the same.
Step 4: The edit distance is
Edit Distance = Number of insertions + Number of deletions
= 2 + 1 = 3
No comments: