Edit Distance (Big Data Analytics, Machine Learning)

Edit distance is a way of quantifying how dissimilar two strings (e.g. words) are to one another by counting the minimum number of operations

required to transform one string into the other. 

Edit distances are useful in:

1) Natural language processing, where automatic spelling correction can determine candidate

corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. 

2) In bioinformatics, it can be used to quantify the similarity of DNA sequences,

which can be viewed as strings of the letters A, C, G and T.

 

Different definitions of an edit distance use different sets of string operations.

The operations are the removal, insertion, or substitution of a character in the

string.


Types of Edit distance:

Different types of edit distance allow different sets of string operations. For instance:

a) The Levenshtein distance allows deletion, insertion and substitution.

b) The Longest common subsequence (LCS) distance allows only insertion and deletion, not substitution.

c) The Hamming distance allows only substitution, hence, it only applies to
strings of the same length.

d) The Damerau–Levenshtein distance allows insertion, deletion, substitution,
and the transposition of two adjacent characters.
e) The Jaro distance allows only transposition.

Some edit distances are defined as a parameterizable metric calculated with a

specific set of allowed edit operations, and each operation is assigned a cost

(possibly infinite). This is further generalized by DNA sequence alignment

algorithms such as the Smith–Waterman algorithm, which make an operation's

cost depend on where it is applied.

 

 

Formal definition and properties:

Given two strings a and b on an alphabet Σ (e.g. the set of ASCII characters, the

set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series

of edit operations that transforms a into b. One of the simplest sets of edit

operations is that defined by Levenshtein:

Insertion of a single symbol. If a = uv, then inserting the symbol x produces

uxv. This can also be denoted ε→x, using ε to denote the empty string.

Deletion of a single symbol changes uxv to uv (x→ε).

Substitution of a single symbol x for a symbol yx changes uxv to uyv (xy).

 

Edit distance can be calculated by 2 methods:


A) Longest Common Sequence (LCS) Method

B) Classical Method


A) Longest Common Sequence (LCS) Method: LCS method performs detection operation on characters in a string.

Suppose ‘x’ and ‘y’ are two points represented as strings. Then the edit distance d(x , y) can be calculated as follows:


d(x , y)= length of string ‘x’ + length of string ‘y’ -2*LCS 


Where LCS=number of common characters


For Example 1] Consider the following 

x=A B C D E   and

y=A C F D E G 

Find the edit distance.

Solution:

Length of string ‘x’ = 5

Length of string ‘y’ = 6

LCS = 4


d(x,y)= 5 + 6 - (2*4) =11-8 =3

Hence, the edit distance is 3.


Example 2] What would be the edit distance between x and y? 

x= MAHARASHTRA and 

y= MAHABALESHWAR


Answer:

  x= MAHARASHTRA  and 

  y= MAHABALESHWAR

  LEN(X)=11

  LEN(Y)=13

  LCS=9

Edit Distance=11+13-2*9=24-18=6



 B) Classical Method:

By using the classical method, we can calculate the edit distance by counting the

minimum number of operations required to transform one string into the other.

The distance between two points is the number of insertions and deletions.


For Example: 1] Find the edit distance using classical method.

x= A B C D E    and

y= A C F D E G


Solution:

To find the edit distance between x and y we need to transform x into y or y into x

and calculate the total number of insertions and deletions required.


Here we are transforming y into x:


Step 1:Delete ‘B’ from position 2 of string ‘x’

x= A C D E

     1 2 3 4

y= A C F D E G

     1 2 3 4 5  6


Step 2: Insert ‘F’ at the position 3 of string ‘x’

x= A C F D E

     1 2 3 4 5

y= A C F D E G

     1 2 3 4 5  6


Step 3: Insert ‘G’ at position 6 of string ‘x’

x= A C F D E G

     1 2 3 4 5  6

y= A C F D E G

     1 2 3 4 5  6

Now both strings look the same.


Step 4: The edit distance is

Edit Distance = Number of insertions + Number of deletions

  = 2 + 1 = 3




No comments:

ads
Powered by Blogger.