I recently had a conversation with a friend about the distance between words. Yes, it sounds absurd but it is a rather interesting concept and ended up being my “CS fact of the day”
Lavenshtein Distance is the most commonly used technique to determine the distance between two words and it basically measures the minimum number of changes required to transform one word into the other.It can be used to check for possible spelling errors etc. by minimizing the distance.
Let’s say word A is ‘meteor’ and word B is ‘enter’. There isn’t any spelling mistake here per se, but we can still measure the distance.
Transforming A into B involves identifying the minimum number of changes. In this case, it is LD(A,B) = 3.
This is because we can make the following changes:
eteor (delete m)
enteor (insert n)
enter (delete o)
There are obviously infinite number of sequences that will go from A to B, but the objective here is to calculate the minimum one (sort of like the linear distance concept because linear distance is the shortest)
Interestingly enough, this LD (or Lavenshtein Distance) happens to form a metric space. That is, if LD(A, B) = x and LD(B, C) = y, then LD(A, C) = z and (x+y) >= z. This is basically the triangle inequality. So, it enables you to build spacial partitioning trees.
Pretty neat, huh?
So, there’s more than a difference between words… there is a distance…