• incanter

1.2.3-SNAPSHOT

# levenshtein-distance

## incanter.stats

• (levenshtein-distance a b)

http://en.wikipedia.org/wiki/Levenshtein_distance

internal representation is a table d with m+1 rows and n+1 columns

where m is the length of a and m is the length of b.

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

1. kitten ? sitten (substitution of 's' for 'k')
2. sitten ? sittin (substitution of 'i' for 'e')
3. sittin ? sitting (insert 'g' at the end).

The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:

* It is always at least the difference of the sizes of the two strings.
* It is at most the length of the longer string.
* It is zero if and only if the strings are identical.
* If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.

### Source incanter/stats.clj:3174 top

(defn levenshtein-distance
"
http://en.wikipedia.org/wiki/Levenshtein_distance

internal representation is a table d with m+1 rows and n+1 columns

where m is the length of a and m is the length of b.

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.

For example, the Levenshtein distance between \"kitten\" and \"sitting\" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

1. kitten ‚άν sitten (substitution of 's' for 'k')
2. sitten ‚άν sittin (substitution of 'i' for 'e')
3. sittin ‚άν sitting (insert 'g' at the end).

The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:

* It is always at least the difference of the sizes of the two strings.
* It is at most the length of the longer string.
* It is zero if and only if the strings are identical.
* If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.

"
[a b]
(let [m (count a)
n (count b)
init (apply deep-merge-with (fn [a b] b)
(concat
;;deletion
(for [i (range 0 (+ 1 m))]
{i {0 i}})
;;insertion
(for [j (range 0 (+ 1 n))]
{0 {j j}})))
table (reduce
(fn [d [i j]]
(deep-merge-with
(fn [a b] b)
d
{i {j (if (= (nth a (- i 1))
(nth b (- j 1)))
((d (- i 1)) (- j 1))
(min
(+ ((d (- i 1))
j) 1) ;;deletion
(+ ((d i)
(- j 1)) 1) ;;insertion
(+ ((d (- i 1))
(- j 1)) 1))) ;;substitution))
}}))
init
(for [j (range 1 (+ 1 n))
i (range 1 (+ 1 m))] [i j]))]

((table m) n)))
Vars in incanter.stats/levenshtein-distance: defn let
Used in 0 other vars