levenshtein: Levenshtein Distance Metric in Scheme
License: LGPL 3 Web: http://www.neilvandyke.org/levenshtein-scheme/
(require levenshtein) | package: levenshtein |
1 Introduction
This is a Scheme implementation of the Levenshtein Distance algorithm, which is an edit distance metric of string similarity, due to Vladimir Levenshtein. The Levenshtein Distance is a function of two strings that represents a count of single-character insertions, deletions, and substitions that will change the first string to the second. More information is available in NIST DADS and the Michael Gilleland article, “Levenshtein Distance in Three Flavors.”
This implementation is modeled after a space-efficient Perl implementation by Jorge Mas Trullenque. It has been written in R5RS Scheme, and extended to support heterogeneous combinations of Scheme types (strings, lists, vectors), user-supplied predicate functions, and optionally reusable scratch vectors.
2 Basic Comparisons
In the current implementation, all comparisons are done internally via vectors.
procedure
(vector-levenshtein/predicate/get-scratch a b pred get-scratch) → any/c a : any/c b : any/c pred : any/c get-scratch : any/c
Few, if any, programs will use this procedure directly. This is like vector-levenshtein/predicate, but allows get-scratch to be specified. get-scratch is a procedure of one term, n, that yields a vector of length n or greater, which is used for record-keeping during execution of the Levenshtein algorithm. make-vector can be used for get-scratch, although some programs comparing a large size or quantity of vectors may wish to reuse a record-keeping vector, rather than each time allocating a new one that will need to be garbage-collected.
procedure
a : any/c b : any/c pred : any/c
procedure
a : any/c b : any/c
procedure
a : any/c b : any/c
procedure
a : any/c b : any/c
procedure
a : any/c b : any/c
Calculate the Levenshtein Distance of vectors a and b. pred is the predicate procedure for determining if two elements are equal. The /eq, /eqv, and /equal variants correspond to the standard equivalence predicates, eq?, eqv?, and equal?. vector-levenshtein is an alias for vector-levenshtein/equal.
(vector-levenshtein '#(6 6 6) '#(6 35 6 24 6 32)) ==> 3
procedure
a : any/c b : any/c pred : any/c
procedure
a : any/c b : any/c
procedure
a : any/c b : any/c
procedure
a : any/c b : any/c
procedure
a : any/c b : any/c
Calculate the Levenshtein Distance of lists a and b. pred is the predicate procedure for determining if two elements are equal. The /eq, /eqv, and /equal variants correspond to the standard equivalence predicates, eq?, eqv?, and equal?. list-levenshtein is an alias for list-levenshtein/equal. Note that comparison of lists is less efficient than comparison of vectors.
(list-levenshtein/eq '(b c e x f y) '(a b c d e f)) ==> 4
procedure
a : any/c b : any/c
Calculate the Levenshtein Distance of strings a and b.
(string-levenshtein "adresse" "address") ==> 2
3 Type-Coercing Comparisons
Procedures levenshtein and levenshtein/predicate provide a convenient interface for comparing a combination of vectors, lists, and strings, the types of which might not be known until runtime.
procedure
a : any/c b : any/c pred : any/c
Calculates the Levenshtein Distance of two objects a and b, which are vectors, lists, or strings. a and b need not be of the same type. pred is the element equivalence predicate used.
(levenshtein/predicate '#(#\A #\B #\C #\D) "aBXcD" char-ci=?) ==> 1
procedure
a : any/c b : any/c
Calculate the levenshtein distance of a and b, in a similar manner as using levenshtein/predicate with equal? as the predicate.
(define g '#(#\g #\u #\m #\b #\o)) (levenshtein g "gambol") ==> 2 (levenshtein g "dumbo") ==> 1 (levenshtein g "umbrage") ==> 5
4 History
Version 0.6 —
2009-03-14 – PLaneT (1 3) Documentation fixes.
Version 0.5 —
2009-02-24 – PLaneT (1 2) License is now LGPL 3. Tests moved out of main file. Converted to author’s new Scheme administration system.
Version 0.4 —
2005-07-10 – PLaneT (1 1) Added Testeez tests.
Version 0.3 —
2005-07-09 – PLaneT (1 0) PLaneT release, and minor documentation changes.
Version 0.2 —
2004-07-06 Documentation changes.
Version 0.1 —
2004-05-13 First release. Tested only lightly, and today is the 13th, so caveat emptor.
5 Legal
Copyright (c) 2004–2009 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License (LGPL 3), or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.