Author Topic: [Resource] Damerau Levenshtein String Distance  (Read 2648 times)

Code: [Select]
function damerauLevenshteinDistance(%a,%b)
{
%h = strLen(%a)+1;
%w = strLen(%b)+1;

for(%i=0;%i<%h;%i++)
%a[%i] = getSubStr(%a,%m[%i,0] = %i,1);
for(%i=0;%i<%w;%i++)
%b[%i] = getSubStr(%b,%m[0,%i] = %i,1);

for(%y=1;%y<%h;%y++)
{
for(%x=1;%x<%w;%x++)
{
%c = !(%a[%y-1] $= %b[%x-1]);
%i = %m[%y,%x-1]+1;
%d = %m[%y-1,%x]+1;
%s = %m[%y-1,%x-1]+%c;

if(%s < %d && %s < %i)
%l = %s;
else if(%d < %i)
%l = %d;
else
%l = %i;

if(%x > 1 && %y > 1 && %a[%y-1] $= %b[%x-2] && %a[%y-2] $= %b[%x-1] && (%t = %m[%y-2,%x-2]+%c) < %l)
%l = %t;
%m[%y,%x] = %l;
}
}
return %m[%h-1,%w-1];
}

find the distance between two strings as a numerical value
includes insertion, deletion, substitution, and transpositions

respectively each method is shown below, each of these strings are considered a distance of '1' from eachother

damerauLevenshteinDistance("abc","abbc");
damerauLevenshteinDistance("abc","ac");
damerauLevenshteinDistance("abc","axc");
damerauLevenshteinDistance("abc","aabc");

this function clearly becomes slower the longer the strings so make sure to include some sort of limiter and cache results
do not allow a user to enter multiple 255 length strings concatenated together in a serverCmd and then parse them through this function, your server will explode
if you expect this function to be called many times for similar strings consider adding a simple global variable cache array

me: aw yea I can evade the censor list! fkc u dumpazz
damerau-levenshtein distance algorithm: allow us to introduce ourselves

this would actually be a very poor algorithm for censor
demonstrated by the words below which are all a distance of '1' from a word that would be found on a censor list


Code: [Select]
fog
fig
hag
gag
tag
rag
sag
jag
mag
bag
lag
nag
shot
shut
tuck
muck
suck
ruck
buck
cuck
duck
huck
luck
puck
yuck
bigger
digger
nagger
rigger
tigger


this algorithm is more useful when trying to match a string to a specific list ie a user list or a item list brick list etc
further more this algorithm is very computational and is not something you would want to run on every multi word message sent by a user against a censor list

is there no phonetic comparison algorithm out there


this would actually be a very poor algorithm for censor
demonstrated by the words below which are all a distance of '1' from a word that would be found on a censor list


Code: [Select]
duck
nooooo
targeted censorship!

did you know that if duck isn't in the dictionary, using this form of search will return
"suck"
and
"richard"
for me it's normally returned in that order, and i think richard was in there as short for richard?
what a great freshman assignment, it also took into account closeness of keys to type it