Can strings be compared fuzzily?

Mike Wang

It’s Thanksgiving week and you’re putting the finishing touches on your new e-commerce site, just in time for Black Friday. You’ve just gotten the product search functionality working, so you kick back in your chair and relax a little, dreaming of turkey and mashed potatoes.

When suddenly, you remember that turkey makes people sleepy! How’re your customers supposed to search for products on your site if they’re half-awake, groggily misspelling and mis-remembering the names of the things they want?

Fuzzy-compare to the rescue!

String compare functions (SCFs) are the core of any search function, allowing you to find matches to an arbitrary query. SCFs take two strings as arguments:

  • a needle, and
  • a haystack

A normal SCF will return true if and only if the needle is exactly identical to the haystack. This type of SCF is very unforgiving; a single spelling error in the needle will render your code blind to the user’s intended quarries.

In contrast, a fuzzy SCF will return true even if the needle and haystack are not exactly identical; they will return true as long as the two strings are more similar than a certain similarity threshold. This type of SCF will forgive spelling mistakes and typos, allowing even the most drowsy of users to find what they seek. Unfortunately, they also inherently heavier than a normal SCF, and if you are not careful, will slow your site’s search down to a unbearable crawl.

There are many ways to define how “similar” two strings are to each other. Most methods try to calculate the edit distance between them, but these usually carry heavy computation costs and sometimes involve advanced information theory. You need something fast and light. A particular programmer called Vyacheslav Egorov once encountered the same exact problem, and the following algorithm is derived from his genius solution:

// Mike Wang | 2017 | MIT license
function strcmpFuzzy(needle, haystack, threshold) {
  var h_len = haystack.length;
  var n_len = needle.length;
  if (n_len > h_len) {
    return false;
  }
  else if (n_len === h_len) {
    return (needle === haystack);
  }
  else {
    var h = 0;
    outer: for (var n = 0; n < n_len; n++) { // loop through each char in needle
      var n_ch = needle.charCodeAt(n);
      var j = 1;
      while (h < h_len) { // loop through each char in haystack, picking up where last outer loop left off
        if (haystack.charCodeAt(h) === n_ch) {
          continue outer; // break inner loop and start new outer loop
        }
        if (j > threshold) {
          return false; // only occurs if threshold or more haystack characters were skipped at once
        }
        j++; // increments occur only if chars do not match
        h++;
      }
      return false; // only occurs once inner loop has reached end of haystack
    }
    return true; // only occurs if each needle char found a haystack match
  }
}

This function will return true if and only if:

  • each character in the needle also exists in the haystack, AND
  • the characters in the needle exist in the same order in the haystack, AND
  • no more than threshold characters are skipped in the haystack at once

So, for example:

strcmpFuzzy('hard drives', 'hard drives', 3); // true
strcmpFuzzy('had drives' , 'hard drives', 3); // true
strcmpFuzzy('had dives'  , 'hard drives', 3); // true
strcmpFuzzy('had dies'   , 'hard drives', 3); // true
strcmpFuzzy('had des'    , 'hard drives', 3); // false
strcmpFuzzy('had des'    , 'hard drives', 4); // true
strcmpFuzzy('lard hives' , 'hard drives', 3); // false
strcmpFuzzy('dard hrives', 'hard drives', 3); // false

There! With a fuzzy SCF added to your site’s product search, you finally can rest easy that your customers will be able to find the products they desire, even after a third helping of turkey. At least, until they break out the wine…

 

Last updated by on .

What Are Your Thoughts?

Your email address will not be published. Required fields are marked *