# How does the Bitap algorithm work?

5

According to Wikipedia in English:

The bitmap algorithm (also known as o-shift or, offset and or   Baeza-Yates-Gonnet algorithm) is a sequence of correspondence   approximate algorithm. The algorithm indicates whether a given text   contains a subsequence that is "approximately equal" to a given   standard, where the approximate equality is defined in terms of   Levenshtein distance - if the substring and pattern are within a   given k distance from each other, then the algorithm considers them   equal.

I would like to know how this algorithm works exactly because the content I found is very complex.

asked by anonymous 24.02.2015 / 04:38

3

I extract the code contained in this response from this article , and the code analysis is my own.

Below is the code in C to follow the rationale for the analysis. Note: In this code the values of 1 and 0 are semantically opposite.

``````#include <stdlib.h>
#include <string.h>
#include <limits.h>

const char *bitap_fuzzy_bitwise_search(const char *text, const char *pattern, int k)
{
const char *result = NULL;
int m = strlen(pattern);
unsigned long *R;
int i, d;

if (pattern[0] == '#include <stdlib.h>
#include <string.h>
#include <limits.h>

const char *bitap_fuzzy_bitwise_search(const char *text, const char *pattern, int k)
{
const char *result = NULL;
int m = strlen(pattern);
unsigned long *R;
int i, d;

if (pattern[0] == '%pre%') return text;
if (m > 31) return "The pattern is too long!";

/* Initialize the bit array R */
R = malloc((k+1) * sizeof *R);
for (i=0; i <= k; ++i)
R[i] = ~1;

/* Initialize the pattern bitmasks */
for (i=0; i <= CHAR_MAX; ++i)
for (i=0; i < m; ++i)

for (i=0; text[i] != '%pre%'; ++i) {
/* Update the bit arrays */
unsigned long old_Rd1 = R[0];

R[0] <<= 1;

for (d=1; d <= k; ++d) {
unsigned long tmp = R[d];
/* Substitution is all we care about */
R[d] = (old_Rd1 & (R[d] | pattern_mask[text[i]])) << 1;
old_Rd1 = tmp;
}

if (0 == (R[k] & (1UL << m))) {
result = (text+i - m) + 1;
break;
}
}

free(R);
return result;
}
') return text;
if (m > 31) return "The pattern is too long!";

/* Initialize the bit array R */
R = malloc((k+1) * sizeof *R);
for (i=0; i <= k; ++i)
R[i] = ~1;

/* Initialize the pattern bitmasks */
for (i=0; i <= CHAR_MAX; ++i)
for (i=0; i < m; ++i)

for (i=0; text[i] != '%pre%'; ++i) {
/* Update the bit arrays */
unsigned long old_Rd1 = R[0];

R[0] <<= 1;

for (d=1; d <= k; ++d) {
unsigned long tmp = R[d];
/* Substitution is all we care about */
R[d] = (old_Rd1 & (R[d] | pattern_mask[text[i]])) << 1;
old_Rd1 = tmp;
}

if (0 == (R[k] & (1UL << m))) {
result = (text+i - m) + 1;
break;
}
}

free(R);
return result;
}
``````

# Analysis

Initialization occurs until the third loop `for` . The third loop causes all error parsing positions for each attempt to indicate "failed" for initialization effect.

The last `for` , which has a `for` nested, makes character-to-character comparisons to check for errors. The% internal% limits the scope of checks to the maximum distance established in the algorithm call.

Once the vector R is all filled with `for` , the word result is returned.

24.02.2015 / 13:18