What is the main difference between the Knuth-Morris-Pratt and Boyer-Moore algorithms

4

I know that KMP (Knuth-Morris-Pratt) is used to find a text Y in X, tries to set a pattern in Y, and then saves this pattern in a vector. And I also know that BM (Boyer-Moore) works best for small words.

But what is the main difference in your operation, which one is better?

    
asked by anonymous 22.08.2017 / 15:30

1 answer

2

Functional example in English

In a crude explanation

Boyer-Moore's approach is to try to match the last character of the pattern instead of the first with the assumption that if there is no match at the end, it is not necessary to try to match at the beginning. This allows "big jumps" so BM works best when the pattern and text you are looking for are similar to "natural text" (ie English)

Knuth-Morris-Pratt searches for occurrences of a word "W" within a main "text string", using the observation that when mismatch, the word itself incorporates sufficient information to determine where the next match could begin, thereby ignoring re-examination of previously corresponding characters. (Source: Wiki )

This means that KMP is best suited for small sets such as DNA (ACTG)

According to the creator:

  

The classic Boyer-Moore algorithm suffers from the phenomenon that tends to   function as efficiently in small alphabets as DNA. THE   jump distance tends to stop growing with the length of the   standard because substrings reappear frequently. By remembering more   has been combined, larger jumps can be achieved   text. One can even arrange the "perfect memory" and, therefore,   look at each character to the maximum, while the Boyer-Moore algorithm,   while linear, can inspect a character's text several times.   This idea of remembering more was explored in the literature by   others. This suffers from the need for very large tables or   state machines.

When to use:

In theory, both algorithms will perform "like"; KMP will make about 2n comparisons in the search phase and Boyer-Moore will make about 3n comparisons in the worst case search phase. In neither case do you have to repeat the pre-processing when you receive new text.

But the real answer is that you should not use any of them practice.

The linear auxiliary storage required by both algorithms leads to considerably lower performance in modern architectures because of all the extra memory accesses.

However, the ideas behind Boyer-Moore and KMP support most fast-string matching algorithms. Something like the KMP "fault function" is used by all the practically effective chain matching algorithms I know of; It turns out you can calculate a suboptimal "fault function" for an on-the-fly pattern that still gives you a linear time match while you just need constant additional space. Boyer-Moore is faster than linear in the "middle case" of combining a fixed pattern against random noise, and this happens in many practical situations.

Sources: 1 2

    
04.09.2017 / 12:34