One of the most commonly used functions in word processing (and other text manipulation) is that of searching some body of text (the target) for a particular pattern of letters (the pattern.) For example, the ISO module Strings has
PROCEDURE FindNext(pattern, stringToSearch : ARRAY OF CHAR; startIndex : CARDINAL; VAR patternFound : BOOLEAN; VAR posOfPattern : CARDINAL);
The simplest and most natural algorithm for implementing such a search is to count through the target until the first letter of the pattern is matched, then count on both strings to see if the match continues. If it does for the entire pattern, the whole string matches. If a mismatch is found, one then continues to try to match the first character of the pattern with characters in the target.
Find the pattern "sip" in the target "Mississippi."
Compare "s" to "M" and "i" and, finding no match, advance.
Finding a match at "s" look at "i" which fails to match.
Mississippi sip
The next try letter matches "s," again and then also "i", but not "p."
Mississippi sip
At the next position there is a mismatch, and then there is another match at "s," but the next letter fails. On the next advance, one has
Mississippi sip
and here the desired match is obtained. If m and n denote the lengths of the two strings, this (brute force) algorithm may require up to mn comparisons, that is, it is O(mn). Here is an implementation of this procedure :
PROCEDURE FindNext(pattern, stringToSearch : ARRAY OF CHAR; startIndex : CARDINAL; VAR patternFound : BOOLEAN; VAR posOfPattern : CARDINAL); VAR countSi : CARDINAL; (* initial stringToSearch position count *) countP : CARDINAL; (* pattern count *) countSr : CARDINAL; (* count to the right of stringToSearch start *) endP : CARDINAL; (* last pattern position *) lenS : CARDINAL; (* length of stringToSearch *) ch : CHAR; (* temp holder to reduce referencing *) BEGIN patternFound := FALSE; endP := LENGTH (pattern); IF endP <> 0 (* if is zero, do nothing *) THEN DEC(endP); (* last position used is one less than length *) lenS := LENGTH(stringToSearch); countSi := startIndex; IF endP = 0 THEN (* was one character only *) ch := pattern[0]; (* save it *) WHILE (countSi < lenS) AND (ch <> stringToSearch[countSi]) DO (* and go compare *) INC(countSi); END; (* while count *) IF countSi < lenS THEN (* we exited the loop before end of string *) patternFound := TRUE; (* so we got that one match *) posOfPattern := countSi; RETURN; (* tell the world so *) END; (* if countSi < lenS *) ELSE (* more than one character to be matched *) ch := pattern[endP]; (* start at back end *) WHILE countSi + endP < lenS DO IF ch = stringToSearch[countSi+endP] THEN (* got a first character match *) countP := 0; (* do look for more *) countSr := countSi; LOOP IF pattern[countP] <> stringToSearch[countSr] THEN EXIT; END; (* if pattern[countP] *) INC(countP); INC(countSr); IF countP = endP (* match whole thing *) THEN patternFound := TRUE; posOfPattern := countSi; RETURN; END; (* if countP = endP *) END; (* loop *) END; (* if ch = stringToSearch *) INC(countSi); END; (* while *) END; (* if endP = 0 *) END; (* endP <> 0 *) END FindNext;
However, a close examination of the progress of this algorithm shows us that some comparisons are not necessary. If for instance, one has matched the first two letters "si" and there is then a mismatch, one already knows that the initial "s" cannot match the second letter "i" and so the search can be advanced by two places rather than one, and one comparison can be saved. Sometimes the potential savings are more dramatic:
Find the pattern "gead" in the target "geaageabgeacgead"
geaageabgeacgead gead
The first three letters match, but when the fourth letter fails to match, they are known not to match the first letter either, so the pattern can be moved over to:
geaageabgeacgead gead
then to
geaageabgeacgead gead
and finally to
geaageabgeacgead gead
where a match on all four characters is finally obtained. Notice that this particular search turns out to be O(n), where n is the length of the target string. However, the ability to slide the matching process over by three characters depends on the fact that there are no repetitions in the pattern string. If there are repetitions, the amount that the search can be shifted depends on the place where the repetition is found in the pattern.
Find the pattern "papa" in the target "papuapapyruspapa"
papuapapyruspapa papa
This time, when fails to match, the pattern can be moved over only by two places because of the repetition of the first letter.
papuapapyruspapa papa
Similar considerations apply to a later match of "pap" followed by a mismatch.
In order to implement this strategy, therefore, it is necessary to search the pattern within itself looking for repetitions in order to determine how far the pattern can be moved after a failure at each position. Since this means examining the pattern, and then following up by examining the target, this algorithm is O(m + n). It is called the Knuth-Morris-Pratt algorithm, after the three who originated and refined it.
Knuth-Morris-Pratt search = scan pattern and construct "backup" vector (table) scan the target while characters match, advance in both pattern and target on mismatch, shift by amount in "backup" vector for that position
Find the pattern "cashcar" in the target "xcucatcastcashewcashcucashcatcashcart"
(construct backup vector): (T = target and P = pattern)
Matched | Mismatch@ | backup | meaning | action | |
---|---|---|---|---|---|
pos | letter | [j] | |||
none | 0 | c | -1 | [0] # c | next T pos, resume P at 0 |
c | 1 | a | 0 | [1] # a maybe c | keep T pos, resume P at 0 |
ca | 2 | s | 0 | [2] # s maybe c | keep T pos, resume P at 0 |
cas | 3 | h | 0 | [3] # h maybe c | keep T pos, resume P at 0 |
cash | 4 | c | -1 | [4] # c | next T pos, resume P at 0 |
cashc | 5 | a | 0 | [5] # a maybe c | keep T pos, resume P at 0 |
cashca | 6 | r | 2 | [6] # r, got "ca" | next T pos, resume P at 2 |
The term "backup" refers to the position in the pattern that the search can be backed up to in order to continue examining the target without wasting comparisons. The value -1 will serve to flag that the position in the target is to be incremented for the next comparison. When the scan "backs up" to the position -1, both the pattern and target counters will be incremented; when it backs up to the position 0 (or to any other positive number), only the target position will be incremented before continuing comparisons. Here is the KNP (Knuth-Morris-Pratt) version of the FindNext procedure:
PROCEDURE FindNext (pattern, stringToSearch : ARRAY OF CHAR; startIndex : CARDINAL; VAR patternFound : BOOLEAN; VAR posOfPattern : CARDINAL); (* Knuth Morris Pratt version adapted by R. Sutcliffe last modified 1995 05 04 *) VAR count1, count2, sCount, pCount, patternLen, targetLength: INTEGER; backup : Vector; BEGIN targetLength := LENGTH (stringToSearch); patternLen := LENGTH (pattern); (* construct backup vector *) count1 := 0; count2 := -1; backup [0] := -1; REPEAT (* look for match in pattern with itself *) IF (count2 = -1) (* starting (over) *) OR (pattern [count1] = pattern [count2]) (* (still) matching *) THEN INC(count1); (* keep going *) INC(count2); IF pattern [count1] # pattern [count2] (* no match there *) THEN (* keep same count *) backup [count1] := count2; ELSE (* keep same backup *) backup [count1] := backup [count2]; END; (* if pattern *) ELSE count2 := backup[count2]; END; (* if count2 *) UNTIL count1 >= patternLen; sCount := startIndex; pCount := 0; REPEAT IF (pCount = -1) (* restart flag *) OR (stringToSearch [sCount] = pattern [pCount]) (* matching *) THEN INC(sCount); (* move to next pos in source *) INC(pCount); (* if was -1, now 0 for restart *) ELSE (* take backup no restart flag or match same source pos *) pCount := backup [pCount]; END; (* if pCount *) UNTIL (pCount >= patternLen) OR (sCount >= targetLength); IF pCount >= patternLen THEN (* got it so report back *) posOfPattern := sCount - patternLen; patternFound := TRUE; ELSE (* failed *) posOfPattern := targetLength; patternFound := FALSE; END; (* if pCount *) END FindNext;
Unless the pattern being searched for has considerable repetition, this method may not yield much better results than the brute force method in practice. Yet another method that provides much faster searching capability is the Boyer-Moore algorithm; however, it will not be discussed here as it is rather more complicated.