Skip to content

Latest commit

 

History

History
187 lines (136 loc) · 7.79 KB

README.markdown

File metadata and controls

187 lines (136 loc) · 7.79 KB

Boyer-Moore String Search

Goal: Write a string search algorithm in pure Swift without importing Foundation or using NSString's rangeOfString() method.

In other words, we want to implement an indexOf(pattern: String) extension on String that returns the String.Index of the first occurrence of the search pattern, or nil if the pattern could not be found inside the string.

For example:

// Input: 
let s = "Hello, World"
s.indexOf("World")

// Output:
<String.Index?> 7

// Input:
let animals = "🐶🐔🐷🐮🐱"
animals.indexOf("🐮")

// Output:
<String.Index?> 6

Note: The index of the cow is 6, not 3 as you might expect, because the string uses more storage per character for emoji. The actual value of the String.Index is not so important, just that it points at the right character in the string.

The [brute-force approach](../Brute-Force String Search/) works OK, but it's not very efficient, especially on large chunks of text. As it turns out, you don't need to look at every character from the source string -- you can often skip ahead multiple characters.

The skip-ahead algorithm is called Boyer-Moore and it has been around for a long time. It is considered the benchmark for all string search algorithms.

Here's how you could write it in Swift:

extension String {
  func indexOf(pattern: String) -> String.Index? {
    // Cache the length of the search pattern because we're going to
    // use it a few times and it's expensive to calculate.
    let patternLength = pattern.characters.count
    assert(patternLength > 0)
    assert(patternLength <= self.characters.count)

    // Make the skip table. This table determines how far we skip ahead
    // when a character from the pattern is found.
    var skipTable = [Character: Int]()
    for (i, c) in pattern.characters.enumerate() {
      skipTable[c] = patternLength - i - 1
    }

    // This points at the last character in the pattern.
    let p = pattern.endIndex.predecessor()
    let lastChar = pattern[p]

    // The pattern is scanned right-to-left, so skip ahead in the string by
    // the length of the pattern. (Minus 1 because startIndex already points
    // at the first character in the source string.)
    var i = self.startIndex.advancedBy(patternLength - 1)

    // This is a helper function that steps backwards through both strings 
    // until we find a character that doesn’t match, or until we’ve reached
    // the beginning of the pattern.
    func backwards() -> String.Index? {
      var q = p
      var j = i
      while q > pattern.startIndex {
        j = j.predecessor()
        q = q.predecessor()
        if self[j] != pattern[q] { return nil }
      }
      return j
    }

    // The main loop. Keep going until the end of the string is reached.
    while i < self.endIndex {
      let c = self[i]

      // Does the current character match the last character from the pattern?
      if c == lastChar {

        // There is a possible match. Do a brute-force search backwards.
        if let k = backwards() { return k }

        // If no match, we can only safely skip one character ahead.
        i = i.successor()
      } else {
        // The characters are not equal, so skip ahead. The amount to skip is
        // determined by the skip table. If the character is not present in the
        // pattern, we can skip ahead by the full pattern length. However, if
        // the character *is* present in the pattern, there may be a match up
        // ahead and we can't skip as far.
        i = i.advancedBy(skipTable[c] ?? patternLength)
      }
    }
    return nil
  }
}

The algorithm works as follows. You line up the search pattern with the source string and see what character from the string matches the last character of the search pattern:

source string:  Hello, World
search pattern: World
                    ^

There are three possibilities:

  1. The two characters are equal. You've found a possible match.

  2. The characters are not equal, but the source character does appear in the search pattern elsewhere.

  3. The source character does not appear in the search pattern at all.

In the example, the characters o and d do not match, but o does appear in the search pattern. That means we can skip ahead several positions:

source string:  Hello, World
search pattern:    World
                       ^

Note how the two o characters line up now. Again you compare the last character of the search pattern with the search text: W vs d. These are not equal but the W does appear in the pattern. So skip ahead again to line up those two W characters:

source string:  Hello, World
search pattern:        World
                           ^

This time the two characters are equal and there is a possible match. To verify the match you do a brute-force search, but backwards, from the end of the search pattern to the beginning. And that's all there is to it.

The amount to skip ahead at any given time is determined by the "skip table", which is a dictionary of all the characters in the search pattern and the amount to skip by. The skip table in the example looks like:

W: 4
o: 3
r: 2
l: 1
d: 0

The closer a character is to the end of the pattern, the smaller the skip amount. If a character appears more than once in the pattern, the one nearest to the end of the pattern determines the skip value for that character.

Note: If the search pattern consists of only a few characters, it's faster to do a brute-force search. There's a trade-off between the time it takes to build the skip table and doing brute-force for short patterns.

Credits: This code is based on the article "Faster String Searches" by Costas Menico from Dr Dobb's magazine, July 1989 -- Yes, 1989! Sometimes it's useful to keep those old magazines around.

See also: a detailed analysis of the algorithm.

Boyer-Moore-Horspool algorithm

A variation on the above algorithm is the Boyer-Moore-Horspool algorithm.

Like the regular Boyer-Moore algorithm, it uses the skipTable to skip ahead a number of characters. The difference is in how we check partial matches. In the above version, it a partial match is found but it's not a complete match, we skip ahead by just one character. In this revised version, we also use the skip table in that situation.

Here's an implementation of the Boyer-Moore-Horspool algorithm:

extension String {
  public func indexOf(pattern: String) -> String.Index? {
    let patternLength = pattern.characters.count
    assert(patternLength > 0)
    assert(patternLength <= self.characters.count)

    var skipTable = [Character: Int]()
    for (i, c) in pattern.characters.dropLast().enumerate() {
      skipTable[c] = patternLength - i - 1
    }

    var index = self.startIndex.advancedBy(patternLength - 1)

    while index < self.endIndex {
      var i = index
      var p = pattern.endIndex.predecessor()

      while self[i] == pattern[p] {
        if p == pattern.startIndex { return i }
        i = i.predecessor()
        p = p.predecessor()
      }

      let advance = skipTable[self[index]] ?? patternLength
      index = index.advancedBy(advance)
    }

    return nil
  }
}

In practice, the Horspool version of the algorithm tends to perform a little better than the original. However, it depends on the tradeoffs you're willing to make.

Credits: This code is based on the paper: R. N. Horspool (1980). "Practical fast searching in strings". Software - Practice & Experience 10 (6): 501–506.

Written for Swift Algorithm Club by Matthijs Hollemans