This library is a Ruby extension, a wrapper around the Aho-Corasick implementation in C, found in Strmat package.
The source code (ac.c and ac.h) was “adapted” from Strmat. In fact, I’ve changed only 3-4 lines of code from the original implementation so it will feat my needs: search needed to return the current position in the searched string.
Having a dictionary of known sentences (note: not words!), this kick ass algorithm can find individual patterns in an incoming stream of data. Kinda Fast.
The algorithm has 2 stages: one where an internal tree in being build from the given dictionary leaving the search to the second step.
Well, you can do some crazy things with it, like, you can lookup for DNA patterns or maybe analyze network sequences (read: strange and maybe proprietary network protocols), or domestic stuff like building contextual links on your blog posts to enrich your users experience.
gem install aurelian-ruby-ahocorasick --source=http://gems.github.com
$ git clone git://github.com/aurelian/ruby-ahocorasick.git $ cd ruby-ahocorasick
To build and install the gem on your machine (run with sudo if needed):
$ rake install
rake -T
will list other cool tasks.
Get version 0.4.5 (released on 19 November 2008) from rubyforge :
$ gem install ruby-ahocorasick
It’s known to work / compile / install on Ubuntu 8.04 and Mac OS 10.4.*. It should work out of the box if you have gcc.
Unfortunately I don’t have a Windows PC around nor required knowledge about Microsoft compliers.
require 'ahocorasick' keyword_tree= AhoCorasick::KeywordTree.new # creates a new tree keyword_tree.add_string( "foo-- Z@!bar" ) # add's a keyword to the tree keyword_tree.add_string( "cervantes" ) # even more results= keyword_tree.find_all( "1011000129 foo-- Z@!bar761 ! 001211 6xU" ).each do | result | result[:value] # => "foo-- Z@!bar" result[:starts_at] # => 11 result[:ends_at] # => 23 result[:id] # => 1 end
You can get some API reference on the wiki.
For now, just use the email address.
Other suffix – tree implementations:
© 2008 – Aurelian Oancea, < oancea at gmail dot com >
released under MIT-LICENCE