Skip to content

Commit

Permalink
Fixed twokenizer errors
Browse files Browse the repository at this point in the history
Fixed error where ........ would tokenize to ... ... ...
and the error where ~......... would hang

Hopefully fixes github issue #14

(from tobiowo@471d223
but with CRLF->LF)

Change analysis: 
sample of 1,355,000 tweets
changes 30,000 of them
they all look like improvements, mostly the ellipsis
  • Loading branch information
brendano committed Oct 23, 2012
1 parent 09c8354 commit 97920c2
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions src/cmu/arktweetnlp/Twokenize.java
Original file line number Diff line number Diff line change
Expand Up @@ -98,12 +98,13 @@ public static String OR(String... parts) {
// @aliciakeys Put it in a love song :-))
// @hellocalyclops =))=))=)) Oh well

static String bfLeft = "(♥|0|o|°|v|\\$|t|x|\\.|;|\\u0CA0|@|ʘ|•|・|◕|\\^|¬|\\*)";
static String bfLeft = "(♥|0|o|°|v|\\$|t|x|;|\\u0CA0|@|ʘ|•|・|◕|\\^|¬|\\*)";
static String bfCenter = "(?:[\\.]|[_-]+)";
static String bfRight = "\\2";
static String s3 = "(?:--['\"])";
static String s4 = "(?:<|&lt;|>|&gt;)[\\._-]+(?:<|&lt;|>|&gt;)";
static String basicface = "(?:(?i)" +bfLeft+bfCenter+bfRight+ ")|" +s3+ "|" + s4;
static String s5 = "(?:[.][_]+[.])";
static String basicface = "(?:(?i)" +bfLeft+bfCenter+bfRight+ ")|" +s3+ "|" +s4+ "|" + s5;

static String eeLeft = "[\\\\\ƪԄ\\((<>;ヽ\\-=~\\*]+";
static String eeRight= "[\\-=\\);'\\u0022<>ʃ)//ノノ丿╯σっµ~\\*]+";
Expand Down

0 comments on commit 97920c2

Please sign in to comment.