-
Notifications
You must be signed in to change notification settings - Fork 199
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- model as jar resource - auto-detect input format - greedy decoder by default - standalone twokenize.sh script - conll conversion script - fake bill nye example - other stuff
- Loading branch information
Showing
16 changed files
with
189 additions
and
92 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,43 +1,28 @@ | ||
CMU ARK Twitter Part-of-Speech Tagger v0.3-pre | ||
CMU ARK Twitter Part-of-Speech Tagger v0.3 | ||
http://www.ark.cs.cmu.edu/TweetNLP/ | ||
|
||
Basic usage | ||
=========== | ||
|
||
Requires Java 6. To run the tagger from unix shell: | ||
|
||
./runTagger.sh example_tweets.txt modelfile > tagged_tweets.txt | ||
./runTagger.sh examples/example_tweets.txt | ||
|
||
Another example: | ||
The tagger outputs tokens, predicted part-of-speech tags, and confidences. | ||
For more information: | ||
|
||
./runTagger.sh --input-format json barackobama.jsonlines.txt -output tagged_barackobama.txt | ||
./runTagger.sh --help | ||
|
||
The outputs should match tagged_tweets_expected.txt and barackobamaexpected.txt respectively. | ||
We also include a script that invokes just the tokenizer: | ||
|
||
./twokenize.sh examples/example_tweets.txt | ||
|
||
Information | ||
=========== | ||
|
||
Advanced usage | ||
-------------- | ||
|
||
We include a pre-compiled .jar of the tagger so you hopefully don't need to | ||
compile it. But if you need to recompile, do: | ||
mvn install | ||
NOTE: requires Maven 3.0.3+ | ||
|
||
To train and evalute the tagger, see: | ||
ark-tweet-nlp/src/main/java/edu/cmu/cs/lti/ark/ssl/pos/SemiSupervisedPOSTagger.java | ||
scripts/train.sh and scripts/test.sh | ||
|
||
Contents | ||
-------- | ||
* runTagger.sh is the script you probably want | ||
* lib/ dependencies | ||
* ark-tweet-nlp/src the project code itself (all java) | ||
Version 0.3 of the tagger is 40 times faster and more accurate. Please see the tech report on the website for details. | ||
|
||
Information | ||
----------- | ||
This tagger is described in the following paper. Please cite it if you write a | ||
research paper using this software. | ||
This tagger is described in the following two papers, available at the website. Please cite this if you write a research paper using this software. | ||
|
||
Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments | ||
Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, | ||
|
@@ -46,23 +31,7 @@ research paper using this software. | |
Linguistics, companion volume, Portland, OR, June 2011. | ||
http://www.ark.cs.cmu.edu/TweetNLP/gimpel+etal.acl11.pdf | ||
|
||
The software is licensed under Apache 2.0 (see LICENSE file). | ||
|
||
Version 0.2 of the tagger differs from version 0.1 in the following ways: | ||
|
||
* The tokenizer has been improved and integrated with the tagger in a single Java program. | ||
|
||
* The new tokenizer was run on the 1,827 tweets used for the annotation effort and the | ||
annotations were adapted for tweets with differing tokenizations. The revised annotations | ||
are contained in a companion v0.2 release of the data (twpos-data-v0.2). | ||
|
||
* The tagging model is trained on ALL of the available annotated data in twpos-data-v0.2. | ||
The model in v0.1 was only trained on the training set. | ||
|
||
* The tokenizer/tagger is integrated with Twitter's text commons annotations API. | ||
|
||
Contact | ||
------- | ||
Please contact Brendan O'Connor ([email protected]) and Kevin Gimpel ([email protected]) | ||
if you encounter any problems. | ||
======= | ||
|
||
Please contact Brendan O'Connor ([email protected]) and Kevin Gimpel ([email protected]) if you encounter any problems. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
247120554400821248 2012-09-15T23:51:46 Bill_Nye_tho all out of wood facts | ||
247120392324542464 2012-09-15T23:51:07 Bill_Nye_tho u can build a house w/ it | ||
247119965784784896 2012-09-15T23:49:25 Bill_Nye_tho more wood facts still to come | ||
247119210113802240 2012-09-15T23:46:25 Bill_Nye_tho its biodegradable #woodfacts | ||
247118527113355264 2012-09-15T23:43:42 Bill_Nye_tho u could burn it #woodfacts | ||
247117483482431488 2012-09-15T23:39:33 Bill_Nye_tho SHOUT OUT TO MY NIGGAS EATIN HUMUS | ||
247114115762499584 2012-09-15T23:26:11 Bill_Nye_tho if u want me to give a lecture at ur school contact ur student board or w/e. or contact me and i'll just come i don't give a fuck lol | ||
247113014011109378 2012-09-15T23:21:48 Bill_Nye_tho u ever been havin the illest dream ever n u wake up right as its gettin real good and ur like damn i wasnt done smashin Jane Goodall's shit | ||
247089985625395202 2012-09-15T21:50:17 Bill_Nye_tho sometimes i'll freeze water then melt it then freeze it again an i just keep doing that until it stops being awesome but it never does | ||
246819478107746304 2012-09-15T03:55:23 Bill_Nye_tho YO I CANT THINK OF ANYTHING THAT GETS ME MORE HEATED THAN ARTIFICIAL PLANTS | ||
246815764902981632 2012-09-15T03:40:38 Bill_Nye_tho look at u lookn all cute over there girl come here a min lemme holla atcha whats ur bigges fantasy mine is to visit Triton,Neptunes 7th moon | ||
246811004770590721 2012-09-15T03:21:43 Bill_Nye_tho its cool that chameleons can blend in with their environment but at a certain points it's like just do u homie!!! | ||
246806113645907969 2012-09-15T03:02:17 Bill_Nye_tho @Wendys what up Wendy's on average one fully grown bovine can produce about 2400 hamburger patties. and u can fact check that shit homeboyyy | ||
246590149234925569 2012-09-14T12:44:07 Bill_Nye_tho SHOUTS OUT TO PEOPLE WILLINGLY LIVIN IN TOWNS RIGHT NEXT TO ACTIVE VOLCANOES LIKE "NAH WE'RE GOOD" | ||
246589808217051138 2012-09-14T12:42:46 Bill_Nye_tho a lotta people refer to this as a novelty account i dont see whats so novel about science but whatever p.s. lava can flow up to 10km perhour | ||
246348019048542208 2012-09-13T20:41:59 Bill_Nye_tho Jane Goodall is a bad bitch |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,5 @@ | ||
#!/bin/bash | ||
set -eu | ||
|
||
# For development | ||
# java -Xmx1g -jar $(dirname $0)/ark-tweet-nlp/target/bin/ark-tweet-nlp-0.3-SNAPSHOT.jar "$@" | ||
|
||
# For release | ||
java -Xmx1g -jar $(dirname $0)/ark-tweet-nlp-0.3.jar "$@" | ||
# Run the tagger (and tokenizer). | ||
java -Xmx500m -jar $(dirname $0)/ark-tweet-nlp-0.3.jar "$@" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
#!/bin/bash | ||
|
||
VERSION=0.3 | ||
DIR=ark-tweet-nlp-$VERSION | ||
|
||
set -eux | ||
|
||
rm -rf $DIR | ||
mkdir $DIR | ||
|
||
# mvn clean | ||
# mvn package | ||
cp ark-tweet-nlp/target/bin/ark-tweet-nlp-${VERSION}.jar $DIR | ||
|
||
cp -r examples $DIR | ||
cp -r scripts $DIR | ||
rm $DIR/scripts/prepare_release.sh | ||
rm $DIR/scripts/java.sh | ||
cp *.sh $DIR | ||
cp *.txt $DIR | ||
|
||
# these dont work, need to fix | ||
rm $DIR/examples/barackobama* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
#!/usr/bin/env python | ||
# Take the pretsv format and make it CoNLL-like ("supertsv", having tweet metadata headers) | ||
import sys,json | ||
from datetime import datetime | ||
|
||
for line in sys.stdin: | ||
parts = line.split('\t') | ||
tokens = parts[0].split() | ||
tags = parts[1].split() | ||
try: | ||
d = json.loads(parts[-1]) | ||
print "TWEET\t{}\t{}".format(d['id'], datetime.strptime(d['created_at'], '%a %b %d %H:%M:%S +0000 %Y').strftime("%Y-%m-%dT%H:%M:%S")) | ||
print "TOKENS" | ||
except: | ||
pass | ||
|
||
for tok,tag in zip(tokens,tags): | ||
print "{}\t{}".format(tag,tok) | ||
print "" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.