Skip to content
jasonbaldridge edited this page Apr 7, 2013 · 3 revisions

Course project, Phase 4

DUE: April 10, 2013, 1pm CST

Preamble

This project phase involves building on the tshrdlu bot developed in phase three. This phase is even more underspecified than phase three---basically, this is the launch point for you to start doing your own thing for the project. Even more than before, please discuss code and ideas on the class mailing list. Talk, help each other, etc. Collaborate as a class to do something better than you could on your own.

As before you are welcome to work in groups of two or three if you like. If you are working in the same group as you did for phase three, you don't need to let me know; otherwise, please email me with who the group members are.

Submitting your solutions

Write up what you did in as a file <lastname>_<firstname>_p4.pdf. Submit this on Blackboard.

Your code submission will be contained in your fork of the tshrdlu repository. You should tag the submission version of your repository as "PP4".


1. Add a behavior to tshrdlu and submit a pull request

For project three, everyone implemented different, interesting response behaviors. Take at least one of those behaviors and integrate them into the new Akka based setup for tshrdlu, and then submit a pull request for it.

The code you add will most likely extend BaseReplier, but if you find it isn't convenient to do it that way, feel free to create a replier that stands on its own. Most importantly, it needs to handle a ReplyToStatus message appropriately, and you must add it as one of the repliers to the ReplyManager.

Some things to keep in mind:

  • I may make some changes to tshrdlu in the meantime. Make sure you are up-to-date before you begin coding so that you can be using the latest.
  • Don't wait until the due date to submit your pull request.
  • Your submission for this part of phase four is your pull request. Make sure to describe what you did adequately. Also, please note the others in your group who are part of the submission.
  • Note that the person from whose account the pull request is originated will get the "credit" in the main tshrdlu repository for having done that effort. If you have multiple behaviors that you are submitting, consider using different accounts to spread the credit.
  • If you break tshrdlu, tshrdlu will come back to haunt you.

2. Do your own thing

Time to get your own ideas going. Basically, you should build on either the use of Twitter data for interesting NLP tasks or the enterprise of creating bots that use natural language processing techniques to inform their interactions. Below are some suggestions (I'm recycling some of the suggestions from project phase three that no one implemented). I'm of course happy to discuss any of these ideas or your own further, either in person or over email.

Conversational behaviors

Improve the ability to have a conversation.

  • Keep state about interactions, e.g. so that if someone tweets to your bot, it might actually proactively tweet to them (for that session) based on the initial reactions.
  • Use information about the person tweeting to your bot to customize the response, e.g. by using their name (or at least filtering out names of other people) or gather their tweets or previous tweets to get more relevant responses.
  • Use an n-gram language model based on all the tweets you got from searching to generate new text for a response tweet.
  • Check for times when the bot's handle is mentioned (and not necessarily addressed) and try to come up with replies, consider retweeting if it is positive, etc.
  • Use POS-tags

Other behaviors

Until now, the tshrdlu behavior has been dominated by simple responses to messages that address the bot directly. It would be interesting to have the bot do other things, e.g.:

  • Create new tweets via some strategy, e.g.
    • using a language model
    • identifying an interesting link on the web and pulling some text from the page and posting that text and the link
  • Do useful and/or interesting retweeting. For example, given some starting information (possibly on the command line, or via a tweet), create a classifier that will evaluate tweets that match a particular hash tag or keyword and retweet them when they look good (according to the classifier as informed by the initial training data given to it). As an example, one could create a #scala programming language retweeter that seeds the positive class with tweets by people who tweet regularly about Scala (e.g. @odersky and many others), while the negative class comes from randomly selected tweets. A key thing would be enabling followers of your bot to retrain it by replying with a message like "Wrong" when it tweets something irrelevant. Ultimately, the bot should be posting messages about Scala the programming language and not about the theater, the nightclub, or other things "scala" might refer to.
  • Another strategy for retweeting would be to take a given user, collect the most recent tweets of a number of their followers, and produce a linguistic profile (could be a simple as a bag-of-words) that can be used to check whether the bot should or should not retweet various messages produced by the accounts that the bot follows. For example, @tshrdlu follows a bunch of musicians as well as NLP types; by considering the followers of some NLP and/or machine learning accounts, can we have @tshrdlu retweet tweets by such individuals while ignoring those by @justinbieber, @kanyewest, etc?

Data analysis and visualization

Rather than doing the bot thing, use Twitter as an interesting data set to play with.

  • Expand the word cloud code from the tutorial so that you cluster the followers of a given user. (Note: because of rate limits, you'll want to do this in two steps---the first to obtain the descriptions and save them to disk, and then another to actually work with them.)
  • Gauge the relative influence of a group of Twitter accounts by extracting a mention graph from the stream (or possibly using searches). Use PageRank as we discussed in class, and consider alternatives TunkRank too.
  • Visualize some Twitter data in an interesting way, e.g. geographic data, retweet events, similarities of users based on the text they produce, and more. Consider using some of the visualization tools mentioned in this post on 20 tools, especially d3.js (also, check out this video about d3.js by Mike Dewar of bit.ly).

There are of course any number of other things you could do!

Something not-Twitter

If there is something you'd like to work on---including something that has nothing to do with social networks---that's absolutely an option. Discuss it with me!

Rubric

Your submission will be scored using the following rubric. Qualities of full-point submissions are given below each area.

  • Tshrdlu pull request: 10
    • The code integrates with the Akka version of tshrdlu.
    • Please indicate who the individuals involved in creating the pull request are. (Otherwise, only the person whose account is associated with the pull request will get credit.)
  • Coding: 20
    • The code involves non-trivial implementation.
    • The code demonstrates thought about program dependencies and flow.
    • The code is organized and documented.
  • Writing: 30
    • The write-up clearly explains what was done.
    • The write-up has examples, including relevant output.
    • The write-up provides analysis of output, as appropriate.
    • The write-up has references to any papers, blog posts, or other resource that were used to complete the work.
    • The write-up is professionally done (organized, free of spelling and grammar errors).
  • Creativity: 20
    • The work shows original thought in selection of task, choosing algorithms for solving it, solutions to coding challenges, and analyzing their output.
    • The work combines different ideas from the class in new ways.
  • Overall quality: 20
    • The work as a whole is high quality.

Make sure not to skimp on the write-up! You are welcome to use the style files for conferences like ACL and SIGIR if you are so inspired.