motivation: lots of notes, wanted input easy, consequence is that they're not categorized. can i do it automatically?
let's take a shot. approach:
- drop everything in a database (tumblr api call, sqlite3)
- clean up words
- tf-idf [explain how this works] a. build index b. build reverse index c. write functions
- create index mapping words to documents
- serialize (like pickle) that object
- quick-and-dirty results builder
- discuss results. "female", "math" are good examples. "economics" not as good. remember how tf-idf works (short docs favored).
- extensions: live search; wikipedia vector semantic relatedness clusters; live suggestions on new text (say, as i'm writing something).