-
Notifications
You must be signed in to change notification settings - Fork 0
/
todo.todo
67 lines (55 loc) · 2.94 KB
/
todo.todo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Stuff to do now:
☐ do CV test with syllables
☐ add language names to stopwords list
☐ add code to submission
add readme to submission (python notebook)
☐ run through naive bayes
☐ run through k-nearest neighbors
☐ add some neural network stuff to this list ☐ run through naive bayes
☐ run through k-nearest neighbors
☐ add some neural network stuff to this list ☐ run through naive bayes
☐ run through k-nearest neighbors
☐ add some neural network stuff to this list
features:
✔ bag of words @done (16-11-21 13:25)
✔ characters present @done (16-11-23 12:30)
☐ syllables
☐ bigrams (over characters and/or syllables)
✔ attested year (or none) @done (16-11-28 16:49)
☐ normalize over centuries, or continuous value
☐
Machine learning categorization:
✔ vectorize definitions (features below) @done (16-11-23 12:30)
✔ SVM @today @done (16-11-28 14:49)
✔ test different kernels @done (16-11-28 19:44)
✔ Make a confusion matrix visualization @done (16-11-29 08:46)
☐ run through naive bayes
☐ run through k-nearest neighbors
☐ add some neural network stuff to this list
After project is done:
☐ talk to Mark Davies
☐ make webapp
other:
☐ find some sources to cite
preprocessing:
initial categorization:
___________________
Archive:
✔ purge html tags @done (16-11-7 16:04) @project(preprocessing)
✘ include a method for going backwards for a mistake @cancelled (16-11-15 16:14) @project(initial categorization)
✔ write in doc @done (16-11-15 16:14) @project(initial categorization)
✔ do it @done (16-11-15 16:14) @project(initial categorization)
some words don't have language names in their definitions, but do have references to other forms of themselves. This causes a loop.
some words don't have an origin, like "buzz", these need to be encoded as "other" manually.
✔ check the tags on those samples against the definitions, record the number of errors @today @done (16-11-15 16:14) @project(initial categorization)
✔ take user input (correction or none) and store stats (store result in new file) @done (16-11-15 14:08) @project(initial categorization)
✔ take a random sample of about 10% of the data @today @done (16-11-15 14:08) @project(initial categorization)
about 4500 items O.o
✔ print word category and definition @done (16-11-15 14:08) @project(initial categorization)
✔ write in doc @today @done (16-11-14 17:29) @project(preprocessing)
✔ make the hierarchy @today @done (16-11-14 17:29) @project(initial categorization)
✔ make regexes to find names @today @done (16-11-14 17:29) @project(initial categorization)
✔ purge html objects @done (16-11-11 16:04) @project(preprocessing)
✔ fix chains/links @done (16-11-11 16:04) @project(preprocessing)
✔ arrange in tsv @done (16-11-1 16:04) @project(preprocessing)
✔ download database @done (16-11-1 16:03) @project(preprocessing)