-
Notifications
You must be signed in to change notification settings - Fork 5
/
ChangeLog
9121 lines (7074 loc) · 350 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
2002-02-13 Andrew McCallum <[email protected]>
* opts.c (parse_bow_opt): Make it still work if $HOME isn't
defined (as in a CGI).
(Patch from John McCall <[email protected]>.)
* svm_base.c (sqrtf): #define as sqrt if we don't HAVE_SQRTF.
(Suggestion from Alberto Lavelli <[email protected]>.)
* lex-html.c: Added the ability to deal with html entities of the
form &ENTITY; or &#DIGITS;. Patch from Arturo Perez
* primelist.c: Fixed off-by-one error in list of primes. This was
occasionally causing an infinite loop. (Reported by Mikhail
Sogrine <[email protected]> and Arturo Perez
* int4str.c: Removed old, irrelevant comments.
2001-09-26 Andrew McCallum <[email protected]>
* wi2pv.c (bow_wi2pv_flush): Assert that there are no bytes in the
PVM.
* archer.c (archer_index_filename): Properly print the filename as
part of the mmap error. Include the vocabulary size in the
progress message.
2001-03-17 Andrew McCallum <[email protected]>
* wi2pv.c (bow_wi2pv_new): Write something so that no PV can have
a start offset of 0, so that 0 can be reserved in pv.c as a
special value.
* bow/libbow.h (_FILE_OFFSET_BITS): Define to be 64 here. This
fixed problem with bow_fwrite_off_t().
* strtrie.c (bow_strtrie_present): If STR is not all lowercase,
don't exit, just say it isn't in the trie.
* pv.c: Use 0 instead of -1 to represent an unset seek start,
because I'm not sure whether offset_t is signed.
* int4str.c (_str2id): Make sure we do unsigned arithmetic.
(_bow_str_hash_lookup): Comment out bad fix to looping hash lookup,
and document the persistent problem in a comment below.
* archer.c (archer_index_filename): Use mmap() to get the text
files.
2001-02-04 Andrew McCallum <[email protected]>
* pv.c: Use off_t instead of int in appropriate places.
(_FILE_OFFSET_BITS): New macro, defined to be 64.
* archer.c, wi2pv.c (_FILE_OFFSET_BITS): New macro, defined to be
64.
* bow/archer.h: Use off_t instead of int in appropriate places.
* bow/libbow.h (bow_fwrite_off_t): New function.
(bow_fread_off_t): New function.
2001-01-21 Andrew McCallum <[email protected]>
* stoplist.c (stophash_init): New function.
(bow_stoplist_present_hash): New function.
* pv.c (bow_pv_write): Faster, architecture-dependent
implementation based on single call to fwrite().
(bow_pv_read): Likewise for reading.
* int4word.c (bow_words_set_map): Allow MAP to be NULL, in which
case this function simply initializes the WORD_MAP.
(bow_words_write): If !ARCHIVE_COUNTS, don't write out the
WORD_MAP_COUNTS. Currently is archiving.
(bow_words_read_from_fp): Likewise for reading.
* int4str.c (_str2id): No need to initialize H twice.
(_bow_str_hash_lookup): Make this a public interface.
(_bow_str_hash_lookup2): New function.
(_bow_str2int): New function that takes a pre-computed hash. Use
above function.
(bow_str2int): New shell around above function.
* archer.c (archer_index_filename): Add local lexing
implementation, used only if USE_FAST_LEXER is #define'd to be 1
(which is the default). Print document count once every 200
documents, not every 100.
(archer_sort_hites): New function.
(archer_query): Use it.
* bow/libbow.h: Declare new functions.
2000-12-19 Andrew McCallum <[email protected]>
* lex-simple.c: Cast (char) to (unsigned char) before passing to
isalpha() and other functions that take an int. On Solaris, the
high bit in these chars gets changed to a sign bit in the int, and
isalpha() and islower(), etc don't work.
* lex-simple.c (bow_lexer_simple_open_text_fp): Increase default
document size. Only scan FP for start string if it's non-empty.
If there is no end pattern, read the contents more efficiently.
(bow_lexer_simple_get_raw_word): New temporary version that is more
efficient, but ignores out many of the command-line arguments.
(bow_lexer_simple_postprocess_word): Likewise.
2000-12-18 Andrew McCallum <[email protected]>
* bow/libbow.h: Declare strtrie functions.
* int4str.c: Many efficiency cleanups, including better string
hash function and more efficient code paths for str_hash_lookup().
* stoplist.c: Use strtrie instead of hashtable.
* archer.c, pv.c, wi2pv.c, bow/archer.h: Trimmed back to version
as of 1999/05/20, and augmented with faster indexing.
* Makefile.in (STANDARD_LIBBOW_C_FILES): Added primelist.c and
strtrie.c.
(ARCHER_H_FILES): Remove all bow/archer_query* files.
(ARCHER_LEX_FILES, ARCHER_Y_FILES, ARCHER_GENERATED_C_FILES): Removed.
(ARCHER_DIST_FILES): Trimmed.
* Makefile.in (ARCHER_H_FILES): Remove reduntant one of these
definitions.
* opts.c (_help_filter): Make sure BOW_METHODS is non-NULL before
trying to use it.
* primes.c: Formatting change.
* Makefile.local: RAINBOW_METHOD_C_FILES: Added dirk.c.
* Makefile.in (STANDARD_RAINBOW_METHOD_C_FILES): Removed dirk.c.
2000-12-07 Kamal Nigam <[email protected]>
* bow/libbow.h (bow_wi2dvf_hide_all_wi): Added.
(bow_smoothing): Added dirichlet smoothing.
(bow_smoothing_dirichlet_filename): Added.
(bow_smoothing_dirichlet_weight): Added.
* bow/naivebayes.h (bow_naivebayes_dirichlet_alphas): Added to
enable dirichlet smoothing.
(bow_naivebayes_dirichlet_total): Likewise.
(bow_naivebayes_load_dirichlet_alphas): Likewise.
(bow_naivebayes_initialize_dirichlet_smoothing): Likewise.
* bow/em.h (bow_em_calculating_perplexity): Added.
* cotrain.c (cotrain_selection_type): Add randomly selection type.
(cotrain_select_docs): Changed prototype.
(cotrain_print_dependency_matrix): Variable for new option.
(cotrain_vocab_split_file): Likewise.
(cotrain_co_gem): Likewise.
(cotrain_options): Added new options
--cotrain-print-dependency-matrix,
--cotrain-split-vocab-from-file, and --cotrain-co-gem.
(cotrain_parse_opt): Likewise.
(cotrain_calculate_perplexity): New funtion.
(cotrain_split_vocabulary_from_file): New function.
(cotrain_do_vocab_split): New code for splitting from file.
(cotrain_generic_select): Changed prototype for changed data
structure.
(cotrain_select_by_confidence_weighting): Likewise.
(cotrain_select_by_confidence): Likewise.
(cotrain_select_by_density_weighting): Likewise.
(cotrain_select_by_density): Likewise.
(cotrain_select_by_random_weighting): New function for random
selection type.
(cotrain_select_by_random): Likewise.
(cotrain_new_vpc_with_weights): New code for printing the dependency
matrix. New code for co-GEM. New code for changed data
structure.
* dirichlet.c (main): Print progress information to stderr, not
stdout. Increase print precision on alphas.
* em.c (bow_em_calculating_perplexity): Made non-static to allow
access from cotraining.
(bow_em_new_vpc_with_weights): Initialize dirichlet smoothing if using
it. Add word counts even when class probs are 0. This will fill
out the class word matrix for perplexity calculations.
(em_calculate_perplexity): Fix class correspondence bug. hit index is
not class index when adding up log_prob_of_data. Also correctly
calculate num_data_words based on actual words occurring in the
model.
(bow_em_pr_wi_ci): Add in code for dirichlet smoothing.
(bow_em_set_weights): Likewise.
(bow_method_em): Change word vector normalization to
set_weights_to_count. This causes document-then-word test
documents to have their probabilities not be scaled to the
document length. This can be interpreted as more correct than the
other way.
* naivebayes.c (bow_naivebayes_dirichlet_alphas): New global
variable for Dirichlet smoothing.
(bow_naivebayes_dirichlet_total): Likewise.
(bow_naivebayes_load_dirichlet_alphas): New function.
(bow_naivebayes_initialize_dirichlet_smoothing): New function.
(bow_naivebayes_pr_wi_ci): Added dirichlet smoothing option.
(bow_naivebayes_print_word_probabilities_for_class): Changed output
format to include word count as well.
(bow_naivebayes_set_weights): Initialize dirichlet smoothing if its
being used.
* opts.c (bow_smoothing_dirichlet_filename): variable for
Dirichlet smoothing.
(bow_smoothing_dirichlet_weight): Likewise.
(bow_options): Added new options --smoothing-dirichlet-filename and
--smoothing-dirichlet-weight.
(parse_bow_opt): Likewise.
* wi2dvf.c (bow_wi2dvf_hide_all_wi): New function.
* Makefile.local (RAINBOW_METHOD_C_FILES): Remove dirk.c because
it's already in Makefile.in
* primes.c (_bow_nextprime): Bugfix from Andrew for mixing alloca
and realloc in weird ways.
* em.c (em_labeled_for_start_only): New option variable.
(em_set_vocab_from_unlabeled): New option variable.
(em_options): Changes for new options.
(em_parse_opt): Likewise.
(bow_em_new_vpc_with_weights): New option --em-labeled-for-start-only
uses the labeled data just to set the starting point of EM, and
not used during iterations. Option --em-set-vocab-from-unlabeled
sets to vocabulary to only words occurring in the unlabeled data.
2000-09-08 Kamal Nigam <[email protected]>
* primes.c (_bow_nextprime): Fix very peculiar memset bug.
Somehow it doesn't seem to matter...
* barrel.c (bow_barrel_new_from_printed_barrel_file): Fixed for
documents with no features. Now they won't get a " " feature.
(bow_barrel_printf_selected): Added support for 'l' (print the word as
many times as it occurred) and for 'P' (print docs in IPL format).
* rainbow.c (rainbow_parse_opt): If using vocabulary from file, do
not add to this vocab later.
(main): Allow user to set vocab from file at indexing time.
* bow/libbow.h (word_map): made global
* int4word.c (word_map): make global
(bow_word2int_add_occurrence): grow word_map multiple times, if
necessary
* maxent.c (bow_maxent_new_vpc_with_weights_doc_then_word):
properly ignore all documents that have no features. These
documents violate the constant document length assumption made by
doc_then_word.
2000-05-21 Andrew McCallum <[email protected]>
* dirk.c (bow_dirk_score): Initialize MAX_SCORE_DI to avoid gcc
warning.
* bpe.c: Don't include <huge_val.h>, it is no longer necessary in
RedHat 6.1.
* bow/crossbow.h (crossbow_classify_doc_new_wa): Declare new
function.
* bow/libbow.h: Declare new functions and variables.
* lex-simple.c (bow_lexer_max_num_words_per_document): New
variable.
(bow_lexer_simple_open_text_fp): Initialize it.
(bow_lexer_simple_open_str): Likewise.
(bow_lexer_simple_postprocess_word): Use it. Also, handle
BOW_XXX_WORDS_ONLY.
(bow_lexer_infix_length): New variable, but unused.
* crossbow.c: Added code for query serving on a socket.
(crossbow_new_root_from_dir): When recursing directories, skip over
directories named "unlabeled". Yipes, this is scary, arbitrary
behavior.
(crossbow_index_filename): If filename path includes the directory
"unlabeled", remove that directory from the file path. Again,
scary arbitrary behavior!
(crossbow_index_filename): Verbosify the file path and class.
(crossbow_index_multiclass_list): Fix call of strtok. Use strtok
instead of strsep, for the sake of Solaris.
(crossbow_classify_doc_new_wa): New function.
(crossbow_classify_doc) [DOC_LENGTH_SCORE_TRANSFORM]: Rescale the
score in a document-length specific way, as an aid to improved
estimation of confidence.. for the confidence-based selection
which unlabeled documents to label.
(crossbow_socket_init): New function.
(crossbow_serve): New function.
(crossbow_query_serving): New function.
(crossbow_options): New command-line option "query-server".
* rainbow.c: Include <strings.h> for bzero on Solaris.
(rainbow_options): New command-line arguments "forking-query-server"
and "use-saved-classifier".
(rainbow_parse_opt): Handle them.
(struct rainbow_arg_state): New member FORKING_SERVER.
(rainbow_query): Handle UNIX signal for broken pipe. Code added by
Dan Rapp <[email protected]>. Remove words from QUERY_WV that
are not in the class barrel! This fixes normalization by document
length. Comment out a bunch of code that would re-set various
parameters specified on the command-line (such as the
classification method); this makes --query-server work much
better.; this will break old behavior, but I don't think it is
ever used. Always set the weights and normalize the QUERY_WV
using the class barrel; previously there was a preference for
using the document barrel.
(SigPipeHandler): New function.
(rainbow_serve): Implement a forking server.
(rainbow_test): Remove from QUERY_WV words not in the class barrel.
(rainbow_test_files): Likewise. If the test file can't be opened,
don't crash, just report so on stderr.
(main): Handle query forking server. When testing saved model and
looping once for each test document, remove from QUERY_WV words
not in the class barrel.
* opts.c (bow_xxx_words_only): New variable.
(bow_options): New command line options "xxx-words-only" and
"max-num-words-per-document".
(parse_bow_opt): Handle them.
* wv.c (bow_wv_prune_words_not_in_wi2dvf): New function.
(bow_wv_fprintf): Print all on one line, just like --print-matrix.
(bow_wv_printf): New function.
* treenode.c (bow_treenode_descendant_matching_name) [WHIZBANG]:
Rely on tree being rooted at directory named "./data" and tree
depth being 2. Without this code, we don't reliably find the
right descendant if there are several treenodes with the same
name.
* split.c (bow_set_docs_to_type): When duplicate tags are
requested for a document, just print a warning instead of exiting
with an error.
* random.c (bow_random_double): Use RAND_MAX if available.
* random.c (bow_random_double): Handle case in which RAND_MAX is
not defined, assuming its value is 2147483647, if necessary.
* maxent.c (bow_maxent_score): From query_wv remove words not in
the class_barrel; (I think this affects normalization). Also, set
and normalize the weights---note that this might now be done
twice.
* lex-gram.c (bow_lexer_gram_get_word): Don't distinguish between
bi-grams in the middle of a document and a tri-gram that only got
two words before the end of the document. Do this by removing
trailing `;'s
* hem.c (crossbow_hem_incremental_labeling): New variable.
(crossbow_hem_fixed_shrinkage): New variable, but unused.
(crossbow_hem_options): New command-line argument
"hem-incremental-labeling".
(crossbow_hem_label_most_confident): New function.
(crossbow_hem_full_em): Call it.
(crossbow_hem_em_one_iteration): Handle incremental labeling.
2000-05-19 Andrew McCallum <[email protected]>
* wi2dvf.c (bow_wi2dvf_add_di_text_str): Assert LEX.
* rainbow.c (rainbow_options): New command-line option
"index-lines".
(rainbow_index_lines): New function.
(main): Call it.
* opts.c (bow_options): New command-line option
"use-unknown-word".
* int4word.c (bow_word2int_use_unknown_word): New global variable.
Use it in various functions.
* bow/libbow.h: Define new global variable.
* naivebayes.c (bow_naivebayes_score): Use a temporary annealing
weight that depends on the number of words in the query_wv.
* crossbow.c (crossbow_index_multiclass_list): Even though
strtok() is deprecated, switch to it from strsep() since the later
doesn't exist in Solaris.
* configure.in: Check for nsl library, needed by Solaris.
* cdmemr.c: New learning method "emr". Change annealing
temperature from 1000 to 100. Several other changes.
(cdmemr_calculate_perplexity): New function.
* barrel.c: Removed #include <nan.h>. It's not necessary and
doesn't exist anymore in RedHat.
* archer-server.c: Remove #include signum.h. It isn't necessary
and doesn't exist in Solaris.
2000-04-28 Andrew McCallum <[email protected]>
* cdm.c (bow_cdm_score): Normalize scores by document length.
2000-02-28 Andrew McCallum <[email protected]>
* barrel.c (bow_barrel_prune_words_in_map): Be sure not to add
words from MAP into the vocabulary.
2000-02-23 Andrew McCallum <[email protected]>
* primes.c (_bow_nextprime): Allocate space for the sieve in the
heap, not on the stack. This helps when we need really, really
big prime numbers.
2000-01-03 Andrew McCallum <[email protected]>
* split.c (bow_split_options): Changed documentation of "test-set"
default from 0.3 to 0.0. Reported by Jason Rennie.
2000-02-25 Gregory C Schohn <[email protected]>
Final release of SVM code as part of rainbow - it will become its
own library now (a rainbow interface will always remain up to
date though).
* svm_trans.c (transduce_svm): fixed a bug in the temporary
updating code. Changed some printing (for the simpler).
* svm_al.c (al_test_data): added fields for train and query
errors.
(al_svm_guts): now using hyp_vect and cur_hyp_yvect for ALL
labels.
Made the transduction method skip training if the labels
queried == labels transduced and they weren't bound.
Finished error reporting so that it doesn't always email me.
(al_svm_test_wrapper): added (& rearranged) appropriate code to
print out additional statistics (query & train accuracy).
* svm_base.c: changed some of the things that get printed, a
couple of constants and other very minor things.
(svm_score): fixed an uninitialized memory read.
(tlf_svm): fixed a bug with the random seed (it was sometimes
using uninitialized memory) for the splits.
2000-02-04 Kamal Nigam <[email protected]>
* vpc.c (bow_barrel_new_vpc_using_class_probs): New function.
* Makefile.local (RAINBOW_METHOD_C_FILES): Removed emda.c as a
file because it does not exist in the repository.
* genem.c: New file for genem method. Requires a secondary method
that utilizes class_probs for train and unlabeled docs (emsimple
with rounds=1 will do, e.g.).
* gaussian.c: New file for gaussian method. It's still very basic
and preliminary.
* Makefile.local (RAINBOW_METHOD_C_FILES): Added gaussian.c,
genem.c, cotrain.c
* cotrain.c: New file for cotrain method. This version was used
for ICML-2000 submission.
* bow/libbow.h (bow_doc_type): Added types bow_doc_pool and
bow_doc_waiting for co-training.
(bow_str2type): Extended for new doc types.
(bow_type2str): New macro.
* wi2dvf.c (bow_wi2dvf_unhide_wi): New function.
(bow_wi2dvf_hide_words_with_prefix): New function.
(bow_wi2dvf_hide_words_without_prefix): New function.
* vpc.c (bow_barrel_set_vpc_priors_using_class_probs): New
function.
* stoplist.c (bow_stoplist_present): If an infix separator is
defined use only the part of the string after it for stopword
identification.
* stem.c (bow_stem_porter): If an infix key is defined, take only
the string after the infix key for stemming purposes.
* split.c (bow_tag_change_tags): Changed prototype. Now returns
the number of docs changed instead of returning void.
* rainbow.c (rainbow_test): set the priors when building the class
barrel. Is it really possible this bug has existed forever?
* opts.c (bow_options): Added code for --lex-infix-string
(parse_bow_opt): Likewise.
* next.c (bow_cdoc_is_pool): New function.
(bow_cdoc_is_waiting): New function.
* nbsimple.c
(bow_nbsimple_set_cdoc_word_count_from_wi2dvf_weights): store
number of terms per class in the normalizer, as well as the
word_count. This way, we have access to the un-rounded number if
we prefer it.
(bow_nbsimple_score): Remove the "feature" that normalizes scores by
doc length for the document-then-word event model. This will now
have longer documents have more extreme probabilities than shorter
documents. Use the normalizer for the total number of words per
class instead of the word_count. This should be slightly more
accurate, as it's not rounded.
* lex-simple.c (bow_lexer_infix_separator): new variable for word
infix recognition.
(bow_lexer_infix_length): Likewise.
* emsimple.c (bow_emsimple_num_em_runs): Changed default to 10.
* active.c (active_cdoc_is_used_for_density): New function.
(active_doc_barrel_set_entropy): Use train, unlabeled, and pool docs
for density-setting. Used by cotrain.c.
(active_doc_barrel_set_density): Likewise. Also, don't print density
of each document.
2000-01-12 Gregory C Schohn <[email protected]>
* svm_smo.c (smo): Added smart re-computation for *W. If *W is
null & there are non-zero weights, then it is recomputed (since it
is necessary for the error evaluations). This saves alot &
cache-thrashing if the tvals vector is already up-to-date (its not
much harder to keep W up to date too).
(svm_smo_yflip_tvals): Killed. See svm_base.c log for details.
* bow/svm.h (svm_yflip_tvals): killed prototypes for this & the
smo/loqo functions. See the svm_smo.c log for details.
* svm_loqo.c (svm_loqo_yflip_tvals): killed (see the log entry for
svm_trans).
* svm_trans.c (transduce_svm): fixed most of the inefficiencies
(all of the big ones). When the smart_vals variable is set, no
extra recomputation is done, each svm sub-problem's output is used
as input along for the next sub-problem (very similar to the
active learning code, but here alot more recomputation needs to be
done since labels and bounds are changing). The hyperplane
Null/non-null std is enforced, where the plane is set to zero
after it is freed, so that the solvers know not to look at it.
Fixed a bug where all unlabeled documents have the same hyp. label
(only relevant when no-bias is also being used).
There is also support for hyperplane stability management (see
svm_base & the refresh option). Alot of debugging code is around
for future use.
Killed the yflip functions. That code just happens inside of the
loop since the hyperplane needs to also be updated, but only for
smo (so clean parameter passing wasn't going to happen).
TODO: get the tval-to-err functions working (though this is a very
petty thing, especially if hyperplanes are being used to do the
error evaluations).
* svm_base.c (svm_options[]): added options for
TRANS_HYP_REFRESH_ARG (the number of iterations to go in the
transduction loop before recalculating the hyperplane from scratch
(to undo precision problems)). Probably never of any use, just a
way for the user to check his/her sanity.
(tlf_svm): added line to also print the running time to stdout.
(tlf_svm): Added initialization of *W to NULL (since smo now uses the
data in the array if the array is non-null.
(svm_vpc_merge): fixed a bug where documents were being re-loaded from
the barrel (when weights per barrel & pairwise voting was used).
The unlabeled docs weren't coming back, but now they are.
* svm_al.c (al_svm_guts): Made the loop a little bit
smarter/efficient when transduction is used. If the queried
labels are the same as those hypothesized & the weights are not
bound for those vectors, the next problem isn't solved (since the
solution will be exactly the same). So far this doesn't seem to
help to often (since running time increases as step size
increases, making this less & less probable). This will likely
help on very big datasets where transduction is very helpful
>>>>>>> 1.487
1999-12-30 Andrew McCallum <[email protected]>
* bow/libbow.h: Declare new functions.
* wa.c (bow_wa_empty): New function.
* rainbow.c (rainbow_options): New command-line option
"print-doc-length".
(bow_print_log_odds_ratio): Don't trod on the IDF any more.
(rainbow_test): If requested, print the length of the document after
each classification.
* naivebayes.c: Add capability to return simply P(d|c) and the
ability to anneal the P(d|c) portion of the P(c|d).
(naivebayes_return_log_pr): New static variable.
(bow_naivebayes_anneal_temperature): New global variable.
(bow_naivebayes_score): Use the new variables.
* bow/naivebayes.h: Declare annealing global variable.
* info_gain.c (bow_word_count_wa): New function.
* em.c (bow_em_set_priors_using_class_probs): Don't set PRIOR_SUM
to MAX_CI. This was a very odd bug.
* dirk.c (bow_dirk_score): Comment out printing of diagnostics.
(bow_dirk_new_vpc): Add code that uses the CDM. I'm not sure if this
is working yet.
* cdmemr.c (use_cdm): New static variable, attend to it.
(bow_cdmemr_new_vpc_with_weights): Set the CDM anneal temperature and
the NAIVEBAYES anneal temperature to 1000. If we aren't very
confident about the most confident classifications this round,
then don't label any more unlabeled documents.
* cdmemi.c: Comments added.
(bow_cdmemi_new_vpc_with_weights): Bug fix. When
BOW_CDMEMI_BINARY_SCORING add to the WA the di from the 0th not
the 1st HITS.
* cdmem.c (bow_cdmem_new_vpc_with_weights): Only do one cdm round
instead of 5. Fix bug by pre-decrementing the NUM_CDM_ROUNDS
instead of post-decrementing.
* cdm.c (bow_cdm_anneal_temperature): New global variable.
(bow_cdm_word_probs_using_ct_alphas): Get the number of classes from
the CLASS_COUNT_BARREL->CDOCS->LENGTH instead of from
bow_barrel_num_classes (class_count_barrel). This way we can use
this code for a version of KNN with a CDM distance metric.
(bow_cdm_score): Calculate the number of words in the query; this was
previously used as the annealing temperature, but is no longer.
Divide the log-prob scores by the annealing temperature.
* archer.c (archer_index_lines): Try to make this work again after
the changes to archer_index() for incremental additions. Still
not working. For the canopies experiments, I just checked out an
old version of archer.
* Makefile.local (RAINBOW_METHOD_C_FILES): Added emda.c, cdmemi.c,
cdmemr.c.
1999-11-22 Andrew McCallum <[email protected]>
* Makefile.in (STANDARD_RAINBOW_METHOD_C_FILES): Added dirk.c.
* dirk.c (log_gamma): Cache 100 integer x's.
(bow_dirk_log_kernel): Take vocab size as argument instead of barrel.
(bow_dirk_score): Add exponentiated log-densities, instead of log
densities. Do this by finding the max and subtracting.
1999-11-16 Andrew McCallum <[email protected]>
* cdm.c (cdm_options): New command-line options
"cdm-print-smallest-alphas" and "cdm-print-largest-alphas".
(cdm_parse_opt): Handle them.
(bow_cdm_initialize_ct): New code allows this to be called more than
once. This way you can add new document (and hence words) and
re-calculate the infogain.
(bow_cdm_ct_set_alphas): Added structure ALPHA_RECORD for printing
largest and smallest alphas. Added, but commented out, code for
smoothing the counts before fitting the Dirichlet, using
log(alpha) in place of alpha, smoothing the alphas. Print the
largest and smallest alphas.
(CDM_SCORE_ANNEAL_TEMPERATURE): New macro, currently defined not to be
used.
(bow_cdm_score): Handle it.
1999-12-20 Kamal Nigam <[email protected]>
* emsimple.c: added option --emsimple-no-init
1999-12-18 Gregory C Schohn <[email protected]>
* svm_al.c: Updated to work with the new model (ie. this can
be called only by svm_tlf (top-level-fn) & calls the trans fn.
or the setup & solve fn).
So far the usage of transduction has no extra heuristics set up,
but the active learning module can be used to get stats about
incrememtally randomly selected labels.
Rewrote the code to work with transduce_svm with as little hassle
as possible. The code that handled the labeled & unlabeled arrays
significantly changed. Now there is only 1 array (no more sub_*
vectors with copies of data).
* svm_base.c (svm_options[]): Alphabetized to improve readability.
(svm_parse_opt): Re-ordered to mostly alphabetical to improve
readability.
(get_top_n): fixed a bug that popped up in obscure places & switched
to a more intelligent algorithm (don't know why it was dumb in the
first place).
(svm_remove_bound_examples): changed the removal code around (again)
as part of the new svm model. The fn now removes either bound, or
misclassified documents & is called by solve_svm (the most inner
svm fn. that calls a solver).
(svm_trans_or_chunk): removed chunk_svm for this. Calls either
transduce_svm or solve_svm depending on the parameters/data.
(svm_tlf): Top-Level-Fn. Permutes data & outputs a hyperplane in
bow_wv if possible. This fn also chooses/sets up the proper fn
(al, trans, removal, etc) to call.
* svm_loqo.c: Updated to work with cvect instead of svm_C. Now
all upper bounds come from the cvect parameter which MUST be
properly initialized. (this is necessary for transduction &
possibly other things).
* svm_smo.c (opt_pair): fixed a blatant bug in the solver (the
examples were added to I0 set in cases where they shouldn't have
been [see keerthi, et al for exactly where the examples should be
added if they weren't already present]).
Now the upper bounds come from the cvect instead of svm_C. The
algorithm is almost identical. The only difference is a little
bit more notice to the exact upper bounds on each of the boxes.
* svm_trans.c: stable version. Has new interface with the svm
model. No known bugs. The code does have some gross
inefficiencies (always zero-ing out temporary values & weights,
causing the solvers to restart each time), but all of the output
examined has been correct.
* bow/svm.h: Updated for a new svm interface. The relationship
between the different solvers is much cleaner now that redundant
code has been mostly eliminated.
Note - the prototypes for most functions have changed, as the
structure of most of the higher-level svm code has changed.
1999-12-01 Kamal Nigam <[email protected]>
* .cvsignore: added rainbow-be to ignore list
* .cvsignore: added rainbow-rank to ignore list.
* em.c (bow_cdoc_is_train_or_unlabeled): moved to split.c
(bow_em_new_vpc_with_weights): removed usage of halt_using_perplexity.
This option is broken, and its code was hurting performance.
* bow/libbow.h (bow_cdoc_is_train_or_unlabeled): New prototype.
* split.c (bow_files_source_type): added code for the
bow_files_source_fraction_train and bow_files_source_number_train.
This is indicated by a following t which converts some number of
training documents. For example, --unlabeled-set=500t takes 500
training docs and converts them to unlabeled docs.
(bow_split_options): Likewise.
(bow_split_parse_opt): Likewise.
(bow_set_doc_types_randomly_by_fraction_remaining): Likewise.
(bow_set_doc_types): Likewise.
(bow_set_doc_types_randomly_by_count_per_class): Added argument
source_tag. To get previous behavior, call this with source_tag
equal to bow_doc_untagged. Used for the new options
bow_files_source_number_train and bow_files_source_fraction_trai.
(bow_set_doc_types_randomly_by_count): Likewise.
(bow_set_doc_types_randomly_by_fraction): Likewise.
(bow_cdoc_is_train_or_unlabeled): New function.
* maxent.c (maxent_options): Added code for new options
--maxent-iteration-docs and --maxent-constraint-docs.
(maxent_parse_opt): Likewise.
(bow_maxent_new_vpc_with_weights_doc_then_word): Likewise.
(bow_maxent_new_vpc_with_weights): Likewise.
1999-11-10 Andrew McCallum <[email protected]>
* svm_base.c (sqrtf): New macro, necessary on some non-Linux
machines. Bug reported by Chuck Rosenberg.
1999-11-08 Andrew McCallum <[email protected]>
* readme.texi: Add simple usage examples for arrow.
* arrow.c (arrow_serve2): Implement the 'query' command. Change
XML labels from "archer" to "arrow".
(main): Change default number of hits on a query from 1 to 10.
* libbow-desc.texi: Update descriptions.
* svm_base.c: Surround many condition man printf's on the
bow_verbosity_level, so that by default rainbow-stats will still
work.
* array.c (cdocs_iterator_count_for_doc): Replace NAN macro with
arithmetic equivalent.
* barrel.c (barrel_iterator_count_for_doc): Likewise.
* wv.c (bow_wv_weight_sum): New function.
* bow/libbow.h: Declare new function.
* train_dirichlet.c (moment_match_mccallum): Separate
implementation of moment matching that determines the variance by
averaging the variance of all dimentions.
(train_dirichlet_mom_sparse): New function.
* bow/train_dirichlet.h: Declare new function.
* tfidf.c (TFIDF_METHOD): Use
bow_wv_set_weights_to_count_times_idf() instead of
bow_wv_set_weights_to_count(), as is correct for TFIDF. This was
previously corrected in the scoring function.
(bow_tfidf_params_tfidf): Change parameter settings for "tfidf"
method. Previously it was identical to the "tfidf_log_words"
method, now it is identical to the "tfidf_log_occur" method. In
other words, previously it calculated IDF using the number of
times the word occurred in the training data; now it uses the
number of training documents in which the word occurs.
* split.c (bow_split_options): Remove documentation for 'r'
suffix. It's confusing and shouldn't be used unawares.
(bow_split_parse_opt): Add a 'pcr' suffix, but its not implemented
yet.
(bow_set_doc_types_randomly_by_count_per_class): Count the number of
untagged documents in each class, and if this function is trying
to tag more than are available, simply have this function tag
less.
* rainbow.c (bow_print_log_odds_ratio): Handle words that are not
in the vocabulary.
* ddf.c: Implement ddfmm classification method. This method fits
the Dirichlet by moment matching only.
* arrow.c (arrow_serve2): New function. Now call this instead of
arrow_serve. It provides output in XML, like archer does. Only
the rank command is implemented.
1999-11-02 Andrew McCallum <[email protected]>
* int4str.c (bow_int2str): Assert that INDEX argument is
non-negative.
1999-10-28 Gregory C Schohn <[email protected]>
* svm_base.c (svm_vpc_merge): fixed bug for svml-basename - all
the docs still need to be output, so that the other data (like
word weights can be properly extracted).
1999-10-28 Andrew McCallum <[email protected]>
* cdmem.c (cdmem_options): New command-line option
"cdmem-dist-data".
(cdmem_parse_opt): Handle it.
(bow_cdmem_new_vpc_with_weights): Let the command-line option
determine what documents are used to learn the distance metric.
* README-SVM (Outputing data): Added new section describing how to
produce files ready for input into SVM^light.
1999-10-27 Gregory C Schohn <[email protected]>
* svm_base.c (svm_vpc_merge): fixed svml bugs
* svm_base.c fixed outdated documentation for parse info.
* svm_smo.c (smo): fixed a parse error
1999-10-26 Gregory C Schohn <[email protected]>
* rainbow.c (rainbow_test): added a line for svms. When svmlight
output is being generated, rainbow_test prints the label (only
works for binary barrels) so that svm_score can append the data
for that example.
* svm_base.c (svm_options[]): removed some of the single character
switches. Added arguments for tsvms & added svml-basename arg.
(svm_permute_data, svm_unpermute_data): added.
(infogain): should have made infogain compatible with sets with
unlabeled data (it ignore those docs with y = 0).
(svm_vpc_merge): added support for using unlabeled docs for
transduction. Also added code to spit out svmlight friendly
files.
(svm_score): added code to write svmlight files.
* svm_trans.c: initial version - pretty much empty now.
* bow/svm.h: added svm_*permute_data declarations & the
transduce_svm declaration.
* svm_al.c (al_svm_test_wrapper): replaced permutation code with
calls to svm_permute_data & svm_unpermute_data.
* svm_smo.c (smo): removed srandom(1) - was only there for
debugging.
* README-SVM (Bugs): removed section about smo being broken (was
fixed).
* Makefile.in: added svm_trans.c (transductive svms) to the
svm_files.
1999-10-25 Andrew McCallum <[email protected]>
* .cvsignore: Add automatically-generated archer files, and a few
others.
1999-10-21 Andrew McCallum <[email protected]>
* barrel.c (bow_barrel_keep_top_words_by_infogain): Don't set the
NUM_WORDS_TO_KEEP to be the WI2IG_SIZE (which is the total number
of words). Set it to the MIN of this and the original
NUM_WORDS_TO_KEEP. Before this fix, no words were ever getting
removed. What a bug! I wonder how long this has been in there?
Reported by Carsten Lanquillon <[email protected]>.
1999-10-20 Andrew McCallum <[email protected]>
* ddf.c (bow_ddf_dirichlet_from_doc_word_counts): Only print the
diagnostics for 10 sampled words, not 50.
* bpe.c (bow_bpe_set_cdoc_word_count_from_wi2dvf_weights): Print
the alphas for only 10 sampled words intead of 20.
1999-10-19 Andrew McCallum <[email protected]>
* svm_base.c: Check verbosity level before printing to stdout.
Only print if above bow_progress.
1999-10-19 Gregory C Schohn <[email protected]>
* svm_base.c (svm_score): removed cnt variable (useless) & fixed a
typo-bug (sub_model[i] -> barrel).
* svm_smo.c (smo): changed the printf for information of where
opt_pair failed to an fprintf.
1999-10-19 Gregory C Schohn <[email protected]>
* Makefile.local (DIST_ALL_FILES): added -DGCSJPRC (turn local
pedantic debugging) to DEFS.
* Makefile.in (ALL_CPPFLAGS): added -Ibow (so that pr_loqo.h is
found by pr_loqo.c even though they aren't in the same directory
[since we can't change pr_loqo.*]).
(DEFS): Changed from _DEFS & now using += instead of the temporary.
* svm_base.c: the epsilon_crit is now /2 for SMO (since the actual
eps is 2x the variable). fixed some printfs.
* svm_loqo.c (build_svm_guts): added code to remember previous KKT
epsilon (even though nobody sets the initial value to anything
different than the macro).
(build_svm_guts): added local define (GCSJPRC) for debugging stuff
which includes stopping the proc & sending mail.
* svm_smo.c: commented #DEBUG. added kcache_ages to appropriate
spots across the file. removed some print statements that weren't
to useful anymore.
(opt_pair): changed an optimality check - used to use (a2+ao2)*eps
to detrmine if something moved far enough, now just using eps_a
(may not be right, but its more correct than before) - we need it
to prevent inf. looping.
(opt_pair): Removed some unreachable in if statements.
(opt_pair): Fixed calculations of bup & blow - they were backwards
(smo): the threshold, b is now (bup+blow)/2 instead of blow (which
is at most epsilon_crit different).
1999-10-16 Gregory C Schohn <[email protected]>
* svm_base.c: Added #ifdef HAVE_LOQO around calls to build_svm_guts
* svm_al.c: Added #ifdef HAVE_LOQO around calls to build_svm_guts
* Makefile.in: Re-enabled svm code. Made the pr_loqo checks look
./bow/pr_loqo.h
1999-10-16 Andrew McCallum <[email protected]>
* README-SVM (Obtaining sources): File renamed from README_SVM.
Clarify directions for where to put pr_loqo.h.
1999-10-15 Andrew McCallum <[email protected]>
* Version (BOW_MINOR_VERSION): Changed from 9 to 95.
* bow/libbow.h (BOW_MINOR_VERSION): Changed from 9 to 95.
Bug fixes for distribution.
* .cvsignore: Added rainbow-rank and rainbow-ts.
* Makefile.in: Temporarily disable SVM from rainbow.
(ARCHER_GENERATED_C_FILES): New variable. Remove this files from
those distributed, because they should be generated.
(ARCHER_DIST_FILES): Added archer.c and archer_query.c
* Makefile.in (DEMO_EXECUTABLES): New variable.
(ARCHER_DIST_FILES): Added dirichlet.c.
(DIST_FILES): Added archer.el
* multiclass.c: Comment out unused variables.
Odd assortment of clean-ups.
* bow/libbow.h (bow_random_reset_seed): Declare function.
* train_dirichlet.c (MOMENT_MATCH_ONLY): New macro.
(SPARSE): Change macro value from 0 to 1. This only effects running
train_dirichlet's main() directly.
(main): comment out the printing of the gammaln() tests. New local
variable COUNTS_SIZE, increased from 100 to 10000. Print more
diagnostics at the end.
* readme.texi: Update for new front-ends and fix command-line
options so they work.
* rainbow.c (rainbow_options): Clean up wording in several places.
(rainbow_query): Change behavior of repeated queries.
(bow_print_log_odds_ratio): Add a new FILE* argument. All callers
changed.
* nbshrinkage.c: Allow different lambda hierarchical mixture
weights for different classes.
* mix.c (mix_options): New command-line option for setting the
number of EM iterations.
(mix_new_vpc): Don't allow initial random class_probs to be zero.
* libbow-desc.texi: Update for new front-ends and MSWin.
* lex-gram.c (bow_lexer_gram_open_text_fp): Properly save the
return value of bow_realloc(). This fixes a nasty crash.
* emsimple.c (bow_emsimple_new_vpc_with_weights): Print
diagnostics using odds_ratio.
* dirichlet.c (main): New command-line argument -I. Handle it.
* dice.c (print_usage): Expand help statement.
* ddf.c (ddf_force_large_alphas): New variable.
(bow_ddf_dirichlet_from_doc_word_counts): Handle it.
(ddfla): New method.
* cdmm.c (CDMM_PRINT_ALPHAS_KEY): Change value to not conflict
with the cdm method.
* bpe.c (bpe_prior_alpha): Change default prior "ghost count" from
1 to 0.