forked from ycjuan/libffm
-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathREADME
370 lines (251 loc) · 11.3 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
LIBFFM is a library for field-aware factorization machine. For the formulation it solves, please check:
http://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf
Table of Contents
=================
- Overfitting and Early Stopping
- Specifying the importance weights
- Installation
- Data Format
- Command Line Usage
- Examples
- Library Usage
- OpenMP
- Building macOS Binaries
- Building Windows Binaries
Overfitting and Early Stopping
==============================
FFM is prone to overfitting, and the solution we have so far is early stopping. See how FFM behaves on a certain data
set:
> ffm-train -p va.ffm -l 0.00002 tr.ffm
iter tr_logloss va_logloss
1 0.49738 0.48776
2 0.47383 0.47995
3 0.46366 0.47480
4 0.45561 0.47231
5 0.44810 0.47034
6 0.44037 0.47003
7 0.43239 0.46952
8 0.42362 0.46999
9 0.41394 0.47088
10 0.40326 0.47228
11 0.39156 0.47435
12 0.37886 0.47683
13 0.36522 0.47975
14 0.35079 0.48321
15 0.33578 0.48703
We see the best validation loss is achieved at 7th iteration. If we keep training, then overfitting begins. It is worth
noting that increasing regularization parameter do not help:
> ffm-train -p va.ffm -l 0.0002 -t 50 -s 12 tr.ffm
iter tr_logloss va_logloss
1 0.50532 0.49905
2 0.48782 0.49242
3 0.48136 0.48748
...
29 0.42183 0.47014
...
48 0.37071 0.47333
49 0.36767 0.47374
50 0.36472 0.47404
To avoid overfitting, we recommend always provide a validation set with option `-p.' You can use option `--auto-stop' to
stop at the iteration that reaches the best validation loss:
> ffm-train -p va.ffm -l 0.00002 --auto-stop tr.ffm
iter tr_logloss va_logloss
1 0.49738 0.48776
2 0.47383 0.47995
3 0.46366 0.47480
4 0.45561 0.47231
5 0.44810 0.47034
6 0.44037 0.47003
7 0.43239 0.46952
8 0.42362 0.46999
Auto-stop. Use model at 7th iteration.
Specifying the importance weights
=================================
Usage:
Use '-W weight_file' to assign importance weights for each training data.
Use '-WV weight_file' to assign importance weights for each validation data.
Please make sure all importance weights are non-negative.
Example:
$ ./ffm-train -p va.ffm -W weights.txt -l 0.00002 tr.ffm
$ ./ffm-train -p va.ffm -W weights.txt -WV va_weights.txt -l 0.00002 tr.ffm
Installation
============
Requirement: LIBFFM is written in C++. It requires C++11 and OpenMP supports. If OpenMP is not available on your
platform, please refer to section `OpenMP.'
- Unix-like systems:
To compile on Unix-like systems, type `make' in the command line.
- OS X:
The built-in compiler should be able to compile LIBFFM. However, OpenMP may
not be supported. In this case you have to compile without OpenMP. See
section `OpenMP' for detail.
- Windows:
See `Building Windows Binaries' to compile.
Data Format
===========
The data format of LIBFFM is:
<label> <field1>:<index1>:<value1> <field2>:<index2>:<value2> ...
.
.
.
`field' and `index' should be non-negative integers. See an example
`bigdata.tr.txt.'
Command Line Usage
==================
- `ffm-train'
usage: ffm-train [options] training_set_file [model_file]
options:
-l <lambda>: set regularization parameter (default 0.00002)
-k <factor>: set number of latent factors (default 4)
-t <iteration>: set number of iterations (default 15)
-r <eta>: set learning rate (default 0.2)
-s <nr_threads>: set number of threads (default 1)
-p <path>: set path to the validation set
-f <path>: set path for production model file
-m <prefix>: set key prefix for production model
-W <path>: set path of importance weights file for training set
-WV <path>: set path of importance weights file for validation set
--quiet: quiet model (no output)
--no-norm: disable instance-wise normalization
--no-rand: disable random update <training_set_file>.bin will be generated)
--json-meta: generate a meta file if sets json file path.
--auto-stop: stop at the iteration that achieves the best validation loss (must be used with -p)
--auto-stop-threshold: set the threshold count for stop at the iteration that achieves the best validation loss (must be used with --auto-stop)
--nds-rate: set the negative down sampling rate for training dataset.
By default we do instance-wise normalization. That is, we normalize the 2-norm of each instance to 1. You can use
`--no-norm' to disable this function.
By default, our algorithm randomly select an instance for update in each inner iteration. On some datasets you may
want to do update in the original order. You can do it by using `--no-rand' together with `-s 1.'
Because FFM usually need early stopping for better test performance, we provide an option `--auto-stop' to stop at
the iteration that achieves the best validation loss. Note that you need to provide a validation set with `-p' when
you use this option.
- `ffm-predict'
usage: ffm-predict test_file model_file output_file [options]
options:
--nds-rate: set the negative down sampling rate for training dataset.
Examples
========
> ffm-train bigdata.tr.txt model
train a model using the default parameters
> ffm-train -l 0.001 -k 16 -t 30 -r 0.05 -s 4 bigdata.tr.txt model
train a model using the following parameters:
regularization cost = 0.001
latent factors = 16
iterations = 30
learning rate = 0.05
threads = 4
> ffm-train -p bigdata.te.txt bigdata.tr.txt model
use bigdata.te.txt as validation set
> ffm-train --quiet bigdata.tr.txt
do not print message to screen
> ffm-predict bigdata.te.txt model output
do prediction
> ffm-train -p bigdata.te.txt -t 100 --auto-stop bigdata.tr.txt
use auto-stop to stop at the best iteration according to validation loss
Library Usage
=============
These structures and functions are declared in the header file `ffm.h.' You need to #include `ffm.h' in your C/C++
source files and link your program with `ffm.cpp.' You can see `ffm-train.cpp' and `ffm-predict.cpp' for examples
showing how to use them.
There are four public data structures in LIBFFM.
- struct ffm_node
{
ffm_int f; // field index
ffm_int j; // column index
ffm_float v; // value
};
Each `ffm_node' represents a non-zero element in a sparse matrix.
- struct ffm_problem
{
ffm_int n; // number of features
ffm_int l; // number of instances
ffm_int m; // number of fields
ffm_node *X; // non-zero elements
ffm_long *P; // row pointers
ffm_float *Y; // labels
};
- struct ffm_parameter
{
ffm_float eta;
ffm_float lambda;
ffm_int nr_iters;
ffm_int k;
ffm_int nr_threads;
ffm_float nds_rate
bool quiet;
bool normalization;
bool random;
bool auto_stop;
};
`ffm_parameter' represents the parameters used for training. The meaning of
each variable is:
variable meaning default
============================================================
eta learning rate 0.1
lambda regularization cost 0
nr_iters number of iterations 15
k number of latent factors 4
nr_threads number of threads used 1
quiet no outputs to stdout false
normalization instance-wise normalization false
random randomly select instance in SG true
auto_stop auto stop at the best iteration false
nds_rate negative down sampling rate 1.0
To obtain a parameter object with default values, use the function
`ffm_get_default_param.'
- struct ffm_model
{
ffm_int n; // number of features
ffm_int m; // number of fields
ffm_int k; // number of latent factors
ffm_float *W; // store model values
bool normalization; // do instance-wise normalization
};
Functions available in LIBFFM include:
- ffm_parameter ffm_get_default_param();
Get default parameters.
- ffm_int ffm_save_model(struct ffm_model const *model, char const *path);
Save a model. It returns 0 on sucess and 1 on failure.
- struct ffm_model* ffm_load_model(char const *path);
Load a model. If the model could not be loaded, a nullptr is returned.
- void ffm_destroy_model(struct ffm_model **model);
Destroy a model.
- struct ffm_model* ffm_train(struct ffm_problem const *prob, ffm_parameter param);
Train a model.
- struct ffm_model* ffm_train_with_validation(struct ffm_problem const *Tr, struct ffm_problem const *Va, ffm_parameter param);
Train a model with training set `Tr' and validation set `Va.' The logloss of the validation set is printed at each
iteration.
- ffm_float ffm_predict(ffm_node *begin, ffm_node *end, ffm_model *model);
Do prediction. `begin' and `end' are pointers to specify the beginning and ending position of the instance to be
predicted.
OpenMP
======
We use OpenMP to do parallelization. If OpenMP is not available on your
platform, then please comment out the following lines in Makefile.
DFLAG += -DUSEOMP
CXXFLAGS += -fopenmp
Note: Please always run `make clean all' if these flags are changed.
Building macOS Binaries
=======================
Apple clang (use libomp)
brew install libomp
make OMP_CXXFLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix libomp)/include" OMP_LDFLAGS="-L$(brew --prefix libomp)/lib -lomp"
Using gcc (installed by homebrew)
brew install gcc
make CXX="g++-8"
Note: replace "8" with version of gcc installed on your machine
Building Windows Binaries
=========================
To build them via command-line tools of Visual C++, use the following steps:
1. Open a DOS command box (or Developer Command Prompt for Visual Studio) and
go to LIBFFM directory. If environment variables of VC++ have not been set,
type
"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\vcvars64.bat"
You may have to modify the above command according which version of VC++ or
where it is installed.
2. Type
nmake -f Makefile.win clean all
Contributors
============
Yu-Chin Juan, Wei-Sheng Chin, and Yong Zhuang
For questions, comments, feature requests, or bug report, please send your email to
Yu-Chin ([email protected])