Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training gets stuck on a specific dataset #225

Closed
fumoboy007 opened this issue Dec 4, 2024 · 14 comments · May be fixed by #228 or #229
Closed

Training gets stuck on a specific dataset #225

fumoboy007 opened this issue Dec 4, 2024 · 14 comments · May be fixed by #228 or #229

Comments

@fumoboy007
Copy link

fumoboy007 commented Dec 4, 2024

I am trying to train a C-SVC on a specific dataset. The training process gets stuck, never finishing.

Reproduction Steps

  1. Unarchive libsvm-stuck-training-issue.zip, which includes reproduction code and data. (The data has already been scaled.)
  2. Copy svm.h and svm.cpp from LIBSVM v335 (or any older version up to and including v300) to the same directory.
  3. clang++ -std=c++11 -o reproduce_issue reproduce_issue.cpp svm.cpp
  4. ./reproduce_issue

Preliminary Investigation

  • The issue is not reproducible after reverting 1c80a42, though I’m not sure why. Similarly, changing typedef float Qfloat to typedef double Qfloat causes the issue to go away. This could be a red herring.
  • After adding some logging to Solver::select_working_set, I can see that the training process eventually gets stuck on the working set {1525, 2023}, repeatedly selecting these two indices for i and j but with alternating order.

My hunch is that there is a bug in the working set selection or the stopping criteria. I will leave it to the experts to investigate further!

@cjlin1
Copy link
Owner

cjlin1 commented Dec 4, 2024

Your data is not a normal one as it has only 1 feature.
Using a high degree with a large regularization parameter easily
causes numerical difficulties.

I can reproduce your results:

$ make; ./svm-train -t 1 -e 0.01 -d 4 -c 100 -g 0.9597420397825849 -w0 1.04884106 -w1 0.95550528 ../test3/data1.txt
make: Nothing to be done for 'all'.
........................................................
WARNING: using -h 0 may be faster
*...............
WARNING: using -h 0 may be faster
*.................
WARNING: using -h 0 may be faster
*..........................................................................
WARNING: using -h 0 may be faster
*
WARNING: reaching max number of iterations

optimization finished, #iter = 10000000
obj = -244346.197253, rho = -1.203479
nSV = 2448, nBSV = 2440
Total nSV = 2448

However, if I change

typedef float Qfloat;

to

typedef double Qfloat;

then the problem disappears

$ make; ./svm-train -t 1 -e 0.01 -d 4 -c 100 -g 0.9597420397825849 -w0 1.04884106 -w1 0.95550528 ../test3/data1.txt
make: Nothing to be done for 'all'.
............................
WARNING: using -h 0 may be faster
*............................................
WARNING: using -h 0 may be faster
*.
WARNING: using -h 0 may be faster
*..................
WARNING: using -h 0 may be faster
*...........................
WARNING: using -h 0 may be faster
*.........
WARNING: using -h 0 may be faster
*............
WARNING: using -h 0 may be faster
*.............
WARNING: using -h 0 may be faster
*..............
WARNING: using -h 0 may be faster
*.....
WARNING: using -h 0 may be faster
*.................
WARNING: using -h 0 may be faster
*...........
WARNING: using -h 0 may be faster
*
optimization finished, #iter = 196403
obj = -245104.184532, rho = -1.003508
nSV = 2455, nBSV = 2448
Total nSV = 2455

If you diff 2.91 and 3.0, the main change is

$ diff ./svm.cpp ../libsvm-3.0/svm.cpp
195c195
< virtual Qfloat *get_QD() const = 0;

virtual double *get_QD() const = 0;
208c208
< virtual Qfloat *get_QD() const = 0;


virtual double *get_QD() const = 0;
415c415
< const Qfloat *QD;


const double QD;
601c601
< double quad_coef = Q_i[i]+Q_j[j]+2
Q_i[j];


  	double quad_coef = QD[i]+QD[j]+2*Q_i[j];

644c644
< double quad_coef = Q_i[i]+Q_j[j]-2*Q_i[j];

  	double quad_coef = QD[i]+QD[j]-2*Q_i[j];

822c822
< double quad_coef=Q_i[i]+QD[j]-2.0*y[i]*Q_i[j];

  			double quad_coef = QD[i]+QD[j]-2.0*y[i]*Q_i[j];

846c846
< double quad_coef=Q_i[i]+QD[j]+2.0*y[i]*Q_i[j];

  			double quad_coef = QD[i]+QD[j]+2.0*y[i]*Q_i[j];

1074c1074
< double quad_coef = Q_ip[ip]+QD[j]-2*Q_ip[j];

  			double quad_coef = QD[ip]+QD[j]-2*Q_ip[j];

1098c1098
< double quad_coef = Q_in[in]+QD[j]-2*Q_in[j];

  			double quad_coef = QD[in]+QD[j]-2*Q_in[j];

1259c1259
< QD = new Qfloat[prob.l];

  QD = new double[prob.l];

1261c1261
< QD[i]= (Qfloat)(this->*kernel_function)(i,i);

  	QD[i] = (this->*kernel_function)(i,i);

1276c1276
< Qfloat *get_QD() const

double *get_QD() const
1298c1298
< Qfloat *QD;


double *QD;
1308c1308
< QD = new Qfloat[prob.l];


  QD = new double[prob.l];

1310c1310
< QD[i]= (Qfloat)(this->*kernel_function)(i,i);

  	QD[i] = (this->*kernel_function)(i,i);

1325c1325
< Qfloat *get_QD() const

double *get_QD() const
1344c1344
< Qfloat *QD;


double QD;
1355c1355
< QD = new Qfloat[2
l];


  QD = new double[2*l];

1364,1365c1364,1365
< QD[k]= (Qfloat)(this->*kernel_function)(k,k);
< QD[k+l]=QD[k];

  	QD[k] = (this->*kernel_function)(k,k);
  	QD[k+l] = QD[k];

1398c1398
< Qfloat *get_QD() const

double *get_QD() const
1419c1419
< Qfloat *QD;


double *QD;
1439c1439

I believe for the case you tried, certain numerical issues occur.
According to our log, in 2.91 we did the change

QD is double now. All Q_i[i] are replaced by QD[i]
to avoid numerical problems.

It aims to solve some numerical problems.

In summary, numerical issues do occur in libsvm in different circumstances.
Yours happens to be one where the old setting works. But we did that
change because of some situations where the old one does not work.

@cjlin1 cjlin1 closed this as completed Dec 4, 2024
@fumoboy007
Copy link
Author

fumoboy007 commented Dec 4, 2024

Hi @cjlin1, thanks for taking a look!

Is there a way that LIBSVM can detect numerical difficulties and respond accordingly? I mentioned in #225 (comment) that the working set selection keeps choosing the same two indices without making progress. This seems like a situation that can be detected and mitigated.

Your data is not a normal one as it has only 1 feature.

Using a high degree with a large regularization parameter easily causes numerical difficulties.

I can understand that the data I provided may not seem normal but perhaps it will help if I give some context.

I encountered this issue while developing a sort of “auto ML” platform that automatically selects features and tunes hyperparameters. The feature selection works by starting with one “best” feature, selecting the next best feature, and so on. Hyperparameter tuning is also happening during this process. I ran into the issue during the start of the feature selection process when it was training with only one feature.

If I were doing all of this manually, then yes, maybe it would be reasonable for me to just work around convergence issues. But for an automated platform, I want the underlying learning algorithm implementation (LIBSVM in this case) to be robust enough that it can handle every case, including numerical edge cases.

I hope that makes sense!

@cjlin1
Copy link
Owner

cjlin1 commented Dec 7, 2024

We have a maximal number of iterations to handle such situations.
Checking such numerical errors first makes the code more complicated
and second slow down the program in the majority of the situations.
Indeed we did not even have a max number of iterations before, and
it was added because of this.
The numerical issues are very rare, so we do not think extra checks are needed.

@fumoboy007
Copy link
Author

We have a maximal number of iterations to handle such situations.

Yup, I saw that. However, it’s not great because the training process takes forever to complete. In my case, a regular training process takes less than one second to complete while this problematic one takes hours (I haven’t checked exactly how long).

The numerical issues are very rare, so we do not think extra checks are needed.

Perhaps it’s too early to say whether an extra check is needed? I am trying to investigate the issue more deeply first. Admittedly, it’s difficult for me because I am not an expert here. Would you mind if I ask you some questions about the implementation and my findings?

@cjlin1
Copy link
Owner

cjlin1 commented Dec 12, 2024 via email

@fumoboy007
Copy link
Author

fumoboy007 commented Dec 13, 2024

Thanks! Here are my findings so far. I added some extra logging and focused on the iterations where the training process starts oscillating between i=1525, j=2023 and i=2023, j=1525.

Findings

Iteration 111360

  1. Select i=1525 and j=2023 with obj_diff_min=-0.01164691862471009.
  2. Calculate quad_coef=0.00000136871642553 and delta=169.13929523467024296.
  3. Update alpha and G as follows:
Variable Previous Value New Value
alpha[1525] 95.55052799999999991 (upper bound) 34.06510009316370713
alpha[2023] 34.06510009316369292 95.55052799999999991 (upper bound)
G[1525] 1.28270253761288888 3.23039156859958565
G[2023] 1.28247103388129835 3.23109825804289930

Iteration 111361

  1. Select i=2023 and j=1525 with obj_diff_min=-0.04086391544522031.
  2. Calculate quad_coef=0.00000136871642553 and delta=516.31545448807764842.
  3. Update alpha and G as follows:
Variable Previous Value New Value
alpha[1525] 34.06510009316370713 95.55052799999999991 (upper bound)
alpha[2023] 95.55052799999999991 (upper bound) 34.06510009316365029
G[1525] 3.23039156859958565 1.28270253760459374
G[2023] 3.23109825804289930 1.28247103387388273

Subsequent Iterations

The values in subsequent iterations are similar to the above two iterations, though the last few digits of obj_diff_min, delta, alpha, and G change slightly.

Questions

  • In Iteration 111360, how did both G[1525] and G[2023] increase from 1 to 3? As far as I understand, the gradient for one or both i and j should get closer to the optimal value of 0 after one iteration.
  • I mentioned that in subsequent iterations, the last few digits of G change slightly. Those small changes are increases. Again, shouldn’t G get closer to 0, even if very slowly?
  • Does the upper bound clipping play any role in the oscillation?
  • Why exactly is this oscillation happening? If it is numerical instability, where is it occurring?

@cjlin1
Copy link
Owner

cjlin1 commented Dec 15, 2024

At optimum, G may not be zero as SVM dual is a constrained
optimization problem. You can see the optimality condition shown
on page 3 of
https://www.csie.ntu.edu.tw/~cjlin/papers/quadworkset.pdf

In theory, things should converge, but apparently numerical
issues occur here. To precisely check what happens, the
best way is to compare with the version that works.
You mentioned that an earlier version stops without issues.
You can print out information in every iteration and do a
comparison.

@fumoboy007
Copy link
Author

At optimum, G may not be zero as SVM dual is a constrained optimization problem.

Ohhh I see, that’s why m(α) ≤ M(α) is the stopping condition. Makes sense, thanks!

Will investigate further~~

fumoboy007 added a commit to fumoboy007/libsvm that referenced this issue Dec 21, 2024
…llate.

Fixes cjlin1#225.

The optimization algorithm has three main calculations:
1. Select the working set `{i, j}` that minimizes the decrease in the objective function.
2. Change `alpha[i]` and `alpha[j]` to minimize the decrease in the objective function while respecting constraints.
3. Update the gradient of the objective function according to the changes to `alpha[i]` and `alpha[j]`.

All three calculations make use of the matrix `Q`, which is represented by the `QMatrix` class. The `QMatrix` class has two main methods:
- `get_Q`, which returns an array of values for a single column of the matrix; and
- `get_QD`, which returns an array of diagonal values.

`Q` values are of type `Qfloat` while `QD` values are of type `double`. `Qfloat` is currently defined as `float`, so there can be inconsistency in the diagonal values returned by `get_Q` and `get_QD`. For example, in cjlin1#225, one of the diagonal values is `181.05748749793070829` as `double` and `180.99411909539512067` as `float`.

The first two calculations of the optimization algorithm access the diagonal values via `get_QD`. However, the third calculation accesses the diagonal values via `get_Q`. This inconsistency between the minimization calculations and the gradient update can cause the optimization algorithm to oscillate, as demonstrated by cjlin1#225.

We change `get_Q` to return a new class called `QColumn` instead of a plain array of values. The `QColumn` class overloads the subscript operator, so accessing individual elements is the same as before. Internally though, the `QColumn` class will return the `QD` value when the diagonal element is accessed. This guarantees that all calculations are using the same values for the diagonal elements, eliminating the inconsistency.

Alternatively, we could change `Qfloat` to be defined as `double`. This would also eliminate the inconsistency; however, it would reduce the cache capacity by half.
fumoboy007 added a commit to fumoboy007/libsvm that referenced this issue Dec 21, 2024
…llate.

Fixes cjlin1#225.

# Background

The optimization algorithm has three main calculations:
1. Select the working set `{i, j}` that minimizes the decrease in the objective function.
2. Change `alpha[i]` and `alpha[j]` to minimize the decrease in the objective function while respecting constraints.
3. Update the gradient of the objective function according to the changes to `alpha[i]` and `alpha[j]`.

All three calculations make use of the matrix `Q`, which is represented by the `QMatrix` class. The `QMatrix` class has two main methods:
- `get_Q`, which returns an array of values for a single column of the matrix; and
- `get_QD`, which returns an array of diagonal values.

# Problem

`Q` values are of type `Qfloat` while `QD` values are of type `double`. `Qfloat` is currently defined as `float`, so there can be inconsistency in the diagonal values returned by `get_Q` and `get_QD`. For example, in cjlin1#225, one of the diagonal values is `181.05748749793070829` as `double` and `180.99411909539512067` as `float`.

The first two calculations of the optimization algorithm access the diagonal values via `get_QD`. However, the third calculation accesses the diagonal values via `get_Q`. This inconsistency between the minimization calculations and the gradient update can cause the optimization algorithm to oscillate, as demonstrated by cjlin1#225.

# Solution

We change `get_Q` to return a new class called `QColumn` instead of a plain array of values. The `QColumn` class overloads the subscript operator, so accessing individual elements is the same as before. Internally though, the `QColumn` class will return the `QD` value when the diagonal element is accessed. This guarantees that all calculations are using the same values for the diagonal elements, eliminating the inconsistency.

# Alternatives Considered

Alternatively, we could change `Qfloat` to be defined as `double`. This would also eliminate the inconsistency; however, it would reduce the cache capacity by half.
fumoboy007 added a commit to fumoboy007/libsvm that referenced this issue Dec 21, 2024
…llate.

Fixes cjlin1#225.

# Background

The optimization algorithm has three main calculations:
1. Select the working set `{i, j}` that minimizes the decrease in the objective function.
2. Change `alpha[i]` and `alpha[j]` to minimize the decrease in the objective function while respecting constraints.
3. Update the gradient of the objective function according to the changes to `alpha[i]` and `alpha[j]`.

All three calculations make use of the matrix `Q`, which is represented by the `QMatrix` class. The `QMatrix` class has two main methods:
- `get_Q`, which returns an array of values for a single column of the matrix; and
- `get_QD`, which returns an array of diagonal values.

# Problem

`Q` values are of type `Qfloat` while `QD` values are of type `double`. `Qfloat` is currently defined as `float`, so there can be inconsistency in the diagonal values returned by `get_Q` and `get_QD`. For example, in cjlin1#225, one of the diagonal values is `181.05748749793070829` as `double` and `180.99411909539512067` as `float`.

The first two calculations of the optimization algorithm access the diagonal values via `get_QD`. However, the third calculation accesses the diagonal values via `get_Q`. This inconsistency between the minimization calculations and the gradient update can cause the optimization algorithm to oscillate, as demonstrated by cjlin1#225.

# Solution

We change `get_Q` to return a new class called `QColumn` instead of a plain array of values. The `QColumn` class overloads the subscript operator, so accessing individual elements is the same as before. Internally though, the `QColumn` class will return the `QD` value when the diagonal element is accessed. This guarantees that all calculations are using the same values for the diagonal elements, eliminating the inconsistency.

# Alternatives Considered

Alternatively, we could change `Qfloat` to be defined as `double`. This would also eliminate the inconsistency; however, it would reduce the cache capacity by half.

# Future Changes

The Java code will be updated similarly in a separate commit.
@fumoboy007
Copy link
Author

After a long and painful investigation, I have finally found the root cause of the issue! 🎉

I sent pull request #228 for your consideration. Below is the root cause analysis copied from the pull request description.

Background

The optimization algorithm has three main calculations:

  1. Select the working set {i, j} that minimizes the decrease in the objective function.
  2. Change alpha[i] and alpha[j] to minimize the decrease in the objective function while respecting constraints.
  3. Update the gradient of the objective function according to the changes to alpha[i] and alpha[j].

All three calculations make use of the matrix Q, which is represented by the QMatrix class. The QMatrix class has two main methods:

  • get_Q, which returns an array of values for a single column of the matrix; and
  • get_QD, which returns an array of diagonal values.

Problem

Q values are of type Qfloat while QD values are of type double. Qfloat is currently defined as float, so there can be inconsistency in the diagonal values returned by get_Q and get_QD. For example, in #225, one of the diagonal values is 181.05748749793070829 as double and 180.99411909539512067 as float.

The first two calculations of the optimization algorithm access the diagonal values via get_QD. However, the third calculation accesses the diagonal values via get_Q. This inconsistency between the minimization calculations and the gradient update can cause the optimization algorithm to oscillate, as demonstrated by #225.

Solution

We change get_Q to return a new class called QColumn instead of a plain array of values. The QColumn class overloads the subscript operator, so accessing individual elements is the same as before. Internally though, the QColumn class will return the QD value when the diagonal element is accessed. This guarantees that all calculations are using the same values for the diagonal elements, eliminating the inconsistency.

Alternatives Considered

Alternatively, we could change Qfloat to be defined as double. This would also eliminate the inconsistency; however, it would reduce the cache capacity by half.

@cjlin1
Copy link
Owner

cjlin1 commented Dec 24, 2024

Thanks for the PR. Your change may cause lots of if checks.
I worry the code may become slower in some situations (e.g.,
all kernel elements have been cached and working set
selection is the main task). Redo exps in the JMLR paper
for quadratic working set selection can be a huge task.

Why don't we modify the get_Q function?
After the loop

		for(j=start;j<len;j++)
			data[j] = (Qfloat)(y[i]*y[j]*(this->*kernel_function)(i,j));

we can have a simple assignment to use QD[i]

@fumoboy007
Copy link
Author

fumoboy007 commented Dec 25, 2024

Why don't we modify the get_Q function? After the loop

		for(j=start;j<len;j++)
			data[j] = (Qfloat)(y[i]*y[j]*(this->*kernel_function)(i,j));

we can have a simple assignment to use QD[i]

Unfortunately, this does not change anything. data is an array of Qfloat, so assigning data[i] = QD[i] would cause a cast from double to Qfloat. Thus, the diagonal values according to get_Q would still be different than the diagonal values according to get_QD.

Your change may cause lots of if checks. I worry the code may become slower in some situations (e.g., all kernel elements have been cached and working set selection is the main task). Redo exps in the JMLR paper for quadratic working set selection can be a huge task.

Yup, the additional checks are unfortunate. Alternatively, we could change QD back to an array of Qfloat, which would also solve the inconsistency. However, I am not sure whether we would reintroduce the other issue that the original change 1c80a42 was trying to solve.

The other alternative I mentioned would be changing Qfloat to double. But not sure how you feel about reducing the cache capacity by half.

@cjlin1
Copy link
Owner

cjlin1 commented Dec 25, 2024 via email

@fumoboy007
Copy link
Author

  • The example you gave is a special one with only one feature. Poly kernel is less used and usually C doesn't need to be that large

Hmm doesn’t it also affect the linear kernel? Also, what is the effect of C on this issue?

  • I recall that the example causing us to introduce double QD is a more typical scenario than your example

Is the past issue documented somewhere? I’d like to learn about it.

The problem with my issue is that it was really hard to understand and debug. None of the values were hitting numerical limits, so it wasn’t obvious that it was caused by something related to numerical imprecision. I would classify this issue more as an unexpected bug than a numerical issue: one calculation was using one number but another calculation was using a different number. That seems like something we should prioritize over numerical issues, which are more expected.

  • People can try to change Qfloat to double if they face numerical issues. We have an faq for this

Well, changing the code and recompiling involves a lot of friction, especially for users who are using a prebuilt executable or library. But that thought inspired me to think of another solution:

  1. Change the type of QD values back to Qfloat to guarantee consistency between calculations. This avoids unexpected issues like this one.
  2. Allow Qfloat to be changed at runtime. This allows users who face numerical issues to easily try a different precision level.

Both changes should be straightforward and won’t have obvious downsides like the other proposed solutions. I can send a pull request if you’re interested in this approach.

fumoboy007 added a commit to fumoboy007/libsvm that referenced this issue Dec 27, 2024
…llate.

Fixes cjlin1#225.

# Background

The optimization algorithm has three main calculations:
1. Select the working set `{i, j}` that [minimizes](https://github.com/cjlin1/libsvm/blob/35e55962f7f03ce425bada0e6b9db79193e947f8/svm.cpp#L829-L879) the decrease in the objective function.
2. Change `alpha[i]` and `alpha[j]` to [minimize](https://github.com/cjlin1/libsvm/blob/35e55962f7f03ce425bada0e6b9db79193e947f8/svm.cpp#L606-L691) the decrease in the objective function while respecting constraints.
3. [Update](https://github.com/cjlin1/libsvm/blob/35e55962f7f03ce425bada0e6b9db79193e947f8/svm.cpp#L698-L701) the gradient of the objective function according to the changes to `alpha[i]` and `alpha[j]`.

All three calculations make use of the matrix `Q`, which is represented by the `QMatrix` [class](https://github.com/cjlin1/libsvm/blob/35e55962f7f03ce425bada0e6b9db79193e947f8/svm.cpp#L198). The `QMatrix` class has two main methods:
- `get_Q`, which returns an array of values for a single column of the matrix; and
- `get_QD`, which returns an array of diagonal values.

# Problem

`Q` values are of type `Qfloat` while `QD` values are of type `double`. `Qfloat` is currently [defined](https://github.com/cjlin1/libsvm/blob/35e55962f7f03ce425bada0e6b9db79193e947f8/svm.cpp#L16) as `float`, so there can be inconsistency in the diagonal values returned by `get_Q` and `get_QD`. For example, in cjlin1#225, one of the diagonal values is `181.05748749793070829` as `double` and `180.99411909539512067` as `float`.

The first two calculations of the optimization algorithm access the diagonal values via `get_QD`. However, the third calculation accesses the diagonal values via `get_Q`. This inconsistency between the minimization calculations and the gradient update can cause the optimization algorithm to oscillate, as demonstrated by cjlin1#225.

# Solution

We change the type of `QD` values from `double` to `Qfloat`. This guarantees that all calculations are using the same values for the diagonal elements, eliminating the inconsistency.

Note that this reverts the past commit 1c80a42. That commit changed the type of `QD` values from `Qfloat` to `double` to address a numerical issue. In a follow-up commit, we will allow `Qfloat` to be defined as `double` at runtime as a more general fix for numerical issues.

# Future Changes

The Java code will be updated similarly in a separate commit.
@fumoboy007
Copy link
Author

Both changes should be straightforward and won’t have obvious downsides like the other proposed solutions. I can send a pull request if you’re interested in this approach.

See #229.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants