Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highly scale-dependent efficiency of ZHEEVR and DSYEVR #995

Open
1 of 2 tasks
amilsted opened this issue Mar 6, 2024 · 9 comments
Open
1 of 2 tasks

Highly scale-dependent efficiency of ZHEEVR and DSYEVR #995

amilsted opened this issue Mar 6, 2024 · 9 comments

Comments

@amilsted
Copy link

amilsted commented Mar 6, 2024

Description

Requesting all eigenvalues and eigenvectors of a large matrix M = A * c with ZHEEVR can take an amount of time that strongly depends the overall real scale factor c.

For example, a gaussian random matrix of size 4096 x 4096 with values close to 1.0 takes ~7.5s on my system with ZHEEVR. Scaling the matrix by a factor of 1e13 makes the process take ~14s, an increase of almost 100%. There is a similar slowdown with DSYEVR.

In a more extreme case, with a different example matrix of the same size, the process takes 180s with ZHEEVR when the scale of eigenvalues is ~1e12 vs. 8s when the scale is ~1. This is a ~20x slowdown. In this case, the matrix is real, but in complex representation. I am happy to upload the matrix if you can tell me where you want it. It is sparse and may have degeneracies (computed eigenvalues are close up to about 1e-9 relative). Random unitary transformations of the same matrix show the same slowdown.

With the same matrix, using DSYEVR stills show a slowdown with the "bad" scaling, but not by as much (~20s vs 6s).

The slowdown only occurs when eigenvectors are requested. The ABSTOL parameter is set to 0 (I am calling via Julia and scipy - Julia definitely sets ABSTOL to 0. Probably scipy does too?).

The accuracy of the eigendecomposition does not seem to be strongly related to the scale. I get sufficiently accurate results in all cases.

The extra time is taken in a single-threaded part of the algorithm (I have seen the problem with both MKL and OpenBLAS).

HEEV does not seem to exhibit the problem and solves fast.

I was surprised that an overall scaling factor makes such a big difference to solution speed. I would have expected the scaling would not matter.

This seems like a problem of significance, since both Julia and Scipy now choose syevr/heevr as the default Hermitian/Symmetric eigensolve algorithms.

Checklist

  • I've included a minimal example to reproduce the issue
  • I'd be willing to make a PR to solve this issue
@thijssteel
Copy link
Collaborator

That is a very interesting observation.

I don't know MRRR that well (but I don't think any of the currently active contributors knows it well).

My initial hunch is that it is just the selection of the algorithm. The documentation of zstemr says this:

If TRYRAC = .TRUE., indicates that the code should check whether
the tridiagonal matrix defines its eigenvalues to high relative
accuracy. If so, the code uses relative-accuracy preserving
algorithms that might be (a bit) slower depending on the matrix.
If the matrix does not define its eigenvalues to high relative
accuracy, the code can uses possibly faster algorithms.

Perhaps it is using the more expensive algorithm for the scaled matrix?

Definitely worth investigating.

@RalphAS
Copy link

RalphAS commented Mar 10, 2024

When asked for all vectors, DSYEVR tries the MRRR scheme, and if it fails it runs the more expensive inverse-iteration approach. It appears that the jump in runtime for at least some cases like the ones reported above is due to this fallback. (Inverse iteration is especially slow for clustered eigenvalues.)

In a few cases I tried, the internal MRRR routine DLARRV (which is only used when vectors are wanted) fails for a magnified copy of a matrix for which it gave good results.

A quick fix would be to recalibrate the existing scaling logic in DSTEMR.

@RalphAS
Copy link

RalphAS commented Mar 17, 2024

The cause of the failure reported by DSTEMR for certain matrices of large norm is an inconsistency in scaling used in a test in DLARRF:

lapack/SRC/dlarrf.f

Lines 467 to 477 in 5426147

* None of the representations investigated satisfied our
* criteria. Take the best one we found.
IF((SMLGROWTH.LT.FAIL).OR.NOFAIL) THEN
LSIGMA = BESTSHIFT
RSIGMA = BESTSHIFT
FORCER = .TRUE.
GOTO 5
ELSE
INFO = 1
RETURN
ENDIF

SMLGROWTH scales with the norm of the input data, but FAIL is a relative measure:

lapack/SRC/dlarrf.f

Lines 289 to 292 in 5426147

S = DLAMCH( 'S' )
SMLGROWTH = ONE / S
FAIL = DBLE(N-1)*MINGAP/(SPDIAM*EPS)
FAIL2 = DBLE(N-1)*MINGAP/(SPDIAM*SQRT(EPS))

Presumably the test at line 491 should use FAIL * SPDIAM.

This may account for some of the failures noted by @oamarques in #100.

@oamarques
Copy link
Collaborator

The slowdown reported by @amilsted is indeed due to ZSTEMR failing and then ZHEEVR resorting to plan B (inverse-iteration), as noted by @RalphAS. The failure stems from DLARRF (in the ZSTEMR chain), as also noted. (On a linux box, I have not observed a ~20x slowdown in ZHEEVR for N=4000, rather ~1.3x, since the reduction to tridiagonal form typically takes most of the time. That is a detail though.)

I could reproduce the failure for N=500, generating a matrix A with ZLATMR (generates random matrices of various types for testing LAPACK programs). Given A*c, c = 1.0e-13, 1.0e+00, 1.0e+13, ZSTEMR works for the first two values of c and fails for the third. This suggests an issue with scaling; we will continue looking to see what the fix could be.

BTW, we tested these algorithms with all kinds of matrices we could think of (see https://github.com/oamarques/STCollection) but we evidently missed something 😊

@amilsted
Copy link
Author

amilsted commented Jan 2, 2025

Good to know this is reproducible. The 20x slowdown does not appear with random matrices - I've only seen things get that bad on certain non-random matrices that we encounter fairly frequently. I am thinking that you probably don't need an example of that, as fixing the fallback behavior that causes the 1.3x slowdown likely fixes thte 20x slowdown too!

@martin-frbg
Copy link
Collaborator

it occurs to me that you could try the quick fix for dlarrf.f suggested in #995 (comment) (if you have the time) if you still have the kind of matrices occurring frequently enough in your work ? I'm not a mathematician though, and wouldn't know if there is something fundamentally unsatisfactory or unelegant in that approach)

@amilsted
Copy link
Author

amilsted commented Jan 2, 2025

Well, there's an easier workaround, which is just to scale the matrix to ~1 before solving. I am using LAPACK rather indirectly (via Julia or numpy and OpenBLAS) so trying the fix is not so easy.

@martin-frbg
Copy link
Collaborator

I see... guess I could sneak that tentative fix into the next OpenBLAS, but not sure if I'd want to diverge from Reference-LAPACK while the matter is still undecided. Just thought you are the one with the definite data for reproducing the issue

@amilsted
Copy link
Author

amilsted commented Jan 3, 2025

If the change is in an OpenBLAS branch I could probably give it a try. I could also probably provide an example matrix that exhibits the large slowdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants