You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run with a large data set, ~200,000 eBOSS spectra, and stumbled upon an issue with memory.
What would be the best strategy to deal with that?
Is there an option float32, or should I split the spectra I am looking into computing
in half according to lambdaRF and tape as best as I can after?
INFO: Starting EMPCA
iter R2 rchi2
Traceback (most recent call last):
File "<HOME>/redvsblue/bin//redvsblue_compute_PCA.py", line 205, in <module>
model = empca.empca(pcaflux, weights=pcaivar, niter=args.niter, nvec=args.nvec)
File "<HOME>/Programs/sbailey/empca/empca.py", line 307, in empca
model.solve_eigenvectors(smooth=smooth)
File "<HOME>/Programs/sbailey/empca/empca.py", line 142, in solve_eigenvectors
data -= np.outer(self.coeff[:,k], self.eigvec[k])
File "<HOME>/.local/lib/python3.6/site-packages/numpy/core/numeric.py", line 1203, in outer
return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
MemoryError
The text was updated successfully, but these errors were encountered:
The code would need to be updated in several places to have the calculation stay in float32 if the inputs are float32, e.g. line 204:
mx=np.zeros(self.data.shape)
to
mx=np.zeros_like(self.data)
A PR like that would be welcome, though in general I'm suspicious about the stability of single precision floating point calculations. Alternatives to consider:
Run on NERSC Cori with 128 GB/node (I know @londumas has access to that machine)
Run on a subset of the input data and cross check fits on the remainder of the data. I'm not sure that going from 100k to 200k input quasars will really give you that much more information, and reserving out 100k of them can be a useful cross check on overfitting anyway.
Run on a subset of the data to develop an initial model, and then iteratively add additional data that have a poor fit when using that model (i.e. bringing in data the cover phase space not covered by the original subset of the data, while not wasting memory on data that are already well described). Beware of any interpretation of the relative eigenvectors in that case, since your training set isn't representative of your full inputs, which may be fine for your cases.
I think any of those would be better than stitching together different redshift ranges or different wavelengths.
I am trying to run with a large data set, ~200,000 eBOSS spectra, and stumbled upon an issue with memory.
What would be the best strategy to deal with that?
Is there an option
float32
, or should I split the spectra I am looking into computingin half according to lambdaRF and tape as best as I can after?
The text was updated successfully, but these errors were encountered: