Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with large data set #5

Open
londumas opened this issue Mar 11, 2019 · 1 comment
Open

Dealing with large data set #5

londumas opened this issue Mar 11, 2019 · 1 comment

Comments

@londumas
Copy link

I am trying to run with a large data set, ~200,000 eBOSS spectra, and stumbled upon an issue with memory.
What would be the best strategy to deal with that?
Is there an option float32, or should I split the spectra I am looking into computing
in half according to lambdaRF and tape as best as I can after?

INFO: Starting EMPCA
       iter        R2             rchi2
Traceback (most recent call last):
  File "<HOME>/redvsblue/bin//redvsblue_compute_PCA.py", line 205, in <module>
    model = empca.empca(pcaflux, weights=pcaivar, niter=args.niter, nvec=args.nvec)
  File "<HOME>/Programs/sbailey/empca/empca.py", line 307, in empca
    model.solve_eigenvectors(smooth=smooth)
  File "<HOME>/Programs/sbailey/empca/empca.py", line 142, in solve_eigenvectors
    data -= np.outer(self.coeff[:,k], self.eigvec[k])    
  File "<HOME>/.local/lib/python3.6/site-packages/numpy/core/numeric.py", line 1203, in outer
    return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
MemoryError
@sbailey
Copy link
Owner

sbailey commented Mar 14, 2019

The code would need to be updated in several places to have the calculation stay in float32 if the inputs are float32, e.g. line 204:

            mx = np.zeros(self.data.shape)

to

            mx = np.zeros_like(self.data)

A PR like that would be welcome, though in general I'm suspicious about the stability of single precision floating point calculations. Alternatives to consider:

  • Run on NERSC Cori with 128 GB/node (I know @londumas has access to that machine)
  • Run on a subset of the input data and cross check fits on the remainder of the data. I'm not sure that going from 100k to 200k input quasars will really give you that much more information, and reserving out 100k of them can be a useful cross check on overfitting anyway.
  • Run on a subset of the data to develop an initial model, and then iteratively add additional data that have a poor fit when using that model (i.e. bringing in data the cover phase space not covered by the original subset of the data, while not wasting memory on data that are already well described). Beware of any interpretation of the relative eigenvectors in that case, since your training set isn't representative of your full inputs, which may be fine for your cases.

I think any of those would be better than stitching together different redshift ranges or different wavelengths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants