Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics operation load the entire raster in memory #187

Open
aaime opened this issue Apr 17, 2018 · 4 comments
Open

Statistics operation load the entire raster in memory #187

aaime opened this issue Apr 17, 2018 · 4 comments

Comments

@aaime
Copy link
Member

aaime commented Apr 17, 2018

Both simple and zonal states compute the statistics by aggregating them on tiles computation, and the property calculation calls getTiles() to make that happen.
It's a concise way to compute all tiles, however, it also means all rastes are computed and retained as getTiles() returns a Raster[], in other words, the entire raster gets loaded into memory.

@aaime
Copy link
Member Author

aaime commented Apr 17, 2018

The original stats ops from JAI do compute statistics on a tile by tile basis, in a loop, so are not affected by this issue. A drawback of the fix will be that the stats computation is going to switch from parallel (using all tile scheduler threads) to linear. On a busy server doing multiple requests that's a good thing, a batch job having all the machine to use might suffer slowdowns though.

aaime added a commit that referenced this issue Apr 17, 2018
Statistics operation load the entire raster in memory #187
@aaime aaime closed this as completed Apr 17, 2018
@simboss
Copy link
Member

simboss commented Apr 17, 2018

Should we add a parameter to control this behavior like, lazy VS eager computation (default to lazy)?

@simboss simboss reopened this Apr 17, 2018
@aaime
Copy link
Member Author

aaime commented Apr 18, 2018

It's not lazy vs eager, it's parallel vs sequential, but still done all in one shot in the getProperty call.
I did try to keep the parallel behavior, but due to implementation details it seems one has to also accept to keep all the tiles in memory. In particular, the tile scheduler field from the base class is not accessible, and even if we did, there is no way to see how many threads it can use (knowing it's important, otherwise we end up calling a method that builds again all tiles before returning, making the thing memory bound again).

Long story short, it seems the operation would need its own local, temporary thread pool to have some control on the computation. So the new parameter could indeed be the number of threads used to compute the stats, defaulting to one.

@simboss
Copy link
Member

simboss commented Apr 18, 2018

let's leave this open and get back to it in time.

At least we patched this for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants