Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Tiling for Multiprocessing #652

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

vschaffn
Copy link
Contributor

@vschaffn vschaffn commented Feb 13, 2025

Resolves #649

Summary

The tiling grid generation enables splitting large raster datasets into smaller tiles, with an optional overlap parameter, to facilitate multiprocessing and improve performance when handling large

Features

  • Tiling Grid Generation (generate_tiling_grid): Generate a grid of positions by splitting [row_min, row_max] x [col_min, col_max] into tiles of size row_split x col_split with optional overlap.

  • Raster Method compute_tiling: Compute the raster tiling grid with generate_tiling_grid, specifying tile_size and overlap. Ensure both raster (self and raster_ref) are compatible with the same tiling.

  • Raster Method plot_tiling : Plot the raster with its tiling grid.

Tests

  • A test has been implemented on mock and real data, with and without overlapping to ensure the good behavior of generate_tiling_grid (expected shape, expected first and last tile, overlap consistency).

Example (on xDEM)

  • 200x200 tiling without overlapping :
    image

  • 200x200 tiling with 10 pixels overlapping :
    image

@vschaffn vschaffn force-pushed the 649-tiling_for_multiprocessing branch 2 times, most recently from 04008e3 to 5555068 Compare February 13, 2025 17:42
@vschaffn vschaffn force-pushed the 649-tiling_for_multiprocessing branch from 5555068 to b787b68 Compare February 13, 2025 17:44
@rhugonnet
Copy link
Member

Great! 😄

In terms of locations in the package, as these are methods that only array-related, I suggest we move them to either raster/array.py or to a new raster/tiling.py file (where we could also move the existing subdivide_array:

def subdivide_array(shape: tuple[int, ...], count: int) -> NDArrayNum:

Preference for the second option.

I don't picture users manipulating these lower-level functions themselves, so we could keep them outside of the Raster class as non-public methods that simply work on Raster.shape.

Also, I remember that the "stitching" part of the splitting with overlap was quite delicate sometimes (cropping the overlapping chunks with relative indexes). If it can be useful to you, here's how we had it coded in one package ("inv" saves the indexes to stitch the overlapping chunks later on): https://github.com/iamdonovan/pyddem/blob/2fbd54049cf90f77076ad62ba5e1b7cf179cf52e/pyddem/fit_tools.py#L1390

[
[[0, 50, 0, 50], [0, 50, 45, 95], [0, 50, 90, 100]],
[[45, 95, 0, 50], [45, 95, 45, 95], [45, 95, 90, 100]],
[[90, 100, 0, 50], [90, 100, 45, 95], [90, 100, 90, 100]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected the tiling here to yield [0, 55, 0, 55], [0, 55, 45, 100], [45, 100, 0, 55], [45, 100, 45, 100].

The list comprehensions here should normally work, might save you time re-coding this 😉 : https://github.com/iamdonovan/pyddem/blob/main/pyddem/fit_tools.py#L1398

Copy link

@adebardo adebardo Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be careful using pyddem, it's under GPL-3.0 License (same as RichDem), It is therefore not advisable to import it. However, I will look into it regarding "inspiration."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to Sébastien Dinot (https://github.com/sdinot) for proposing legally viable solutions.

  • Even projects under the GNU GPL v3.0 license are rarely written ex nihilo. This function may come from another project released under a permissive license. We need to search it. Without a specialized tool, it's mission impossible, but I have a little idea: it's time to test Software Heritage.

  • If that leads nowhere, someone needs to write an equivalent function without being influenced by the existing code (in other words, it would be best to find someone who has never seen this code). When rewritten in a black-box approach, there's little chance the final result will be identical. This implementation can potentially draw inspiration from an algorithm found in a book or article, in which case the latter should be explicitly cited as the source of inspiration to avoid any ambiguity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm not sure where this function came from originally.
From Git blame, I'm the one who added it to Pyddem 5 years ago, but it might have been inspired from a Stackoverflow response or similar.

@vschaffn
Copy link
Contributor Author

@rhugonnet I created a new file raster/tiling.py and moved generate_tiling_grid and subdivide_array as you suggested.
For tiling, if I have understood correctly, the overlap shouldn't be taken into account in the size of the tiles, and should be all around the tiles. That is what I did in the last commit, hope it would be right 😃

@rhugonnet
Copy link
Member

rhugonnet commented Feb 14, 2025

Exactly yes, it's all good now I think! 🙂

On the organization again:
As compute_tiling(Raster.shape) is mostly meant to be called internally in our future delayed_multiprocessing functionalities (and we get the same output by calling Raster.tiling() or tiling(Raster.shape)), I'm not sure that we need to expose it in the Raster class at all, same for plot_tiling(). If we do, at least not publicly (i.e. Raster._tiling()).

For next steps: We might have to think about how the user will define the tiling argument into Raster for delayed_multiprocessing before calling reproject(). It might be redundant to have it in every function call, maybe we'll have to define a new attribute Raster.chunksizes? (mirroring Dask naming)

@vschaffn
Copy link
Contributor Author

Exactly yes, it's all good now I think! 🙂

On the organization again: As compute_tiling(Raster.shape) is mostly meant to be called internally in our future delayed_multiprocessing functionalities (and we get the same output by calling Raster.tiling() or tiling(Raster.shape)), I'm not sure that we need to expose it in the Raster class at all, same for plot_tiling(). If we do, at least not publicly (i.e. Raster._tiling()).

For next steps: We might have to think about how the user will define the tiling argument into Raster for delayed_multiprocessing before calling reproject(). It might be redundant to have it in every function call, maybe we'll have to define a new attribute Raster.chunksizes? (mirroring Dask naming)

@rhugonnet I agree that it is maybe better to avoid overloading Raster class with tiling. Then I moved these functions in tiling.py.

For next steps, you are probably right, but as tiling is no longer defined in Raster, I find that it makes less sense to define an attribute Raster.chunksize there. I think we will see what is best to do about this in future tickets 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[POC] Tiling for multiprocessing
3 participants