-
-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Compression improvements #1302
Conversation
char global_no_coll_cause_string[512]; | ||
|
||
if (H5D__mpio_get_no_coll_cause_strings(local_no_coll_cause_string, 512, | ||
global_no_coll_cause_string, 512) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved most of this code into a new function to get strings for the reasons why collective I/O was broken for code reuse.
*------------------------------------------------------------------------- | ||
*/ | ||
herr_t | ||
H5D_select_io_mem(void *dst_buf, const H5S_t *dst_space, const void *src_buf, const H5S_t *src_space, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new routine that is very similar to H5D__select_io(), but rather than copying between application memory and the file, copies between two memory buffers according to the selection in the dst and src dataspaces.
*--------------------------------------------------------------------------- | ||
*/ | ||
static const char * | ||
H5FD__mem_t_to_str(H5FD_mem_t mem_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file allow one to see what type of I/O the MPI I/O file driver is doing. Previously one would only see the offset and length of the I/O. This now also shows you whether it's a superblock area, raw data, object header, etc.
*------------------------------------------------------------------------- | ||
*/ | ||
herr_t | ||
H5_mpio_gatherv_alloc(void *send_buf, int send_count, MPI_Datatype send_type, const int recv_counts[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two new functions here are simply wrappers around MPI_(All)gatherv that hide a bit of boilerplate code. Both allocate the receive buffer for the caller. The only difference between the two is that the "simple" function calculates the recv_counts and displacements arrays for the caller before making the MPI_(All)gatherv call.
51af53b
to
5124042
Compare
@@ -273,6 +373,185 @@ static int H5D__cmp_filtered_collective_io_info_entry_owner(const void *filtered | |||
/* Local Variables */ | |||
/*******************/ | |||
|
|||
#ifdef H5Dmpio_DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The below code adds debugging to H5Dmpio similar to that in the MPI I/O file driver.
*------------------------------------------------------------------------- | ||
*/ | ||
static herr_t | ||
H5D__mpio_array_gatherv(void *local_array, size_t local_array_num_entries, size_t array_entry_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole routine was rewritten and move to H5mpi.c
if ((mpi_rank = H5F_mpi_get_rank(io_info->dset->oloc.file)) < 0) | ||
HGOTO_ERROR(H5E_IO, H5E_MPI, FAIL, "unable to obtain MPI rank") | ||
if ((mpi_size = H5F_mpi_get_size(io_info->dset->oloc.file)) < 0) | ||
HGOTO_ERROR(H5E_IO, H5E_MPI, FAIL, "unable to obtain MPI size") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than retrieving the MPI rank and size multiple times in this file, do it once in H5D__chunk_collective_io which tends to be the main entrypoint in this file. Then, just hand those down to functions as needed.
*/ | ||
if (H5D__mpio_array_gatherv(chunk_list, chunk_list_num_entries, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than gathering everybody's list of chunks into a collective array, the feature has been revised in most places to construct different MPI derived types to only send as much data as needed, greatly reducing the feature's memory usage.
*------------------------------------------------------------------------- | ||
*/ | ||
static herr_t | ||
H5D__filtered_collective_chunk_entry_io(H5D_filtered_collective_io_info_t *chunk_entry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This routine used to work on either reading an individual chunk (for dataset reads) or reading and writing an individual chunk (for dataset writes). However, any chunk reads here used to be independent which is a scalability problem for the feature. The new H5D__mpio_collective_filtered_chunk_read, H5D__mpio_collective_filtered_chunk_update and H5D__mpio_collective_filtered_chunk_common_io routines now perform the duties of this routine, but in a manner that allows chunk reads to be done collectively. This should generally scale much better and still allows the user the option of specifying independent chunk reads when desired.
} /* end H5D__mpio_collective_filtered_chunk_reinsert() */ | ||
|
||
/*------------------------------------------------------------------------- | ||
* Function: H5D__mpio_get_chunk_redistribute_info_types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 3 functions below here create different MPI derived datatypes to extract certain portions of information from the overall per-chunk H5D_filtered_collective_io_info_t structure. Usually, a particular operation (shared chunk redistribution, chunk reallocation, chunk reinsertion) only needs a few fields out of that structure and this information is gathered to all ranks, so sending just the few fields necessary can drastically save on memory usage at the expense of a bit of MPI overhead.
*------------------------------------------------------------------------- | ||
*/ | ||
static herr_t | ||
H5D__mpio_collective_filtered_io_type(H5D_filtered_collective_io_info_t *chunk_list, size_t num_entries, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This routine was just revised a little bit to create slightly more efficient MPI derived types for performing I/O on filtered chunks.
/* Participate in the collective re-insertion of all chunks modified | ||
* in this iteration into the chunk index | ||
*/ | ||
for (j = 0; j < collective_chunk_list_num_entries; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunk index reinsertion logic here moved into H5D__mpio_collective_filtered_chunk_reinsert, which more efficiently handles memory usage as well as chunk reinsertion itself.
*/ | ||
for (i = 0; i < collective_chunk_list_num_entries; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunk index reinsertion logic here moved into H5D__mpio_collective_filtered_chunk_reinsert, which more efficiently handles memory usage as well as chunk reinsertion itself.
*/ | ||
for (j = 0; j < collective_chunk_list_num_entries; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunk file space reallocation logic moved into H5D__mpio_collective_filtered_chunk_reallocate, which more efficiently handles memory usage.
HGOTO_ERROR(H5E_DATASET, H5E_CANTGATHER, FAIL, "couldn't gather new chunk sizes") | ||
|
||
/* Collectively re-allocate the modified chunks (from each process) in the file */ | ||
for (i = 0; i < collective_chunk_list_num_entries; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunk file space reallocation logic moved into H5D__mpio_collective_filtered_chunk_reallocate, which more efficiently handles memory usage.
|
||
if (have_chunk_to_process) | ||
if (H5D__filtered_collective_chunk_entry_io(&chunk_list[i], io_info, type_info, fm) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duties now performed by H5D__mpio_collective_filtered_chunk_update instead.
*/ | ||
for (i = 0; i < chunk_list_num_entries; i++) | ||
if (mpi_rank == chunk_list[i].owners.new_owner) | ||
if (H5D__filtered_collective_chunk_entry_io(&chunk_list[i], io_info, type_info, fm) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duties now performed by H5D__mpio_collective_filtered_chunk_update instead.
Add support for chunk fill values to parallel compression feature Add partial support for incremental file space allocation to parallel compression feature
…allel compression
Refactor chunk reallocation and reinsertion code to use less MPI communication during linked-chunk I/O
H5D__get_num_chunks can be used to correctly determine space allocation status for filtered and unfiltered chunked datasets
Avoid doing I/O when a rank has no selection and the MPI communicator size is 1 or the I/O has been requested as independent at the low level Avoid 0-byte collective read of incrementally allocated filtered dataset when dataset han't been written to yet
a85daed
to
c00813c
Compare
* Fix the function cast error in H5Dchunk.c and activate (#1170) `-Werror=cast-function-type`. Again. * Parallel Compression improvements (#1302) * Fix for parallel compression examples on Windows (#1459) * Parallel compression adjustments for HDF5 1.12 * Committing clang-format changes Co-authored-by: David Young <[email protected]> Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
* Fix the function cast error in H5Dchunk.c and activate (HDFGroup#1170) `-Werror=cast-function-type`. Again. * Parallel Compression improvements (HDFGroup#1302) * Fix for parallel compression examples on Windows (HDFGroup#1459) Co-authored-by: David Young <[email protected]>
No description provided.