Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Various fixes to the DAOS tensorflow-io plugin. (#2)
General: Asserts were added and enabled after each DAOS event-related call in order to track down internal race conditions in the DAOS client code, see DAOS-10601. The DAOS_FILE structure documented behavior for the 'offset' field, but most of that behavior didn't need to be implemented. Field 'offset' was removed while fixing the Append() and Tell() functions, leaving only a single field 'file' in DAOS_FILE, so the DAOS_FILE struct was removed as well. Typos in error messages were corrected. File dfs_utils.cc: In DFS::Setup(): Code after the Connect() call replaced the detailed error status set in Connect() with a generic TF_NOT_FOUND error with no accompanying message. This cost me several days of debugging to realize that a problem was not some object not being found, but rather was a connection failure to an unhealthy container. The TF_NOT_FOUND has been removed, allowing the more detailed error messages in Connect() to be reported. In ReadBuffer::ReadBuffer() By setting buffer_offset to ULONG_MAX, an uninitialized buffer will never be matched by CacheHit(), removing the need for a separate 'initialized' variable. The valid variable is no longer needed as well, more on that below. In ReadBuffer::~ReadBuffer() daos_event_fini(() cannot be called on an event that is still in flight, it fails without doing anything, daos_event_test() must wait for any prior event to complete, otherwise the event delete that follows daos_event_fini() could then cause corruption of the event queue. Call the reworked WaitEvent() (see below) first to ensure that daos_event_fini() will clean up the event before it is deleted. In ReadBuffer::FinalizeEvent() The same problem exists here as in ~ReadBuffer(), daos_event_fini() can't be called on an event that is still in flight. However, FinalizeEvent() isn't actually needed, a call to dfs_file->buffers.clear() in Cleanup() accomplishes the same thing using the ~ReadBuffer code, so FinalizeEvent was removed. ReadBuffer::WaitEvent() There is a need for a WaitEvent() function in several places to wait for any outstanding event to complete, but this routine manipulates 'valid', so it can't be used anywhere else. Removed the 'valid' code so that this routine can become a void and be called in multiple places. ReadBuffer::AbortEvent() daos_event_abort() doesn't actually contain any logic to ask the server to abort an in-flight dfs_read() request. In addition it is buggy, internal DAOS asserts were hit due to daos_event_abort() calls during I/O testing. The code was changed to instead use WaitEvent() to simply wait for a prior read to complete before issuing a new one, and AbortEvent() was removed. ReadBuffer::ReadAsync() Both daos_event_fini() and daos_event_init() must be called on a daos_event_t structure before the event can be reused for another dfs_read() call. These have been added. The AbortEvent() call was replaced with a call to WaitEvent(). The code was changed to save the errno from a failed dfs_read() call in the event's ev_error field so that the error will be detected, and so a user cannot accidentally read trash data after a failed dfs_read() call. ReadBuffer::ReadSync() This function is no longer used, see below. ReadBuffer::CopyData() The WaitEvent() call ensures that the thread blocks until any in-flight read request is done. The event->ev_error field is used to detect I/O failure either at the time the dfs_read() is issued or in the reply, so the valid flag is no longer needed. ReadBuffer::CopyFromCache() The TF_RandomAccessFile read() function allows for int64_t-sized reads, so change the return value here to int64_t. If an I/O error occurred, then return -1 so that the caller function Read() can easily tell when there has been an I/O error. Provide a detailed error message so that the user can tell what caused the error. File dfs_filesystem.cc: In DFSRandomAccessFile constructor: Added an assert() on the event queue creation. In Cleanup(): Replaced FinalizeEvent() code with a dfs_file->buffers.clear() call. Add asserts on dfs function calls. In df_dfs_filesystem::Read(): The loop "for (auto& read_buf : dfs_file->buffers)" was missing a break statement, so CacheHit was called 256 times for each curr_offset value. A break was added. Support was added for detecting a read error and returning -1. Since Read() is now a while loop, there is no reason to specially use ReadSync() for the first buffer. Code changed to use ReadAsync() for all readahead, CopyFromCache() will block until the first buffer's I/O is complete. ReadSync is now unused, and is removed. I could not determine a reason for the WaitEvent loop: if (curr_offset >= dfs_file->file_size) because I/O requests will never be be started beyond EOF. The loop was removed. In DFSWritableFile: The Append() function had to make a dfs_get_size() call for each append to a file, adding a second round trip to the server for each append. This is very expensive. Member functions were added to cache the file size and update it locally as Append() operations are done.. Since the tensorflow API only allows one writer, local caching is allowable. Should there be an I/O error, the actual size of the file becomes unknown, the new member functions take that into account and call dfs_get_size() in those situations to reestablish the correct size of the file. In Append(): The dfs_file->daos_file.offset field was not updated after an Append() operation completed successfully, so a subsequent Tell() call would return the size of the file before the last Append(), not after, the reported size was incorrect. The code was changed to update the cached file size after successful Append() operations. In RenameFile(): Similar to the Setup() case, the detailed error statuses in Connect() were being hidden by a genereric TF_NOT_FOUND error. The generic error was removed. Signed-off-by: Kevan Rehm <[email protected]>
- Loading branch information