-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-38069: File-based result delivery in Qserv #781
Conversation
8c24717
to
13c2288
Compare
13c2288
to
23fa3b3
Compare
a9ef00f
to
430ffc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I think readHttpFileAndMerge may need some work, but otherwise just minor comments.
3c88184
to
323c056
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. Just some minor things to consider.
bool ChannelShared::buildAndTransmitError(util::MultiError& multiErr, Task::Ptr const& task, bool cancelled) { | ||
auto qId = task->getQueryId(); | ||
bool scanInteractive = true; | ||
waitTransmitLock(scanInteractive, qId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor detail, waitTransmitLock()
doesn't really do anything good for FileChannelShared()
, but it probably doesn't really do any damage either as waitTransmitLock()
is only called by FileChannelShared()
when there is an error. It could possibly be an issue if there are a lot of errors. That being said, there are case where there can be a lot of erros very quickly in qserv.
The guts of this call could be moved to a protected function so FileChannelShared()
can avoid calling waitTransmitLock()
but SendChannelShared
would waitTransmitLock()
before calling the protected function. Or, FIleChannelShared
could disable transmitLock
in it's instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will see what I can do to get this implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have evaluated the code of waitTransmitLock
. I propose eliminating this method in a separate "code cleanup" JIRA ticket after merging the current PR and using the new code in production. After that, we would be able to eliminate the SSI-based result delivery path. This will eliminate Any need in having the "transmit manager" which was introduced as a flow-control mechanism to limit the memory usage & XROOTD/SSI load.
If we do this step now for the FileChannelShared
then we will end up with a more complicated code. As you mentioned in your comment, calling this method from FileChannelShared
won't do any damage.
lock_guard<mutex> const streamMutexLock(_streamMutex); | ||
++_lastCount; | ||
bool lastTaskDone = _lastCount >= _taskCount; | ||
return lastTaskDone; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have it's own mutex to avoid unrelated mutex contention issues. _lastCount
has nothing to do with _streamMutex, so this is just asking for trouble. Each mutex should cover as small a number of cases/variables as possible to reduce contention andthe likelyhood of accidentaly trying to lock the mutex twice. (I think in the original code, _streamMutex was already locked for some other reason, so adding another mutex wouldn't help anything.)
However you may be able to get rid of the mutex entirely.
std::atomic _lastCount{0}; // in the class header
auto count = ++_lastCount; // safe since _lastCount is atomic.
bool lastTaskDone = count >= _taskCount;
return lastTaskDone;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly, I'm a bit hesitant to do this now as this change means a possible shift in the locking behavior of these streaming classes compared with the current implementation. My intent here was not to change anything in the behavior of the SSI branch of the resulting streaming. I propose that we do this change (as per your suggestion above) as well as others that may modify the behavior of the streaming classes in a separate ticket. In this case, we would always see an effect of the changes.
7305ebb
to
ca70181
Compare
Added the corresponding configuration option to the entry-point CLI for ingecting a value of the result folder into the config file of of the worker CMSD. The configuration file has been extended as well to allow serving files from workers via the XROOTD file-based protocol. The SSI provider class has been extened to recognise result files at the local filesystems of workers as valid XROOTD resources.
The service is now available to the client code via shared pointer that can be stored and used by classes as needed. The change reduces the number of parameters which are sent around the code. In the new version of the code only the shared pointer to the service is put to where the configuration parameters are consumed. Also added new parameters to support the file-based result delivery, including: a location of the results folder, the number of BOOST ASIO threads to run the QHTTP server at worker, and a selector for the desired results delivery protocol.
Also elminated deprecated protobuf schema and large result attributes
…sses Split wbase::SendChannelShared into the base class and an implementation Did the minor refactoring in naming and using XROOTD/SSI streaming channels These change were required to prepare ground for introducing other "ChannelShared"-alike implementations of the result processing and delivery mechanisms. The refactoring aims at correcting dependencies in a code that creates and sets up worker tasks, so that task initialization was happening completelly within the implementation file of wbase/Task.cc rather than being spread between wbase/Task.cc and wcontrol/Foreman.cc. Altogether this is meant to improve the observability of the code and make it easier to maintain. Also, some minor code cleanup and removal of unused header includes and forward declarations was made. Thould reduce the compilation time of the relevant modules. Got rid of unused members and methods Classes wbase::ChannelShared (used to be wbase::SendChannelShared) and wbase::TransmitData had an obsolete mechanism for setting up result schema in the response messages to be sent to Czar. The current implementation no longer uses that API. The obsilete API was removed to avoid confusions as it wasn't obvious where the actual schema settings were made. Got rid of std:: in the iomplementation files with "using std". Guaranteed and enforced synchronization in private methods that require a lock to be held before calling the methods. In the new code, a cons reference to a lock is passed around contexts (methods) where such lock is required. This compiler-enforced technique has proven to work for large code bases where it would be hard to track a state of the lock by other methods. The only other alternative method would be to do the run-time inspection of the lock at the entrance of a method.
The new result writer redirects result sets into files at workers. Only the last "summary" message with no rows is sent to Czar for each worker query (regardless if a number of tasks that were required to process the query). Also did the minor refactoring in the error messages Added HTTP server to Qserv worker for serving result files Also made minor fixes and improvements to basic tests of QHTTP made in the Replication/Ingest system code base. Using a configuration option to select the desired result delivery protocol Added support (is configured) for deleting unclaimed result files at the startup time of the worker service and after restarting the Czar.
It was the trivial refactpring of the code meant to simplify the code and prepare it for the functional extention.
The new version of the result merger at Czar would dynamically determine which protocol should be used for pulling results from workers: SSI stream, a file read via the XROOTD file protocol, or a file read via the HTTP protocol. In case of the last two options, the merger would also also utomatically tell a worker to delete the result file upon completion of the merge (including the unsuccessful ones). A choice of the delivery method is based on the optional fileds in the Protobuf messages received from workers. It's up to the workers to decide on what method to select.
Also, extended the Web Dashboard to display the filesystem status and usage statistics.
ca70181
to
77ffa3e
Compare
No description provided.