Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-38069: File-based result delivery in Qserv #781

Merged
merged 8 commits into from
Jul 20, 2023
Merged

Conversation

iagaponenko
Copy link
Contributor

No description provided.

Copy link
Contributor

@jgates108 jgates108 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think readHttpFileAndMerge may need some work, but otherwise just minor comments.

src/ccontrol/MergingHandler.cc Show resolved Hide resolved
src/ccontrol/MergingHandler.cc Show resolved Hide resolved
src/ccontrol/MergingHandler.cc Show resolved Hide resolved
src/ccontrol/MergingHandler.cc Show resolved Hide resolved
src/wbase/ChannelShared.cc Show resolved Hide resolved
src/wconfig/WorkerConfig.cc Outdated Show resolved Hide resolved
src/wconfig/WorkerConfig.h Outdated Show resolved Hide resolved
src/wcontrol/Foreman.cc Outdated Show resolved Hide resolved
src/wdb/QueryRunner.cc Outdated Show resolved Hide resolved
src/xrdsvc/SsiService.h Outdated Show resolved Hide resolved
@iagaponenko iagaponenko force-pushed the tickets/DM-38069 branch 3 times, most recently from 3c88184 to 323c056 Compare July 12, 2023 01:44
Copy link
Contributor

@jgates108 jgates108 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. Just some minor things to consider.

bool ChannelShared::buildAndTransmitError(util::MultiError& multiErr, Task::Ptr const& task, bool cancelled) {
auto qId = task->getQueryId();
bool scanInteractive = true;
waitTransmitLock(scanInteractive, qId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail, waitTransmitLock() doesn't really do anything good for FileChannelShared(), but it probably doesn't really do any damage either as waitTransmitLock() is only called by FileChannelShared() when there is an error. It could possibly be an issue if there are a lot of errors. That being said, there are case where there can be a lot of erros very quickly in qserv.

The guts of this call could be moved to a protected function so FileChannelShared() can avoid calling waitTransmitLock() but SendChannelShared would waitTransmitLock() before calling the protected function. Or, FIleChannelShared could disable transmitLock in it's instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will see what I can do to get this implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have evaluated the code of waitTransmitLock. I propose eliminating this method in a separate "code cleanup" JIRA ticket after merging the current PR and using the new code in production. After that, we would be able to eliminate the SSI-based result delivery path. This will eliminate Any need in having the "transmit manager" which was introduced as a flow-control mechanism to limit the memory usage & XROOTD/SSI load.

If we do this step now for the FileChannelShared then we will end up with a more complicated code. As you mentioned in your comment, calling this method from FileChannelShared won't do any damage.

src/wbase/FileChannelShared.cc Show resolved Hide resolved
src/wbase/FileChannelShared.cc Outdated Show resolved Hide resolved
lock_guard<mutex> const streamMutexLock(_streamMutex);
++_lastCount;
bool lastTaskDone = _lastCount >= _taskCount;
return lastTaskDone;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have it's own mutex to avoid unrelated mutex contention issues. _lastCount has nothing to do with _streamMutex, so this is just asking for trouble. Each mutex should cover as small a number of cases/variables as possible to reduce contention andthe likelyhood of accidentaly trying to lock the mutex twice. (I think in the original code, _streamMutex was already locked for some other reason, so adding another mutex wouldn't help anything.)

However you may be able to get rid of the mutex entirely.

std::atomic _lastCount{0}; // in the class header

auto count = ++_lastCount; // safe since _lastCount is atomic.
bool lastTaskDone = count >= _taskCount;
return lastTaskDone;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, I'm a bit hesitant to do this now as this change means a possible shift in the locking behavior of these streaming classes compared with the current implementation. My intent here was not to change anything in the behavior of the SSI branch of the resulting streaming. I propose that we do this change (as per your suggestion above) as well as others that may modify the behavior of the streaming classes in a separate ticket. In this case, we would always see an effect of the changes.

src/wbase/FileChannelShared.cc Show resolved Hide resolved
src/wbase/SendChannelShared.cc Show resolved Hide resolved
@iagaponenko iagaponenko force-pushed the tickets/DM-38069 branch 3 times, most recently from 7305ebb to ca70181 Compare July 15, 2023 01:49
Added the corresponding configuration option to the entry-point CLI
for ingecting a value of the result folder into the config file of
of the worker CMSD. The configuration file has been extended as well
to allow serving files from workers via the XROOTD file-based protocol.
The SSI provider class has been extened to recognise result files
at the local filesystems of workers as valid XROOTD resources.
The service is now available to the client code via shared pointer
that can be stored and used by classes as needed. The change reduces
the number of parameters which are sent around the code. In the new
version of the code only the shared pointer to the service is put
to where the configuration parameters are consumed.

Also added new parameters to support the file-based result delivery,
including: a location of the results folder, the number of BOOST ASIO
threads to run the QHTTP server at worker, and a selector for the desired
results delivery protocol.
Also elminated deprecated protobuf schema and large result attributes
…sses

Split wbase::SendChannelShared into the base class and an implementation

Did the minor refactoring in naming and using XROOTD/SSI streaming channels
These change were required to prepare ground for introducing other
"ChannelShared"-alike implementations of the result processing and delivery
mechanisms.

The refactoring aims at correcting dependencies in a code that creates
and sets up worker tasks, so that task initialization was happening
completelly within the implementation file of wbase/Task.cc rather
than being spread between wbase/Task.cc and wcontrol/Foreman.cc.
Altogether this is meant to improve the observability of the code and
make it easier to maintain.

Also, some minor code cleanup and removal of unused header includes
and forward declarations was made. Thould reduce the compilation
time of the relevant modules.

Got rid of unused members and methods

Classes wbase::ChannelShared (used to be wbase::SendChannelShared)
and wbase::TransmitData had an obsolete mechanism for setting up
result schema in the response messages to be sent to Czar. The current
implementation no longer uses that API. The obsilete API was removed
to avoid confusions as it wasn't obvious where the actual schema
settings were made.

Got rid of std:: in the iomplementation files with "using std".

Guaranteed and enforced synchronization in private methods that require
a lock to be held before calling the methods. In the new code, a cons
reference to a lock is passed around contexts (methods) where such lock
is required. This compiler-enforced technique has proven to work for large
code bases where it would be hard to track a state of the lock by other methods.
The only other alternative method would be to do the run-time inspection of
the lock at the entrance of a method.
The new result writer redirects result sets into files at workers.
Only the last "summary" message with no rows is sent to Czar
for each worker query (regardless if a number of tasks that were
required to process the query).
Also did the minor refactoring in the error messages

Added HTTP server to Qserv worker for serving result files

Also made minor fixes and improvements to basic tests of QHTTP
made in the Replication/Ingest system code base.

Using a configuration option to select the desired result delivery protocol

Added support (is configured) for deleting unclaimed result files
at the startup time of the worker service and after restarting the Czar.
It was the trivial refactpring of the code meant to simplify the code
and prepare it for the functional extention.
The new version of the result merger at Czar would dynamically determine
which protocol should be used for pulling results from workers: SSI stream,
a file read via the XROOTD file protocol, or a file read via the HTTP
protocol. In case of the last two options, the merger would also also
utomatically tell a worker to delete the result file upon completion of
the merge (including the unsuccessful ones).

A choice of the delivery method is based on the optional
fileds in the Protobuf messages received from workers.
It's up to the workers to decide on what method to select.
Also, extended the Web Dashboard to display the filesystem status
and usage statistics.
@iagaponenko iagaponenko merged commit fe21aa3 into main Jul 20, 2023
@iagaponenko iagaponenko deleted the tickets/DM-38069 branch July 20, 2023 22:27
@iagaponenko iagaponenko changed the title Tickets/dm 38069 DM-38069: File-based result delivery in Qserv Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants