-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Distributed processing of Event Logs #1249
Comments
I am not sure I understand the problem. Is it about processing Apps in runtime or about tools resources requirements? Processing eventlogs require large resources. As instance, Spark History Server is known to require large memory and resources to process eventlogs. |
Previously, the python CLI had option to submit the Tools jar as a Spark job. This was mainly a way to work with large eventlogs since the CLI will be able to spin distributed Spark jobs. |
Note that scaling can also be done via making a single machine run more efficient by storing the data in a database vs in memory. For instance like RocksDB. This issue should likely be split up into multiple for the various improvements being made |
@tgravescs , yes I agree. We had a previous issue #815 to track that |
Please note there are 2 other issues to improve processing of event logs on a single machine: |
Currently, we run the Tool (python+jar) on a single machine which is limited by the memory and compute of the host machine. However, Tools should have the capability to process large scale event logs.
Although, we do support running the Tools as a Spark Listener but is not useful for apps that are already processed.
Some of the ideas are:
rapids_4_spark_qualification_output
directories.cc: @viadea @kuhushukla
The text was updated successfully, but these errors were encountered: