You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a user I want to hang onto the config file I used in its entirety for each run so that I can have easy reference and copy-paste access to the exact settings I used.
As a user, I want to output results in the same directory both from multiple suites and from multiple runs of spark-bench.
As a user, I find it overwhelming to see such wide rows in my results with so much repeated information.
FYI, speaking for myself here, I think outputting results on a suite-by-suite basis is a little bit of an odd choice. It was my choice, but I still think it's weird. I definitely think this could be improved.
At the moment, it seems like the trend is toward wrapping every element of the configuration and environment into the output as columns in the results table. As we try to add OS and hardware information, the number of columns will increase. Having one table may be convenient for certain kinds of analysis if you're using SQL, but is not very user-friendly when it comes to casually going over output -- particularly on the console. There is a lot of duplicated information -- OS is the same for all workloads on one machine, for example, and spark options are the same for each spark-submit. Since the testing structure is hierarchical (submits -> suites -> workloads), it may make more sense to have similarly hierarchical output. In the discussion that spawned this issue, the idea was mentioned of having a directory for each submit and placing the configuration file for that submit in there along with the results output CSV...or something like that. I think that it would be worth looking into extending this idea to a multi-level directory hierarchy something like this:
Base directory for this run
Original config file that created this run
Information common to all submits -- environment variables, parallelism, etc.
Directory for submit 1
If debugging is enabled, maybe the temporary config file created by spark-launch? Or we could just put the temp files there by default rather than putting them in /tmp and trying to delete them, which often fails
CSV or other table format with the hardware and OS information, Spark version and parameters, and anything else common to the entire submit
Directory for suite 1
spark-bench configuration information for suite 1 and anything else that can change between suites
At this point, we could go another step further and create a directory for each workload, containing its configuration and its output as a CSV or similar. I think this might be going a bit far. Instead, we could put the existing results tables in this directory, only without the information that has already been included at a higher level. Only the workload parameters and results would need to be included in this table.
Directory for suite 2...
Directory for submit 2...
I think something like this would go a long way toward making the output more user-friendly, and it also handles the "easy reference" issue mentioned above.
I'm not sure how far this goes toward a solution, but it is my initial thoughts on the fact.
It's important to consider scriptability of the results in the new format. Craig points out that while the hierarchical format is convenient for human readers, it may make it difficult to automatically extract results for reporting. We need to be careful that the new format is machine-friendly as well as human-friendly.
The text was updated successfully, but these errors were encountered:
This is an issue that was originally opened on ecurtin/spark-bench. I am copy-pasting the ensuing discussion here:
Based on feedback from @brad-kaiser
As a user I want to hang onto the config file I used in its entirety for each run so that I can have easy reference and copy-paste access to the exact settings I used.
As a user, I want to output results in the same directory both from multiple suites and from multiple runs of spark-bench.
As a user, I find it overwhelming to see such wide rows in my results with so much repeated information.
FYI, speaking for myself here, I think outputting results on a suite-by-suite basis is a little bit of an odd choice. It was my choice, but I still think it's weird. I definitely think this could be improved.
@showermat commented on Jul 28:
At the moment, it seems like the trend is toward wrapping every element of the configuration and environment into the output as columns in the results table. As we try to add OS and hardware information, the number of columns will increase. Having one table may be convenient for certain kinds of analysis if you're using SQL, but is not very user-friendly when it comes to casually going over output -- particularly on the console. There is a lot of duplicated information -- OS is the same for all workloads on one machine, for example, and spark options are the same for each spark-submit. Since the testing structure is hierarchical (submits -> suites -> workloads), it may make more sense to have similarly hierarchical output. In the discussion that spawned this issue, the idea was mentioned of having a directory for each submit and placing the configuration file for that submit in there along with the results output CSV...or something like that. I think that it would be worth looking into extending this idea to a multi-level directory hierarchy something like this:
I think something like this would go a long way toward making the output more user-friendly, and it also handles the "easy reference" issue mentioned above.
I'm not sure how far this goes toward a solution, but it is my initial thoughts on the fact.
@showermat commented on Aug 18
It's important to consider scriptability of the results in the new format. Craig points out that while the hierarchical format is convenient for human readers, it may make it difficult to automatically extract results for reporting. We need to be careful that the new format is machine-friendly as well as human-friendly.
The text was updated successfully, but these errors were encountered: