Add tests for benchmarks #1341

upsj · 2023-05-21T07:56:44Z

Pulled out of #1323, this is everything from that PR without nlohmann-json and flag changes

Add test framework comparing the benchmark output against a reference

sonarcloud · 2023-06-06T16:18:35Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
11 Code Smells

MarcelKoch · 2023-06-07T11:33:35Z

Have you looked into https://json-schema.org/? This is a way to validate json objects, so it could be used instead of the manual checking currently done.

upsj · 2023-06-07T12:25:26Z

We don't have a fixed schema, because the input object can contain information from different benchmarks (matrix_statistics, spmv, solver, ...) and we only check for what is necessary.

However, I am not doing any validation in here (only sanitation to make the output deterministic), are you talking about the other PR?

MarcelKoch · 2023-06-07T12:52:29Z

You are validating the benchmark output by comparing it to some predefined output. I think this is exactly the use case for this schema validation. Of course we would have different schemas for the different benchmark types. Here is an example for the blas schema (it might not be complete since I'm missing the other operations):

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "n": {"type": "integer"},
      "blas": {
        "type": "object",
        "patternProperties": {
          "^copy|axpy|scal": {
            "type": "object",
            "properties": {
              "time": {"type": "number"},
              "flops": {"type": "number"},
              "bandwidth": {"type": "number"},
              "repetitions": {"type": "integer"},
              "completed": {"type": "boolean"}
            },
            "required": [
              "time",
              "flops",
              "bandwidth",
              "repetitions",
              "completed"
            ]
          }
        }
      }
    },
    "required": [
      "n",
      "blas"
    ]
  }
}

upsj · 2023-06-07T22:07:28Z

@MarcelKoch Schemas might be more resilient to change, but requires significantly more work in adapting to changes (we can't just regenerate the output) and doesn't allow checking whether we left other parts of the input unchanged. On top of that, we have multiple types of output sections (text, DEBUG logger output, JSON arrays, JSON objects) that would need to be parsed.

MarcelKoch · 2023-06-12T12:15:36Z

@MarcelKoch Schemas might be more resilient to change, but requires significantly more work in adapting to changes (we can't just regenerate the output)

Not generating the expected output can also be a good thing, because we would keep errors from incorrect benchmarks otherwise. IMO changing the schema would not be too much in most cases. And I think the schema would be a more common approach to the test.

and doesn't allow checking whether we left other parts of the input unchanged. On top of that, we have multiple types of output sections (text, DEBUG logger output, JSON arrays, JSON objects) that would need to be parsed.

The schema would only be for the JSON arrays and objects. The other parts would be tested as it currently is. Although thb I'm not sure what the value is of testing the debug logger output. But I need to think about that a bit more.

upsj · 2023-06-12T12:45:02Z

Having the reference output be part of the repository and checking stderr and stdout is a really fundamental goal of this approach, because it lets us notice when something in the output changes, either because we changed something on purpose or because we caused an accidental change. Either way the test fails, and we need to explicitly change the reference outputs. The change also gets preserved in git, so you see in the code review which parts of the benchmark (and its internals in the DEBUG case) changed how.
Also JSON schemata can't check questions like: does the benchmark preserve parts of the input that were ignored? does the order of the output entries stay consistent?

MarcelKoch · 2023-06-12T13:03:52Z

Having the reference output be part of the repository and checking stderr and stdout is a really fundamental goal of this approach, because it lets us notice when something in the output changes, either because we changed something on purpose or because we caused an accidental change. Either way the test fails, and we need to explicitly change the reference outputs. The change also gets preserved in git, so you see in the code review which parts of the benchmark (and its internals in the DEBUG case) changed how.

All of that would also be valid with the schema approach.

Also JSON schemata can't check questions like: does the benchmark preserve parts of the input that were ignored? does the order of the output entries stay consistent?

I'm not sure what the first part means, but, on the second part, do we really care about the output order? This would also depend on the JSON implementation, how it handles object properties. At least, I don't think that there is a predefined order on the properties. (I think it would be possible to check the order or array entries)

upsj · 2023-06-12T13:15:42Z

nlohmann-json and RapidJSON both support preserving the order of objects in the output, and a change in the order is usually the side-effect of a change in the code. Whether the side-effect was intended or accidental, it's still useful to be notified of it.
Preserving the input as much as possible can matter if we have a workflow running through multiple benchmarks operating on the same JSON array, e.g. [{"filename":"file.mtx"}] -> [{"filename":"file.mtx", "problem": {"rows":1, "columns":1, ...}}] -> [{"filename":"file.mtx", "problem": ..., "optimal": {"spmv": "csr"}}].

MarcelKoch · 2023-06-12T13:49:41Z

From our offline discussion, the schema based approach will be put on the backlog, and, for now, the purely text based approach is sufficient. Perhaps after #1323 would be a good point to reevaluate the usage of schema.

tcojean · 2023-06-12T13:54:59Z

Could that be summarized somehow? Or is it only the previous points? Either here or in notion. I've also recently seen fruitful usages of the JSON schema, the more info on the topic the better.

upsj · 2023-06-12T14:00:53Z

The important bullet points for me were

JSON schemas can do almost all we need to do here, including handling the different ways typenames get mangled between different compilers.
To me, the other outputs (banner, status messages, DEBUG logger output, ...) are at least as important as the JSON output, so I didn't want to put too much effort into this yet.
One thing JSON schemas can't do is check the order in which entries were added to the JSON object. This is an important descriptor of the execution control flow though, since it tells us when the order that different operations in the benchmark were run changes.
The tooling around text diffs is really convenient, is there something similar for JSON schema violations?

greole

Not a full review, just some thoughts so far. In general I think if test_framework.py grows in size using pytest is the better alternative. And I would assume that the whole sanitizing could be simplified by splitting the traversal of the dictionary/lists and the actual sanitization. Also, some docstrings would improve readability.

greole · 2023-06-12T18:54:53Z

benchmark/test/test_framework.py.in

I am a bit unsure which route to go here. In general, I would recommend to use something like pytest https://docs.pytest.org/en/7.3.x/ because it ships all kind of useful functionality, but if the scope of these python tests stays limited and doesn't increase over time it might be OK.

our containers currently don't have pytest installed, but long-term, this might be useful. Does pytest provide any kind of utilities for long text diffs? I don't see any right now

Good question, I don't know about long text diffs, but comparing dictionaries is quite good. And IMO that should be the preferred route anyway. Read json to python dicts (reference and test output) -> sanitize -> compare both dictionaries.

The thing I am interested here is text output, the JSON objects are far from the only important part. The only reason to parse the JSON input is that it is easier to sanitize the results in parsed form. Text diffs on pretty-printed JSON are really convenient, I'm not sure there is much to be improved by moving to dicts?

benchmark/test/test_framework.py.in

greole · 2023-06-12T20:05:03Z

benchmark/test/test_framework.py.in

+    return sanitize_json(value, sanitize_all)
+
+
+def sanitize_json(parsed_input, sanitize_all=False):


The function name is a bit misleading, since the parsed input has been already serialised to a python object. I would probably write a function like this

def recursive_apply(parsed_input: dict, sanitizer: collections.abc.Callable) -> dict """Recurses into parsed_input and applys sanitizer if the values are not dictionary or list types""" ...

Then the sanitizer should also know sanitzie_all was set. This way you could untangle the calls from sanitze_json to sanitize_json_single and potentially back to. sanitize_json

This might be really helpful in case the sanitation needs to become more advanced, but I believe with the complexity involved in such a stateful sanitizer object, the current implementation is still easier to maintain. I am also not sure if we will need more complex sanitation in the forseeable future, since we basically only need get rid of a handful of individual floats, arrays of floats and implementation-dependent strings.

benchmark/test/test_framework.py.in

greole · 2023-06-13T06:43:00Z

Regarding the json-schema discussion: https://pypi.org/project/jsonschema/ might be an option.

- no more -detailed information in the output - moved the range annotation closer to the hot loop

Co-authored-by: Gregor Olenik <[email protected]>

- add missing newline - remove disable test outputs - fix docstrings - fix duplicate matrix_statistics test Co-authored-by: Yuhsiang M. Tsai <[email protected]>

sonarcloud · 2023-07-20T05:09:23Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

80.0% Coverage
0.0% Duplication

codecov · 2023-07-20T05:14:02Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.91 🎉

Comparison is base (187c25a) 90.35% compared to head (7214d88) 91.26%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1341      +/-   ##
===========================================
+ Coverage    90.35%   91.26%   +0.91%     
===========================================
  Files          598      598              
  Lines        50781    50706      -75     
===========================================
+ Hits         45881    46276     +395     
+ Misses        4900     4430     -470

Impacted Files	Coverage Δ
test/mpi/preconditioner/schwarz.cpp	`100.00% <100.00%> (ø)`

... and 25 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

upsj self-assigned this May 21, 2023

MarcelKoch self-requested a review May 22, 2023 12:11

upsj force-pushed the benchmark_tests branch from 08ff62e to 16608a3 Compare May 25, 2023 07:07

upsj mentioned this pull request May 25, 2023

Add support for profiler and file input to benchmarks #1342

Merged

upsj changed the base branch from develop to benchmark_profiler_input_flags May 25, 2023 07:11

upsj force-pushed the benchmark_profiler_input_flags branch from 12a4218 to a86eef7 Compare June 1, 2023 10:15

Base automatically changed from benchmark_profiler_input_flags to develop June 4, 2023 13:53

upsj force-pushed the benchmark_tests branch from 16608a3 to 857ca5b Compare June 4, 2023 14:16

upsj added the 1:ST:ready-for-review This PR is ready for review label Jun 12, 2023

upsj requested a review from a team June 12, 2023 07:41

greole reviewed Jun 12, 2023

View reviewed changes

upsj and others added 21 commits July 19, 2023 15:22

fix benchmark tests for multi-config generators

94b6bc8

handle windows newlines correctly

c57b051

fix SYCL warnings in output

4a01858

strip implementation-dependent demangled typenames

08b5d49

strip more path-depentent output in test framework

c3d41b6

update benchmark test outputs

16a6aff

more strict path removal

d81bfca

update distributed outputs

12beb6e

sanitize more output

7a15cc7

format files

cff1716

disable unstable tests

3a219f5

move SYCL_DEVICE_FILTER by ONEAPI_DEVICE_SELECTOR

c8fc837

update benchmark outputs

848e361

- no more -detailed information in the output - moved the range annotation closer to the hot loop

update distributed benchmark outputs

9bf25ea

Replace deprecated SYCL_DEVICE_FILTER

14406bd

improve documentation and function naming

c6d41e1

Co-authored-by: Gregor Olenik <[email protected]>

update version

53774ce

review updates

72c21b4

- add missing newline - remove disable test outputs - fix docstrings - fix duplicate matrix_statistics test Co-authored-by: Yuhsiang M. Tsai <[email protected]>

support older python versions

ad99e89

fix typing error

a950685

remove unused tests

0dab762

upsj force-pushed the benchmark_tests branch from f512133 to 0dab762 Compare July 19, 2023 13:23

upsj added the 1:ST:no-changelog-entry Skip the wiki check for changelog update label Jul 19, 2023

upsj added 2 commits July 19, 2023 15:48

fix device memory access segfault

4cdfcd8

remove deprecated SYCL environment variables

7214d88

upsj merged commit 1a9877b into develop Jul 20, 2023
14 checks passed

upsj deleted the benchmark_tests branch July 20, 2023 07:27

upsj mentioned this pull request Feb 27, 2024

Run examples as part of the test suite #1558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests for benchmarks #1341

Add tests for benchmarks #1341

upsj commented May 21, 2023 •

edited

Loading

sonarcloud bot commented Jun 6, 2023

MarcelKoch commented Jun 7, 2023

upsj commented Jun 7, 2023

MarcelKoch commented Jun 7, 2023

upsj commented Jun 7, 2023

MarcelKoch commented Jun 12, 2023

upsj commented Jun 12, 2023

MarcelKoch commented Jun 12, 2023

upsj commented Jun 12, 2023 •

edited

Loading

MarcelKoch commented Jun 12, 2023

tcojean commented Jun 12, 2023

upsj commented Jun 12, 2023

greole left a comment

greole Jun 12, 2023

upsj Jun 20, 2023 •

edited

Loading

greole Jun 21, 2023

upsj Jun 21, 2023 •

edited

Loading

greole Jun 12, 2023 •

edited

Loading

upsj Jun 20, 2023

greole commented Jun 13, 2023

sonarcloud bot commented Jul 20, 2023

codecov bot commented Jul 20, 2023

		return sanitize_json(value, sanitize_all)


		def sanitize_json(parsed_input, sanitize_all=False):

Add tests for benchmarks #1341

Add tests for benchmarks #1341

Conversation

upsj commented May 21, 2023 • edited Loading

sonarcloud bot commented Jun 6, 2023

MarcelKoch commented Jun 7, 2023

upsj commented Jun 7, 2023

MarcelKoch commented Jun 7, 2023

upsj commented Jun 7, 2023

MarcelKoch commented Jun 12, 2023

upsj commented Jun 12, 2023

MarcelKoch commented Jun 12, 2023

upsj commented Jun 12, 2023 • edited Loading

MarcelKoch commented Jun 12, 2023

tcojean commented Jun 12, 2023

upsj commented Jun 12, 2023

greole left a comment

Choose a reason for hiding this comment

greole Jun 12, 2023

Choose a reason for hiding this comment

upsj Jun 20, 2023 • edited Loading

Choose a reason for hiding this comment

greole Jun 21, 2023

Choose a reason for hiding this comment

upsj Jun 21, 2023 • edited Loading

Choose a reason for hiding this comment

greole Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

upsj Jun 20, 2023

Choose a reason for hiding this comment

greole commented Jun 13, 2023

sonarcloud bot commented Jul 20, 2023

codecov bot commented Jul 20, 2023

Codecov Report

upsj commented May 21, 2023 •

edited

Loading

upsj commented Jun 12, 2023 •

edited

Loading

upsj Jun 20, 2023 •

edited

Loading

upsj Jun 21, 2023 •

edited

Loading

greole Jun 12, 2023 •

edited

Loading