Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for benchmarks #1341

Merged
merged 27 commits into from
Jul 20, 2023
Merged

Add tests for benchmarks #1341

merged 27 commits into from
Jul 20, 2023

Conversation

upsj
Copy link
Member

@upsj upsj commented May 21, 2023

Pulled out of #1323, this is everything from that PR without nlohmann-json and flag changes

  • Add test framework comparing the benchmark output against a reference

@upsj upsj self-assigned this May 21, 2023
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:core This is related to the core module. reg:benchmarking This is related to benchmarking. type:solver This is related to the solvers type:preconditioner This is related to the preconditioners labels May 21, 2023
@MarcelKoch MarcelKoch self-requested a review May 22, 2023 12:11
@upsj upsj changed the base branch from develop to benchmark_profiler_input_flags May 25, 2023 07:11
@upsj upsj force-pushed the benchmark_profiler_input_flags branch from 12a4218 to a86eef7 Compare June 1, 2023 10:15
Base automatically changed from benchmark_profiler_input_flags to develop June 4, 2023 13:53
@sonarcloud
Copy link

sonarcloud bot commented Jun 6, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell C 11 Code Smells

75.0% 75.0% Coverage
0.0% 0.0% Duplication

@MarcelKoch
Copy link
Member

Have you looked into https://json-schema.org/? This is a way to validate json objects, so it could be used instead of the manual checking currently done.

@upsj
Copy link
Member Author

upsj commented Jun 7, 2023

We don't have a fixed schema, because the input object can contain information from different benchmarks (matrix_statistics, spmv, solver, ...) and we only check for what is necessary.

However, I am not doing any validation in here (only sanitation to make the output deterministic), are you talking about the other PR?

@MarcelKoch
Copy link
Member

You are validating the benchmark output by comparing it to some predefined output. I think this is exactly the use case for this schema validation. Of course we would have different schemas for the different benchmark types. Here is an example for the blas schema (it might not be complete since I'm missing the other operations):

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "n": {"type": "integer"},
      "blas": {
        "type": "object",
        "patternProperties": {
          "^copy|axpy|scal": {
            "type": "object",
            "properties": {
              "time": {"type": "number"},
              "flops": {"type": "number"},
              "bandwidth": {"type": "number"},
              "repetitions": {"type": "integer"},
              "completed": {"type": "boolean"}
            },
            "required": [
              "time",
              "flops",
              "bandwidth",
              "repetitions",
              "completed"
            ]
          }
        }
      }
    },
    "required": [
      "n",
      "blas"
    ]
  }
}

@upsj
Copy link
Member Author

upsj commented Jun 7, 2023

@MarcelKoch Schemas might be more resilient to change, but requires significantly more work in adapting to changes (we can't just regenerate the output) and doesn't allow checking whether we left other parts of the input unchanged. On top of that, we have multiple types of output sections (text, DEBUG logger output, JSON arrays, JSON objects) that would need to be parsed.

@upsj upsj added the 1:ST:ready-for-review This PR is ready for review label Jun 12, 2023
@upsj upsj requested a review from a team June 12, 2023 07:41
@MarcelKoch
Copy link
Member

@MarcelKoch Schemas might be more resilient to change, but requires significantly more work in adapting to changes (we can't just regenerate the output)

Not generating the expected output can also be a good thing, because we would keep errors from incorrect benchmarks otherwise. IMO changing the schema would not be too much in most cases. And I think the schema would be a more common approach to the test.

and doesn't allow checking whether we left other parts of the input unchanged. On top of that, we have multiple types of output sections (text, DEBUG logger output, JSON arrays, JSON objects) that would need to be parsed.

The schema would only be for the JSON arrays and objects. The other parts would be tested as it currently is. Although thb I'm not sure what the value is of testing the debug logger output. But I need to think about that a bit more.

@upsj
Copy link
Member Author

upsj commented Jun 12, 2023

Having the reference output be part of the repository and checking stderr and stdout is a really fundamental goal of this approach, because it lets us notice when something in the output changes, either because we changed something on purpose or because we caused an accidental change. Either way the test fails, and we need to explicitly change the reference outputs. The change also gets preserved in git, so you see in the code review which parts of the benchmark (and its internals in the DEBUG case) changed how.
Also JSON schemata can't check questions like: does the benchmark preserve parts of the input that were ignored? does the order of the output entries stay consistent?

@MarcelKoch
Copy link
Member

Having the reference output be part of the repository and checking stderr and stdout is a really fundamental goal of this approach, because it lets us notice when something in the output changes, either because we changed something on purpose or because we caused an accidental change. Either way the test fails, and we need to explicitly change the reference outputs. The change also gets preserved in git, so you see in the code review which parts of the benchmark (and its internals in the DEBUG case) changed how.

All of that would also be valid with the schema approach.

Also JSON schemata can't check questions like: does the benchmark preserve parts of the input that were ignored? does the order of the output entries stay consistent?

I'm not sure what the first part means, but, on the second part, do we really care about the output order? This would also depend on the JSON implementation, how it handles object properties. At least, I don't think that there is a predefined order on the properties. (I think it would be possible to check the order or array entries)

@upsj
Copy link
Member Author

upsj commented Jun 12, 2023

nlohmann-json and RapidJSON both support preserving the order of objects in the output, and a change in the order is usually the side-effect of a change in the code. Whether the side-effect was intended or accidental, it's still useful to be notified of it.
Preserving the input as much as possible can matter if we have a workflow running through multiple benchmarks operating on the same JSON array, e.g. [{"filename":"file.mtx"}] -> [{"filename":"file.mtx", "problem": {"rows":1, "columns":1, ...}}] -> [{"filename":"file.mtx", "problem": ..., "optimal": {"spmv": "csr"}}].

@MarcelKoch
Copy link
Member

From our offline discussion, the schema based approach will be put on the backlog, and, for now, the purely text based approach is sufficient. Perhaps after #1323 would be a good point to reevaluate the usage of schema.

@tcojean
Copy link
Member

tcojean commented Jun 12, 2023

Could that be summarized somehow? Or is it only the previous points? Either here or in notion. I've also recently seen fruitful usages of the JSON schema, the more info on the topic the better.

@upsj
Copy link
Member Author

upsj commented Jun 12, 2023

The important bullet points for me were

  • JSON schemas can do almost all we need to do here, including handling the different ways typenames get mangled between different compilers.
  • To me, the other outputs (banner, status messages, DEBUG logger output, ...) are at least as important as the JSON output, so I didn't want to put too much effort into this yet.
  • One thing JSON schemas can't do is check the order in which entries were added to the JSON object. This is an important descriptor of the execution control flow though, since it tells us when the order that different operations in the benchmark were run changes.
  • The tooling around text diffs is really convenient, is there something similar for JSON schema violations?

Copy link
Collaborator

@greole greole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review, just some thoughts so far. In general I think if test_framework.py grows in size using pytest is the better alternative. And I would assume that the whole sanitizing could be simplified by splitting the traversal of the dictionary/lists and the actual sanitization. Also, some docstrings would improve readability.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit unsure which route to go here. In general, I would recommend to use something like pytest https://docs.pytest.org/en/7.3.x/ because it ships all kind of useful functionality, but if the scope of these python tests stays limited and doesn't increase over time it might be OK.

Copy link
Member Author

@upsj upsj Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our containers currently don't have pytest installed, but long-term, this might be useful. Does pytest provide any kind of utilities for long text diffs? I don't see any right now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I don't know about long text diffs, but comparing dictionaries is quite good. And IMO that should be the preferred route anyway. Read json to python dicts (reference and test output) -> sanitize -> compare both dictionaries.

Copy link
Member Author

@upsj upsj Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing I am interested here is text output, the JSON objects are far from the only important part. The only reason to parse the JSON input is that it is easier to sanitize the results in parsed form. Text diffs on pretty-printed JSON are really convenient, I'm not sure there is much to be improved by moving to dicts?

benchmark/test/test_framework.py.in Outdated Show resolved Hide resolved
benchmark/test/test_framework.py.in Show resolved Hide resolved
benchmark/test/test_framework.py.in Outdated Show resolved Hide resolved
benchmark/test/test_framework.py.in Outdated Show resolved Hide resolved
benchmark/test/test_framework.py.in Outdated Show resolved Hide resolved
benchmark/test/test_framework.py.in Outdated Show resolved Hide resolved
benchmark/test/test_framework.py.in Outdated Show resolved Hide resolved
return sanitize_json(value, sanitize_all)


def sanitize_json(parsed_input, sanitize_all=False):
Copy link
Collaborator

@greole greole Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name is a bit misleading, since the parsed input has been already serialised to a python object. I would probably write a function like this

def recursive_apply(parsed_input: dict, sanitizer: collections.abc.Callable) -> dict
"""Recurses into parsed_input and applys sanitizer if the values are not dictionary or list types"""
...

Then the sanitizer should also know sanitzie_all was set. This way you could untangle the calls from sanitze_json to sanitize_json_single and potentially back to. sanitize_json

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be really helpful in case the sanitation needs to become more advanced, but I believe with the complexity involved in such a stateful sanitizer object, the current implementation is still easier to maintain. I am also not sure if we will need more complex sanitation in the forseeable future, since we basically only need get rid of a handful of individual floats, arrays of floats and implementation-dependent strings.

benchmark/test/test_framework.py.in Outdated Show resolved Hide resolved
@greole
Copy link
Collaborator

greole commented Jun 13, 2023

Regarding the json-schema discussion: https://pypi.org/project/jsonschema/ might be an option.

@upsj upsj added the 1:ST:no-changelog-entry Skip the wiki check for changelog update label Jul 19, 2023
@sonarcloud
Copy link

sonarcloud bot commented Jul 20, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

80.0% 80.0% Coverage
0.0% 0.0% Duplication

@codecov
Copy link

codecov bot commented Jul 20, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.91 🎉

Comparison is base (187c25a) 90.35% compared to head (7214d88) 91.26%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1341      +/-   ##
===========================================
+ Coverage    90.35%   91.26%   +0.91%     
===========================================
  Files          598      598              
  Lines        50781    50706      -75     
===========================================
+ Hits         45881    46276     +395     
+ Misses        4900     4430     -470     
Impacted Files Coverage Δ
test/mpi/preconditioner/schwarz.cpp 100.00% <100.00%> (ø)

... and 25 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@upsj upsj merged commit 1a9877b into develop Jul 20, 2023
14 checks passed
@upsj upsj deleted the benchmark_tests branch July 20, 2023 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:no-changelog-entry Skip the wiki check for changelog update 1:ST:ready-to-merge This PR is ready to merge. mod:core This is related to the core module. reg:benchmarking This is related to benchmarking. reg:build This is related to the build system. reg:testing This is related to testing. type:preconditioner This is related to the preconditioners type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants