Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a verificationExperiment annotation for experiments meant to compare results for regression testing and cross-tool comparisons #3473

Open
casella opened this issue Feb 2, 2024 · 67 comments
Labels
discussion Indicates that there's a discussion; not clear if bug, enhancement, or working as intended enhancement New feature or request
Milestone

Comments

@casella
Copy link
Collaborator

casella commented Feb 2, 2024

Modelica models include an experiment annotation that defines the time span, tolerance and communication interval for a default simulation of the system. Usually, these parameters are set in order to get meaningful results from the point of view of the modeller. Since in many cases the models are affected by significant parametric uncertainty and modelling assumptions/approximations, it typically makes little sense to seek very high precision, say rtol = 1e-8, resulting in longer simulation times, when the results are affected by maybe 1% or more error.

We are currently using these values also to generate reference results and to run simulation whose result are compared to them, both for regression testing and for cross-tool testing. This is unfortunately not a good idea, mainly for two reasons:

  1. in some cases, the numerical errors exceed the tolerance of the CSV-compare tool, so we get a lot of false negatives because different tools, or different versions of the same tool, or even the same tool on different hw/ws platforms (see, e.g., The simulations of some PowerGrids models differ based on OS and CPU OpenModelica/OpenModelica#11935) lead to different numerical errors/approximations;
  2. some other cases feature chaotic motion (e.g. the Furuta Pendulum or three-body problems) or large amounts of closely-spaced state events (e.g. all kind of switched circuit models) whose triggering time inevitably tend to drift apart due to accumulating errors in determining the exact switching times.

For testing, what we need is to select simulation parameters which somehow guarantee that the numerical solution obtained is so close to the exact one that numerical errors cannot lead to false negatives, so that a verification failure really means something has changed with the model or the way it was handled to generate simulation code.

In both cases, what we need is to clearly differentiate between the role of the experiment annotation, which is to produce meaningful results for the end-user of the model, and of some new annotation, which is meant to produce near-exact results for comparisons.

For case 1., what one typically needs is to provide a tighter tolerance, and possibly a shorter communication interval. How much tighter and shorter, it depends on the specific case, and possibly also on the set of tools involved in the comparison - there is no one-fits-all number.

For case 2., it is necessary to choose a much shorter simulation interval (smaller StopTime), so that the effects of chaotic motion or the accumulated drift of event times don't have enough time to unfurl significantly. Again, how much shorter, it depends on the participant to the game, and may require some adaptation.

For this purpose, I would propose to introduce the verificationExperiment annotation, with exactly the same arguments as the experiment annotation, to be used for the purpose of generating results for verification. Of course, if some arguments (e.g. StartTime) are not supplied, or if the annotation is outright missing, the corresponding values of experiment annotation (or their defaults) will be used instead.

@casella casella added enhancement New feature or request discussion Indicates that there's a discussion; not clear if bug, enhancement, or working as intended labels Feb 2, 2024
@casella casella added this to the ModelicaSpec3.7 milestone Feb 2, 2024
@HansOlsson
Copy link
Collaborator

I agree that such information is useful, and ideally it should only be needed for a fraction of the models.

I believe that what we have used for Dymola is instead of modifying the package that information is available externally - and also to loosen the tolerance so that we don't get false positives; but I will check.

As I can see that has a number of advantages:

  • No need to push it up-stream to library developers.
  • No need to decide which tool has the correct idea about such tolerances; as it is available externally from the library.

@henrikt-ma
Copy link
Collaborator

I'm not 100% happy with this direction, as it means that the default will be to use the settings from the experiment-annotation, meaning that we don't automatically detect that decreasing tolerance by an order of magnitude has significant impact on the result.

I see good reasons to generate reference results according to a procedure where tolerance is automatically set an order of magnitude lower than that in the experiment, so that CI correctness tests running with the experiment settings can detect when the experiment is inappropriate.

I can imagine there could rare situations when the automatic settings for reference result generation need to be overridden, but I have yet to see an actual example of this. Similar to what @HansOlsson's says regarding Dymola, we at Wolfram also keep information for overriding the experiment in files that are separate from the Modelica code, with the ability to both control reference result generation and settings for running tests (used in rare situations as an alternative to using very sloppy comparison tolerances when the experiment settings are considered really poor).

A related topic is the use of different StopTime for regression testing and for model demonstration, but this might be better handled with a TestStopTime inside the normal experiment.

@christoff-buerger
Copy link
Member

christoff-buerger commented Feb 2, 2024

I honestly would not introduce any new annotations etc to the language. In my opinion Modelica already has all the means one needs to achieve this, e.g., inheritance and modification. What has to be done by each individual project is to decide for ITS DOMAIN AND USE-CASES on the structure and templates of modelling all the simulation applications.

For example, you can have basic user experiments in your library (i.e., examples) and in a separate regression test library refine these experiments with different setting (change tolerances, stop times whatever). In this setup, the regression testing setup is not part of the final library distribution, as it likely should be (because you might add a bunch of scripts and other software for a continuous integration support resulting in huge dependencies to make your automatic tests run). You just don't want to ship all of that. And as @henrikt-ma mentions, good pratice at the tool vendors is already to have separate settings for this.

I think, testing is a matter of project policies and templates. MSL can come up with a good scheme for its use case and improve its infrastructure without any need for language changes.

@casella
Copy link
Collaborator Author

casella commented Feb 3, 2024

I respectfully disagree with @henrikt-ma and @christoff-buerger:

  • the choice of specific simulation parameters for regression testing is the ultimately the responsibility of the library developer, not of the tool vendors. I understand this is only necessary for a minority of cases (maybe 2-5%, most cases work fine with default settings), but yet if one aims at seeing 100% verification, this will be necessary)
  • tightening the tolerance is not enough for examples that show chaotic motion or have large number of events. In those cases, the only hope to match the results across tools is to keep the simulation short. On the other hand, if one wants to demonstrate chaotic motion, of course the experiment annotation must be long enough to actually show that. So, there are clearly conflicting requirements for the experiment annotation and for the verificationExperiment annotation, hence the need for two separate annotations
  • regarding the MSL, it has been common practice for 25 years to use all runnable examples of the Modelica library for regression testing. If we had infinite resources, we could stop doing that and develop another library with 500+ test cases that are explicitly set up for verification and regression testing only. Unfortunately, we don't have those resources, MAP-Lib is not a for-profit tool vendor. I believe adding a verificationExperiment annotation to the few dozens of test cases that really need them would be much more practical way to achieve the same goal
  • if we want (as we plan to) to extend verification and qualification services by the MA to all open-source libraries, not only the MSL, the lack of infinite resource issue would become even more important
  • all this of course does not prevent tool vendors to perform whatever kind of regression testing they want with their own tools. Tool vendors are still free to check what happens if the experiment annotation tolerance is reduced by a factor 10. The aim here is for library developers to have a means to declare what is a meaningful experiment for cross-tool verification, in a way that is as much as possible tool-independent. Once again, this not only concerns the tolerance, but all the simulation parameters: StartTime, StopTime, Interval, an Tolerance. And possibly more that we still don't have, e.g. suggesting specific solvers that have certain stability regions, which is necessary in some advanced applications

Regarding this comment

What has to be done by each individual project is to decide for ITS DOMAIN AND USE-CASES on the structure and templates of modelling all the simulation applications.

it is true that each MA project has its own responsibility, but the whole point of the MA is to provide coordinated standards and libraries. In this case, I clearly see the need for a small language change to support the work by MAP-Lib. BTW, we are talking about introducing one new annotation, which is no big deal. We could actually introduce a vendor annotation __MAP_Lib_verificationExperiment, but this really seems overkill to me, can't we just have it as a standard Modelica annotation?

@GallLeo, @beutlich, @dietmarw, @AHaumer, @hubertus65, I'd like to hear your opinion.

@GallLeo
Copy link
Collaborator

GallLeo commented Feb 4, 2024

First of all: We need something for MSL, but the standardized result should be easy to adopt by all library developers.
With "library developers" I don't mean the few public libraries which already have their own testing setup, but the hundreds of company-internal libraries, where part-time developers (engineers) hear about regression testing for the firs time. As soon, as they want to give their library to a colleague with a different Modelica tool, they enter the inscrutable world of cross-tool-testing.
For MSL testing, many implicit rules are used. They will not be adopted by independet library developers.

I had a look at the "heuristics" proposed 10 years ago, when I started creating the first set of reference results:
https://raw.githubusercontent.com/wiki/modelica/ModelicaStandardLibrary/MSLRegressionTesting.pdf#page=11
Takeaway: These rules worked quite fine for most test cases (at least much better as if we had used tolerance 1e-4 everywhere). Back then, I was not able to use a tight tolerance everywhere, because some models would only simulate with their default tolerance. Are these models to be considered "bad models"? Or should we be able to test any working model?

After the 4.0.0 release, @beutlich documented his adaption of tolerances in the MSL wiki:
https://github.com/modelica/ModelicaStandardLibrary/wiki/Regression-testing-of-Modelica-and-ModelicaTest
Takeaway: Finding the "right" strict tolerance for MSL examples still needs manual work.

So, in order to make it transparent, I'm in favor of explicitly specifying verification settings.
But, then we have to limit the manual work for setting these up as much as possible:
If a example/test case is changed, should the verification setting be updated automatically?
Example: If you change stop time of an example from 1000 s to 2000 s, interesting dynamics could be missed, if you keep your verification stop time ate 1000 s.

Where to specifiy the explicit verfication settings?

@christoff-buerger proposed using extends.
We could do that in ModelicaTest, without manually writing a new library:
ModelicaTest.Modelica...MyExample extends Modelica...MyExample annotation(experiment(...));
Benefit: All MSL-test cases would reside in ModelicaTest (no need to run two libraries)
Drawback: CI or library developer has to generate the extends-test-case in ModelicaTest.
If the test case for a new example is not generated, it will be missed in regression runs.
If the experiment setup of an example is changed, it will not be automatically updated in the test case.

@HansOlsson proposed storing the verifications setup externally.
This might work, as long as you test in one Modelica tool and this Modelica tool cares about the verification setup.
I'm unsure, how to handle it in multi-tool environments.
We already have the burden of updating Resources/Reference/Modelica/.../MyExample/comparisonSignals.txt

Seperate files are very hard to keep up to date, therefore storing verification settings in the example/test case seems the be right place to me. Especially, if we think about using signals from figures annotation as important comparison signals.

I would add arguments to experiment instead of new verificationExperiment.
Reasons:

  • It's easy to keep sync of the default experiment settings and the verification experiment settings.
  • There are tools adding vendor-specific parts to the experiment annotation in order to "nail down" the settings for a specific tool (e.g. __Dymola_Algorithm="Cvode"). The used integration algorithm is very important, as soon as you are not only testing for correctness ("exact solution" of the DAE). Most people also test for run time of their test cases.
    Drawback: Complicated 'experiment' annotations in the MSL release distract new users.

@beutlich
Copy link
Member

beutlich commented Feb 4, 2024

I am pretty much aligned with @GallLeo here.

Example models

Using example models simulation results as reference results for cross-tool, cross-solver or cross-MSL-version regressing testing can be considered as a misuse. In strict consequence there should be no directory in Modelica/Resources/Reference/ at all.

Reference models

Reference models should be taken from ModelicaTest. If example models from Modelica shall be also considered for regression testing, it might be an option to extend from it and adopt the experiment settings. This would also simplify the job of the regression tester since only one library needs to be taken into account.

I also agree that we should not only specify the solver settings but also the valid reference signals. I see multiple options.

  1. Specify reference signals outside the model in ModelicaTest/Resources/Reference/ModelicaTest/ and take solver settings from the experiment annotation. This is status-quo (for models in ModelicaTest). It is error-prone in the sense that signal files can be missing or wrong (duplicated or invalid signal names).
  2. Specify both reference signals and solver settings in the model. Yes, that could distract MSL users, even if only in ModelicaTest.
  3. Specify both reference signals and solver settings outside the model. The only requirement we have is that it should be a machine and human redable text-based file format (say TXT (as now for the comparisonSignals.txt) or some more structured YAML with comments and syntax highlighting). The main advantage I see is, that I do not need a Modelica parser or editor to obtain and modify these settings, it's all there in a specified file/directory location which can be controlled from the test runner engine and easily format-tested by linters/checkers. Of course, it again is error-prone in the sense of missing files or wrong signals. (You might compare it to the Modelica localization feature where we also keep the localization outside the library iself in a dedicated directory location.)

Summary

I am in favour to keep reference models outside Modelica and only use ModelicaTest for it. I am in favour to not have reference signals and experiment settings distributed in various locations, i.e., option 2 or 3 are my preferences. I even see more advantage in option 3. In that case it is left to specify (if at all)

  • an optional annotation where to find these regression settings files
  • how to name them
  • the file format and structure

I still need to be convinced that we need to have this in the specification or if it is simply up to the library developers of ModelicaTest. (Remember, it is now all in ModelicaTest and not any more in the MSL.)

@dietmarw
Copy link
Member

dietmarw commented Feb 5, 2024

I would also be in favour of having the verification settings separate from the simulation model but referenced from within the simulation model. So basically @beutlich's option 3.

@HansOlsson
Copy link
Collaborator

First of all: We need something for MSL, but the standardized result should be easy to adopt by all library developers. With "library developers" I don't mean the few public libraries which already have their own testing setup, but the hundreds of company-internal libraries, where part-time developers (engineers) hear about regression testing for the firs time. As soon, as they want to give their library to a colleague with a different Modelica tool, they enter the inscrutable world of cross-tool-testing. For MSL testing, many implicit rules are used. They will not be adopted by independet library developers.

I agree that many of those rules will not be adopted by all libraries.

Additionally to me use-cases such as this is one of the reasons we want the possibility to have the information completely outside of the library; so that the user of the library can add the things needed for testing - without changing the up-stream-library.

@AHaumer
Copy link

AHaumer commented Feb 5, 2024

IMHO @casella is right with his analysis and his idea. I'm not fond of storing additional to the model the comparisonSignals.txt and separate settings for comparison / regression tests. It is much better to store this information (stop time, interval length, tolerance, what else?) within the model, either a second annotation or as part of the experiment annotation. If not present, the setting of the experiment annotation are valid (in most cases). Additionally we could specify the tolerance for comparison. The model developer decides whether changes for comparison / regression tests are necessary or not, maybe after receiving a message that comparison / regression tests are problematic (could be from tool vendors). As this annotation would be supported by most tools, this could be adopted by all third-party libraries.
Regarding test examples:
I think it's a good idea to use all "normal" examples that are intended to demonstrate "normal" usage, and additionally "weird" examples from Modelica.Test.

@maltelenz
Copy link
Collaborator

One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.

I'm a big proponent of "test what you ship", in the sense that one should (at least) have tests that do what a user would do. In this case it means running the examples with the settings from the experiment annotation. I can see that there are cases where this is not possible, like the chaotic systems already mentioned. In this case, a shorter stop time for result comparison is fine.

Except for extremely rare cases like the chaotic system, I don't want testing to happen with special settings for test runs. If you do that, you are no longer testing what the user will experience, and models might not even simulate in the end product (with the settings in the experiment annotation). Nobody would notice, if the tests run with some special configuration.

I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.

I also want to make people aware of the TestCase annotation we already have, which is a candidate for a place to add test related things.

@HansOlsson
Copy link
Collaborator

One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.

I'm a big proponent of "test what you ship", in the sense that one should (at least) have tests that do what a user would do. In this case it means running the examples with the settings from the experiment annotation. I can see that there are cases where this is not possible, like the chaotic systems already mentioned. In this case, a shorter stop time for result comparison is fine.

I agree with testing what you ship (as part of testing; you can add more) and and even this case involves two different things:

  • Running the chaotic model (should work to the actual end-point); we don't want the chaotic model to just crash
  • Comparing it with the reference (will be outside the bounds if using original end-point)

Note that there can be other reasons than chaos for separating testing from running, e.g.:

Modelica.Mechanics.MultiBody.Examples.Loops.EngineV6 (and _analytic) says:

Simulate for 3 s with about 50000 output intervals, and plot the variables engineSpeed_rpm, engineTorque, and filteredEngineTorque. Note, the result file has a size of about 300 Mbyte in this case. The default setting of StopTime = 1.01 s (with the default setting of the tool for the number of output points), in order that (automatic) regression testing does not have to cope with a large result file.

(Note that the experiment annotation doesn't have number of intervals, but Interval=6e-5 corresponds to the above, and even 1e-4 will sort of work as well; but not much higher - since that will under-sample the envelope too much leading to weird effects.)
I don't know why the StopTime is 1.01 s instead of 1 s.

@maltelenz
Copy link
Collaborator

I agree with testing what you ship (as part of testing; you can add more) and and even this case involves two different things:

* Running the chaotic model (should work to the actual end-point); we don't want the chaotic model to just crash

* Comparing it with the reference (will be outside the bounds if using original end-point)

Thank you for clarifying this. I agree the test (in a perfect world) in this case should include both these steps.

Modelica.Mechanics.MultiBody.Examples.Loops.EngineV6 (and _analytic) says:

Simulate for 3 s with about 50000 output intervals, and plot the variables engineSpeed_rpm, engineTorque, and filteredEngineTorque. Note, the result file has a size of about 300 Mbyte in this case. The default setting of StopTime = 1.01 s (with the default setting of the tool for the number of output points), in order that (automatic) regression testing does not have to cope with a large result file.

(Note that the experiment annotation doesn't have number of intervals, but Interval=6e-5 corresponds to the above, and even 1e-4 will sort of work as well; but not much higher - since that will under-sample the envelope too much leading to weird effects.)

As a user of the model, I would not want to see the weird under-sampling of the engineTorque either. How am I supposed to know if it is the actual behavior?

The file size issue for testing can be dealt with by the tool only storing the variables you need for the comparison. For the user, if we introduce figures in the model, tools could be smarter there as well and only store (by default) what you need for plotting (and animation).

@henrikt-ma
Copy link
Collaborator

I also agree that we should not only specify the solver settings but also the valid reference signals. I see multiple options.

At Wolfram, where we have somewhat extensive use of figures in the models, we have the default policy to simply use all variables appearing in figures as comparison signals (@GallLeo already briefly mentioned this idea above). It has several advantages:

  • No separate comparisonSignals.txt files to keep track of.
  • What is visible in the figures is much more likely to have been verified by the engineer behind the model.
  • What is visible in the figures has little risk of being ill-defined internal helper variables that could be problematic for correctness testing.
  • Variables that are interesting for comparison, but which don't fit into any of the key figures of a model can be placed in figures under a suitably named Figure.group (Regression Testing, for instance).

One small limitation of the approach is that there is no way to say that a variable in a figure should be excluded from correctness testing. I'm not aware of any model where we have needed this, but still… As @maltelenz suggested, the TestCase-annotation could be the perfect place for this sort of information. For example:

annotation(TestCase(excludeVariables = {my.first.var, my.arr[2].y}));

@henrikt-ma
Copy link
Collaborator

henrikt-ma commented Feb 8, 2024

I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.

I think it is well understood that there isn't one Tolerance that will work for all models. What I'm afraid is less well understood is that it is very hard to detect this when reference results are generated with the same tolerance which is used when running a test. To avoid this, I'd like something like the following:

annotation(TestCase(experiment(toleranceFactor = 0.01)));

The default for toleranceFactor should be something like 0.1, and I would expect that there would rarely be a reason to override the default; when there is a correctness error due to tolerance settings, one would fix this by adjusting the Tolerance, not by using a toleranceFactor closer to 1.

@mwetter
Copy link

mwetter commented Jun 7, 2024

I am also in favor of adding a new annotation, or an entry to the experiment annotation, that allows specifying data needed to produce result files for cross comparison. Our current setup is to simply increase the tolerance by a factor of 10, but such a global policy turns out to be unsatisfactory. In my opinion, it is the modelers responsibility to add this information. Most of our contributors are not "professional" developers but mainly researchers or student who use the library and occasionally develop contributions so asking them to put all information inside the .mo file would be preferred.

Therefore, having model-specific specifications for the CI tests would be valuable. These specification should include

  • what tolerance to use for normal CI testing as a user would run
  • what tolerance to use for tool cross-checking,
  • what trajectories to test for in normal CI testing
  • what trajectories to exclude for results tool cross-checking if very noisy
  • what change in StopTime for the results tool cross-checking if a model is chaotic
  • this could be augmented with (vendor-specific?) entries for specific tools, for example to not test a certain model in tool X because the tool does not yet work for that model, or to increase the tolerance only for tool X. These are information that are needed in our tests.

Currently we have (some of) these information spread across different files for about 1700 test models. This is due to historical reasons. It would be good to refactor this as the current situation makes it hard for new developers to figure out what is to be specified where, and it makes it harder to explain how to port changes across maintenance branches. Having these inside the .mo annotation is my preferred way.

@HansOlsson
Copy link
Collaborator

This seems quite important, we can mention at the next phone meeting - but we will work a lot on it outside of those meetings as well.

@henrikt-ma
Copy link
Collaborator

This seems quite important, we can mention at the next phone meeting - but we will work a lot on it outside of those meetings as well.

Please just make sure to keep the language group in the loop. Is it the MAB-Lib monthly meeting one should participate in to engage in the work, or some other forum?

@casella
Copy link
Collaborator Author

casella commented Sep 17, 2024

One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.

Conceptually speaking, it's both things, as also mentioned by @mwetter in his comment below. Whether or not we want to put the information about how to create the reference data in the same annotation or in a different annotation it's just a matter of taste but does not change the fundamental requirement (for me), i.e., that all the relevant information for testing a model should be embedded in the model through annotations, not spread over other files. We already have this concept for documentation, I don't really see why testing should be different.

Regarding storing the names of the variables to be used for automated comparisons (possibly using regexp to keep them compact) I also think they should have a place in an annotation, rather than in a separate file, for the same reason the documentation is embedded in the model and not stored elsewhere. I agree with @henrikt-ma that we should make good use of the information provided by the figures annotation by default, because in many cases variables that you want the user to see are also good to check if the simulation results are correct. But these two requirements are not necessarily identical (see my next comment), so there should be means to declare that some plotted variables should not be used for testing, or that some more variables should be tested that are not plotted.

I can imagine there could rare situations when the automatic settings for reference result generation need to be overridden, but I have yet to see an actual example of this

@henrikt-ma with cross-tool checking of the Buildings library, this happens in a significant number of cases, and it's really a nuisance if you want to make the library truly tool-independent with a high-degree of dependability. @mwetter can confirm. The solution, so far, was to tighten the tolerance in the experiment annotation, e.g. to 1e-7 for all those cases. This is really not a clean solution, because those simulations become unnecessarily slower and, most importantly, the difference in the obtained results is small, much smaller than the modelling errors inherent in the simulation model, which of course has lots of modelling approximations. As I'll argue once more in the next comment, the requirements for human inspection are completely different from the requirments for automatic cross-tool checking.

I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.

@maltelenz, I don't see the point of using a tighter tolerance for the reference creation, and then using a sloppy one to generate results that will fail the the verification test because of numerical errors. If you compare results obtained with different tools (and, mind you, the reference result will be one of them!), of course the tolerance should be the same.

@casella
Copy link
Collaborator Author

casella commented Sep 17, 2024

I re-read all the comments, and I noticed that there are two rather different opinions: according to the first one, the information to run cross-tool verifications should belong to completely separate entities, e.g. separate libraries such as ModelicaTest, according to the other instead, the information should be embedded in the models themselves. Let me add some extra considerations about that, which I hope can further clarify the issue.

  1. The goal of libraries such as ModelicaTest is not (necessarily) to run cross-tool comparisons. It is (ideally) to test each reusable component in the library at least once, and each modelling option of such components at least once, using test models that are as simple as possible and whose outcome is somewhat predictable (ideally, a closed-form analytic solution should be available). In other words, the goal of ModelicaTest is to demonstrate that the component models are correct and do what they are expected to do. From this point of view, the requirement that we have for the MSL is that if you develop a new component, you should also add tests in ModelicaTest that demonstrate its correct implementation. Conversely, the goal of the Examples sub-package is to demonstrate the use of the library to model real systems. It is usually simple ones, but yet it is complete systems, not test benches for individual components.
  2. From this point of view, I respectfully but completely disagree with @beutlich's statement "Using example models simulation results as reference results for cross-tool, cross-solver or cross-MSL-version regressing testing can be considered as a misuse". Examples of real systems are the actual use cases where it is of the utmost importance that different tools produce the same result. Even more than the output of the component tests.
  3. IMHO we have two completely orthogonal partitions of test cases. On one axis, we have models that test individual component (contained in ModelicaTest) vs. models demonstrating complete systems (contained in Modelica.XXX.Examples). On the other axis we have simulations meant to be inspected by humans to understand the behaviour of a component or system vs. simulations meant to automatically compare the output of different tools to catch regressions or tool implementation bugs. The first axis is handled by different libraries, the second can be handled by different experiment annotations. Unless we want to have four libraries, one for each combination of these aspects, which doesn't really seem a reasonable proposition to me.
  4. As I already mentioned in my initial post, the requirements for system simulations to be inspected by humans, e.g. for design purposes, are fundamentally different from the requirements for system simulations meant for automatic cross-check and regression testing. On one hand, the former need not be super-accurate, because all models are approximated, so there is no point running very slow simulations with Tolerance = 1e-8 if the model has a 10% uncertainty; additionally, they may legitimately show behaviour, such as chaotic motion (the double pendulum) or extremely large number of oscillations (the EngineV6 model) or event-triggered commutations (AC/DC converter models), that are interesting per se but hard to reproduce exactly across tools. On the other hand, if you want to do regression or cross-tool checking, you need to run simulations which are as accurate as humanly (machinely?) possible, so that failures in verification can be attributed with certainty to tool issues and not to numerical errors; this requirement is completely different from the requirement of getting accurate enough simulations given the approximations of the model. Notice that this conceptually holds both for system examples (as in Modelica.XXX.Examples) and component tests (as in ModelicaTest).

Except for extremely rare cases like the chaotic system, I don't want testing to happen with special settings for test runs. If you do that, you are no longer testing what the user will experience, and models might not even simulate in the end product (with the settings in the experiment annotation). Nobody would notice, if the tests run with some special configuration.

I agree with that, with a few important remarks

  • the extremely rare cases are rare, but not extremely. Based on my many-year experience trying to match OpenModelica's results with Dymola's, it's probably something like 2-5% of the test cases. If we want to adopt a tool-neutral approach to reference result generation and regression testing for MSL, while eventually achieving 100% success rate with all tools, they need to be handled properly.
  • this argument is in fact in favour of having a specific verificationExperiment annotation, only for those few cases that need it. Currently, we always generate the reference trajectory and do regression testing with a 10X smaller tolerance by default. Which is not enough in some cases, but in most other cases actually does what @maltelenz argues against, i.e., it doesn't test what the user experiences. If we had such a verificationExperiment annotation, we could use it for the (relatively few) examples where it is really needed to use different settings in order to do reliable automatic testing, and use the default experiment annotation in all other cases
  • BTW, nothing would prevents tool vendors from doing regression and cross-tool checking both with the regular experiment annotation and with the verificationExperiment annotation, if both are present. It goes without saying that numerically tricky models may legitimately fail the former and pass the latter, for the reasons argued above.

I also want to make people aware of the TestCase annotation we already have, which is a candidate for a place to add test related things.

Good point.

@bilderbuchi
Copy link

bilderbuchi commented Sep 17, 2024

I noticed that there are two rather different opinions: according to the first one, the information to run cross-tool verifications should belong to completely separate entities, e.g. separate libraries such as ModelicaTest, according to the other instead, the information should be embedded in the models themselves.

Just a quick remark: this division is not terribly surprising to me, because it can also be observed in general in programming languages' testing tools -- in my personal experience there seem to be (at least) two major "schools":

  • Tests-with-implementation puts the code/information needed for testing close to the tested implementation, i.e. typically in the same file. Examples are GoogleTest, Catch2 or doctest.
  • Tests-separate collects the testing code in a place separate from the implementation being tested, e.g. in a separate module or folder. Examples are Python's pytest or unittest, or Julia's built-in unit testing.

I'll refrain from discussing respective pros/cons, since much of this comes down to personal preference, and to avoid bloating the discussion here. I just want to point out that the two approaches have widespread use/merit, and which one is "better" depends on a load of factors, probably.

As far as my experience is concerned, what sets Modelica somewhat apart from the ones above is the noticeable dominance of "reference testing"/golden master/regression testing. The approach of asserting smaller facts about models' behaviour, e.g. like XogenyTest proposed, seems to have very little to no use from my vantage point.
I guess much of this is owed to the specific problem domain (in essence, time-dependent ODE simulation), but I'm wondering if an assert-oriented testing approach (i.e. testing specific facts about a component/model, not the whole set of time traces) could alleviate some of the problems encountered?

@maltelenz
Copy link
Collaborator

I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.

@maltelenz, I don't see the point of using a tighter tolerance for the reference creation, and then using a sloppy one to generate results that will fail the the verification test because of numerical errors. If you compare results obtained with different tools (and, mind you, the reference result will be one of them!), of course the tolerance should be the same.

I believe the idea is that there is a "perfect" solution, if you had infinitely tight tolerance requirements. The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing experiment, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable.

If we instead generate the reference result with a sloppy tolerance from the experiment, it could already have results that are on the "edge" of the "band" around the perfect solution that we consider acceptable. If a different tool then gets results that are the same distance away from the perfect solution, but on the other edge of the band, it will fail the test.

@casella
Copy link
Collaborator Author

casella commented Sep 17, 2024

As a last remark (I apologize for flooding this ticket), it is my goal as the leader of MAP-Lib to eventually provide an MA-sponsored infrastructure that can test open-source libraries (starting from the MSL and ModelicaTest, of course) by running all models that can be simulated with all Modelica tools, automatically comparing their results and visualizing the outcome of the comparisons in a web interface. This would allow the library developer(s) to easily inspect the results, and eventually pick the results of any one tool as reference results, based on his/her expert judgement, so that he/she is not limited in the development to the availability of tool(s) for which he or she has a license. It will also allow the library users to pick the best tool(s) to run a certain library. I believe this is the only way we can truly foster the concept of tool-independent language and libraries, moving it from theory to practice.

In order for this thing to work, we have to lower the entry barrier as much as we can, so that we get as many libraries as possible in. As I envision it, one could start by asking for his/her library to be tested. He/she will then be able to compare the results with different tools, which may also indirectly point out models that are numerically fragile, and eventually select reference trajectories among the ones that were computed, not necessarily with the tool(s) that he or she has installed on his computer. In most cases, assuming the tools are not buggy and that the model is kosher with respect to the language spec (a mandatory requirement IMHO), selecting any specific tool result obtained with the default experiment annotation as a reference will cause all other tool results to fall within the CSV-compare tubes, so everybody's happy.

For a few corner cases (2-5%?) it will be clear that the model is numerically trickier, so the library developers may introduce a verificationExperiment annotation to determine tight enough conditions that allow to reduce the numerical errors below the CSV-compare tool tolerance, so that all tool results are close enough. In some cases, it will be clear that the results of some tools are plain wrong, and this will provide useful information to tool developers for bug fixing. Some other times, every tool could give a different result, which may be a strong indication of a numerically fragile model, which is useful information for the library developer. This is also an essential feature for MSL development: developing the "Standard" library with a single tool is not really a viable proposition, the developer needs as much feedback as possible from all MLS compliant tools to develop something truly standard.

Once this process has converged, the library will be usable with all tools and will have all the information embedded within it to allow continuous testing of this property, which will be publicly demonstrated on the MA servers. I believe this would be a very strong demonstration of the power of the Modelica ecosystem.

Now, the crucial point to make this dream come true is that the additional effort for the library developer to enter this game will need to be as low as possible, otherwise this is simply not going to happen.

With my proposal, and with some reasonable heuristics to pick meaningful comparison variables absent their indication (e.g. only the state variables for dynamic models) one can run cross-tool comparisons of an open-source library with zero additional effort by the library developer. The idea is that the majority of tested models will be fine, and it will only be necessary to provide a few custom annotations for the critical models, e.g. selecting specific variables for testing or changing the verificationExperiment annotation.

BTW, this task is conceptually under the responsibility of the open-source library developer(s), which however may have not enough time or motivation to take care of it. What is nice it that other parties (i.e., the Modelica community) could help in this process by means of pull requests to the library code base that introduce such annotations where needed. These pull requests with a few added annotations can preliminary be fed to the MA testing infrastructure, so that the library developer can see the result and accept the PRs with just one click of the mouse, if the results are fine. Which means, he or she can easily put in the judgement, without much effort. This is how we'll get things done. We could get student involved in this work, which would also promote Modelica with the younger generations.

IMHO this kind of process has a much higher chance of practical success than a process where we expect that Library Officers or open-source Library Developers (which are usually volunteers doing this in their spare time) have unlimited time and resources to set up elaborate testing libraries with ad-hoc developed test models, elaborate rules, multiple files, scripts and whatnot. Ideally, this may be the best option, but in practice, this is never going to happen. We need a viable path to achieve the dream I envisioned, and I believe that the verificationExperiment annotation, alongside with annotations for selecting the reference variables, is a key feature for that.

My 2 cts as MAP-Lib leader. 😃

@casella
Copy link
Collaborator Author

casella commented Sep 17, 2024

I believe the idea is that there is a "perfect" solution, if you had infinitely tight tolerance requirements.

I agree, except for the case of chaotic systems, for which there are theoretical reasons why even a very, very tight tolerance doesn't work in the long term, due to exponential divergence of very close trajectories.

The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing experiment, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable.

"Hopefully" is unfortunately a problematic word here, in my experience 😃. Why hoping, if you can just tighten the tolerance also when generating the result to be compared?

Anyway, the problem here is that a solution with a 2% error, obtained from a model that has 10% uncertainty on key parameters (which is pretty normal for thermal systems) may be perfectly fine for all practical purposes, except automated verification, for which we have (rightfully) set a 0.2% relative tolerance. One thing is to get a result which is good enough for some application, another thing is to verify that two different tools give the same result if the numerical approximations are good enough.

Why should we unnecessarily use a tigher tolerance in the experiment annotation, hence longer simulation times, to stay within the CSV bounds, which have no consideration for the model uncertainty?

The fundamental issue addressed by this ticket is that the requirements for simulations are completely different whether you want to use the results to make decisions about a system being modelled, or you want to use the simulation result for cross-tool and regression checking. Different requirements lead to different simulation setups. Hence, different annotations to specify them.

@HansOlsson
Copy link
Collaborator

I would partially agree with @beutlich comment.

To me ModelicaTest is not only models testing each component once, but intended as unit-tests of components; so if there's a weird flag there should be a test-model for that. As always coverage and testing goes hand in hand.

It might be good if we also added some "integration tests" (in the testing meaning) - and to me that would fit naturally in ModelicaTest, but perhaps separated in some way. However, I understand that we might not have resources for that.

In contrast, the Examples models in MSL are primarily constructed as Examples demonstrating how to use (and sometimes not use) the library. Thus in one sense using them for (cross-tool-)testing is misusing them, but on the other hand we have them and they should work reliably (at least in general: see #2340 for an exception, and as discussed we also have chaotic systems as another exception) - so why not test them as well?

However, we should still keep the mind-set that Example-models are examples - not testing-models.

After thinking through this I think that one conclusion is that since the Examples aren't primarily for testing and they shouldn't be the only tests for the library, it should be ok to reduce the simulation time considerably for specific Example-models (due to chaotic behavior or too large results).

Whether such information is stored in the library or separately is a somewhat different issue. Both should work in principle, but I could see problems with getting agreement on the exact levels for different tools - and thus the need to add something outside of the library for specific cases and tools - which means that if we start inside the library we would have both.

@henrikt-ma
Copy link
Collaborator

However, I don't see that such a discussion is relevant here.

My point is that we must have realistic expectations on how far we can get with cross-tool verification when we haven't even agreed upon what Tolerance really applies to.

@casella
Copy link
Collaborator Author

casella commented Sep 18, 2024

My point is that we must have realistic expectations on how far we can get with cross-tool verification when we haven't even agreed upon what Tolerance really applies to.

@henrikt-ma my expectations are based on several years of experience trying to match Dymola-generated reference values with OpenModelica simulations on several libraries, most notably the MSL and Buildings, which contain a wide range of diverse models (mechanical, themal, electrical, thermo-fluid, etc.). The outcome of that experience is that the current set-up works nicely in 95% of the cases, but then we always get corner cases that need to be handled. If we don't, we end up with an improper dependence of the success of the verification process on the tool that actually generated the reference data, which is something we must absolutely get rid of, if we want the core statement "Modelica is a tool-independent modelling language" to actually mean something.

I am convinced that my latest proposal will enable to handle all these corner cases nicely. Of course this is only based on good judgement, I have no formal proof of that, so I may be wrong, but I guess we should try to make a step forward, as the current situation is not really tenable for MAP-Lib. If this doesn't work, we can always change it, the MLS is not etched in stone.

As to the meaning of Tolerance, the MLS Sect. 18.4 defines it as:

the default relative integration tolerance (Tolerance) for simulation experiments to be carried out
which is a bit (I guess deliberately) vague, but widely understood as the relative tolerance on the local estimation error, as estimated by the used variable-step-size solver. In practice, I understand that parameter is just passed on to the relative tolerance parameter of integration routines.

The point of this parameter is not to be used quantitatively, but just to have a knob that you can turn to get more or less precise time integration of differential equations. In most practical cases, experience showed that 1e-4 was definitely too sloppy for cross-tool verification, 1e-6 gives satisfactory results in most cases, but in some cases you need to further tighen that by 1 to 3 orders of magnitude to avoid numerical errors play a too big role. That's it 😅.

@casella
Copy link
Collaborator Author

casella commented Sep 18, 2024

  • StopTimeVerification = StopTime to be used both for reference generation and verification simulations

If placed inside the TestCase annotation, it could be called just StopTime.

Sounds good. KISS 😃

  • ToleranceReference = 0.1*Tolerance to be used for reference generation

The point of having a (Reference)ToleranceFactor = 0.1 instead is that if you override the default and then, much later, decide to use another Tolerance, you avoid the risk of suddenly having the same tolerance for reference generation as for verification, hence preserving the sanity check that Tolerance is reasonably set.

I understand the requirement, but I'm not sure this is a good solution in all cases. If the default Tolerance = 1e-6, it makes perfect sense to generate reference results with Tolerance =1e-7. But if we need to tighen it significantly, e.g. Tolerance = 1e-8, it may well be that Tolerance = 1e-9 is too tight and the simulation fails. Happened to me many times with numerically challenging thermo-fluid models. Per se, I don't think that the reference should necessarily be generated with a tigher tolerance than the verification simulation, if the latter is tight enough. Do I miss something?

That was the reason for specifying the tolerance for reference generation directly, instead of doing it with some factor. At the end of the day, I guess it doesn't matter much and it's mostly a matter of taste, in fact GUIs could take care of this aspect, like giving the number of intervals which is then translated into the Interval annotation by computing StopTime-StartTime/numberOfIntervals.

  • ToleranceVerification = Tolerance to be used for verification simulations

I think this is a bad idea, as it only opens up for having inadequate Tolerance set in models.

This was @maltelenz's requirement. He wants to run one verification simulation with the Tolerance set in the experiment annotation. If the results fall within the 0.2% CSV-compare bounds (which is 95% of the time), the verification is passed. Otherwise, we need to override the default and try a stricter one. Do I miss something?

@henrikt-ma
Copy link
Collaborator

I understand the requirement, but I'm not sure this is a good solution in all cases. If the default Tolerance = 1e-6, it makes perfect sense to generate reference results with Tolerance =1e-7. But if we need to tighen it significantly, e.g. Tolerance = 1e-8, it may well be that Tolerance = 1e-9 is too tight and the simulation fails. Happened to me many times with numerically challenging thermo-fluid models. Per se, I don't think that the reference should necessarily be generated with a tigher tolerance than the verification simulation, if the latter is tight enough. Do I miss something?

Yes; if the library developer can't even make the simulation results match a reference generated with more strict tolerance, I would argue that the reference shouldn't be trusted. In my experience, it doesn't need to be 100 times smaller than Tolerance to serve as a sanity check, 10 times seems to be enough difference. I can also imagine that a factor of 0.01 might too often give solvers problems dealing with too tight tolerance, which is why I'm suggesting 0.1 to be the default. Before a default is decided upon, however, I suggest we try some different numbers to see in how many cases the default would need to be overridden; if we can manage that number with a default of 0.01, I'd be in favor of that since it gives us an even stronger indication of the quality of the reference results in the default cases.

I'd be very sceptic about using a ReferenceToleranceFactor above 0.1 for any test, as it would show that the test is overly sensitive to the Tolerance setting. For instance, if another tool uses a different solver, one couldn't expect that the Tolerance in the model would work.

@henrikt-ma
Copy link
Collaborator

henrikt-ma commented Sep 19, 2024

  • ToleranceVerification = Tolerance to be used for verification simulations

I think this is a bad idea, as it only opens up for having inadequate Tolerance set in models.

This was @maltelenz's requirement. He wants to run one verification simulation with the Tolerance set in the experiment annotation. If the results fall within the 0.2% CSV-compare bounds (which is 95% of the time), the verification is passed. Otherwise, we need to override the default and try a stricter one. Do I miss something?

That it only looks like a poor workaround for not being able to specify more relaxed comparison tolerance when one insists on prioritizing simulation speed over result quality. If we avoid introducing ToleranceVerification, we ensure that verification always corresponds to what the user will actually see.

@casella
Copy link
Collaborator Author

casella commented Sep 19, 2024

For the record, here are some required tolerance changes for Buildings in order to avoid cross-tool verification issues: PR lbl-srg/modelica-buildings#3867

@casella
Copy link
Collaborator Author

casella commented Sep 19, 2024

That it only looks like a poor workaround for not being able to specify more relaxed comparison tolerance when one insists on prioritizing simulation speed over result quality. If we avoid introducing ToleranceVerification, we ensure that verification always corresponds to what the user will actually see.

The problem is, we don't only want to see that. We also want to make sure that if you tighten the tolerance, you get closer to the "right" result. Regardless of what the user experience is.

@henrikt-ma
Copy link
Collaborator

That it only looks like a poor workaround for not being able to specify more relaxed comparison tolerance when one insists on prioritizing simulation speed over result quality. If we avoid introducing ToleranceVerification, we ensure that verification always corresponds to what the user will actually see.

The problem is, we don't only want to see that. We also want to make sure that if you tighten the tolerance, you get closer to the "right" result. Regardless of what the user experience is.

I'd say it makes sense to perform both tests, but not to let a test with some ToleranceVerification completely replace the real deal with Tolerance. What I hear here is also that you want tighter comparison requirements when using the ToleranceVerification – this is something to keep in mind when designing how comparison tolerances should be given.

Let's assume there is a Comparison annotation for specifying the comparison tolerances. Where should it be placed? How many could there be in a single model?

We should remember the possibility of using extends to define additional models running with a tighter Tolerance. With this, it would suffice to just have at most one TestCase per model, and at most one Comparison within the TestCase:

package Examples
  model DoublePendulum
    …
    annotation(
      experiment(StopTime = 5.0)
      // No need for TestCase since defaults are OK.
    );
  end DoublePendulum;
end Examples;

package TightToleranceTests
  model DoublePendulum
    extends Examples.DoublePendulum;
    annotation(
      experiment(Tolerance = 1e-6),
      TestCase(Comparison(valueRel = 1e-5))
    );
  end DoublePendulum;
end TightToleranceTests;

@casella
Copy link
Collaborator Author

casella commented Sep 26, 2024

I'd say it makes sense to perform both tests,

Agreed 100%, we're now on the same page with @maltelenz here.

but not to let a test with some ToleranceVerification completely replace the real deal with Tolerance.

Ditto.

What I hear here is also that you want tighter comparison requirements when using the ToleranceVerification – this is something to keep in mind when designing how comparison tolerances should be given.

We need to experiment a bit here. Historically, we figured out that 0.002 relative tube tolerance in CSV-compare was good enought to avoid spurious regressions in 95% of the cases, and it was tight enough to guarantee that the solutions look the same to the naked eye. The 5% few spurious regressions were simply not handled properly for cross-tool verification 😃

Once we have a proper two-tier verification procedure in place, we might as well require an even tighter tolerance for the base case, e.g. 1e-4 instead of 2e-3. We could try that out with MSL, and if that doesn't increase the number of "special cases" too much, we could change the default tolerance for CSV-compare accordingly.

But I'm not hearing cries out there to get this. What I hear is people crying because of endless fighting with spurious false negatives.

Let's assume there is a Comparison annotation for specifying the comparison tolerances. Where should it be placed?

I guess in the TestCase annotation 😃

How many could there be in a single model?

I'd say at most two. One tight (currently 0.002, maybe less, see above), one more relaxed for user experience (5-10%).

Currently these are hard-wired in the CSV-compare tool. The question is whether we want to make them explicit and possibly define default values in the MLS. I'm not sure.

We could also say that we always run two tests, one tight and one relaxed. Given the current experience, I'd say that would definitely be overkill in most cases. I'm in favour of running a moderately tight verification with the default experiment annotation, and only in case it fails but passes a sloppier verification, re-try with the same moderately tight verification but tighter tolerances, as specified in the TestCase annotation. But I am ready to change my mind here if there are convincing arguments to do so.

We should remember the possibility of using extends to define additional models running with a tighter Tolerance. With this, it would suffice to just have at most one TestCase per model, and at most one Comparison within the TestCase:

Sure. But this requires a significant amount of extra library management work, which, as I argued here, will likely prevent a wider use of systematic cross-tool verification in practice.

I understand that in general adding hard-wired stuff to the specification should be avoided in favour of doing things with the existing language, because then you do what you want, you can change it if you change your mind, and you don't need to ask for permission or get an agreement in a committee. However, adding a few annotations here and there to handle a few corner cases is sooo much more convenient in this case. See it as a form of syntactic sugar 😉

@henrikt-ma
Copy link
Collaborator

We should remember the possibility of using extends to define additional models running with a tighter Tolerance. With this, it would suffice to just have at most one TestCase per model, and at most one Comparison within the TestCase:

Sure. But this requires a significant amount of extra library management work, which, as I argued here, will likely prevent a wider use of systematic cross-tool verification in practice.

I understand that in general adding hard-wired stuff to the specification should be avoided in favour of doing things with the existing language, because then you do what you want, you can change it if you change your mind, and you don't need to ask for permission or get an agreement in a committee. However, adding a few annotations here and there to handle a few corner cases is sooo much more convenient in this case. See it as a form of syntactic sugar 😉

But having just a single TestCase per model, and a single Comparison per test case also makes it much easier to present and edit this information in GUIs. Additionally, in the supposedly rare cases when the standard example's tolerance isn't sufficient for high precision testing, using extends has the advantage that it allows the user to also simulate the example with the tighter tolerance using existing user interfaces in tools.

Multiple experiment setups within the same model would be an interesting feature, but considering the impact on user interfaces related to experiments, I think it is a big task belonging to a separate discussion that shouldn't block what we need for regression testing and cross-tool comparisons. Once we have that, we can migrate the extends-based solution, but it won't really hurt to rely on extends in the meantime; tests running with tighter tolerance can be placed in testing libraries such as ModelicaTest, so that they don't clutter the selection of examples in the main package.

@HansOlsson
Copy link
Collaborator

Language group:

  • ModelicaTest as component test (but also integration tests)
  • Modelica.*.*Examples for demonstrations

Testing both models and tools.
Two parts of standardization: how to test (generating reference results etc), and what to test - the proposal is for "what to test".

Possible to override experiment settings with TestCase-sub-annotation:

  • different stopTime for chaotic/quickly rotating/quickly switching
  • Something for tolerance? (The main idea is to have tight tolerance for reference, and use two simulations with tight and default tolerance to see whether tolerance is the culprit.)
    • Malte - fine with increasing tolerance for comparison, but not with tightening tolerance for test-running integration. comparisonTolerance - default 0.002 (right?)
    • Also likely to have more fine-grained result than pass/fail

@HansOlsson
Copy link
Collaborator

Note: I realized that a minor item is that any model with TestCase-annotation is not seen as part of the interface and can be updated freely without being seen as a breaking change.

That may be:

  • Desired; as these models should be demonstration Examples and various tests, and thus not used for production.
  • Something that should be restricted to models having TestCase.shouldPass-annotation.
  • A reason to use another annotation.

(I thought that we maybe already had the first rule for Examples in MSL - but I could not find it written somewhere.)

@casella
Copy link
Collaborator Author

casella commented Oct 7, 2024

I have another interesting use case to motivate why we may need one extra set of simulation parameters for simulation. Consider this test of the Buildings library Buildings.Fluid.HydronicConfigurations.ActiveNetworks.Examples.InjectionTwoWayVariableReturn.diff.con.T2Sup.T:

immagine

The test case under question is a day-long simulation (86400 s) with the default Dymola choice of number of intervals, i.e. 500. This means Interval = 180, i.e. three minutes. For all practical purposes, a sampling time of three minutes is good enough for a day-long simulation, since the main interest here is on slow thermal behaviour, and helps avoiding too large simulation results files. However, as it is clear from the above image, this sampling inteval is definitely not short enough to represent the signal correctly if the data points are intepolated by piecewise linear interpolation, which is standard practice, for verification comparisons. So, the tolerance tube gets an edgy shape (which has nothing to do with the actual system behaviour) and as a consequence the simulation points can fall slightly outside it.

Of course one could have much worse cases of aliasing if the frequency of the oscillations are above Shannon's limit.

I think this can motivate the need for a shorter Interval for generating reference data than the one set in the experiment annotation. This could also be set (if necessary) in the TestCase annotation.

@casella
Copy link
Collaborator Author

casella commented Oct 7, 2024

In fact, the more I think of it, the more I believe that this could explain a lot of false negatives that take place around peaks of some variables that are not sampled fast enough. It is obvious that under-sampling can lead to significant under-estimation of the actual peak value, so that the tubes build around a severly under-sampled simulation will be incorrectly too narrow.

As far as I understand, this issue cannot be handled in a satisfactory way by relaxing the tube tolerance. It can be observed by human inspection, and fixed once and for all by declaring an appropriate Interval for the reference result.

@HansOlsson
Copy link
Collaborator

HansOlsson commented Oct 7, 2024

In fact, the more I think of it, the more I believe that this could explain a lot of false negatives that take place around peaks of some variables that are not sampled fast enough. It is obvious that under-sampling can lead to significant under-estimation of the actual peak value, so that the tubes build around a severly under-sampled simulation will be incorrectly too narrow.

As far as I understand, this issue cannot be handled in a satisfactory way by relaxing the tube tolerance. It can be observed by human inspection, and fixed once and for all by declaring an appropriate Interval for the reference result.

I don't fully agree with this.
I can understand that we cannot handle it by relaxing the tube tolerance with the current criteria.

However, to me this indicates that the tube tolerance criteria may not be appropriate if the model has somewhat periodic behavior (as is common) - if we instead of tube tolerance had used some Lp-norm of the deviations it seems it would just have worked; without having to modify the Interval for specific models. Obviously we still need a bound on the norm and the tube tolerance may give useful insights for that.

It's not necessarily that those two alternative criteria should be used - it's just that instead of starting to add extra intervals for a lot of models to fit with a specific criteria we should also re-evaluate that criteria.

(But obviously not block MSL 4.1.0 release for this.)

@maltelenz
Copy link
Collaborator

For all practical purposes, a sampling time of three minutes is good enough for a day-long simulation, since the main interest here is on slow thermal behaviour

If the thing of interest is slow behavior, why then test a variable with fast behavior?

@HansOlsson
Copy link
Collaborator

For all practical purposes, a sampling time of three minutes is good enough for a day-long simulation, since the main interest here is on slow thermal behaviour

If the thing of interest is slow behavior, why then test a variable with fast behavior?

I can think of several answers:

  • It is only oscillating during a short part of the simulation, it mostly has slow behavior. (Click on other link.)
  • There might not exist a corresponding low-pass filtered variable.
  • To me the problem also occurs every time there's a min or max - even if not very fast behavior.

@casella
Copy link
Collaborator Author

casella commented Oct 8, 2024

If the thing of interest is slow behavior, why then test a variable with fast behavior?

Two reasons:

  • It is only oscillating or showing other high-frequency behaviour (e.g. sharp corners due to min-max) during a short part of the simulation, while being slow most of the time (as Hans noted already)
  • In case the slow variable verification also fail, this flow rate is the main possible root cause, so it is good to see it to understand what is going wrong

@casella
Copy link
Collaborator Author

casella commented Oct 8, 2024

The point of my post was simply that we should also take into account Shannon's theorem and aliasing issues when defining the sampling interval of reference trajectories, not just the tolerance of the underlying ODE solver.

Building tubes, using Lp-norms, or performing any other kind of analysis on badly undersampled variables is never a good idea, simply because too much undersampling may lose crucial information. E.g., a periodic signal sampled at an integer multiple of its period will look like a constant, as it is well-known.

IMHO this may need specific provisions different from the experiment annotation when generating reference trajectories.

@maltelenz
Copy link
Collaborator

Thanks for answering my maybe a bit terse and confrontational question :)

These quick oscillations, or sharp transients, are exactly the kind of signals we have the most trouble with when trying to compare our results against the MSL references, so we agree on where the problems are.

@AHaumer
Copy link

AHaumer commented Oct 15, 2024

For some ideas about periodic and chaotic models see #4477.
Additionally, I am discussing a regression issue with @GallLeo where an example with standard tolerance simulates fine but with tighter tolerance the simulation time "explodes". So I would like to point out that not always tighter tolerance and / or shorter interval is a good idea.

@casella
Copy link
Collaborator Author

casella commented Oct 16, 2024

Additionally, I am discussing a regression issue with @GallLeo where an example with standard tolerance simulates fine but with tighter tolerance the simulation time "explodes".
So I would like to point out that not always tighter tolerance and / or shorter interval is a good idea.

Good that you point this out. This is the reason why I added the possibility of specifying the tolerance factor for generating reference data in my first proposal.

@AHaumer please provide a precise link of the model in question, so that everyone involved in this discussion can play around with it. I think we had a lot of discussions on the massimi sistemi here but not enough on actual use cases. We should build a small library of publicly available problematic models that we can actually run with multiple tools, and then make sure that whatever we propose can deal with all of them nicely.

@henrikt-ma
Copy link
Collaborator

In fact, the more I think of it, the more I believe that this could explain a lot of false negatives that take place around peaks of some variables that are not sampled fast enough. It is obvious that under-sampling can lead to significant under-estimation of the actual peak value, so that the tubes build around a severly under-sampled simulation will be incorrectly too narrow.
As far as I understand, this issue cannot be handled in a satisfactory way by relaxing the tube tolerance. It can be observed by human inspection, and fixed once and for all by declaring an appropriate Interval for the reference result.

I don't fully agree with this. I can understand that we cannot handle it by relaxing the tube tolerance with the current criteria.

However, to me this indicates that the tube tolerance criteria may not be appropriate if the model has somewhat periodic behavior (as is common) - if we instead of tube tolerance had used some Lp-norm of the deviations it seems it would just have worked; without having to modify the Interval for specific models. Obviously we still need a bound on the norm and the tube tolerance may give useful insights for that.

It's not necessarily that those two alternative criteria should be used - it's just that instead of starting to add extra intervals for a lot of models to fit with a specific criteria we should also re-evaluate that criteria.

I agree that this is not an obvious reason to use shorten the Interval. A simple alternative to the use of an Lp-norm is to use more relaxed comparison tolerance, at least for variables with high frequency content in relation to Interval.

@henrikt-ma
Copy link
Collaborator

Additionally, I am discussing a regression issue with @GallLeo where an example with standard tolerance simulates fine but with tighter tolerance the simulation time "explodes".
So I would like to point out that not always tighter tolerance and / or shorter interval is a good idea.

Good that you point this out. This is the reason why I added the possibility of specifying the tolerance factor for generating reference data in my first proposal.

My take on this is that if the simulation time "explodes" when the tolerance is tightened by an order of magnitude, then the experiment.Tolerance is likely to be too close to a setting which is too tight for the model dynamics. As a first remedy, I think one should try to relax the Tolerance, and see if there is a setting sufficiently far off the problematic region so that both Tolerance and 0.1 * Tolerance give acceptable simulation performance, while at the same time having results that are similar enough. Moderate differences between the Tolerance and 0.1 * Tolerance simulation results could be acceptable as long as one knows what one is doing when relaxing comparison tolerances to allow tests to pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Indicates that there's a discussion; not clear if bug, enhancement, or working as intended enhancement New feature or request
Projects
None yet
Development

No branches or pull requests