Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct memory access in fmi3SetX / fmi3GetX functions #515

Open
t-sommer opened this issue Dec 18, 2018 · 21 comments · May be fixed by #1891
Open

Direct memory access in fmi3SetX / fmi3GetX functions #515

t-sommer opened this issue Dec 18, 2018 · 21 comments · May be fixed by #1891
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@t-sommer
Copy link
Collaborator

Currently all requested values are copied twice for every call of fmi3GetX / fmi3SetX which is inefficient in terms of bandwidth and memory usage. This can be avoided by re-using the memory for subsequent calls of the getters and setters.
A possible implementation is outlined below. (It is the result of a discussion with @pmai)

Instantiation

The FMU allocates the memory for all variables during fmi3Instantiate using the callbacks provided by the environment.
For FMUs that don't support variable array sizes this memory may be static. The FMU may even use this memory directly for it's calculations.
The Memory layout is defined by the model description: it is the same as for a get / set call where all variables of a respective type are set / retrieved in the same order as they appear in the model description.

Get variables

fmi3Status fmi3GetX(const fmi3ValueReference vr[], size_t nvr, fmi3X* values[], size_t *nValues);
  • vr: all variables the caller is interested in (NULL = all)
  • nvr: size of vr
  • values: the memory that contains the variables
  • nValues: size of values

Set variables

fmi3Status fmi3SetX(const fmi3ValueReference vr[], size_t nvr, fmi3X values[], size_t nValues);
  • vr: all variables that have changed since the last call to fmi3SetX (NULL = all)
  • nvr: size of vr
  • values: memory retrieved from fmi3GetX
  • nValues: size of values

The value references are provided to allow optimizations if only a small number of variables are set / retrieved.

Structural changes

The FMU re-allocates the memory using the callbacks provided by the environment. The memory layout changes according to the structural parameters.

Termination

In fmi3Terminate() the FMU frees the memory using fmi3Free()

Comments welcome!

@t-sommer t-sommer added this to the FMI3.0 milestone Dec 18, 2018
@APillekeit
Copy link
Collaborator

It is not completely clear how this is intended to work and how it will provide an improvement. Could you please provide an example in pseudo code?

@KarlWernersson
Copy link
Collaborator

I am all for the initiative, however assuming memory layout based on the order variables appear in modelDescription.xml is not feasible, in addition there might be internal variables not exposed in the modelDescription that share memory with variables that are exposed.
For this to work we need a solution that is flexible enough so it could be used in a somewhat efficient way with tools organizing the internal variables in different ways. If not we are forcing tools to internally conform to a certain memory layout and/or will get allot of FMU's with inefficient internal mapping

@KarlWernersson
Copy link
Collaborator

Also if you share memory with the FMU and the master isn't get/set redundant?, you would just need a update/compute function that could replace all the get/set functions

@pmai
Copy link
Collaborator

pmai commented Jan 21, 2019

I agree with @KarlWernersson, this needs decoupling from order of variables, however this could be achieved through a proper setup call, and in this way could also be made optional (if people want this): The importing implementation should call a setup function (fmi3SetupSharedMemory or something of this nature), passing in an array of value references and a pointer to memory, informing the FMU of the memory area used for data exchange, and the layout of variables in there - if we allow mixture of types, this needs alignment standard, otherwise we might specify memory areas seperately by basic type.

This way all data allocation is handled by the importer: This would solve e.g. the real-time static allocation issues, and allow direct sharing of this memory between FMUs, resulting in zero-copy overhead if done right.

Sync for CS is automatic through fmi3DoStep, for ME this might need seperate functions.

If setup shared memory function is not called, then the importer must use the classic fmi3Get/SetXXX approach.

fmi3SetupSharedMemory(

@t-sommer
Copy link
Collaborator Author

The initial memory layout / value references could still be specified in the XML. This way an importer can generate code that directly accesses the variables w/o looking up the value references dynamically.

@pmai
Copy link
Collaborator

pmai commented Jan 21, 2019

Since the importer would be controlling the memory layout, I don't see how this kind of initial memory layout default would be of benefit to the importer? I could see how this makes hard-coding the exported FMU code easier/faster (at the expense of not allowing a changed memory layout, which kind of negates the benefits of shared memory for coupled FMUs), but not how this would be of benefit to the importer...

@andreas-junghanns andreas-junghanns changed the title Re-use memory in fmi3SetX / fmi3GetX functions Shared memory in fmi3SetX / fmi3GetX functions Jan 29, 2019
@pmai pmai modified the milestones: FMI3.0, FMI 3.1 Jan 29, 2019
@pmai
Copy link
Collaborator

pmai commented Jan 29, 2019

Retarget to 3.1 as discussed in WebMeeting on 2019-01-29.

@pmai
Copy link
Collaborator

pmai commented Jan 29, 2019

Core ideas for future (3.1) optional support (triggered by WebMeeting on 2019-01-29, just my currently feeling about those):

  • Provide shared memory data-exchange API as optional mode, that implementations that care about this performance wise can optionally switch on, and use instead of normal getters/setters.
  • Shared memory is provided (allocated) by importers to FMU
  • The importer also controls data layout inside that memory (because it has the most knowledge of what data is exchanged between multiple FMUs, so can design layout in more optimal fashion).
  • Data is exchanged by direct access, with clearly defined synchronization points (defined by function calls), when data is valid for reading/writing by Importer and by FMU.
  • Sync points are probably easy to define for Co-Simulation, maybe harder to define for Model-Exchange.
  • Sync points should take into account concurrent Co-Sim of multiple FMUs, but this needs more thinking about concurrency issues.
  • Shared memory API could also enable selective partial updating of large arrays easily.
  • Support for proper sparse arrays (i.e. arrays where not all elements are backed by memory) would probably need more complex API, will have to be taken into account in sparse array support for 3.1 (if we want that).

Should be possible to provide in completely backward compatible fashion for a minor (e.g. 3.1) release.

@t-sommer
Copy link
Collaborator Author

FWIW, the dynamic memory model of S-functions has been criticized as a limitation for embedded use-cases in the comparison with FMI on Wikipedia (which also applies to FMI).

@chrbertsch
Copy link
Collaborator

Regular Design Meeting:
Discuss, what has to be done in FMI 3.0 to make this possible in FMI 3.1 (so that we do not have to wait until FMI 4.0)

@chrbertsch
Copy link
Collaborator

At the coming Berlin Design Meeting due to the focus on layered standards we will not have time to discuss this in detail, but perhaps we could form a working group to further discuss this topic (and perhaps other efficiency related topics such as extended lifetime for binary variables)

@chrbertsch chrbertsch changed the title Shared memory in fmi3SetX / fmi3GetX functions Avoid copying in fmi3SetX / fmi3GetX functions Oct 26, 2022
@chrbertsch chrbertsch changed the title Avoid copying in fmi3SetX / fmi3GetX functions Shared memory access in fmi3SetX / fmi3GetX functions Oct 26, 2022
@chrbertsch
Copy link
Collaborator

chrbertsch commented Oct 26, 2022

F2F Design meeting Berlin:

Intention: avoid copying
It is not just an annotation, but an API change.
Torsten B.: in some modes (initialization, event mode, continuous time mode), get functions trigger computation.
We could loose caching functionality.
Klaus: the references FMUs do it in some other way.
Pierre: this is a specific proposal for share memory.
Andreas: Shall we go into this topic.
Klaus: we need prototype implementations
Christian: is this related to extended lifetime?
Pierre: this is a different approach, somehow related.
Torsten S: Benefit only for certain FMUs, with few arrays, binary variables. Not many scalar variables.
Torsten B.: For some use cases we can use virtual getFunctions to get the benefit of caching.
Klaus: a student of mine measured some test cases.

Working Group to discuss this: Andreas, Pierre, Torsten B, Timo, Torsten S., Klaus, dSPACE
Pierre will invite to a meeting

@timrulebosch
Copy link

timrulebosch commented May 25, 2023

@t-sommer Regarding your original suggestion, I would offer a suggestion as follows:

fmi3Status fmi3GetValueAddr(const fmi3ValueReference vr[], size_t nvr, void* addr[]);

The FMU operates as before, allocating is memory etc. Then, for each requested VR, the FMU can decide to return the address of the value or NULL (supported/not supported). The Importer takes care of the response.

There is one edge condition for binary variables where it is more useful to return a pointer to a variable which holds the allocated address of the buffer (i.e. void**). In that case either the Importer or the FMU can realloc() the buffer as required, reflecting the new size in the associated variable, and the "other side" will see the same effect.

@timrulebosch
Copy link

We did a lot of work on this topic in the last days. As a result it seems we would need the following API:

fmi3Status fmi3GetValueAddr(const fmi3ValueReference vr[], size_t nvr, void* addr[], void* additional[]);
fmi3Status fmi3SetValueAddr(const fmi3ValueReference vr[], size_t nvr, void* addr[], void* additional[]);

In the case of complex types, the 'additional' parameter would hold an additional reference - e.g. the address of the size variable for an fmi3Binary.

In the case that shared access is not supported for a variable, the function should return NULL for that variable. The importer falls back to the preexisting methods.

No additional changes are required. Everything will work. FMUs supporting this will behave correctly.

There are numerous use cases for doing this. However, many of them may require additional specification, so here one should only consider making the capability available. That is all that is required.

@t-sommer
Copy link
Collaborator Author

@timrulebosch, could you implement a prototype based on the Reference FMUs?

@timrulebosch
Copy link

This is about it ... more or less:

// FMU
int32_t default_value = 42;  // VR = 24
int32_t *value = &default_value;

fmi3Status fmi3GetValueAddr(const fmi3ValueReference vr[], size_t nvr, void* addr[], void* additional[])
{
    for (int i = 0; i < nvr; i++ {
        addr[i] = (vr[i] == 24) ? (void*)value : NULL;
        additional[i] = NULL;
    }
}

fmi3Status fmi3SetValueAddr(const fmi3ValueReference vr[], size_t nvr, void* addr[], void* additional[])
{
    for (int i = 0; i < nvr; i++ {
        if (vr[i] == 24) value = (int32_t*)addr[i];
    }
}

fmi3Status fmi3DoStep(.....)
{
    ....
    *value += 1;
    ....
}

The interface is trivial. Someone doing this would likely have a very specific use case, needing support in the Importer, and of-course the FMU (and possibly a layered standard ... as well). You can certainly achieve specific things, for example, reduction in the memory churn when working with fmi3Binary variables (avoiding unnecessary malloc, memcpy & free calls).

And perhaps other interesting things too.

@t-sommer
Copy link
Collaborator Author

I've created a prototype of a shared memory API based on the LinearTransform model.

@chrbertsch
Copy link
Collaborator

@t-sommer
Copy link
Collaborator Author

t-sommer commented Jun 21, 2023

TODOs from the FMI Design Meeting 6/20/2023:

  • define lifetime of memory passed from and to the importer
  • define XML / annotations to declare variables that support direct memory access
  • rename "shared" to "direct" to avoid disambiguity (see shared memory)
  • document dual use of fmi3Set{VariableType}Pointer() for both registering memory and retrieving the actual values

@t-sommer t-sommer changed the title Shared memory access in fmi3SetX / fmi3GetX functions Direct memory access in fmi3SetX / fmi3GetX functions Jul 8, 2023
@t-sommer t-sommer linked a pull request Jul 13, 2023 that will close this issue
@timrulebosch
Copy link

timrulebosch commented May 14, 2024

A pattern we use in our simulation system (non-FMI) is to effectively map variable vector/arrays from the Model into the Importer. Because the model is aware of its variable vector layout, it can directly access variables without overhead. Conversely, the importer learns the variable layout of the Model, and is able to map those variables to the signal exchange mechanism of the simulation.

This mechanism removes the need for our models to do any kind of data marshaling (Vref searches, assignment, memcpy). Semantics of operation and access are inferred from the inherent state of the model (so no need for get/set functions).

Such a pattern/mechanism could be achieved in FMI where an FMU maps its variable vector/arrays into the Importer. Code for such an approach is as follows:

typedef enum {
    fmi3ValueNone = 0,
    fmi3ValueFloat32,
    fmi3ValueFloat64,
    fmi3ValueInt8,
    fmi3ValueUint8,
    fmi3ValueInt16,
    fmi3ValueUint16,
    fmi3ValueInt32,
    fmi3ValueUint32,
    fmi3ValueInt64,
    fmi3ValueUint64,
    fmi3ValueString,
    fmi3ValueBinary,
    fmi3ValueCount,
} fmi3ValueType;


typedef fmi3Status fmi3GetVariableMap(
    fmi3Instance instance, fmi3ValueType type,
    // Return VR table from internal FMU map.
    const fmi3ValueReference valueReferences[], size_t nValueReferences,
    // Return pointer to internal FMU array/vector
    void** values,
    // Return pointer to internal size for binary data, otherwise NULL.
    size_t** valueSizes, // Size of content in binary buffer to be consumed
    size_t** valueLength // Length of allocated binary buffer (for buffer management i.e. realloc)
);

@chrbertsch
Copy link
Collaborator

chrbertsch commented May 15, 2024

FMI Design Meeting Munich:
Matthias: Irina did first try with a single FMU with 10.000 of inputs and an internal multiplication the inputs with a parameter: Speedup of about 20% (set/get replaced by pointer-approach: the importer provides a pointer at initialization, bypassing the set/get operations).
This saves a copy operation and does not spare a lot of get/set operations.
Andreas: you could speed up get/set operation if the value reference vector is alays the same.
Klaus: If the importer guarantees the order of the value references
Torsten: You get rid of the check operations.
Andreas: and you could also get rid of the copy with the pointers. 20% of what? communication or computation?
Matthias: Currnently we look at large systems with about of 100 models where this could get more important for realtime systems.
Andreas: what do we need to decide to go into this direction? What kind of experiment do we need to perform to decide this?

Torsten: for paralellelization one must make sure that the writing / reading of variables is done correctly
Matthias: Ingetrity of the data has to be ensured by the importer
Torsten: One could define a pointer to an internal struct with a separate xml describing the memory layout.

Andreas: we have to define the problem.
Matthias: importer shall do optimization in case of hundreds of models for realtime with thousands of inputs
Tim: Vehicle model generated with FMI-Kit. 100 --> 6000 parameter: we see an influence
Pierre: a pointer solution could even slow this down
Andreas: this could be optimized today by the importer. In Modelica tools you have a "lazy" computation possibility. At the standard level we have to focus on the things that have to happen.

Pierre: for vECUs think using binary variables with contiguous binary arrays I would expect an acceleration
Andreas: and we have clocks so that you know when data has changed.

Andreas: all the solutions boil down to direct memory access and describing the memory layout.
Karl, Andreas: one could provide arrays/structs
Pierre: one would have to share a memory layout description before creation of the FMUs

Andreas: if the importer knows better when to communicate what. Even today the importer could leave out "set" calls.
Matthias: define memory layout, directly write to memory.
Andreas: I am questioniong whether this is faster
Pierre/Andreas: in the case of two FMUs with identical interfaces: one could list all the VRs in an order and then exchange an array or pointer ....
Matthias: we suggest to pass pointers to the FMU
Andreas: the dereferencing of thousands of different pointers in combination with copy operation can be very costly. I did experiements with vTunes by Intel (a profiler that takes into account the cache effects).
Pierre: . This creates an upredictable memory access. Chasing pointers is inefficient for CPUs. Copying in L2 cache can be very efficient .... one could make communication faster , but the calculation can get slower.

Pierre: One optimization could be to have the set/doStep/get operations in one call. (cache effects) This makes sense for point-to-point connections.

Andreas: Calling a the get/set functions with the same "valid" VRs could bring performance benefit.

Pierre: the FMU could give a natural order and a kind of "get/set-it-all". the intelligence could be in the importer (e.g. just in time compilation)
Andreas: the memory resides in the FMU and passing the pointeres outside
For our use cases this is not a real optimization

Pierre: We need more realistic code for benchmarking.
Andreas: for the dSpace use case I need to understand what the problem ist
Matthias: we will try to describe our kind of systems
Pierre: also what kind of codes runs in the computation (we need to see the complexity of computation.
Matthias: what about the OSI-use cases?
Pierre: there we compared Flatbuffers and Protobuffers, one cane save in some cases a factor of two in the communication, but this is not so important compared to other computations (e.g. serialization /deserialization.

Pierre: it might be beneficial to put set-doStep-get more closely together.

Next steps:
Matthias: we could provide a use case descriptions, and provide examples
Pierre, Andreas: we woud be willing to implement this also
Torsten: I can evluate a concept with the reference FMUs.
Christian: we should start an FCP dcument for this topic starting with describing the use cases.
Matthias: I will discusss with Irina and prepare a presentation in one of the next meetings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants