Rethink consistency checker in control board NWS/YARP #2939

PeterBowman · 2023-02-01T12:32:43Z

Patch #2841 introduced a check in YARP 3.7.0 that, under certain circumstances, prevents controlBoard_nws_yarp from broadcasting state (ports /state:o and /stateExt:o) to clients (namely its NWC counterpart, i.e. remote_controlboard). This encompasses encoder data as well as torques, currents, control modes, etc. The check itself consists of a comparison of timestamps as returned by IEncodersTimed::getEncodersTimed. If any of these timestamps (each corresponding to one wrapped joint) is more than one second apart from the others, the check claims this situation is inconsistent and prevents from dispatching the port write further down in the code:

yarp/src/devices/controlBoard_nws_yarp/ControlBoard_nws_yarp.cpp

Lines 511 to 520 in 920ecde

    
           // Check if the encoders timestamps are consistent. 
        
           for (double tt : times) 
        
           { 
        
               if (std::abs(times[0] - tt) > 1.0) 
        
               { 
        
                   yCError(CONTROLBOARD, "Encoder Timestamps are not consistent! Data will not be published."); 
        
                   yarp::sig::Vector _data(subdevice_joints); 
        
                   return; 
        
               } 
        
           }

As I observed in #2862 (comment), this consistency checking can break some workflows:

In our hardware setup, there is one physical motion controller per joint, connected via CAN bus to a CPU where our centralized control board implementation resides. Often, we observe communication issues on specific (not all) controllers, but we are prepared to deal with that: relying on a heartbeat signal, the conflicting device is restarted when a reasonable timeout elapses. We are able to command working joints while some others are still trying to restart, i.e. in our case robot joints are somewhat independent from each other.

The first point I'd like to bring up in this issue is: what was the rationale behind the introduction of this consistency check and what problems does it aim to solve?

In my opinion, disabling state broadcasting is a bit draconian (clients can no longer attend to robot state just because one single joint is off sync) and inconsistency could be handled differently, i.e. through return codes. These are already broadcast through the same port along with proper data (see Boolean fields here).

The times vector as it appears in the previous snippet is only used as the envelope of both state ports after averaging:

yarp/src/devices/controlBoard_nws_yarp/ControlBoard_nws_yarp.cpp

Lines 522 to 527 in 920ecde

    
           // Update the port envelope time by averaging all timestamps 
        
           time.update(std::accumulate(times.begin(), times.end(), 0.0) / subdevice_joints); 
        
           yarp::os::Stamp averageTime = time; 
        
           extendedOutputStatePort.setEnvelope(averageTime); 
        
           extendedOutputState_buffer.write();

This envelope is read by the NWC counterpart, RemoteControlBoard, and stored in a class variable named lastStamp here. Clients that want to retrieve certain data (broadcast by said state ports, such as current torques) will trigger a local save of the timestamp enveloping the last state message: ref.

Although this operation (retrieve time envelope) is repeated in several places throughout the NWC's code, lastStamp is only consumed (read from) in two situations:

Interface calls to IEncodersTimed::{getEncoderTimed,getEncodersTimed} and IMotorEncoders::{getMotorEncoderTimed,getMotorEncodersTimed}. Please note the timestamps we get here are the average of all timestamps, contrary to the interface contract which says "stamps: pointer to the array that will contain individual timestamps".
An interface call to IPreciselyTimed::getLastInputStamp here. The contract specifies that this is the "time stamp relative to the last acquisition", which seems correct, but...

...I believe lastStamp is still being misused in both situations. In the former, since we don't expect an average stamp according to the documentation. In the latter, because lastStamp is also set whenever the client performs an RPC call that contains timestamp information, see getTimestamp. Example: if a client calls getRefTorque(), the NWC will internally update lastStamp (for further use as an average stamp of encoder data and last acquisition stamp; see the two points explained earlier) despite not having received any data through the state ports for a while. As far as I can guess the purpose behind the consistency checker, I believe this behavior invalidates it. Interestingly, the stamp returned by RPC calls, generated on the NWS side, is simply the current timestamp at the time said RPC call was performed: ref.

Proposed course of action (depending on the first point: what was the purpose of the consistency checker)

Revert CB_NWS encoder check #2841 per the above reasons (draconian solution, we lose information about current state, a single faulty joint invalidates the whole robot part). Detect inconsistency through return codes instead, e.g. let consumers deduce it from the boolean returned by getEncoder and similar. This might require action on the implementor's side: if my joint is not responding, return false and let it flow through the NWS/NWC pair down to remote callers.
Make /state:o and /stateExt:o use the current stamp as the port envelope instead of averaging anything, consistently with the behavior of current RPC calls.
(probably a protocol breaker, then to be considered for YARP 4) Include proper per-joint timestamps in the joint data message, i.e. add a VectorOfDouble stamps field in the jointData struct and populate it with the return values of the second out array parameter in getEncodersTimed.
Regarding IPreciselyTimed::getLastInputStamp: in order to be consistent with the contract, use the most recent stamp obtained in the previous point (assuming "time stamp relative to the last acquisition" means "the last time we got a response from the real hardware", not "the last time we called the raw subdevice").

cc @randaz81

The text was updated successfully, but these errors were encountered:

PeterBowman · 2023-02-01T13:33:58Z

Perhaps the consistency check makes more sense for the /state:o port, which is not used by the NWC (it reads only from /stateExt:o)? The only usage I know is yarp read ... /prefix/state:o. Thus it also might not need any envelope at all.

PeterBowman · 2023-08-12T09:22:37Z

@randaz81 would it be possible to consider this for the next release (3.9.x)?

pattacini · 2024-02-03T16:02:04Z

An example of the consistency checker causing NWS to stop streaming:

https://github.com/orgs/robotology/discussions/685

randaz81 · 2024-02-05T11:02:23Z

I thought about this issue for quite some time, thinking about possible solutions (one at the end of this comment). Of course, code can be always improved and new checks can be introduced to make it more resilient to unexpected situations. However, let's analyze the issue from its fundamentals.

Often, we observe communication issues on specific (not all) controllers, but we are prepared to deal with that: relying on a heartbeat signal, the conflicting device is restarted when a reasonable timeout elapses.

No. This is something that my old master discouraged me from doing always. Under no situation you should allow a robot to operate if its status is (partially) undefined because of hardware failures in one of its parts. Under no situation, you should fix in software a problem of a hardware nature, hiding the problem with a watchdog and possibly causing a malfunctioning of a higher layer of the stack. Consider a coordinated movement of multiple joints, a differential system, or a cartesian controller. What will be the final trajectory of the end effector, if one controller receives a position feedback older than 1 second?
This check is doing what is supposed to be. Stop the robot and print an error to the user. In a very strict way, I understand, but I would make it even more strict, putting the system in "hardware fault" if I had the possibility. This check prevents a malfunctioning manipulator from damaging itself, the environment or injuring people, so the user has the chance to fix the hardware before continuing.

Now, considering possible improvements to the system, I'm currently working to improve return codes in Yarp interfaces and this is very related to your point 1:

Detect inconsistency through return codes instead, e.g. let consumers deduce it from the boolean returned by getEncoder and similar. This might require action on the implementor's side: if my joint is not responding, return false and let it flow through the NWS/NWC pair down to remote callers.

A draft of PR is already open #3051, It was originally scheduled for yarp 3.10 but unfortunately, it is a pretty large change, and it will require some extra development time to be propagated in all affected devices (especially the motorcontroller which implements a huge amount of methods). I'll keep you updated on the advancements on this.

This was referenced Feb 1, 2023

Fix handling of IRemoteVariables messages via RPC #2935

Merged

Relax requirement on timestamp consistency checker #2931

Closed

PeterBowman mentioned this issue Feb 1, 2023

Throttle error message on inconsistent encoder timestamps #2869

Merged

randaz81 added the Component: Devices label Aug 16, 2023

randaz81 self-assigned this Aug 16, 2023

randaz81 added this to the YARP 3.9.0 milestone Aug 16, 2023

traversaro mentioned this issue Aug 5, 2024

ergoCub 1.1 S/N:001 – Left arm suddently stopped streaming data robotology/icub-tech-support#1865

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink consistency checker in control board NWS/YARP #2939

Rethink consistency checker in control board NWS/YARP #2939

PeterBowman commented Feb 1, 2023 •

edited

Loading

PeterBowman commented Feb 1, 2023

PeterBowman commented Aug 12, 2023

pattacini commented Feb 3, 2024

randaz81 commented Feb 5, 2024

Rethink consistency checker in control board NWS/YARP #2939

Rethink consistency checker in control board NWS/YARP #2939

Comments

PeterBowman commented Feb 1, 2023 • edited Loading

PeterBowman commented Feb 1, 2023

PeterBowman commented Aug 12, 2023

pattacini commented Feb 3, 2024

randaz81 commented Feb 5, 2024

PeterBowman commented Feb 1, 2023 •

edited

Loading