Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Point streaming queue has reached max allowed elements #420

Closed
rr-tom-noble opened this issue Aug 20, 2021 · 28 comments
Closed

Point streaming queue has reached max allowed elements #420

rr-tom-noble opened this issue Aug 20, 2021 · 28 comments

Comments

@rr-tom-noble
Copy link

Hi,

I'm currently attempting to use the point streamer through the joint command topic to publish commands to the driver, however, I am getting the following error messages:

[ERROR] [1629453571.361495217] [/yaskawa/motion_streaming_interface]: Point streaming queue has reached max allowed elements
[ERROR] [1629453571.385527740] [/yaskawa/motion_streaming_interface]: Point streaming queue has reached max allowed elements
[ERROR] [1629453571.452335748] [/yaskawa/motion_streaming_interface]: Aborting point stream operation.  Failed to send point (#9): Invalid message (3) : Trajectory start position doesn't match current robot position (3011)
[ERROR] [1629453571.518157040] [/yaskawa/motion_streaming_interface]: Aborting point stream operation.  Failed to send point (#9): Invalid message (3) : Trajectory start position doesn't match current robot position (3011)
[ERROR] [1629453571.585306160] [/yaskawa/motion_streaming_interface]: Aborting point stream operation.  Failed to send point (#9): Invalid message (3) : Trajectory start position doesn't match current robot position (3011)

The second error only ever appears after the first, so I think the max queue error is the root of the problem. I'm publishing commands at a rate of 100 Hz, so I don't think I should be pushing the driver to any kind of limit. I've tried searching through the code but have been unable to find where abouts the error is raised.

Any help would be much appreciated!

@gavanderhoorn
Copy link
Member

gavanderhoorn commented Aug 20, 2021

I would suggest to configure your ROSCONSOLE_FORMAT env var to include file and line nr information. That would make it a lot easier to find where this is being printed.

See Console Output Formatting on the ROS wiki.

Some expectation management: the code you're using was never merged into the mainline driver and has therefore not seen the same level of use and debugging. It's possible there are issues which have not been addressed.

As to the error: the queue on the MotoROS side is 'emptied' based on whether segments have been executed. If your dX is larger than can be covered in the fixed dT (given current state of the robot), I could imagine overrunning the queue at some point. But this is just a guess.


edit:

Some expectation management: the code you're using was never merged into the mainline driver and has therefore not seen the same level of use and debugging. It's possible there are issues which have not been addressed.

and if you've merged it with current HEAD, there are even more possibilities for unknown issues, as motoman_driver (and MotoROS) have changed quite a bit since 2015/16, when that code was written.

@rr-tom-noble
Copy link
Author

Thanks @gavanderhoorn. Do you know if there's equivalent point streaming functionality in the mainline driver?

@tdl-ua I believe you wrote this point streaming code? Any ideas on whether this issue was present in the branch originally?

@ted-miller @EricMarcil Do you know of any significant changes to the motoman driver which may not play nicely with this point streaming code?

@gavanderhoorn
Copy link
Member

gavanderhoorn commented Aug 23, 2021

Do you know if there's equivalent point streaming functionality in the mainline driver?

no, there isn't.

@tdl-ua I believe you wrote this point streaming code?

no, that was @boneil (#37). @tdl-ua updated it in #215.


Edit: with an updated ROSCONSOLE_FORMAT: where is that error message printed?


Edit 2: seems to be here: https://github.com/ros-industrial/motoman/pull/215/files#diff-91823845753e4d8172e7046fdee7dda955aab677ff8965435d4820933a0f960eR269.

@tdl-ua
Copy link
Contributor

tdl-ua commented Aug 23, 2021

@aa-tom, @gavanderhoorn is correct that I only updated the point streaming interface from two other previous commits that did the heavy lifting.

But I'm fairly certain (it's been over 3 years) that you have to send the trajectory points at 40Hz exactly because that is the rate at which they are processed and executed by the controller. I vaguely remember having a similar issue when using the wrong publishing rate.

See here for a working usage example of #215.

@ted-miller
Copy link
Contributor

Do you know of any significant changes to the motoman driver which may not play nicely with this point streaming code?

I did a brief comparison of the MotoPlus code. Here are the top things that jumped out to me:

  1. 4 hour trajectory limit. After 4 hours of continuous streaming, you will need to stop and restart the trajectory stream. (Due to precision of timestamp data type.)
  2. This should not be used with an HC (collaborative) style robot.
  3. There have been significant improvements to the 'in-motion' flag, which is often used for determining the start of the next trajectory. This can lead to erroneous Trajectory start position doesn't match current robot position errors

@rr-tom-noble
Copy link
Author

Edit 2: seems to be here: https://github.com/ros-industrial/motoman/pull/215/files#diff-91823845753e4d8172e7046fdee7dda955aab677ff8965435d4820933a0f960eR269.

Yeah, that's the place

But I'm fairly certain (it's been over 3 years) that you have to send the trajectory points at 40Hz exactly because that is the rate at which they are processed and executed by the controller. I vaguely remember having a similar issue when using the wrong publishing rate.

Thanks for the advice. I'll try setting the publishing rate to 40Hz and see if that fixes the issue, although I'd expect that the queue would fill up almost immediately if that were the case? What we're seeing is that the point streamer works fine at a rate of 150Hz for a few minutes before suddenly hitting the issue.

I've also added some extra info statements including queue size and thread timing info to support debugging, so I'll get back with those results when I get a chance to test

@rr-david-alatorre
Copy link

To add to what tom said about the publishing rate:
We have changed the controller status update period to 4ms (from 25, or 40Hz) to allow us to go faster.

We then tested the point streamer at 200Hz (to match 4ms) but reached max queue elements within a minute.
At 150 Hz it took several minutes but eventually filled the queue, so we lowered it all the way to 100 and are still having the problem at random intervals - sometimes after a few minutes, sometimes after a long time streaming.

If this was caused by mismatched rates then I would expect to see the queue fill almost immediately as a mismatch of 1Hz would fill the 20 element queue in 20 seconds; if it is the internal queue that is the problem then I would at least expect consistent times before overflows. Very confusing!

@gavanderhoorn
Copy link
Member

gavanderhoorn commented Aug 29, 2021

To help in diagnosing this, please make a wireshark capture of the traffic between the controller and motoman_driver, then share that here.

Since the streaming extension is using regular simple messages, we can try to see whether something is not working correctly on that level.

@gavanderhoorn
Copy link
Member

@aa-tom @DavidAlatorreAA: any luck with the wireshark trace?

@riv-robot
Copy link

@gavanderhoorn Normally we would have sent this Wireshark capture immediately, but a few unrelated issues cropped up, so apologies for the delay. Attached is the capture and some notes are:

  1. This was ran on our highest performance PC, overclocked but with real-time scheduling off.
  2. We set the rate to 200 Hz.
  3. The queue size error occurred at UTC 11:00:17.973
  4. In wireshark select View > Time Display Format > UTC Time of Day

motoman_pointstreamer_200hz_queue_110017973.zip

@gavanderhoorn
Copy link
Member

gavanderhoorn commented Sep 2, 2021

Thanks for the capture.

It seems to start somewhere in the middle of an ongoing session though.

Could you capture from the beginning (ie: include drives enablement, etc)?

The queue size error occurred at UTC 11:00:17.973

a packet nr reference is easier to find in a capture :)

@riv-robot
Copy link

Here's the capture from the beginning. Packet nr 791435; true that is a lot easier than the UTC goose chase.
motoman_pointstreamer_200hz_init.zip
Note only the first 250,000 packets are included as the full file was too big. I can send all in separate files if needed.

@gavanderhoorn
Copy link
Member

Even from this capture it's clear the ROS side is stopping the motion, not MotoROS.

Specifically, this line triggers motoman_driver to send a Stop Motion command:

ROS-Industrial SimpleMessage, Motoman Motion Ctrl (0x7d1), 68 bytes, little-endian
    Prefix
        Packet Length: 64
    Header
        Message Type: Motoman Motion Ctrl (0x000007d1)
        Communications Type: Service Request (2)
        Reply Code: Unused / Invalid (0)
    Body
        Robot ID: 0
        Sequence Number: 0
        Command: Stop Motion (200111)
        Data (reserved)
            J0:    0,000000000
            J1:    0,000000000
            J2:    0,000000000
            J3:    0,000000000
            J4:    0,000000000
            J5:    0,000000000
            J6:    0,000000000
            J7:    0,000000000
            J8:    0,000000000
            J9:    0,000000000

which the controller then acknowledges.

I'd suggest to add some debug messages which show how/when the ptstreaming_queue_ is popped, which sequence nrs get transmitted and which get acked.

That could perhaps give you some insight into the behaviour of the queue and when its size starts to approach max_ptstreaming_queue_elements.

The rest of the SimpleMessage traffic looks pretty normal to me (with the exception of course that the trajectory never ends).

@gavanderhoorn
Copy link
Member

The rest of the SimpleMessage traffic looks pretty normal to me (with the exception of course that the trajectory never ends).

That's not entirely true: your capture shows some missing TCP segments. I'm not sure this is not a red herring, but fi packet 7390 and 7391 (in your full capture) imply something went wrong during transmission of a packet. That would indicate TCP packet loss.

It happens a few more times.

There are Dup ACKs which follow those missing segments, so in the end the bytes probably were received by the destination host, but it's not something I see often in captures of Simple Message traffic.

@gavanderhoorn
Copy link
Member

@ted-miller: I just noticed that MotoROS does not seem to copy over the Sequence field from an incoming Joint Trajectory Point Full to the subsequent Motoman Motion Reply packet.

It's set to -1 (at least in these captures):

ROS-Industrial SimpleMessage, Motoman Motion Reply (0x7d2), 76 bytes, little-endian
    Prefix
        Packet Length: 72
    Header
        Message Type: Motoman Motion Reply (0x000007d2)
        Communications Type: Service Reply (3)
        Reply Code: Success (1)
    Body
        Robot ID: -1
        Sequence Number: -1
        Command: Joint_Traj_Pt_Full (14)
        Result: Success/True (0)
        Subcode: Unknown (0)
        Data (reserved)
            J0:    0,000000000
            J1:    0,000000000
            J2:    0,000000000
            J3:    0,000000000
            J4:    0,000000000
            J5:    0,000000000
            J6:    0,000000000
            J7:    0,000000000
            J8:    0,000000000
            J9:    0,000000000

If the sequence number would be copied into the outgoing ack, we could track packet transmission and acknowledgement more easily.

@riv-robot
Copy link

riv-robot commented Sep 3, 2021

@gavanderhoorn We are trying to print INFO statements around the ptstreaming_queue_ pop, as implemented here. However, these statements are not appearing on the console. Do we need to set a particular logger to a different level than default?

That's not entirely true: your capture shows some missing TCP segments. I'm not sure this is not a red herring, but fi packet 7390 and 7391 (in your full capture) imply something went wrong during transmission of a packet. That would indicate TCP packet loss.

It happens a few more times.

There are Dup ACKs which follow those missing segments, so in the end the bytes probably were received by the destination host, but it's not something I see often in captures of Simple Message traffic.

I ran a few tests at 120 Hz (stable) and 200 Hz (unstable) and we get the same behaviour albeit at a lower rate. The queue we are exceeding has a size of 20 and we receive more than 20 TCP missing segments messages, without the queue size error occuring. Perhaps this is a result of running the controller at 4 Hz instead of the default 10 Hz or some of the other sleeps we've reduced in our branch of the MotoPlus code. A "red herring" may well be the correct terminology but I expect if we can get the INFO statements to print we'll have a better idea.

@gavanderhoorn
Copy link
Member

gavanderhoorn commented Sep 3, 2021

We are trying to print INFO statements around the ptstreaming_queue_ pop, as implemented here. However, these statements are not appearing on the console. Do we need to set a particular logger to a different level than default?

I would expect you'd have to configure your .launch file to redirect output to screen.

Or just add --screen to your roslaunch invocation.

I would also suggest to use named loggers, instead of the anonymous versions. Would save you a few all-capital sentences, as you could just filter for that specific logger.

Perhaps this is a result of running the controller at 4 Hz instead of the default 10 Hz

I'm not sure which controller you are referring to here. I'm not aware of anything running "at 10 Hz" by default.


Edit: I'm not sure I understand the changes you linked to. How would you track the nr of elements in the queue over time?

@riv-robot
Copy link

I'm not sure which controller you are referring to here. I'm not aware of anything running "at 10 Hz" by default.

Typo I meant 4 ms and 10 ms. And I'm talking about the Yaskawa Motoman YRC1000micro controller. We have the MotoPlus SDK and modified the default CONTROLLER_STATUS_UPDATE_PERIOD from 10 ms to 4 ms.

Thanks for the other .launch screen tip, will test this morning.

@gavanderhoorn
Copy link
Member

That would be the period then.

@riv-robot
Copy link

Still having some problems outputting those extra INFO statements. Modified the launch file as per below, but only get two statements on the rosconsole:

INFO] [1630665756.395857964] [/yaskawa/motion_streaming_interface]: Starting joint point received. Starting on-the-fly streaming.
[ INFO] [1630665957.986673026] [/yaskawa/motion_streaming_interface]: Point streaming complete,  setting state to IDLE
<node unless="$(arg use_bswap)" name="motion_streaming_interface"
        pkg="motoman_driver" type="motion_streaming_interface" output="screen"/>
</launch>

@gavanderhoorn
Copy link
Member

I can only guess you're either:

  • not editing the file you believe you're editing
  • you're not rebuilding the driver (correctly) after making those changes

@gavanderhoorn
Copy link
Member

@robertjbush @DavidAlatorreAA: what's the status here? Have you been able to make any progress?

@riv-robot
Copy link

riv-robot commented Sep 16, 2021

@gavanderhoorn You beat me to it! I was just in the process of responding today.

We did some refactoring of our code base and we've been running the ROS master on our fastest PC at 150 Hz consistently without problems. At 200 Hz, although we do sometimes get the max queue error, the robot is able to continue moving. Setting real-time flags to the motion_streaming_interface does help.

I was not able to get the extra INFO statements to display. Any help from the community would be appreciated here.

@gavanderhoorn
Copy link
Member

we've been running the ROS master on our fastest PC

I'm a bit confused: the roscore is not involved in node communication at all.

Setting real-time flags to the motion_streaming_interface does help.

You mean starting the process with SCHED_FIFO or something similar?

@riv-robot
Copy link

riv-robot commented Sep 16, 2021

I'm a bit confused: the roscore is not involved in node communication at all.

We are also confused as to why this made a difference; but it did (note the motion_streaming_interface node runs on the same PC as the master in our configuration).

You mean starting the process with SCHED_FIFO or something similar?

Correct.

@gavanderhoorn
Copy link
Member

For now I'll assume that's correlation and not causation then.

Unless you're re-creating your publishers every loop or something similar. Then I could see this having an influence.

@gavanderhoorn
Copy link
Member

I'm going to close this.

We don't officially support point streaming in the current version of motoman_driver, and from the lack of activity here I assume the initially reported issue has been resolved.

@gavanderhoorn
Copy link
Member

@rr-robert and @rr-david-alatorre:

Unless you're re-creating your publishers every loop or something similar. Then I could see this having an influence.

could you confirm you're not doing this?

I wouldn't understand what effect it would have on the queue overflowing, but would give some clue as to why roscore could be involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants