-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logic for checking for empty or preemptable tx mailbox condition seems incorrect #57
Comments
Good catch! I agree that the id check should be reversed. bxcan/testsuite/tests/integration.rs Line 359 in 3fc7a0e
|
Ok, thanks for confirming! I'll draft up a PR to get this fixed, and add a test to ensure that code path is covered, unless you're already looking at it. |
Ok, looking through the transmit implementation as a whole, there are a few other things I noticed that look like issues:
Before I go "fixing" what I think are issues, I'd like someone to confirm how this was intended to work. |
Your concerns regarding the current implementation are valid, another pair of eyes is always welcome. As a reference: the current implementation takes inspiration from https://github.com/OpenCyphal-Garage/platform_specific_components/tree/master/stm32/libcanard/bxcan. I have not looked into the code again but it would be interesting to understand if e.g. the linked code or the zephyr driver have the same shortcomings. |
Thanks for the reply. I did some reading on the subject from the Cyphal document you linked, a conversation on the matter in a Zephyr issue and the PR to fix the priority inversion issue in Zephyr, and a couple of different papers, including the implementation of CANOpen in VHDL by the European Space Agency. For the sake of the rest of this discussion, I'm going to reference information on the bxCAN peripheral specifically, by page number within the STM32L432KC reference manual rev4, although all SKUs that use the same version of the bxCAN peripheral should be the same. From what I've gathered this is not a simple situation to work around generically; it largely depends on bus load, application needs, and the implementation of the hardware peripheral. In cases where the CAN peripheral does not implement priority based transmission, but has multiple TX slots (like the MCP2515 that uses assignable priorities to each slot instead of using the frame ID to determine priority) priority inversion avoidance from the perspective of a single node must be managed entirely by the application and driver code, however that isn't the case with ST's bxCAN peripheral. For the remainder of this discussion, I will be referring to only the bxCAN peripheral, it's documented behaviour and how it applies to the concept of priority inversion. For the case of ST's bxCAN peripheral, in the default reset state where the TXFP bit is not set in the MCR (page 1497), the hardware will avoid priority inversion within it's own TX mailboxes by always transmitting in order of ID priority, as opposed to FIFO order when the TXFP is set (Behaviour described on pages 1482-1483). So, in the simplest case, as long as there is a mailbox available, and the peripheral is in ID priority mode (the default), queuing a frame of lower priority into an empty mailbox when there are higher priority frames already enqueued in other mailboxes is safe from priority inversion. In fact, I'd argue that the current implementation of TX queuing in this library is actually very detrimental to bus throughput and timing, as well as creates the possibility of priority inversion from a bus-wide perspective, instead of just from the perspective of a single node. For example, if we have a bus with multiple nodes, In the case where node "A" (running an application using this driver) attempts to transmit two frames of IDs 0xA and 0xB, however both frames are constructed of asynchronously received data from other devices/buses on connected to the processor. In this case it is indeterminate which of the two frames will be prepared by the application first. In the case where the higher priority frame (0xA) is first, it will be placed into a TX mailbox and await transmission on the bus, however if that frame has not been transmitted yet, or worse, is in the process of being transmitted (as per the state chart on page 1483, there is no way to distinguish between a mailbox that is pending, scheduled [highest priority mailbox in the peripheral], or currently being transmitted) while the application attempts to enqueue the lower priority frame (0xB), this driver will reject the queueing of the message, forcing the application to holdback the message and attempt retransmission once the peripheral has signalled that it has successfully transmitted the higher priority frame (0xA), even if there are two empty TX mailboxes. The issue here is that by the time the application re-attempts queuing the frame, accounting for the latency of either the irq firing or the runloop reaching the task that triggered the send again, the driver rechecking the mailboxes and re-enqueuing the lower priority frame, it's likely that the three inter-frame spacing bits have elapsed on the bus and the lower priority frame will have missed a potential opportunity for transmission. In this scenario we may have also created a bus-wide priority inversion, as a lower priority frame, say ID 0xCC, from another node may have been transmitted in the frame slot, instead of the higher priority frame from the node which had it's frame rejected from queuing by this driver's code. In the more complex scenario where the three TX mailboxes are already holding pending frames, let's say IDs 0xA, 0xB and 0xC, from previous application events, but due to bus loading are currently awaiting transmission, and the application attempts to schedule a fourth frame with Id 0x9. In this case, if the driver simply rejected the frame because the mailboxes were full, a priority inversion would be created, however, the bxCAN peripheral provides nice facilities for handling this too. In this case, to prevent priority inversion, the driver should use the CODE field from the CAN_TSR register (page 1498) to determine the lowest priority message currently enqueued (in this example ID 0xC), dequeue the message (read + abort) and replace it with the new higher priority frame (0x9), returning the dequeued frame to the application for appropriate handling. This scenario however, depicts the only opportunity I can think of for a priority inversion within the node. In a more generalized scenario, it would occur where the hardware selects and begins to arbitrate, or has already begun arbitration and/or transmission frame 0xA while the driver is attempting to dequeue frame 0xC and queue 0x9. From my perspective this outcome is both acceptable and not avoidable without creating a bottleneck within the node that has the potential to reduce bus and node throughput. Due the the asynchronous nature of CAN, along with the uncertainty of execution timing of the application and bus loading, it's not possible to guarantee in the driver-space that when the TX mailboxes are full there is no chance of a request to queue a higher priority message from the application, so even in the current implementation where the driver refuses to queue a frame with a lower priority ID than any already in TX mailboxes, it's possible for the same situation to arise if the application queues frames with IDs 0xC, 0xB, 0xA, then 0x9. The only way I know of to avoid this would be to only use a single TX mailbox and maintain a software queue of messages, ordered by priority, loading the next frame into the TX mailbox after the hardware has indicated it has successfully transmitted the last loaded frame. However, this solution comes with a potential penalty on bus throughput, if the driver/application aren't able to load the next frame into the TX mailbox fast enough for it to be scheduled by the peripheral to be arbitrated in the next frame slot, which is likely, as the TX mailbox in use is immutable while the frame is being transmitted, and the irq for transmission doesn't fire until the frame has completed transmission and has been acknowledged on the bus, leaving the 3 inter-frame spacing bits for the new frame to be queued by the driver/application and prepared/scheduled by the peripheral, which at the modest bit rate of 250Kbps amounts to 12µs. To summarize, my suggestions for adjusting the current implementation are as follows:
|
Thank you for going into depth and documenting everything so cleanly! Summarizing I understand that the bxCAN peripheral can be used in two modes:
I highlighted the last entry because I think it is the biggest shortcoming of the priority mode: Reordering can occur for messages with the same ID. Depending on the higher level protocol that might be OK or not. Nevertheless I think your proposal sounds good and it would simplify the implementation with the benefit that the user has more control by choosing the desired mode. |
I really appreciate the time you're taking to talk this through with me! I honestly hadn't consulted the embedded traits to see what its requirements are. Reading the non-blocking trait documentation, it seems to vaguely suggest that the driver should always be in priority mode, and only be applying FIFO ordering when there are messages with the same priority, although it isn't explicitly stated. I see in some of the original PRs to the embedded-hal project, you either authored them or were involved in some way. Can you clarify what the intent with the current trait API surface is in that crate? I definitely like the idea of keeping it simple, and using FIFO for the embedded-can traits and enabling the choice between FIFO and priority based within the driver APIs. Although, now that I've had some time to digest this all, I have a couple of additional thoughts:
Result: Driver dequeues MB 2 and inserts new frame with ID 0x8
Result: Driver returns WouldBlock, as no MB higher than 1 is free or is pre-temptable
Result: Driver returns WouldBlock, as no MB higher than 2 is free or is pre-temptable Thoughts? |
The idea was to minimize priority inversion while allowing users of the API to do segmented transfers.
Could you open a issue on the
To me that looks like a good solution as compromise for the (current definition of) embedded-can traits. As I type this response another idea came to me (probaly even more complex, I‘m not convinced its a good idea):
|
I've opened an issue on the embedded-hal repo to see if there is any appetite to adjust the traits defined there. I do feel like I should clarify one point on the compromise I listed in my last comment, as it may not immediately be clear; the reasoning for only queuing frames while in FIFO mode with the same priority as an already queued frame iff there is a free/preempt-able mailbox with a higher number is we have no way of telling if the hardware is already trying to transmit the previous frame with the same ID. Even in the case where the frame we are concerned with isn't the lowest priority frame queued, it may have been when the peripheral was selecting the next frame to arbitrate.
I can see where this could be the right solution in some applications. However, I suspect the right solution is to have adjustments made to the embedded-can trait such that there can be a distinction between FIFO and ID priority modes. Otherwise it appears there will always be a compromise on bus throughput and/or usability when implementing these traits, and across different hardware peripherals the trade-offs needed to implement the behaviour described are certainly not the same. |
Another thing that might be worth looking into is the blocking Also, the current API is really weird. With it, the current function may end up holding a copy of a packet some unrelated part of the system has enqueued and then having to decide what to do with it. Sometimes it would be better to be able to only transmit if there is currently space in an output mailbox. |
The condition here:
bxcan/src/lib.rs
Line 848 in 3fc7a0e
should return an Ok if the mailbox is empty or if it is not transmitted and has a lower priority, however the condition seems to return WouldBlock in the case that the message is still pending, but the Id is greater than or equal (same or lower priority).
Should this be reversed to be
&& id > IdReg::from_register(tir.bits())
so it only returns WouldBlock in the case that the requested id is higher (lower priority) than the existing pending frame?The text was updated successfully, but these errors were encountered: