Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: uspace: Let the user choose the CPU affinity per hal thread #2514

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mbuesch
Copy link
Contributor

@mbuesch mbuesch commented Jun 2, 2023

This is the patch I currently use to set the CPU affinity per hal thread.

I want to select different CPUs for the different threads.

I'm not convinced about doing that via prio mapping, because it's kind of not intuitive. But it was easiest to implement for now.
That's why this is RFC.
But I'd like to hear your general opinion on this topic, before implementing this in another way.

The way this works is that threads get descending priorities and for each prio we can select a different CPU (if we want) by specifying the CPU number in the corresponding environment variable (e.g. RTAPI_CPU_NUMBER_PRIO98=1 to run the thread with prio 98 on CPU 1).

@petterreinholdtsen
Copy link
Collaborator

Can you provide some insight into why this is useful? I suspect knowing this might make it easier to comment on the approach used and other alternatives.

@mbuesch
Copy link
Contributor Author

mbuesch commented Jun 2, 2023

Well, this is especially useful, if one CPU is not fast enough to handle all RT tasks.
In my case I want to direct a specific task (the base thread) to an isolated CPU, but not all of them (everything else).

@mbuesch mbuesch changed the title RFC: uspace: Let the user choose the CPU affinity per process RFC: uspace: Let the user choose the CPU affinity per hal thread Jun 3, 2023
@rodw-au
Copy link
Contributor

rodw-au commented Jun 5, 2023

I think this should be extended to setting the affinity for the NIC used to connect with any Ethernet hardware. It's been reported (and also recommended by the RT kernel guys) that the NIC should share the core that is isolated for PREEMPT_RT

@mbuesch
Copy link
Contributor Author

mbuesch commented Jun 5, 2023

Can you provide a link to more information on the NIC affinity topic? I don't know what this means.

Please also note that this PR does not add the ability to configure a thread CPU affinity. It merely extends the existing mechanism to support different CPU affinities per thread, instead of only having the option of a single selectable CPU that all threads can be directed to.

@rodw-au
Copy link
Contributor

rodw-au commented Jun 5, 2023

NIC = Network interface card.
Where Ethernet connected hardware (like a Mesa card) is used, in recent kernels some NIC drivers can result in excessive network latency. This is quite seperate to normal latency measured with latency-test. If the network latency exceeds the servo thread period persistently, the Mesa hardware will issue an error finishing read and disable further communication with Linuxcnc.

We have some discussion with the RT kernel team and it has been suggested the NIC interrupt should be moved to the isolated core. @pcw-mesa will know more as he stated:
My test systems were all Intel CPUs with 4 cores, isolcpus=3 and the Ethernet IRQ pinned to CPU3

It would be good if this could be done without the user having to be involved in some low level Linux configuration.

@mbuesch
Copy link
Contributor Author

mbuesch commented Jun 5, 2023

Ok. I don't think LinuxCNC is the right place to do some modification of the Ethernet IRQ.
There is the proc interface for modifying IRQ affinity and irqbalance for a higher level IRQ manager.
How would LinuxCNC even know where to pin which interrupt to, unless you configured it in some LinuxCNC configuration? The user can just configure it in the proc interface or irqbalance instead.

I don't see why LinuxCNC should be involved here, except for maybe some configuration hint in the documentation of such card drivers.

And this has basically nothing to do with this PR. :)

@andypugh
Copy link
Collaborator

andypugh commented Jun 6, 2023

And this has basically nothing to do with this PR. :)

I suspect that you might be missing the point that we run an ethernet driver in the realtime thread if we are using a Mesa Ethernet card. I think that controlling the ethernet IRQ affinity is very much part of the bigger picture in this case.

@mbuesch
Copy link
Contributor Author

mbuesch commented Jun 6, 2023

Can we please not mix up IRQ and thread CPU affinity? I do not see how that has anything to do with the change proposed in this PR.
I do agree that there must be an IRQ affinity configuration somewhere. But that is a topic that can be discussed and implemented separately, if needed.

@rodw-au
Copy link
Contributor

rodw-au commented Jun 6, 2023

There are command line tools that allow you to query the interupts in use and set the affinity. But it's not a user friendly process.

Perhaps the hm2_eth driver could be modified to query the core the thread is running on and adjust the affinity of the NIC used to match the thread defined here. This PR could help this process by saving the CPU core for the threads it creates somewhere in the Linuxcnc environment.

In the latest kernels (from 5.10 and on), network latency is a major issue on many computers. This was never an issue with Debian Buster (4.19 kernel). Debian Bookworm is released in 3 days with the 6.1 kernel and linuxcnc-uspace is in it's repos so everything that can be done to reduce network latency will help reduce user support issues.

So while it may or may not be related to this issue, IRQ affinity of the NIC is a very real issue facing this project. Some of us have been battling to understand the issue for over 12 months1

@pcw-mesa
Copy link
Collaborator

pcw-mesa commented Jun 6, 2023

This is not just a issue with Hostmot2 but any Ethernet connected device that uses the standard network stack.
(like ColorCNC, Remora/NVEM etc) That said, it may be that this is better addressed by a setup script
that takes an Ethernet device name and does both the network IP address setup and IRQ affinity

@SebKuzminsky
Copy link
Collaborator

This PR breaks the rate-monotonic scheduling (RMS) promise we make here: http://linuxcnc.org/docs/2.9/html/man/man3/hal_create_thread.3hal.html#DESCRIPTION

Maybe that's ok, but we should talk about it before changing the behavior of LinuxCNC in this way.

RMS means that higher-priority threads may interrupt lower-priority threads, but lower-priority threads may not interrupt higher-priority threads. In LinuxCNC, this means the base thread may interrupt the servo thread, but the servo thread may not interrupt the base thread.

This is probably mostly important for LinuxCNC's split-thread components, i.e. components that need functions in both the base thread and the servo thread, such as stepgen, pwmgen and encoder. The base-thread functions of these components can currently depend on information from the servo thread not changing while they're running. What are the effects if we remove that guarantee, like this PR does? I'm not sure, but we should figure that out before making this change.

(I have independently been working on a different change to uspace scheduling that also moves different threads to different CPUs, see https://github.com/LinuxCNC/linuxcnc/tree/busywait8. I've been holding it back partially for this RMS reason, so I am also interested in finding the answer to this question.)

@mbuesch
Copy link
Contributor Author

mbuesch commented Jun 7, 2023

Thank you for your comment @SebKuzminsky .
That's exactly the kind of problem I wanted to learn about.

I was not aware that we made such a guarantee in LinuxCNC.
I agree that this needs at least a careful audit of the multi thread drivers.
And I think we will need some kind of multiprocessor memory barriers mechanism as well.

Lots of years ago I played around with machinekit. (Disclaimer: Therefore my knowledge about them is not up to date).
They had a(nother) way to create threads on different CPUs.
But they also had completely changed the way hal communication worked with SMP synchronization. But that feature completely killed performance for me and that is where I had to drop my machinekit experiments. Maybe they are doing better now? I don't know.

But maybe we can find a lightweight synchronization mechanism (e.g. one barrier per task entry/exit plus some additional barriers where needed. But not on every hal signal access) that doesn't kill performance.
Of course this needs careful audit of all drivers.

@SebKuzminsky
Copy link
Collaborator

The current situation, with all realtime threads sharing one CPU using rate-monotonic scheduling, is this:

  1. A fast thread and a slow thread share a bunch of variables.
  2. The slow thread will never interrupt the fast thread.
  3. The fast thread can interrupt the slow thread, potentially several times during a single invocation of the slow thread.

Therefore:

  1. The fast thread will never see the shared variables change while it's running (because the slow thread can't interrupt it and change the variables).
  2. The fast thread may see a snapshot of the variables where some but not all of them have been updated (because the fast thread interrupted the slow thread while it was in the middle of updating the variables).
  3. The slow thread may see the variables change (perhaps several times) during its invocation, as the fast thread interrupts the slow thread and writes to the variables.

Is that correct? Does it adequately describe the current situation?

This... doesn't seem ideal. But apparently it works well enough that our users don't report bugs about it.

I think an ideal solution would have the following properties:

  1. A set of variables needs to be shared between different threads.
  2. Updates (writes) to these variables should be atomic, even when multiple variables are updated, so the reader either sees all the changes or none of the changes, they never see a "partial write" where some but not all variables have been updated.
  3. Realtime threads may block, but not for any significant amount of time.
  4. The solution must work both for the current setup where all realtime threads share a single CPU using RMS, and the alternate setup being proposed by this PR (and by the busywait branch) where all realtime threads run concurrently on separate CPUs

This problem statement is drawn both from the multi-thread components like stepgen/pwmgen/encoder mentioned above, and from #2386.

Do we agree on that problem statement?

@gmoccapy
Copy link
Collaborator

gmoccapy commented Oct 7, 2023

stuttgart meeting would like to know how this relates to @SebKuzminsky s busywait branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants