Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add synchronization for multicore #782

Closed
wants to merge 2 commits into from
Closed

Conversation

soyersoyer
Copy link
Contributor

There is currently no memory synchronization between the processor cores, which could cause problems in theory. I added memory barriers and atomic access to the cores' status.

I also suspend the cores when waiting. I thought this would make the PI a little cooler, but I haven't been able to measure it. Can someone measure it with a USB power meter?

Maybe not all of these changes are necessary. I'll get to know the barriers better.

Copy link

github-actions bot commented Jan 3, 2025

Build for testing:
MiniDexed_2025-01-03-0b26d6c
Use at your own risk.

@probonopd
Copy link
Owner

Thanks @soyersoyer.
@rsta2, maybe you could have a quick glance at these changes to advise us whether we are on the right track here? Thank you very much!

Copy link

github-actions bot commented Jan 3, 2025

Build for testing:
MiniDexed_2025-01-03-40e62e2
Use at your own risk.

Copy link

github-actions bot commented Jan 6, 2025

Build for testing:
MiniDexed_2025-01-06-9e9d74e
Use at your own risk.

@soyersoyer
Copy link
Contributor Author

Barriers are not needed because they are already in the Acquire () / Release () of the spinlock of the CDexedAdapter getSamples () function.

Volatile variables are written and read with STR/LDR ARM instructions, which are atomic, and it would be enough if there were no other m_nFramesToProcess variable, for which there is no guarantee that its new value would be available to the other cores sooner than the new value of m_CoreStatus. If I understand correctly?
For std::atomic, the STLR/LDAR instruction is used for release/acquire and seq_cst (default) modes, which already ensures that the new value of m_nFrameToProcess is also available.

If volatiles remain, I think that another solution could be to omit the nFrames variable in ProcessSound (use m_nFramesToProcess instead)?
Or pass n_samples as a pointer to CDexedAdapter::getSamples and dereference it after m_SpinLock.Acquire () (DataMemoryBarrier)?

The other thing I don't know is whether Core1 can wait indefinitely if m_CoreStatus[nCore] != CoreStatusIdle is read, then it starts waiting for the interrupt, but between the read and the wait for the interrupt, m_CoreStatus changes (by Core2) and it has already received its Interrupt. Or, how can this be ensured so that it doesn't happen.

I haven't been able to measure yet whether it really consumes less this way and whether it's really worth it, I'll let you know when the meter arrives.

@diyelectromusic
Copy link
Collaborator

I wasn't aware of any issues in this area - do you have something specific you were attempting to fix?

I seem to recall that m_CoreStatus is only ever written by one core and read by another, so just waiting on it ought to be fine. In theory it might be possible for the compiler to attempt to optimise out a read if it isn't obvious it is being changed, or for there to be some kind of cache coherency issue between cores. But it is already declared volatile for exactly this reason, and most multi-core CPUs are fully aware of the problems of cache coherency, so I'm not aware of any issues there either.

As you say, areas where synchronisation could be a problem are already protected.

As a general point, I'm getting slightly nervous that there seem to be a lot of changes happening here and there with no specific issue to solve and as the original developers have since moved on from the project I can't help feeling that we are introducing additional risk as a consequence. I know code bases ought to evolve, but there is also danger of negating the significant testing that has already happened with the codebase to date if we introduce too many other changes without a significant reason.

Kevin

@soyersoyer
Copy link
Contributor Author

Having a core continuously read a variable is not the most sophisticated solution. It generates unnecessary heat, can drain a battery faster if the Pi is running on it, and the DAC output can be noisier due to the extra heat. It would be nice to find a stable solution so that it can go to suspend while it waits.

And then I saw that volatile is used for synchronization, which isn't exactly what it's for, and I thought I'd get around it better. This is a draft, by the way.

m_CoreStatus is good anyway because of its volatile type, because ARM guarantees that its value will be the same on all cores at the same time. But since processors can rearrange memory writes, there is no guarantee (with volatile) that TG3 or TG6 will not use the previous value of m_nFramesToProcess. (Or at least I don't see it)
But the chance of this is very small, for example, there are several function calls until then, so the cores must have time to synchronize the variable. And I couldn't get the m_nFramesToProcess value to be anything other than 128 on pi4,pi3+. (except for 256 at the beginning).

I also tried running TG3 and TG6 with smaller chunk sizes while the others worked with the default, but I couldn't hear the difference.

So in real life, I don't think it will cause any audible problems with volatile either.

Although if there is a performance or stress test that can make a 3b+ sweat, I would be happy to try it.

@soyersoyer soyersoyer closed this Jan 10, 2025
@diyelectromusic
Copy link
Collaborator

diyelectromusic commented Jan 10, 2025

It would be nice to find a stable solution so that it can go to suspend while it waits.

Maybe. And this might be something that is more an issue with a Pi 5. Certainly for Pi 1,2,3 I think the processing time was pretty maxed out processing frames anyway, so there probably wouldn't be much waiting time for suspending.

With a Pi 4 and 16 TGs that might also be the case, but I don't believe we have any metrics to know if there is any idle time in the system.

The latest performance with a Pi 5 could be more significant, but again, I don't know. Maybe we need some measures of idle times on the different Pis to see if there is anything better than essentially just "spinning".

m_CoreStatus is good anyway because of its volatile type, because ARM guarantees that its value will be the same on all cores at the same time. But since processors can rearrange memory writes, there is no guarantee (with volatile) that TG3 or TG6 will not use the previous value of m_nFramesToProcess. (Or at least I don't see it) But the chance of this is very small, for example, there are several function calls until then, so the cores must have time to synchronize the variable. And I couldn't get the m_nFramesToProcess value to be anything other than 128 on pi4,pi3+. (except for 256 at the beginning).

Yes, I don't know. I can only imagine that is assumed because all cores are kicked off processing frames at the same time, so m_CoreStatus ought to be idle when the number of frames is set? And it is probably quite likely that the number of frames might always be the same - I haven't looked into the sound driver to see how many requested frames come back from GetQueueFramesAvail().

These are questions for Rene really, which is why I'm not a fan of changing these things unless we know there is a problem that needs fixing.

Kevin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants