Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large-scale PowerPC recompiler rework #641

Open
wants to merge 64 commits into
base: main
Choose a base branch
from

Conversation

Exzap
Copy link
Member

@Exzap Exzap commented Jan 30, 2023

Disclaimer: This is work-in-progress. I'm opening this draft PR for visibility, so others can track progress and know not to alter recompiler code. Work started on this in November and the ETA for completion is somewhere in the span of the next few months, depending on my motivation.

Goals

I originally started work on the recompiler in 2014 and since then I have learned a lot more about state-of-the-art compiler and IR design. While I'm generally happy with the quality of our code translation, some of the design choices I made along the way make it hard to introduce further optimizations or fixes. A lot of the complexity is at the burden of the x86-64 backend, which means that all of that would have to be reimplemented when targeting another architecture.

Overall, the idea is to make both the front-end (PPC to IR) and the back-end (IR to x86-64) as "dumb" as possible so that all the complex logic can be shifted to operate on platform-independent IR, lowering the burden on platform-specific code.

State

Please do not report bugs yet. In fact I don't recommend trying this out, it's an active construction site.

  • Reorganized file and folder structure to be more modular
  • Modernize C-style code to use C++ features where it makes sense
  • Fundamentally rework PPC basic block handler to be more flexible. Support non-continous functions and potentially allow for complex inlining
  • Support for bool-based jumps and bool registers instead of having PowerPC CR logic embedded into the IR
  • Allow PowerPC CR bits, SPRs and XER carry bit to reside in registers and participate in register allocation
  • Avoid complex instructions in the IR when they could be implemented using basic operations only. The motivation for this is that it actually simplifies optimizations and allows for emitting more efficient code than having a ton of highly-specialized instructions. Ideally 1 IR instruction = 1 host instruction
    • LSWI / STWSI
    • SRAW / SRAWI
    • BDNZ
    • LWARX / STWCX
    • ADDC and other arithmetic instructions with carry
    • DCBZ
    • MFCR / MTCRF
    • RLWIMI
  • Support typed registers. For now everything is either a 32bit integer or a 2x64bit paired single register
  • Switch floating-point logic over to the newer register allocator that is currently only used for integer registers
  • Improve register allocator to support target instructions with hardcoded registers (e.g. x86's SHL reg/mem, CL). This is currently done suboptimal by the final emitter moving registers around whenever such an instruction is encountered
  • Support for calls to native code in arbitrary locations of the IR program. Currently calling external code is done hackily via macro instructions which need per-backend implementation
  • Rework floating-point register handling. This is a big chapter on it's own and I'll expand on this once I get to it
  • Optimize! This includes bringing back optimizations lost with the restructuring as well as adding some new ones
    • Added a new dead code elimination pass
    • x86 specific: Conditional jumps will use eflags instead of emulating PPC CRx bits where possible
    • Fix loop detection and move register loads/stores out of loops where possible

I know a lot of these are pretty abstract, so in the future I might add a few before-vs-after code examples to this text.

Q&A

Will this PR add ARM support?

No. But it will make adding a new target architecture a lot easier and if I am motivated enough I'll look into adding an aarch64 backend after this is done.

Will this make Cemu faster?

Maybe? After everything is done the recompiler should output faster code, but CPU execution speed generally isn't a bottleneck in Cemu so it's hard to predict whether there will be an actual difference.

What about the proposed plan to use LLVM?

I did quite a bit of research on that. The biggest downside is that LLVM is still quite JIT-unfriendly and comes with significant bloat. Not saying that it wouldn't work, but the cons outweigh the pros in my opinion. Plus we already got a pretty sophisticated recompiler and it would be a waste to throw it away.
On a personal note, I enjoy working on custom solutions more than plugging in libraries so it's easier for me to stay motivated and make progress. In regards to total effort both solutions are about the same.

@Wunkolo
Copy link

Wunkolo commented Jan 30, 2023

What would be the scope of changing the x64 emitter over to something like xbyak?

With the current x64 emitter, adding a new instruction or class of instructions would involve implementing the encoding for those instructions (REX, VEX, EVEX, ModR/M, SIB, etc) from scratch and then implementing the new instruction in particular AND detecting it the particular CPUID flags when this redundant work can probably just be pushed onto a proven library.

@Exzap
Copy link
Member Author

Exzap commented Jan 30, 2023

Thanks for pointing out Xbyak, I wasn't aware of it. The assemblers I looked at were always a bit overkill for our purposes, usually focusing on human-friendly API and less towards a simple interface for machine generated code. We only need a very thin emitter, but Xbyak seems to be exactly that.

As part of this rework I also started a new "cleaner" x86-64 high-performance emitter which I auto-generate from encoding tables. The effort for this is relatively minimal, but using a premade emitter would certainly cut down the effort even further. I'll think about it.

@amayra
Copy link

amayra commented May 16, 2023

did you drop this project ?

@Exzap
Copy link
Member Author

Exzap commented May 17, 2023

Nah just busy with other stuff. I'll get back to this eventually

@jcrm1 jcrm1 mentioned this pull request Sep 26, 2023
@iMonZ
Copy link

iMonZ commented Sep 26, 2023

Nah just busy with other stuff. I'll get back to this eventually

Thanks! ARM64 Support would make the CEMU emulator finally done and future proof!

@Wunkolo
Copy link

Wunkolo commented Sep 26, 2023

On ARM64: I've been using oaknut on other projects. It is structured very similarly to xbyak.

@Gabezin64
Copy link

This will finally fix the lens flare issue in The Wind Waker HD and Twilight Princess HD?

@Exzap
Copy link
Member Author

Exzap commented Oct 13, 2023

This will finally fix the lens flare issue in The Wind Waker HD and Twilight Princess HD?

That's a graphical issue. It's unaffected by this CPU rework.

Exzap added 16 commits August 30, 2024 00:47
Intermediate commit while I'm still fixing things but I didn't want to pile on too many changes in a single commit.
New:
Reworked PPC->IML converter to first create a graph of basic blocks and then turn those into IML segment(s). This was mainly done to decouple IML design from having PPC specific knowledge like branch target addresses. The previous design also didn't allow to preserve cycle counting properly in all cases since it was based on IML instruction counting.
The new solution supports functions with non-continuous body. A pretty common example for this is when functions end with a trailing B instruction to some other place.

Current limitations:
- BL inlining not implemented
- MFTB not implemented
- BCCTR and BCLR are only partially implemented

Undo vcpkg change
Instead of having fixed macros for BCCTR/BCCTRL/BCLR/BCLRL we now have only one single macro instruction that takes the jump destination as a register parameter.
This also allows us to reuse an already loaded LR register (by something like MTLR) instead of loading it again from memory.

As a necessary requirement for this: The register allocator now has support for read operations in suffix instructions
Also removed associatedPPCAddress field from IMLInstruction as it's no longer used
@Exzap Exzap force-pushed the jit-work branch 2 times, most recently from de1a45e to a52e39d Compare October 27, 2024 13:42
@Exzap
Copy link
Member Author

Exzap commented Oct 29, 2024

I consider this PR complete. There is more work that can be done but it's at a good point to merge so let's do that.
But first it would be nice to get some feedback. Anyone interested in testing this please grab the executable from github actions and let me know about any issues.

Here is a benchmark. Previous PPC JIT in Cemu 2.2:
image

The reworked PPC JIT from this PR:
image
(lower numbers are better) While in the benchmark some tasks are much faster, real world performance will probably remain largely the same since CPU emulation never really was a bottleneck for us.

There have also been some general accuracy improvements and the top post has all the under-the-hood changes that were made.

@Exzap Exzap marked this pull request as ready for review October 29, 2024 01:32
@boggydigital
Copy link

boggydigital commented Oct 29, 2024

@Exzap I've tried macOS build and it crashed upon loading pipelines on every single game I've tried.

If that helps - I've confirmed that this doesn't happen on 2.2.

@Exzap
Copy link
Member Author

Exzap commented Oct 29, 2024

@boggydigital Can you post the log for other games as well

@goeiecool9999
Copy link
Collaborator

It's the same story on Linux. The point where the access violation happens varies but most of the time it happens at the jump instruction in this screenshot (same behaviour in different games too). Hopefully this gives you a decent clue.
image

@Exzap
Copy link
Member Author

Exzap commented Oct 29, 2024

I was able to get it to crash by turning off BMI2 extension. Unsure if it's directly related to your crashes but we will see. Working on a fix

@Exzap
Copy link
Member Author

Exzap commented Oct 29, 2024

Can you grab the latest build and check again @goeiecool9999 @boggydigital

@goeiecool9999
Copy link
Collaborator

No change. It crashes in the same spot.

@boggydigital
Copy link

boggydigital commented Oct 29, 2024

Tried the latest build. It crashes for me as well.

@goeiecool9999
Copy link
Collaborator

That fixes it 🥳

@boggydigital
Copy link

Likewise, I can't repro the crash in any of the ~10 titles I've tried. Thank you @Exzap!

@Ammar-Sadaoui
Copy link

what is a real bottleneck here if CPU emulation was not the problem ?

@Exzap
Copy link
Member Author

Exzap commented Oct 31, 2024

what is a real bottleneck here if CPU emulation was not the problem ?

It differs by game, but for the more graphically complex games it's usually the GPU command processor.
Discussing this at full length goes outside the scope of this PR but if you are curious about Cemu's architecture the best place to learn more is our discord where we have discussions about these things and anyone can ask questions.

@squidbus
Copy link
Contributor

squidbus commented Nov 3, 2024

Tested on macOS with most first party titles and didn't encounter any issues compared to main.

@mkrcos
Copy link

mkrcos commented Nov 15, 2024

Tested on linux and most games run great. The only issue that I've found is in Mario Kart 8, it gets stuck on the loading screen after finishing the first race.

@SirHrVedel
Copy link

SirHrVedel commented Nov 15, 2024

Runs great in Windows with most titles i've tried. Only issue there is with it that i've noticed, is the loading gets stuck in Mario Kart 8 when finishing the first race in a cup, and crashes in Wii Party U when attempting to go into any minigame (No stack trace in the log, but the crashing issue also happends on the stable 2.X builds)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.