Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMDe #12

Open
AndreyTykhonov opened this issue Dec 15, 2020 · 42 comments
Open

SIMDe #12

AndreyTykhonov opened this issue Dec 15, 2020 · 42 comments

Comments

@AndreyTykhonov
Copy link

https://github.com/simd-everywhere/simde

This project looks promising. I tried to add mm_dp_ps support (to fix SSE 4.1 in Cyberpunk) but failed to compile after that. Maybe you will be interested

@mirh
Copy link
Contributor

mirh commented Dec 15, 2020

DPPS should already be implemented, maybe you are hitting simd-everywhere/simde#648
As for popcnt specifically, they choose not to to include it in the project (since I guess it's not technically a SIMD instruction?)

Anyway, duh, shit. I can't believe this flew under my radar this spring when I was looking for alternatives to SSEPlus.

@mr-c
Copy link

mr-c commented Dec 15, 2020

@AndreyTykhonov I am interested in hearing more about your SIMDe failure to compile; can you share more details?

@AndreyTykhonov
Copy link
Author

@AndreyTykhonov I am interested in hearing more about your SIMDe failure to compile; can you share more details?

I was in very beginning. I use primary C#, so C++ is something crazy for me. I compiled sample pin 3.14 project, added instruction watch like in this code, but when I trying to include SSE 4.1 header I receive crazy amount of errors related to pin modules (even without simde calls / variable, just after including!). I tested simde in console app and it works perfect, but with pin something crazy is going. So I think that I should waste my time learning C++ to fix Cyberpunk instead of developers and removed solution with game, lol. But think that it can by handy for someone and posted here information about simde

Before this I compiled project with simde mm_dp_ps, watched in debugger asm code and injected in game jmp to new memory where asm code from compiled exe, lol. I even got it to cyberpunk logos, but it too crazy so I stopped this research

@mirh
Copy link
Contributor

mirh commented Dec 15, 2020

Pin can only include 3 very specific headers.
... and it already emulates everything up to even AVX512 I think.

If you are trying to extend the new icudt.dll that's a completely different approach.

@ogurets
Copy link
Owner

ogurets commented Dec 16, 2020

@AndreyTykhonov thanks! Looks very promising! Actually I've been looking for implementations since I crashed into those AVX instructions after prologue. Pintool does implement everything, but it doesn't allow using it's implementations freely.

If you wish to add support for it, you will have to include SIMDe headers, add another HOTFIX macro like this:

#define HOTFIX_DPPS(offset, a, b, imm8, instr_size) \
	if (rip == g_imageBase + (offset)) { \
		(a) = simde_mm_dp_ps((a), (b), (imm8)); \
		ctx->Rip += (instr_size); \
		return EXCEPTION_CONTINUE_EXECUTION; \
	}

Refactor the calls like:

HOTFIX_POP(0x045AD8D, ctx->Rax, ctx->Rcx, 5);
HOTFIX_DPPS(0xsomething, a, b, imm8, size);

And the trickiest part - find all those usages of dpps in the game and describe them with HOTFIX_DPPS. You can actually use hotpatch.log, which is being written by this library after every "unknown instruction" crash, but depending on the number of calls to these instructions it can be too tedious.

I used a modified version of instruction_hook tool to automate that for popcnt, will share it later.

@AndreyTykhonov
Copy link
Author

AndreyTykhonov commented Dec 16, 2020

@AndreyTykhonov thanks! Looks very promising! Actually I've been looking for implementations since I crashed into those AVX instructions after prologue. Pintool does implement everything, but it doesn't allow using it's implementations freely.

If you wish to add support for it, you will have to include SIMDe headers, add another HOTFIX macro like this:

#define HOTFIX_DPPS(offset, a, b, imm8, instr_size) \
	if (rip == g_imageBase + (offset)) { \
		(a) = simde_mm_dp_ps((a), (b), (imm8)); \
		ctx->Rip += (instr_size); \
		return EXCEPTION_CONTINUE_EXECUTION; \
	}

Refactor the calls like:

HOTFIX_POP(0x045AD8D, ctx->Rax, ctx->Rcx, 5);
HOTFIX_DPPS(0xsomething, a, b, imm8, size);

And the trickiest part - find all those usages of dpps in the game and describe them with HOTFIX_DPPS. You can actually use hotpatch.log, which is being written by this library after every "unknown instruction" crash, but depending on the number of calls to these instructions it can be too tedious.

I used a modified version of instruction_hook tool to automate that for popcnt, will share it later.

Wow! Thanks! Did you fixed project compilation with simde headers? I'm not too good and C++, after including header I got hundreds of errors :D If you can attach project with connected simde header that compiles I would grateful!

And about offsets, I already got all SSE 4.1 & SSE 4.2 instructions offsets for 1.04 version, here is my results, maybe you find a use for it (beware - there is starting offsets like Cyberpunk2077.AK::WriteBytesMem::Count, but I can recreate file with only Cyberpunk2077 reference as start point)
sseInstructions.json.txt

Actually I tried to fix some instructions with assembler so this is what I fixed:

  • No fixes: Game crashed at start
  • dpps: Game crash at first logo
  • pminuw: Game crash at second logo
  • pmaxuw: Still at second logo crash
  • ptest: Game actually got to intro video, but after that part it stuck with high CPU usage and watchdog timeout

My asm code contained errors, so some of methods returned wrong values, I think it's the problem of freeze and crash. But looks like this is all methods that needed to got to menu. Anyway, all list of SSE functions that game used (if I not forget something):

  • dpps
  • popcnt
  • pmulld
  • pminsd
  • pmaxuw
  • pmuldq
  • blendvps
  • pcmpistri
  • blendps
  • pminuw
  • pmovsxwd
  • packusdw
  • pabsd
  • roundps
  • ptest
  • pinsrb

@EvgeniySpinov
Copy link

EvgeniySpinov commented Jan 25, 2021

So glad to see this thread, cause I was doing exactly the same what is described here: I've found all SSE 4.1 and SSE 4.2 function calls and was emulating them with SIMDe. I'm stuck earlier on the path tough: I'm experimenting with DPPS emulation and after the first DPPS emulated call I get "Access violation exception when trying to access 0x00000000000". It seems that DPPS returns wrong values to the registers.

My best guess is that after calling ExceptionHandler values of the registers are restored to the stacked ones and I was trying to resolve that. However according to comments here - I might be wrong.

SIMDe was successfully included into the project without any errors (few warnings), however I was using only popcnt_hotpatch project.

@AndreyTykhonov if this is not the case for you - let's investigate. For me it was very easy - I've extracted source of SIMDe into a subfolder near the popcnt_hotpatch project and used relative paths to make inclusions.


#define HOTFIX_DPPS(offset, a, b, imm8, instr_size) \
	if (rip == g_imageBase + (offset)) { \
		(a) = simde_mm_dp_ps((a), (b), (imm8)); \
		ctx->Rip += (instr_size); \
		return EXCEPTION_CONTINUE_EXECUTION; \
	}

As for this code - the issue is that variables a and b should be XMM registers. They come within the exception context as _M128A structure, so appropriate casting needs to be made. Unless I'm missing anything.

I've ended up with something like:

DPPS(0x03A2C33, ctx->Xmm0, ctx->Xmm4, 0x7F, 6);

#define DPPS(offset, dest, src, mask, instr_size) \
	if (rip == g_imageBase + (offset)) { \
		__m128 register1 = _mm_load_ps((float*) &dest); \
		__m128 register2 = _mm_load_ps((float*) &src); \
		dest = simde_mm_dp_ps(register1, register2, mask); \
		ctx->Rip += (instr_size); \
		return EXCEPTION_CONTINUE_EXECUTION; \
	}

And after the first call I'm getting "Access violation exception".

With SDE Cyberpunk works. I'm working with 1.06 binary.

Would be glad to get deeper into this. Suggestions?

UPD. Actual code for DPPS emulation is a bit different than above:

#define DPPS(offset, dest, src, mask, instr_size) \
	if (rip == g_imageBase + (offset)) { \
		simde__m128 register1 = simde_mm_load_ps((simde_float32 *) &dest); \
		simde__m128 register2 = simde_mm_load_ps((simde_float32 *) &src); \
		simde__m128 register_dest = simde_mm_dp_ps(register1, register2, mask); \
		simde_mm_store_ps((simde_float32*) &dest, register_dest); \
		ctx->Rip += (instr_size); \
		return EXCEPTION_CONTINUE_EXECUTION; \
	}

@EvgeniySpinov
Copy link

EvgeniySpinov commented Jan 26, 2021

I've made some progress with SIMDe, no more "Access violation issue".

SIMDe seems pretty effective and generates like 2-7 ASM lines instead of single DPPS call.

Now I need to form sseInstructions.json.txt, which is kind of tricky. I have all instructions list and their offsets, however I do not have length of the instructions. I've used IDA Pro to get those and I can get length of instructions one by one. But there are 1739 matches for SSE 4.1 and SSE 4.2 instructions, so running manually via hotpatch.log is not an option. As well as manually going though IDA search results.

List of instructions provided by @AndreyTykhonov is different for Cyberpunk 1.06. It has also:
lea
pminsd
pmaxsd
vpmovsxwd
vpmulld
vpblendw
vroundps

But do not have:
pminsd
pcmpistri
pabsd
pinsrb

Can someone help me with either IDA parsing results from search occurances window or with offsets, instructions + their length?

@AndreyTykhonov
Copy link
Author

@EvgeniySpinov glad to see progress on this!

My list didn't contains AVX instructions since I not parsing it.
1.1 contains AVX instructions too, even developer said that removed it (maybe unused code?)
Here instructions from 1.1 version (AVX not parsed), hope it will help you.

1851 SSE 4.1 / 4.2 instructions:

  • dpps (1498)
  • popcnt (123)
  • pmulld (76)
  • pminsd (30)
  • pmaxsd (30)
  • pmaxuw (14)
  • pmuldq (12)
  • blendvps (11)
  • pcmpistri (10)
  • blendps (8)
  • pminuw (8)
  • pmovsxwd (8)
  • packusdw (7)
  • pabsd (7)
  • roundps (6)
  • ptest (1)
  • pinsrb (1)
  • pshufb (1)

11_sseInstructions.json.txt

@EvgeniySpinov
Copy link

EvgeniySpinov commented Jan 27, 2021

Looks exactly what is needed! Thank you for sharing. I do not have 1.1 though, but will get an update.

Question meanwhile: could you please share a way how you generate this? Really curious of the approach and would like to use it for other projects as well.

Also a question: In this call dpps xmm1,[rsp+20],7F? Is that "20" is heximal or decimal value? My assumption that it is heximal.

And one more question about file contents. Some of the offsets are calculated from functions like "Cyberpunk2077.AK::ReadBytesSkip::Count+D1AF". Is there a way to get absolute offset for all instructions?

@AndreyTykhonov
Copy link
Author

@EvgeniySpinov I can regenerate list without "Cyberpunk2077.AK::ReadBytesSkip::Count+D1AF" if you need, just offsets after exe base position. All values are heximal.

My steps to generate list:

  1. Creating suspended game with PHacker
  2. Using Cheat Engine (CE next) looking for memory regions with executable flag
  3. In CE disassembler saving asm output as txt file, setting length based on memory regions
  4. Using my own parser to translate CE txt output to json
  5. Parsing json to remove all non-SSE instructions

@EvgeniySpinov
Copy link

Right, that doesn't look I'll be able to quickly reproduce :)

I have some progress with IDA script, but I propose to unite our effort. Could you please regenerate file with absolute offset positions?

Meanwhile I'll try to write a wrapper for JSON to translate those instructions into HOTFIX calls in C++ and implementing them with SIMDe. If that would work - then we can look into details of getting list of calls+offsets in more automated way.

@mirh
Copy link
Contributor

mirh commented Jan 27, 2021

Not that a general fix for AVX would hurt, but anyway my dudes wasn't that already fixed in patch 1.05 for cyberpunk?

@AndreyTykhonov
Copy link
Author

@EvgeniySpinov
sseInstructions.json.txt

@EvgeniySpinov
Copy link

EvgeniySpinov commented Jan 27, 2021

Not that a general fix for AVX would hurt, but anyway my dudes wasn't that already fixed in patch 1.05 for cyberpunk?

My understanding that it was - I was able to play on my Athlon X6 1090T, which doesn't have AVX only with SSE 4.x patches. AVX was removed after shitstorm on CDPR forums from people with server Xeons, which do not have AVX either.

@EvgeniySpinov
Copy link

EvgeniySpinov commented Jan 31, 2021

Spent some time today moving forward on this one.

Some of the instructions are represented in weird way. For example:
{"Offset":"Cyberpunk2077.exe+2BF36D3","Asm":"dpps xmm4,[7FF624208440],7F","Length":10}

Is that an address where dpp float should be taken for operation?

IDA reports on this address:
dpps xmm4, cs:xmmword_142EF8440, 7F

Also this one:
pmaxuw xmm2,[r8+rax*2]

In IDA:
pmaxuw xmm2, xmmword ptr [r8+rax*2]

Anyone knows how to fetch second register value in C++ code from the exception?

(without these instruction calls - I can get through few logos, apparently while game is loading the rest of the stuff. Emulated only DPPS for now)

@EvgeniySpinov
Copy link

EvgeniySpinov commented Feb 2, 2021

Ok, I've progressed through:

  • Emulated all SSE4.1, 4.2 calls with SIMDe (some of the calls in the file are SSE3 actually, like pabsd or pshufb, I've skipped them)
  • Created parser for JSON provided by @AndreyTykhonov. As result I'm getting list of calls to be put into the source code
  • Functions which mentioned above are jumped over - I didn't find any way to process them yet. Please share if you have any ideas
  • When I start Cyberpunk2077 in debug mode of MSVS - game launches and I can get to the menu (there is some issue with the fonts though). Due to debug mode - I get super low performance (0-1 fps with 20-25% CPU load), so I didn't even try to load the save game.
  • When I start game normally - it just silently crashes. No error messages, nothing, just instant crash without appearing in Task manager even.

After experimenting with commenting out instruction calls - game starts and crashes as it should (with illigal instruction call).

My guess is that it is somehow due to a number of instruction calls, heap size, etc, cause commenting different sets of calls allows to launch the game, so the problem is not with the calls themselves. IDA can also start the game in debug mode.

Resulting dll with all the calls is 1.5M. When I comment like 10% of instruction calls (even the same call, like DPPS for instance) dll might reduce in size to 300-400Kb and then game launches.

So currently observation is: big dll - process crashes instantly. small dll - process starts.

@mirh @AndreyTykhonov Have you seen such a behavior before? Do you know which direction should I dig into?

@EvgeniySpinov
Copy link

Ok, guys, I've made it. Everything works. 1727 lines with various instruction calls.

The problem is ... I get 3 fps. Same as with Intel SDE. Completely unplayable as you may guess.

How the hell, this guy makes it: https://cs.rin.ru/forum/viewtopic.php?f=10&t=71329

Look for "SSE 4.x". His patch works perfectly - I get 30-40 fps hitting my GPU.

@mirh
Copy link
Contributor

mirh commented Feb 5, 2021

Yeah, luther_d is one sick fella.
Is your fix still just expanding on popcnt_emulator though?
While the current version is supposedly better than the old one, it reverts back to some form of trap-and-emulate.
State of the art sounds way neater.

@EvgeniySpinov
Copy link

I've based updates on popcnt_hotfix.

You think better idea is to use PIN to intercept instruction calls before they happen and emulate with SIMDe those calls instead of exception handling?

@mirh
Copy link
Contributor

mirh commented Feb 5, 2021

I'm not really the sharpest tool on the shed, to be honest
Still, I know that "handling broken eggs after they happened" is orders of magnitude slower.

My uneducated guess without any kind of actual profiling is indeed that exception handling is the biggest performance offender.

@EvgeniySpinov
Copy link

Great article, which means that SDE already using PIN tool JIT compilation in order to intercept instruction calls before any exception. And performance is equal to our solution - which surprises me, tbh, I would expect SDE to be faster, since we're working with exceptions.

As a POPCNT emulator - idea of this tool is great - emulating only 1 instruction instead of whole CPU architecture allows to launch the game and have minimal impact. However whole SSE 4.x stack is heavy. BTW, Intel SDE is developing, so probably there would be a way to emulate selected set of instructions only. Haven't checked for popcnt, but probably there is a switch by now. There is definitely for SSE 4.1, 4.2, 4.3, etc.

Need to get in touch with luther_d and understand how this could be tackled. My best guess is that luther_d is not emulating all of the instruction calls required. Likely he operates on a subfunction level, jumping over functions which contain SSE 4.x where possible and emulating their output when not.

@mr-c
Copy link

mr-c commented Feb 5, 2021

Hello from the SIMDe project! When using SIMDe to cope with SSE4.1 instructions not available on the running processor, do you compile using the highest SIMD level available (like SSE3, SSE2, etc..) or are you using the unoptimized fallback implementations?

@EvgeniySpinov
Copy link

Hey @mr-c, thank you for coming to our bonfire :) You've got a great project and great fellows who help people like me to use it.

If you mean compiler and linker options, then highest supported SIMD level is SSE2 for Phenom X6 1090T, which is default for MSVC 2019. I didn't change anything there. SSE3 is partially supported as figured later on: had to emulate pabsd and pshufb calls, cause they were causing invalid code exceptions.

If you mean using SSE1,2 within SIMDe calls, then in here: simd-everywhere/simde#694 I was told that I should not mix them, i.e. either all native or SIMDe. So did I.

@mr-c
Copy link

mr-c commented Feb 5, 2021

:-) I'm the SIMDe cheerleader, all the credit goes to our amazing contributors!

Yep, I meant compiler options. The MSVC equivalent of gcc's '-msse2', which seems to be /arch:SSE2 and the default

According to https://www.cpu-world.com/CPUs/K10/AMD-Phenom%20II%20X6%201090T%20Black%20Edition%20-%20HDT90ZFBK6DGR%20(HDT90ZFBGRBOX).html I see that SSE3 is supported, but I don't see a MSVC command line option for that. Does MSVC automatically set __SSE3__ ? If not, you may benefit from defining SIMDE_ARCH_X86_SSE3 to 1.

@EvgeniySpinov
Copy link

Phenom X6 1090T seems have incomplete SSE3 support. It supports IA SSE3, but do not IA Supplemental SSE3. I do not know what is the difference though, but instructions mentioned above are SSE3 instructions and I still had to emulate them.

But I've added SIMDE_ARCH_X86_SSE3 1 - that didn't trigger full rebuild. Seems like functions I'm using from SIMDe mostly using SSE2 functions.

@mr-c
Copy link

mr-c commented Feb 5, 2021

@EvgeniySpinov Interesting! I wasn't aware of the SSE3 sub-levels.

Can you remind me (maybe with a link) how SIMDe is being compiled/used?

@mirh
Copy link
Contributor

mirh commented Feb 5, 2021

SSE3 is SSE3, SSSE3 is SSSE3. MSVC has way less automatic granularity than, say, gcc but still I think intrinsics should do it.

Great article, which means that SDE already using PIN tool JIT compilation in order to intercept instruction calls before any exception.

No, because like ogurets said, he's using probe and not JIT.

@mr-c
Copy link

mr-c commented Feb 5, 2021

But I've added SIMDE_ARCH_X86_SSE3 1 - that didn't trigger full rebuild. Seems like functions I'm using from SIMDe mostly using SSE2 functions.

Oops, I was wrong, you should use SIMDE_X86_SSE3_NATIVE instead

@AndreyTykhonov
Copy link
Author

How the hell, this guy makes it: https://cs.rin.ru/forum/viewtopic.php?f=10&t=71329

Luther_d solution is not based on exception handling. It request new memory on game start, writes ASM code to new memory that will be executed instead of not supported instructions (for example dpps xmm0, xmm1, 7F and dpps xmm1, xmm2, 7F IS DIFFERENT CODE)

After new memory created, he injecting jmps to new memory, like dpps xmm0, xmm1, 7F becomes jmp OFFSET_IN_MEMORY and nops, so this solution very fast

As I understand, luther_d solution is not automated, since he releasing fixes with new overloads, I think he restarts games and fixes until it working, this is reason why it not going to be updated

We can possible port fix to new game versions in few steps:

  1. Get unsupported instructions at 105 version
  2. Install fix, save all ASM code after jmp
  3. Get unsupported instructions at 12 version
  4. Try to replace same instructions with same ASM code

But there can be new instruction, so without good ASM knowledge we can't do much. I tried to use same method at beginning

@EvgeniySpinov
Copy link

@mr-c I've added SIMDE_X86_SSE3_NATIVE - it didn't trigger rebuild as well. I think SSE3 is rarely used. In SIMDe as well.

Regarding SIMDe compilation: didn't get your question. You mean which command line is used to compile DLL or how SIMDe is used in source code?

@AndreyTykhonov That is possible. Is that an assumption or you've spoken to him? Asking cause there is few questions:

  • How can you jump before instruction execution within the external sub-function into your DLL sub-function? How to intercept such a call without modifying EXE?
  • If there is a way to do so: this could be achieved with SIMDe: instead of exception handling, we can just direct code of instruction call like DPPS to our function. I think our performance is poor not only due to an exception, but also due to a major mapping search effort (like 1727 checks of instructions at maximum on EACH exception)

Regarding ASM code generation - I think it's possible to automate with SIMDe: create all calls combinations that is met within the EXE including:

  • Function
  • SRC
  • DST
  • Mask (where applicable)
  • Length

Get their code in ASM disabling all (or almost all) compiler optimizations. Then do like you've said: before DPPS call for instance - jump to new memory region with needed stuff + NOOP.

Question is: how to force application to behave differently on particular instruction call from DLL?

@AndreyTykhonov
Copy link
Author

AndreyTykhonov commented Feb 6, 2021

Is that an assumption or you've spoken to him?

I compared RDR2 & Cyberpunk with and without fix, it replaces unsupported instructions with jumps to new memory

How can you jump before instruction execution within the external sub-function into your DLL sub-function? How to intercept such a call without modifying EXE?

As I understand, he is modifying executable memory after game started, there is no hook on exceptions, fix probably contains offsets that should be replaced, not automated at all, like in JSON that I created

If there is a way to do so: this could be achieved with SIMDe

I think theoretically it's possible, but I see a problem with registers. I don't know how you can manage something like "dpps xmm01, [rax+r1*4], 7F" inside SIMDe, as well simple registers like xmm0, xmm1

Question is: how to force application to behave differently on particular instruction call from DLL?

That's why I said that every instruction is different in memory. He basically created different methods for each instruction type: dpps xmm0, xmm1, 7F and dpps xmm1, xmm2, 7F is in different memory regions, so it can be easily jumpable without additional parameters

@EvgeniySpinov
Copy link

EvgeniySpinov commented Feb 8, 2021

Thanks for sharing your observations.

As I understand, he is modifying executable memory after game started, there is no hook on exceptions, fix probably contains offsets that should be replaced, not automated at all, like in JSON that I created

That's why I said that every instruction is different in memory. He basically created different methods for each instruction type: dpps xmm0, xmm1, 7F and dpps xmm1, xmm2, 7F is in different memory regions, so it can be easily jumpable without additional parameters

Do you or anyone in this thread know how to perform such a jumps in executable memory from DLL, i.e. knowing offset jump to other space before instruction is executed? You've said that you were experimenting with ASM calls previously? You've tried the same approach?

I think theoretically it's possible, but I see a problem with registers. I don't know how you can manage something like "dpps xmm01, [rax+r1*4], 7F" inside SIMDe, as well simple registers like xmm0, xmm1

It seems that not all of the 1727 instruction calls play critical role in here. I had around 12-15 jump over calls (just jump to next instruction) which were hit by the program (I've checked) and game still worked fine. While if to jump over all instructions - that leads to a crash before the 1st logo.

One more way to move forward without getting too deep into ASM: get list of instructions involved in rendering (have some ideas how to do this) and move them to the head of the list, so they would be found soonest while exception is processed. I've noticed that until loading a save game - my fix was behaving the same as with luther fix, i.e. 100% CPU load during startup, smooth video feed, smooth menu, etc. While during loading a saved game it clearly started to lag. Probably game utilizes instructions closer to the end of the list during loading and rendering and this causes additional delays.

@mr-c
Copy link

mr-c commented Feb 8, 2021

Regarding SIMDe compilation: didn't get your question. You mean which command line is used to compile DLL or how SIMDe is used in source code?

How do you go from SIMDe source code to object code? What transformations, compilation options, and defines are set?

@EvgeniySpinov
Copy link

How do you go from SIMDe source code to object code? What transformations, compilation options, and defines are set?

Compiler:

/Yu"stdafx.h" /ifcOutput "D:\Games\Cyberpunk 2077_UPD\bin\x64\" /GS /Qpar /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Fd"x64\Release\vc142.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_USRDLL" /D "POPCNT_HOTPATCH_EXPORTS" /D "_WINDLL" /errorReport:prompt /GT /WX- /Zc:forScope /Gd /Oi /MD /std:c++17 /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Ot /Fp"x64\Release\icudt.pch" /diagnostics:column

Linker:

/OUT:"C:\Users\XXX\VisualStudio Projects\popcnt_emulator\popcnt_hotpatch\x64\Release\icudt.dll" /MANIFEST /LTCG:incremental /NXCOMPAT /PDB:"C:\Users\XXX\VisualStudio Projects\popcnt_emulator\popcnt_hotpatch\x64\Release\icudt.pdb" /DYNAMICBASE "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" /STACK:"128000000"",125000000" /IMPLIB:"C:\Users\XXX\VisualStudio Projects\popcnt_emulator\popcnt_hotpatch\x64\Release\icudt.lib" /DLL /MACHINE:X64 /OPT:REF /INCREMENTAL:NO /PGD:"C:\Users\XXX\VisualStudio Projects\popcnt_emulator\popcnt_hotpatch\x64\Release\icudt.pgd" /SUBSYSTEM:WINDOWS /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /ManifestFile:"x64\Release\icudt.dll.intermediate.manifest" /OPT:ICF /ERRORREPORT:PROMPT /NOLOGO /VERBOSE:CLR /ASSEMBLYDEBUG:DISABLE /TLBID:1

While we're searching for other options, I've attached source code of the most recent source code, just to give more information on how it is done currently.

popcnt_hotpatch.cpp.zip

@mr-c
Copy link

mr-c commented Feb 11, 2021

Thanks @EvgeniySpinov ; I'm not seeing where you set SIMDE_X86_SSE3_NATIVE?

@EvgeniySpinov
Copy link

Right, I did this before inclusion of sse4.2.h like so:
#define SIMDE_X86_SSE3_NATIVE 1

However after no effect on resulting DLL removed it. I've put it back in my sources just to make sure it's always there.

@EvgeniySpinov
Copy link

Ok, I've built up a quick profiler which output "most hot" instruction calls. There were not too much of those, like around 200, so I've put them in the beginning of the list (they were closer to the end before). Performance gain was around 2 times, so instead of 3 fps I've got like 5-6. Already better than SDE, but still too slow.

Looks like the biggest penalty is exception handling. Need to find a way to modify executable space in order to jump before instructions and not enter exception handling. No other way around.

@EvgeniySpinov
Copy link

Dag around PIN tool a bit in sense of using JIT. In theory it looks good, but documentation notifies about performance impact and according to README.md of this repo @ogurets already tried this approach with PIN tool using it just for a popcnt instruction call notifying all of us about visible performance hit.

I have 1 more idea how to tackle the problem without PINs and SDEs:

  • Find all SSE 4.x instruction calls within binary
  • For each combination of registers generate and export function from DLL implementing call with SIMDe. Probably approach with single function per call + passed registers as arguments would work here.
  • Patch EXE file and replace instruction calls with calls to exported functions from DLL. Potentially this process could be automated.

Not so neat as just adding DLLs, but after looking into way of modifying executable code from DLL - that doesn't seem an easy way to do, or I'm missing something.

What do you think?

@AndreyTykhonov
Copy link
Author

AndreyTykhonov commented Mar 5, 2021

@EvgeniySpinov probably the best way is to start game suspended, inject DLL and unfreeze game.
DLL should read instructions offsets (maybe from file in the same directory) and replace it with jmps to self compiled methods (in DLL memory space).

I don't know how to force C++ to mark input variable as specific register, like:
void dpps_xmm4_xmm5_7f(float a, float b)
to use actual xmm4 register instead of variable a, as well to output result to specific register. I don't know is that even possible. This is why Luther_d uses assembler memory injections.

As I mentioned before, we can actually update Luther_d fix to last version, and Luther_d already updated fix to 1.1 version, but this will works only with CyberPunk, not for next games.

Based on Steam Hardware Survey, there is only 1.5% of PC that doesn't have SSE4.1. Even if we can play CyberPunk 2077 on 40 fps, next games will be slower. Several years and new games will give 20 fps and I'm not talking about Phenom single thread performance. Let's face it: it's time to upgrade PC, even budget i3 will give TWICE much FPS.

I think it's not worth to continue research and we should close this issue. Do you agree?

@EvgeniySpinov
Copy link

@EvgeniySpinov probably the best way is to start game suspended, inject DLL and unfreeze game.
DLL should read instructions offsets (maybe from file in the same directory) and replace it with jmps to self compiled methods (in DLL memory space).

The only issue I'm seeing is that Windows DEP randomizes memory space for DLLs each time application launches. This is done to prevent DLLs from doing exactly this thing: modifying executable memory and jump to addresses populated by DLL. Cause viruses doing this as well.

I don't know how to force C++ to mark input variable as specific register, like:
void dpps_xmm4_xmm5_7f(float a, float b)
to use actual xmm4 register instead of variable a, as well to output result to specific register. I don't know is that even possible. This is why Luther_d uses assembler memory injections.

This is very valid point. I'm also straggling to find a way to utilize exact registers. For reading and writing. Seems like that kind of access is on ASM level :(

As I mentioned before, we can actually update Luther_d fix to last version, and Luther_d already updated fix to 1.1 version, but this will works only with CyberPunk, not for next games.

Yep, I know that luther_d has done fix update and I'm happily using it. The thing is yes - those fixes, specially from luther_d are not available for all games with SSE 4.x, so that was kind of a vector for me.

Based on Steam Hardware Survey, there is only 1.5% of PC that doesn't have SSE4.1. Even if we can play CyberPunk 2077 on 40 fps, next games will be slower. Several years and new games will give 20 fps and I'm not talking about Phenom single thread performance. Let's face it: it's time to upgrade PC, even budget i3 will give TWICE much FPS.

I think it's not worth to continue research and we should close this issue. Do you agree?

You're absolutely right and this is valid point which I was also thinking off. And I'm glad to see that you have the same rationale and shared it. However I'm doing this not for actual gaming on Phenom, but rather to understand C++, DLLs and instructions hacking more. And I was really excited to see and to use SIMDe project for that and collaborate with you on this one.

People with Phenom (like me) do not even need to upgrade to play games - now there are variety of clouds where you can game without a hassle. I've tried with Cyberpunk as well before luther_d first fix - Full HD, 60 fps rock solid on Nvidia GFN.

Not sure I want to go to ASM space - that seems too much, but currently it also seems that only way to get level of fixes luther_d provides. So I'm kind of puzzled.

@mirh
Copy link
Contributor

mirh commented Mar 12, 2021

I'm pretty sure DEP can be disabled
Also friendly reminder that popcnt_emulator was about lack of SSE4.2 if really any, and an old Xeon is still leaps and bounds faster than a ps4.
Which will still be supported in a few years time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants