-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excellent. An attempt for a console tool. #8
Comments
Thanks for the thorough analysis! gcc does seem to have a problem generating tight code for LZSSE, although I'm surprised Intel is that far ahead! I'll have to have a more thorough look at the assembly you generated on the weekend to see what is causing the slowdown (although I suspect you are right and the Intel code is doing a much better job with the registers). I'd say that the slow result with the small file alice29 could be due to timer resolution or it could be due to some fixed overhead/system overhead. One thing I've found is that if the OS hasn't allocated the memory pages for the buffer yet (just mapped them), then the initial page fault to allocate memory can be quite a large amount of overhead on those small files. |
>Thanks for the thorough analysis! My pleasure. >gcc does seem to have a problem generating tight code for LZSSE, although I'm surprised Intel is that far ahead! GCC is good, but in my experience Intel distributes registers more effectively. When even a single register is not in use then stack accesses hurt speed, you know, since you utilize at max all the squad of XMMs it is even more so - the accumulative penalties matter! I think your etude is not to be presented with any compiler/options, its essence is lost ... in the translation. >I'll have to have a more thorough look at the assembly you generated on the weekend to see what is causing the slowdown (although I suspect you are right and the Intel code is doing a much better job with the registers). My wish is textual fans to have one fully operational console tool (even in its simplest form i.e. file-to-file), just to have one PARAGON performer and feel what the speed religion is all about. >I'd say that the slow result with the small file alice29 could be due to timer resolution or it could be due to some fixed overhead/system overhead. One thing I've found is that if the OS hasn't allocated the memory pages for the buffer yet (just mapped them), then the initial page fault to allocate memory can be quite a large amount of overhead on those small files. Here I admit, despite the "triviality" of such measuring, cannot sense what is going on, maybe I will ask for help Hamid or some guys at Intel's forum. Anyone?! |
Hi again, Also, I believe you were right about this 'malloc' that was not in reality executed fully, I moved it far before the benchmarking, and lo, it reports okay. Another dumb mistake of mine was the statement about 100x faster compression rate, in fact it is 1000x, if not bigger. On my laptop Core 2 Q9550s @2.83GHz, alice29.txt is decompressed much faster, I believe the report is okay (given that I had many tasks in the tray):
Also I added a few more strong compressors to the 'bundle', the full log for 'alice29.txt':
Had time only to run 4 testfiles on i5-2430M @3ghz, DDR3 @666MHz:
Comparing with TurboBench I see no discrepancy:
Except for the 'small file':
Food for thought: 640MB/s vs 18560 MB/s (clean Windows 7 with no tasks in the tray). |
Hi, Benchmarking 'TDELCC' a.k.a. The-Definitive-English-Language-Compression-Corpus, a smashdown, https://github.com/Sanmayce/Nakamichi Another iteration of Sanmayce's decompression showdown 'FULG', revision 4, all performers are included in the package: 128t_opaque_GS.png: Fulg-Textual_[De]Compression_Showdown_v4.tar.gz: Satanichi_smashdown.pdf: Always, it is good to get the picture how the latest compressors fare in TEXTUAL realm. Included compressors:
Compression command lines:
Decompression command lines:
Testmachinette: Laptop 'Dzvertcheto' Thinkpad L490, Intel i7-8565U (8) @ 4.600GHz, 64GB, Linux Fedora 39
Note01a: Nakamichi thrashes the virtual RAM (since it needs ~(61-(Source-Buffer + Target-Buffer = 2 + 3)-67)=-11 gigabytes more than 64GB), seen by the 6h systemtime. Testmachinette: Laptop 'Dzvertcheto' Thinkpad L490, Intel i7-8565U (8) @ 4.600GHz, 64GB, Linux Fedora 39
Note01: The Walltime includes LOAD-DECOMPRESS-DUMP times, that is, external-RAM -> internal-RAM -> external-RAM.
Note08: Joergen's BriefLZ was compiled with these lines:
Bottomlines:
Obviously, WhiskeyLake rocks, being only 25W. Oh, wanted to include the Fabrice Bellard's superthrasher NNCP... somenight. 2023-Dec-30, |
Hi Conor,
many thanks for the undreamt performance of your LZSSE2, simply the FASTEST decompressor!
Not an issue but feedback, wish this site had [Feedback] section as well.
My wish was to have LZSSE2 in form of console tool as well, so I attempted to do so, but not as it had to be, your etude - your tool, that's the right combination, however I wanted level 17 in my textual comparisons so I just embedded LZSSE2 into my fastest (old, 1MB sliding window) Nakamichi, the result:
LZSSE2 excels at:
Overall, significantly better everywhere, LZSSE2 is superior to Tengu, hands down.
In my benchmarks with Hamid's TurboBench from (Feb 21), LZSSE2 level 16 decompresses 2x faster than Nakamichi 'Goldenboy'! However with Haswell and above I expect 3x, even 4x.
For more tests (console dumps), you may see my compression logs/notes (far from finished) at:
www.sanmayce.com/Downloads/The-Last-Stand_booklet.pdf
Also, in the www.sanmayce.com/Downloads/TEXTUAL_MADNESS.zip package I made one .bat file running 12 compressors for a given file, thus giving quick look where one is ranked:
Performers:
Level17 gives excellent tightness and incredible decompression speed (i5-2430M @3ghz, DDR3 @666MHz):
The SSE4.1 and AVX .cod files are included (Assembly, that is), do you see register utilization/distribution as you intended?
In AVX code I see 4466 lines for LZSSE2_Decompress procedure, while the SSE4.1 amounts to 4819, how does this translate into speed, say, on Haswell?
On i5-2430M @3ghz, DDR3 @666MHz I see no speed difference, at all:
And to mix Level 17 with the TurboBench' results:
Very strange, decompression speed differs a lot between Hamid's bench and mine, my trials are 64, with 'dickens' Intel 15.0 is 2x faster than GCC 5.3.0, or I am wrong?!
Also, no clue, why with 'alice29' my bench gives the miserable 53 MB/s whereas TurboBench reports 1810.58MB/s?! That's why told you that my knowledge is inferior, I failed to offer reliable bench. Maybe, I will change clock() with:
I still don't understand, even partially, the decompression code, yet, at first glance the code generated by Intel 15.0 is tight and makes full use of registers, no?!
Cannot say keep up the fantastic work since you made good already.
Best,
Sanmayce
The text was updated successfully, but these errors were encountered: