Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement brotli's version of lz_extend. #364

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

LucasSloan
Copy link

This provides a 2-3% speed up for compression, with no change to the bytestream. See here for the brotli version.

Before:

Compression             Total  Compressed        | Compression                   |
Method                   Size        Size  Ratio | Iters       Time         Rate |
----------------------------------------------------------------------------------
libdeflate-gzip:1   270062086    85630362  3.154 |    20   26.493 s  194.43 MB/s |
libdeflate-gzip:2   270062086    84037129  3.214 |    20   35.747 s  144.10 MB/s |
libdeflate-gzip:3   270062086    82391861  3.278 |    20   39.707 s  129.73 MB/s |
libdeflate-gzip:4   270062086    81420541  3.317 |    20   43.029 s  119.71 MB/s |
libdeflate-gzip:5   270062086    78832080  3.426 |    20   50.630 s  101.74 MB/s |
libdeflate-gzip:6   270062086    78021372  3.461 |    20   63.719 s   80.84 MB/s |
libdeflate-gzip:7   270062086    77594012  3.480 |    20   87.918 s   58.59 MB/s |
libdeflate-gzip:8   270062086    77190199  3.499 |    20  147.452 s   34.93 MB/s |
libdeflate-gzip:9   270062086    77156775  3.500 |    20  191.025 s   26.97 MB/s |

After:

Compression             Total  Compressed        | Compression                   |
Method                   Size        Size  Ratio | Iters       Time         Rate |
----------------------------------------------------------------------------------
libdeflate-gzip:1   270062086    85630362  3.154 |    20   26.228 s  196.39 MB/s |
libdeflate-gzip:2   270062086    84037129  3.214 |    20   34.950 s  147.38 MB/s |
libdeflate-gzip:3   270062086    82391861  3.278 |    20   39.140 s  131.61 MB/s |
libdeflate-gzip:4   270062086    81420541  3.317 |    20   41.927 s  122.86 MB/s |
libdeflate-gzip:5   270062086    78832080  3.426 |    20   50.023 s  102.97 MB/s |
libdeflate-gzip:6   270062086    78021372  3.461 |    20   61.799 s   83.35 MB/s |
libdeflate-gzip:7   270062086    77594012  3.480 |    20   85.528 s   60.23 MB/s |
libdeflate-gzip:8   270062086    77190199  3.499 |    20  145.867 s   35.31 MB/s |
libdeflate-gzip:9   270062086    77156775  3.500 |    20  189.208 s   27.22 MB/s |

@ebiggers
Copy link
Owner

Thanks for looking into this! This is a part of the code where it feels like there should be something better, but it's been hard to find something that can actually be shown to be better.

When saying that there's a certain percent performance improvement, especially for small differences, it would be helpful to mention which CPU and compiler is used. In a couple quick tests, I don't see the 2-3% improvement on the CPU and compiler combinations I currently happen to do the most testing on most on -- AMD Zen 1 with the latest versions of gcc and clang, and an arm64 processor with clang. It's either a negligible improvement or even a slight regression.

I think you're probably onto something, though; loops should only be unrolled when it provides a clear benefit, and writing the code with incrementing the pointers is probably a better choice than using indexing, especially since many non-x86 CPUs do not support indexed addressing anyway so the compiler would have to convert the code to incrementing the pointers anyway.

Before making a change here, it would be helpful to really understand how the two separate changes (re-rolling the loop, and not using indexed addressing) each affect the generated assembly with gcc and clang, and what result that has on different CPUs. That should allow making a more informed choice, and maybe point to something even better.

@LucasSloan
Copy link
Author

I’m testing on Epyc Rome (Zen 2) with head clang.

It could also be the dataset I’m testing on - a pile of minified JavaScript and CSS.

What is your preferred dataset?

This provides a 2-3% speed up for compression.

Before:

```
Compression             Total  Compressed        | Compression                   |
Method                   Size        Size  Ratio | Iters       Time         Rate |
----------------------------------------------------------------------------------
libdeflate-gzip:1   270062086    85630362  3.154 |    20   26.493 s  194.43 MB/s |
libdeflate-gzip:2   270062086    84037129  3.214 |    20   35.747 s  144.10 MB/s |
libdeflate-gzip:3   270062086    82391861  3.278 |    20   39.707 s  129.73 MB/s |
libdeflate-gzip:4   270062086    81420541  3.317 |    20   43.029 s  119.71 MB/s |
libdeflate-gzip:5   270062086    78832080  3.426 |    20   50.630 s  101.74 MB/s |
libdeflate-gzip:6   270062086    78021372  3.461 |    20   63.719 s   80.84 MB/s |
libdeflate-gzip:7   270062086    77594012  3.480 |    20   87.918 s   58.59 MB/s |
libdeflate-gzip:8   270062086    77190199  3.499 |    20  147.452 s   34.93 MB/s |
libdeflate-gzip:9   270062086    77156775  3.500 |    20  191.025 s   26.97 MB/s |
```

After:

```
Compression             Total  Compressed        | Compression                   |
Method                   Size        Size  Ratio | Iters       Time         Rate |
----------------------------------------------------------------------------------
libdeflate-gzip:1   270062086    85630362  3.154 |    20   26.228 s  196.39 MB/s |
libdeflate-gzip:2   270062086    84037129  3.214 |    20   34.950 s  147.38 MB/s |
libdeflate-gzip:3   270062086    82391861  3.278 |    20   39.140 s  131.61 MB/s |
libdeflate-gzip:4   270062086    81420541  3.317 |    20   41.927 s  122.86 MB/s |
libdeflate-gzip:5   270062086    78832080  3.426 |    20   50.023 s  102.97 MB/s |
libdeflate-gzip:6   270062086    78021372  3.461 |    20   61.799 s   83.35 MB/s |
libdeflate-gzip:7   270062086    77594012  3.480 |    20   85.528 s   60.23 MB/s |
libdeflate-gzip:8   270062086    77190199  3.499 |    20  145.867 s   35.31 MB/s |
libdeflate-gzip:9   270062086    77156775  3.500 |    20  189.208 s   27.22 MB/s |
```
@ebiggers
Copy link
Owner

For quick testing I just use an amalgamation of a bunch of different files. For more detailed testing I do look at different types of files specifically, including types that users of libdeflate have mentioned are important to them, for example bioformatics data. I should probably add more Javascript and CSS to my testing.

Testing on a file containing DNA sequencing data, I happen to see a ~2% improvement with clang, which is similar to what you're reporting. I'm guessing that the significant factor here is the distribution of match lengths in the data.

At the same time, on that file I don't see that improvement with gcc, and just building with gcc instead of clang results in a 10% improvement on that file in the first place.

I'm thinking your change is probably a benefit overall, but I'd like to really understand what's going on here as there may be a way to do even better.

@LucasSloan
Copy link
Author

A quick test suggests just removing the loop unrolling and the goto's is better than this change:

(quality 6)

HEAD: 81.82 MB/s
this PR: 82.55 MB/s
no unrolling: 84.47 MB/s

static forceinline unsigned
lz_extend(const u8 * const strptr, const u8 * const matchptr,
	  const unsigned start_len, const unsigned max_len)
{
	unsigned len = start_len;
	machine_word_t v_word;

	if (UNALIGNED_ACCESS_IS_FAST) {
		while (len + WORDBYTES <= max_len) {
			v_word = load_word_unaligned(&matchptr[len]) ^
				 load_word_unaligned(&strptr[len]);
			if (v_word != 0) {
				if (CPU_IS_LITTLE_ENDIAN())
					len += (bsfw(v_word) >> 3);
				else
					len += (WORDBITS - 1 - bsrw(v_word)) >> 3;
				return len;
			}
			len += WORDBYTES;
		}
	}

	while (len < max_len && matchptr[len] == strptr[len])
		len++;
	return len;
}

@LucasSloan
Copy link
Author

Just switching to pointer arithmetic also seems better than head, but admittedly the variability of my measurements is pretty high:

(quality 6)

head: 80.82 MB/s
pointer arithmetic: 81.40 MB/s

static forceinline unsigned
lz_extend(const u8 * strptr, const u8 * matchptr,
	  const unsigned start_len, unsigned max_len)
{
	const u8 * const strptr_orig = strptr;
	
	max_len -= start_len;
	strptr += start_len;
	matchptr += start_len;
	machine_word_t v_word;
	unsigned matching_bytes;

	if (UNALIGNED_ACCESS_IS_FAST) {

		if (likely(max_len >= 4 * WORDBYTES)) {

		#define COMPARE_WORD_STEP				\
			v_word = load_word_unaligned(matchptr) ^	\
				 load_word_unaligned(strptr);	\
			matchptr += WORDBYTES; \
			max_len -= WORDBYTES; \
			if (v_word != 0)				\
				goto word_differs;			\
			strptr += WORDBYTES;				\

			COMPARE_WORD_STEP
			COMPARE_WORD_STEP
			COMPARE_WORD_STEP
			COMPARE_WORD_STEP
		#undef COMPARE_WORD_STEP
		}

		while (max_len >= WORDBYTES) {
			v_word = load_word_unaligned(matchptr) ^
				 load_word_unaligned(strptr);
			matchptr += WORDBYTES;
			max_len -= WORDBYTES;
			if (v_word != 0)
				goto word_differs;
			strptr += WORDBYTES;
		}
	}

	while (max_len && *matchptr == *strptr) {
		max_len--;
		matchptr++;
		strptr++;
	}
	return strptr - strptr_orig;

word_differs:
	if (CPU_IS_LITTLE_ENDIAN())
		matching_bytes = (bsfw(v_word) >> 3);
	else
		matching_bytes = (WORDBITS - 1 - bsrw(v_word)) >> 3;
	return strptr - strptr_orig + matching_bytes;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants