Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize String with Small String Optimization (SSO) #11241

Open
YYF233333 opened this issue Nov 27, 2024 · 17 comments
Open

Optimize String with Small String Optimization (SSO) #11241

YYF233333 opened this issue Nov 27, 2024 · 17 comments

Comments

@YYF233333
Copy link

YYF233333 commented Nov 27, 2024

Describe the project you are working on

Godot Editor

Describe the problem or limitation you are having in your project

String usually spend a lot of time doing heap alloc stuff (in resize). This phenomenon can be easily detected by profiling editor workload (add/remove node, undo/redo, etc) since editor utilizes String a lot.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

As what was discussed in #77158, Small String Optimization (SSO) improve string performance by eliminating allocation for small/short strings.

To prove we do have small strings, I hook String::resize to get the distribution of string size, the results are like below:

Collected from opening project manager and opening an empty project. Note that last column represent all size beyond 64.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

Store string shorter than certain size (e.g. 32 bytes) direct in String struct. Only alloc if it grow large enough. Should take care of the COW semantic.

If this enhancement will not be used often, can it be worked around with a few lines of script?

No, it is core.

Is there a reason why this should be core and not an add-on in the asset library?

It is core.

@Calinou
Copy link
Member

Calinou commented Nov 27, 2024

Remember that Strings in Godot internally use UTF-32 for its fixed width since 4.0, so the threshold of short string optimization would likely be 8 characters for 32 bytes.

I'm surprised to see such a spike of string resize values at 24 and 25 too. Are these values in bytes or character counts?

@YYF233333
Copy link
Author

YYF233333 commented Nov 27, 2024

It is the value passed to String::resize so character counts. The conclusion should be same though.

Print is so slow I haven't test real workloads, will update the results if I find a way.

@Ivorforce
Copy link

Ivorforce commented Nov 27, 2024

I was interested in how string concatenation can lead to such performance problems (as discussed in godotengine/godot#77158), so I had a look at the relevant source code myself.
I like both all of what @komugi1211s, @SlugFiller and @YYF233333 had to say on the matter. There are a lot of good thoughts here. I, too, stumbled upon some very curious uses of clock cycles in the code.

One example is the use of strlen in a lot of places. Especially, in StringBuilder and StringBuffer both. Do we interact with 20 years old C code so much that we don't have size knowledge? I tracked down the uses of the code. Here's one innocuous one: shader_gles3.cpp. It's passing in a constant string and immediately computing its size in the append implementation. Yikes!
But what's that one line below? It's a reference to p_vertex_code, which is passed down into the function from elsewhere. Where's that string coming from? Why, it's from canvas.glsl.gen.h! The strings in this file make up about 40k characters, which are parsed for length not just once, but twice (_setup and _add_stage)! And then anew for each shader that's created!

There are a few other strange choices, for example the use of 3 different (COW, no less) Vectors in StringBuilder instead of a single one, leading to (almost) thrice the amount of RAM allocations and opening the code up to cache contentions. Not to mention re-allocating these every few calls, leading to even worse caching problems — even though most often, the exact strings to be concatenated are well-known at compile time.

No wonder string handling in Godot is so slow! I feel like if we want to make actual performance improvements to string related code, we should:

  • disallow the use of char* (and strlen especially...) throughout the codebase, replacing with a little struct UnownedString { const char * string, size_t length } and pass those around.
    • Notably, this happens for all static-string constructor calls to String and StringName too.
  • Add a template <typename... Args> concatenate_strings(Args... args) { ... } function that uses the stack to join input strings, avoiding unnecessary allocation.
  • Start using std::variant to linearize memory use in StringBuilder, avoiding problems with the rest of cases where the number of strings is not known at compile time, to reduce cache contentions and RAM waste.

I think SSO is a great idea too! It may help with cache contentions and unnecessary allocations. But it may be worth it to check first where all these small strings are allocating from, to see if there aren't more fundamental problems that should be resolved first.

@SlugFiller
Copy link

I agree with most of what @Ivorforce says, with one exception:

  • Start using std::variant in StringBuilder to avoid problems with the rest of cases where the number of strings is not known at compile time, to optimize the cache contentions and RAM waste.

This would reduce the number of Vectors, but wouldn't change the fact that you're using a Vector, with reallocations, in the first place, in which case it's better too have a buffer with reallocation instead. For the simple reason that if the number of allocations during append is the same, then the reallocating buffer is actually, over all, one allocation less, because it removes the final allocation for the complete buffer.

Also, a small issue with not using const char* is that C/C++ does not offer a string constant with precalculated length. I'm not sure if it's possible to force the length to be calculated at compile time.

@clayjohn
Copy link
Member

I think this is a good idea. In my profiling I found that StringBuffer was still slightly faster than using String directly, even for larger strings like what are used in the shader compiler. I suspect it is due to StringBuffer explicitly Po2 resizing the String ahead of time. When using String directly we do end up doing Po2 allocations since it uses Vector internally, but there is a lot of bookkeeping around it. By manually doing a Po2 resize, we skip that bookkeeping.

@Ivorforce
Copy link

This would reduce the number of Vectors, but wouldn't change the fact that you're using a Vector, with reallocations, in the first place, in which case it's better too have a buffer with reallocation instead. For the simple reason that if the number of allocations during append is the same, then the reallocating buffer is actually, over all, one allocation less, because it removes the final allocation for the complete buffer.

I saw your tests, and was impressed by how much of a difference it made! I think it will be less pronounced with the giant ~20k char strings we're dealing with in the shader code though, since every reallocation thereafter would lead to ~20kb RAM alloc and dealloc. But I'm open to test!

Also, a small issue with not using const char* is that C/C++ does not offer a string constant with precalculated length. I'm not sure if it's possible to force the length to be calculated at compile time.

Luckily it is possible! Here's some code for that:

	template <size_t len>
	UnownedString(const char (&p_cstring)[len]) {
		return UnownedString { p_cstring, len };
	}

@SlugFiller
Copy link

every reallocation thereafter would lead to ~20kb RAM alloc and dealloc

I've previously explained in the RocketChat, but this only applies for WebAssembly, since it uses a custom memory allocator (provided by Emscripten). Every platform that uses a platform-provided reallocator, or otherwise has access to memory mapping, will not actually deallocate the memory before reallocating.

Instead, it will simply remap all the pages other than the first and last ones into new addresses, to create the illusion of a continuous span of memory addresses, even though it's only continuous in virtual address space.

If a page size is 4kb (the default for Linux), then a ~20kb realloc is actually just 5 page moves, each being hardly more expensive than a function call.

@Ivorforce
Copy link

Very interesting! I definitely want to test this. But yes, I agree that means there's all the more reason to question the current setup of the code.

@YYF233333
Copy link
Author

  • Add a template <typename... Args> concatenate_strings(Args... args) { ... } function that uses the stack to join input strings, avoiding unnecessary allocation.

Interesting idea, I think that should be paired with some stack-based string (e.g. std::array)?

Actually, I do have thought about move all short string to stack (and facing the risk of stackoverflow :) ).

But it may be worth it to check first where all these small strings are allocating from, to see if there aren't more fundamental problems that should be resolved first.

There are, actually tons of them. For example just take a look at vformat and String::sprintf. They just lay too deep, changing them you end up with changes everywhere. I do not dare to touch that, but yes I'm willing to see refactors in these place.

@Ivorforce
Copy link

Interesting idea, I think that should be paired with some stack-based string (e.g. std::array)?

typename... Args is needed to support mixed-type inputs (String and char* / UnownedString). std::array can only handle same-type elements.

There are, actually tons of them. For example just take a look at vformat and String::sprintf. They just lay too deep, changing them you end up with changes everywhere. I do not dare to touch that, but yes I'm willing to see refactors in these place.

Yeah, I can see how those uses would be intimidating to adjust, 'just' for a bit of String optimization. I do think it would be doable, but it would involve quite a bit work.

@YYF233333
Copy link
Author

By the way, should we change the title? I think the discussion has been far beyond SSO.

@Ivorforce
Copy link

We diverged a bit because it is an interesting topic, but I think it would be best to keep this proposal for SSO and branch off to other proposals for other changes. I have opened #11249 for the strlen calls, and we have godotengine/godot#77158 for StringBuilder vs StringBuffer (although it may be worth it to add a proposal for this one as well, as its merge seems to have been stalled for quite a while due to its implications, and may need more data / a clear write-up to convince maintainers for a merge).

@Ivorforce
Copy link

Ivorforce commented Dec 1, 2024

I just realized - the best spot to implement this addition would probably be CowData. String is basically a semantic wrapper over CowData which handles all the allocation tasks.
This would be an especially interesting consideration because it would mean one implementation could be used for all Vector users (by statically specifying the local buffer size). In particular, it could be used to optimize all small arrays - including Array and Packed arrays.

@YYF233333 Do you think it would be possible to repeat your experiment hooking into Vector.resize instead of String.resize, grouping the results by the typename T?

@YYF233333
Copy link
Author

YYF233333 commented Dec 2, 2024

I just realized - the best spot to implement this addition would probably be CowData. String is basically a semantic wrapper over CowData which handles all the allocation tasks. This would be an especially interesting consideration because it would mean one implementation could be used for all Vector users (by statically specifying the local buffer size). In particular, it could be used to optimize all small arrays - including Array and Packed arrays.

@YYF233333 Do you think it would be possible to repeat your experiment hooking into Vector.resize instead of String.resize, grouping the results by the typename T?

Make sense to me, but for Array it highly depends on the workload as it is exposed to user. I'll try Vector first.

Edit: This time count size by byte. Truncate at 128 bytes (the last column represents all large allocation).

Open Project Manager

Image

Open an empty project

Image

@Ivorforce
Copy link

Thank you!
Are you normalizing over all the data instead of by each group? In your graph it's a bit difficult to see what ratio for each type is small resizes, and which is larger ones.

@YYF233333
Copy link
Author

Are you normalizing over all the data instead of by each group?

Yes. If you prefer normalized by type here it is:

project manager

Image

empty project

Image

Sum up all types:

project manager

Image

empty project

Image

@Ivorforce
Copy link

Yes. If you prefer normalized by type here it is:

project manager
Image

empty project
Sum up all types:

project manager
empty project

Awesome. Looks like String is the best contender for small array optimization (except StringName which probably isn't widely used), but there may be value in optimizing small arrays for other types too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants