Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace List with LocalVector if used as simple buffer #99936

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

YYF233333
Copy link
Contributor

A byproduct when I go through List. These list are used as simple buffer to store value temporarily, which means:

  • They are local variables.
  • They get value by push_back.
  • Values are consumed by iterating the List.

So I change them to LocalVector, as they do not depend on container implementation.

There are other kinds of usages which can be replaced, but need some prepare first:

  • Same as above, but iterate from tail to head. Need a reverse iterator for LocalVector.
  • Used as stack. Need a pop_back() method for LocalVector.
  • Used as queue. Shall we implement a dedicated Queue?

@AThousandShips
Copy link
Member

AThousandShips commented Dec 3, 2024

This needs some checking to ensure these are safe uses, the swap is only valid if the container size is not changed while iterating, the code in Node for example looks unsafe

@AThousandShips
Copy link
Member

Also important to evaluate is the tradeoff between locality and reduced overhead in memory on one hand, and the cost of reallocation and moving of data on the other

One of the big benefits of List is that changing the size reallocates no memory, so it would depend on the data type stored and the cost of repeatedly reallocating data depending on how the size of the container changes and the overhead there

@YYF233333
Copy link
Contributor Author

YYF233333 commented Dec 3, 2024

This needs some checking to ensure these are safe uses, the swap is only valid if the container size is not changed while iterating, the code in Node for example looks unsafe

My bad, that's a queue. Double checked, no other resize during iterating.

One of the big benefits of List is that changing the size reallocates no memory, so it would depend on the data type stored and the cost of repeatedly reallocating data depending on how the size of the container changes and the overhead there

All these containers gather value by push_back, for List it is an alloc per call. I think alloc times matters more than alloc size.

If performance turns out to be problem, we can call reserve().

@clayjohn
Copy link
Member

clayjohn commented Dec 3, 2024

We should be using reserve everywhere that we can. I.e. for places where we know the total number of elements, reserve exactly enough for the total number of elements, and where we only know a possible range, we should allocate enough to cover the range.

@YYF233333
Copy link
Contributor Author

We should be using reserve everywhere that we can. I.e. for places where we know the total number of elements, reserve exactly enough for the total number of elements, and where we only know a possible range, we should allocate enough to cover the range.

That should be another PR I guess, we should check all LocalVector usage, not just these. Will implement that if have time. Or you just prefer refactor these first?

@stuartcarnie
Copy link
Contributor

I imagine lawnjelly's FixedArray would be beneficial in some cases, if the size is known:

https://github.com/godotengine/godot/blob/71bbe3b00c40986859045e0397f6a3afe311c3e2/core/templates/fixed_array.h

It allocates on the stack, avoiding a heap allocation. It would be useful if VectorView accepted it as it would mean there are numerous cases where it could be used to create things like uniform sets without allocating on the heap.

@YYF233333
Copy link
Contributor Author

YYF233333 commented Dec 5, 2024

Just see #99328 and I think I have to clarify something with this performance tag.

The first priority of this PR is not to improve performance (If these List cause problem we should have found them long ago).

My major concern here is List is generally not considered to play well with modern processors (cache stuff). If you use it, you probably have some special needs (e.g. insert in the middle). If not, like in this PR, the use of List may confuse someone when they touch the code.

So this should be considered a codestyle stuff first, and I did some test to show it is not going to cause systematic regression.

MicroBenchmark

push_back one by one from a std::vector and iterating through to do some nonsense stuff. Also test if you sort before iterate because many code do this.

| relative |               ns/op |                op/s |    err% |     total | benchmark
|---------:|--------------------:|--------------------:|--------:|----------:|:----------
|   100.0% |              121.56 |        8,226,243.55 |    0.1% |      0.01 | `List        itemsize: 4 length: 1 sort: 0`
|   207.7% |               58.53 |       17,085,847.77 |    0.1% |      0.01 | `LocalVector itemsize: 4 length: 1 sort: 0`
|   100.0% |              484.70 |        2,063,120.98 |    0.2% |      0.01 | `List        itemsize: 4 length: 7 sort: 0`
|   195.7% |              247.67 |        4,037,581.08 |    0.1% |      0.01 | `LocalVector itemsize: 4 length: 7 sort: 0`
|   100.0% |            3,865.53 |          258,696.92 |    0.2% |      0.05 | `List        itemsize: 4 length: 63 sort: 0`
|   812.5% |              475.78 |        2,101,827.68 |    0.1% |      0.01 | `LocalVector itemsize: 4 length: 63 sort: 0`
|   100.0% |           34,327.83 |           29,130.88 |    1.1% |      0.41 | `List        itemsize: 4 length: 511 sort: 0`
| 3,671.7% |              934.93 |        1,069,603.20 |    0.7% |      0.01 | `LocalVector itemsize: 4 length: 511 sort: 0`
|   100.0% |          254,799.90 |            3,924.65 |    0.2% |      3.05 | `List        itemsize: 4 length: 4095 sort: 0`
| 8,035.9% |            3,170.79 |          315,378.95 |    1.0% |      0.04 | `LocalVector itemsize: 4 length: 4095 sort: 0`
|   100.0% |              128.63 |        7,774,033.51 |    0.2% |      0.01 | `List        itemsize: 24 length: 1 sort: 0`
|   174.8% |               73.58 |       13,590,218.24 |    0.0% |      0.01 | `LocalVector itemsize: 24 length: 1 sort: 0`
|   100.0% |              508.46 |        1,966,727.22 |    0.2% |      0.01 | `List        itemsize: 24 length: 7 sort: 0`
|   163.0% |              311.90 |        3,206,153.31 |    0.1% |      0.01 | `LocalVector itemsize: 24 length: 7 sort: 0`
|   100.0% |            4,053.12 |          246,723.21 |    0.3% |      0.05 | `List        itemsize: 24 length: 63 sort: 0`
|   368.0% |            1,101.36 |          907,966.79 |    0.2% |      0.01 | `LocalVector itemsize: 24 length: 63 sort: 0`
|   100.0% |           32,916.59 |           30,379.81 |    0.2% |      0.39 | `List        itemsize: 24 length: 511 sort: 0`
|   543.4% |            6,057.23 |          165,092.06 |    1.3% |      0.07 | `LocalVector itemsize: 24 length: 511 sort: 0`
|   100.0% |          284,490.54 |            3,515.06 |    0.2% |      3.40 | `List        itemsize: 24 length: 4095 sort: 0`
|   680.3% |           41,816.05 |           23,914.26 |    0.5% |      0.50 | `LocalVector itemsize: 24 length: 4095 sort: 0`
|   100.0% |              121.97 |        8,198,987.54 |    0.1% |      0.01 | `List        itemsize: 4 length: 1 sort: 1`
|   197.9% |               61.64 |       16,223,083.55 |    0.1% |      0.01 | `LocalVector itemsize: 4 length: 1 sort: 1`
|   100.0% |              561.59 |        1,780,657.10 |    0.6% |      0.01 | `List        itemsize: 4 length: 7 sort: 1`
|   219.0% |              256.46 |        3,899,191.63 |    0.1% |      0.01 | `LocalVector itemsize: 4 length: 7 sort: 1`
|   100.0% |            4,212.65 |          237,380.50 |    0.4% |      0.05 | `List        itemsize: 4 length: 63 sort: 1`
|   678.8% |              620.59 |        1,611,370.31 |    0.2% |      0.01 | `LocalVector itemsize: 4 length: 63 sort: 1`
|   100.0% |           38,429.21 |           26,021.87 |    0.2% |      0.46 | `List        itemsize: 4 length: 511 sort: 1`
| 1,569.3% |            2,448.83 |          408,357.83 |    0.5% |      0.03 | `LocalVector itemsize: 4 length: 511 sort: 1`
|   100.0% |          309,527.59 |            3,230.73 |    0.2% |      3.70 | `List        itemsize: 4 length: 4095 sort: 1`
| 1,573.5% |           19,671.17 |           50,835.82 |    0.3% |      0.23 | `LocalVector itemsize: 4 length: 4095 sort: 1`
|   100.0% |              128.69 |        7,770,760.97 |    1.3% |      0.01 | `List        itemsize: 24 length: 1 sort: 1`
|   154.5% |               83.30 |       12,004,334.72 |    4.2% |      0.01 | `LocalVector itemsize: 24 length: 1 sort: 1`
|   100.0% |              738.00 |        1,355,013.55 |    1.8% |      0.01 | `List        itemsize: 24 length: 7 sort: 1`
|   138.9% |              531.46 |        1,881,610.87 |    0.2% |      0.01 | `LocalVector itemsize: 24 length: 7 sort: 1`
|   100.0% |            6,569.49 |          152,218.86 |    0.1% |      0.08 | `List        itemsize: 24 length: 63 sort: 1`
|   150.1% |            4,377.90 |          228,420.02 |    0.2% |      0.05 | `LocalVector itemsize: 24 length: 63 sort: 1`
|   100.0% |           69,331.06 |           14,423.55 |    0.2% |      0.83 | `List        itemsize: 24 length: 511 sort: 1`
|   132.0% |           52,521.50 |           19,039.82 |    0.2% |      0.63 | `LocalVector itemsize: 24 length: 511 sort: 1`
|   100.0% |          724,062.43 |            1,381.10 |    0.1% |      8.65 | `List        itemsize: 24 length: 4095 sort: 1`
|   123.5% |          586,161.22 |            1,706.02 |    0.1% |      7.01 | `LocalVector itemsize: 24 length: 4095 sort: 1`
Source Code

Just try to run MyTest() in any godot environment. I hack test_main.cpp, thanks @Ivorforce for the idea.

Using this nice small lib nanobench.h

#define ANKERL_NANOBENCH_IMPLEMENT
#include "nanobench.h"
#include "core/templates/list.h"
#include "core/templates/local_vector.h"
#include <ostream>
#include <sstream>
#include <vector>

template<typename T>
class Test
{
    T placehold;
    List<T> list;
    LocalVector<T> vector;

public:
    void bench_list(const std::vector<T> &source, bool sort)
    {
        for (auto &item : source)
        {
            list.push_back(item);
        }
        if (sort)
        {
            list.sort();
        }
        for (auto &item : list)
        {
            placehold = item;
        }
    }

    void bench_local_vector(const std::vector<T> &source, bool sort)
    {
        for (auto &item : source)
        {
            vector.push_back(item);
        }
        if (sort)
        {
            vector.sort();
        }
        for (auto &item : vector)
        {
            placehold = item;
        }
    }

    static void bench(bool sort)
    {
        for (int i = 1; i < 4096 + 1; i *= 8)
        {
            int n = i == 1 ? 1 : i-1;
            std::ostringstream name;
            name << "itemsize: " << sizeof(T) << " length: " << n << " sort: " << sort;
            std::vector<T> source(n);
            auto bench = ankerl::nanobench::Bench()
                .minEpochIterations(1000)
                .relative(true);
            bench.run("List        " + name.str(), [&]{
                Test test;
                test.bench_list(source, sort);
            });
            bench.run("LocalVector " + name.str(), [&]{
                Test test;
                test.bench_local_vector(source, sort);
            });
        }
    }
};

void mytest(void)
{
    Test<int>::bench(false);
    Test<Variant>::bench(false);
    Test<int>::bench(true);
    Test<Variant>::bench(true);
}

@YYF233333
Copy link
Contributor Author

Add reserve to be aligned with #100251.

Copy link
Contributor

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is, generally, a great idea, and pretty unlikely to cause performance regressions. Linked lists are rarely useful. In fact, I would advocate to rename List to LinkedList to avoid accidental use (e.g. by Java folks) because it is rarely more efficient than a linear LocalVector. LocalVector should be the go-to default, and the use of LinkedList should be well-proven to be faster than LocalVector in each specific case, documented next to the variable to justify its use.

That being said — this is a pretty big change, and I gotta say if I were the code maintainer I would be uncomfortable to merge it because of potential regressions. I would therefore advocate to split the PR and address each use separately, to make them small, comfortable and somewhat tested chunks. It's annoying, but it's safe.

@@ -1582,13 +1583,6 @@ void AnimationNodeStateMachine::remove_transition_by_index(const int p_transitio
Transition tr = transitions[p_transition];
transitions.write[p_transition].transition->disconnect("advance_condition_changed", callable_mp(this, &AnimationNodeStateMachine::_tree_changed));
transitions.remove_at(p_transition);

Vector<String> path_from = String(tr.from).split("/");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are removing this because it is unused? Shouldn't that be a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, especially when I see how fast codestyle PRs are merged :)

@YYF233333
Copy link
Contributor Author

That being said — this is a pretty big change, and I gotta say if I were the code maintainer I would be uncomfortable to merge it because of potential regressions. I would therefore advocate to split the PR and address each use separately, to make them small, comfortable and somewhat tested chunks. It's annoying, but it's safe.

Yes, but it is already split from a bigger one :) Despite being somewhat scattered, these changes are essentially the same pattern.

Safety should probably be ok, I double checked they follow the restriction on top, but I cannot prove that to you (unless you read the code).

I'm waiting to see if #100251 can be merged efficiently. If that can not, I'd better give up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants