Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADBDEV-6156 Count startup memory of each process when using resource groups #1023

Open
wants to merge 11 commits into
base: adb-6.x-dev
Choose a base branch
from

Conversation

dnskvlnk
Copy link
Collaborator

@dnskvlnk dnskvlnk commented Aug 23, 2024

Count startup memory of each process when using resource groups

The goal of the patch is to make resource groups manager to track startup
memory of each process so that the redzone handler would estimate memory more
accurately, but the context needs to be brought.

Startup memory is considered correctly by the VMEM tracker, but it's not
accounted by the resource groups manager because on a process' startup this
memory is not added to self->memUsage. This happened because the startup memory
is considered on VMEM tracker initialization, but the ways VMEM tracker and
resource groups count memory are not consistent with each other.

This patch adds startup memory consumption to self->memUsage to make resource
groups consider this memory.

However, one should note that once a process is detached from a resource group
(it finished query execution), resource groups memory manager doesn't see this
memory again. This happens due to resource groups' design. We could've moved
inactive processes' memory to freeChunks, but that would drain freeChunks pretty
fast and cause new processes to fail because any process starts without a
resource group and acquires it later, thus it must add it's startup memory to
freeChunks at first. If freeChunks is already exhausted, a new process fails to
start. Also, there are other complexities in making resource groups manager to
fully consider startup memory. Thankfully, we have the VMEM tracker which
calculates it correctly, so if the resource groups manager fails to track it
correctly, we can still rely on it. It's impossible to solve this problem in
this patch because it's a design flaw rather than a bug.

@BenderArenadata
Copy link

Failed job Deploy multiarch Dockerimages: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807317

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/78383

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807327

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807328

@dnskvlnk dnskvlnk marked this pull request as ready for review August 26, 2024 16:22
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/78465

@BenderArenadata
Copy link

Failed job Regression tests with ORCA on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817041

@BenderArenadata
Copy link

Failed job Regression tests with Postgres on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817039

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817048

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1818658

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1818659

@RekGRpth
Copy link
Member

Can you write some tests to check?

@BenderArenadata
Copy link

DROP

-- start_ignore
! gpstop -rai;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be add

! gpconfig -r gp_resource_manager;

before this line?

@RekGRpth
Copy link
Member

RekGRpth commented Sep 3, 2024

resgroup/enable_resgroup test hangs after applying patch.

@BenderArenadata
Copy link

@BenderArenadata
Copy link

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/79897

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1891649

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/80245

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1911167

@BenderArenadata
Copy link

Failed job Build ubuntu22 for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1924163

@BenderArenadata
Copy link

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/80289

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1926814

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/80301

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/80692

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1979177

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/83439

The goal of the patch is to make resource groups manager to track startup
memory of each process so that the redzone handler would estimate memory more
accurately, but the context needs to be brought.

Startup memory is considered correctly by the VMEM tracker, but it's not
accounted by the resource groups manager because on a process' startup this
memory is not added to self->memUsage. This happened because the startup memory
is considered on VMEM tracker initialization, but the ways VMEM tracker and
resource groups count memory are not consistent with each other.

This patch adds startup memory consumption to self->memUsage to make resource
groups consider this memory.

However, one should note that once a process is detached from a resource group
(it finished query execution), resource groups memory manager doesn't see this
memory again. This happens due to resource groups' design. We could've moved
inactive processes' memory to freeChunks, but that would drain freeChunks pretty
fast and cause new processes to fail because any process starts without a
resource group and acquires it later, thus it must add it's startup memory to
freeChunks at first. If freeChunks is already exhausted, a new process fails to
start. Also, there are other complexities in making resource groups manager to
fully consider startup memory. Thankfully, we have the VMEM tracker which
calculates it correctly, so if the resource groups manager fails to track it
correctly, we can still rely on it. It's impossible to solve this problem in
this patch because it's a design flaw rather than a bug.
@dnskvlnk dnskvlnk changed the title ADBDEV-6156 Track startup memory for resource group ADBDEV-6156 Count startup memory of each process when using resource groups Oct 20, 2024
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/83499

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/83582

@dnskvlnk dnskvlnk marked this pull request as ready for review October 21, 2024 17:02
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/83693

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/83850

src/test/regress/regress_gp.c Outdated Show resolved Hide resolved
src/test/regress/regress_gp.c Outdated Show resolved Hide resolved
-- Start a session which will be detached from a group when the query is done
-- resource groups can't see startup chunks occupied by a detached session but
-- the vmem tracker can
0: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid != (SELECT pg_backend_pid());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this query required? What does exactly this test check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test checks that the vmem tracker can see startup chunks of those processes which are not attached to any resource group. It starts the 0 session which terminates all other processes which can interfere this test's results and then creates a table just to make sure that this session is spawned on all segments. This session is then detached from a resource group and is tracked by vmem only.
Other sessions start transactions so that they can be attached to their resource groups. They allocate all available memory to a resource group. The last allocation should fail because of the vmem tracker, so we could make sure that the vmem tracker actually tracks the memory of those sessions which are not attached to a resource group

Copy link
Member

@bandetto bandetto Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the expected behavior of this test before the patch?

@bandetto
Copy link
Member

Maybe add sql/resgroup/resgroup_startup_memory.sql and expected/resgroup/resgroup_startup_memory.out to src/test/isolation2/.gitignore? For some reason, not all the generated sql and expected files from input/resgroup/ are ignored, but all tests inside input/ folder are.

@dnskvlnk
Copy link
Collaborator Author

@bandetto

Maybe add sql/resgroup/resgroup_startup_memory.sql and expected/resgroup/resgroup_startup_memory.out to src/test/isolation2/.gitignore? For some reason, not all the generated sql and expected files from input/resgroup/ are ignored, but all tests inside input/ folder are.

Yes, this is pretty strange, so I would agree with you. However, I'm not sure about ignoring .sql since they are actually tracked, but some are not, as far as I remember

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/83966

@bandetto
Copy link
Member

However, I'm not sure about ignoring .sql since they are actually tracked, but some are not, as far as I remember

Tests which have input files in input/ instead of sql/ get their sql/..sql files generated out of files from the input/ directory. It means that any sql/..sql file from a test that uses input/ directory can safely be ignored. Seems that this ignoring thing is inconsistent, so it's is up to you to fix it in this patch :^)

@dnskvlnk
Copy link
Collaborator Author

@bandetto I know that, but there are some tests which are generated from .source file but the respective .sql is still tracked. Though I agree with you that if a test is generated then .sql should be ignored

In order to run tests, we need the old version of resGroupPalloc, however, to
keep old tests intact, we need to add it under a new name, because the old
logic of resGroupPalloc isn't compatible with the new behaviour
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/84568

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/84658

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants