Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADBDEV-6156 Count startup memory of each process when using resource groups #1023

Open
wants to merge 26 commits into
base: adb-6.x-dev
Choose a base branch
from

Conversation

dnskvlnk
Copy link
Collaborator

@dnskvlnk dnskvlnk commented Aug 23, 2024

Count the startup memory of each active process when using resource groups

Make the resource manager track the startup memory of each active backend so
that the runaway detector would estimate memory more accurately.

The startup memory is the memory that the backend consumes after startup before
the memory managers (Vmem tracker and resource groups) are initialized. The Vmem
tracker counts this memory as consumed by the segment, but after the backend was
assigned a resource group, this memory was not counted as consumed by the group.

This patch adds startup memory consumption to self->memUsage to make resource
groups consider this memory.
Additionally, this patch slightly modifies the resGroupPalloc function so that
it takes startup memory into account. This is necessary to avoid changing or
complicating the logic of existing tests.

It is worth noting that this patch fixes the accounting of memory consumption by
only active backends (which execute the query). Accounting for the memory
occupied by idle backends is a more complex task that should be done separately.

@BenderArenadata
Copy link

Failed job Deploy multiarch Dockerimages: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807317

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/78383

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807327

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807328

@dnskvlnk dnskvlnk marked this pull request as ready for review August 26, 2024 16:22
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/78465

@BenderArenadata
Copy link

Failed job Regression tests with ORCA on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817041

@BenderArenadata
Copy link

Failed job Regression tests with Postgres on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817039

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817048

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1818658

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1818659

@RekGRpth
Copy link
Member

Can you write some tests to check?

@BenderArenadata
Copy link

DROP

-- start_ignore
! gpstop -rai;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be add

! gpconfig -r gp_resource_manager;

before this line?

Copy link
Collaborator Author

@dnskvlnk dnskvlnk Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in disable_resgroup test

@RekGRpth
Copy link
Member

RekGRpth commented Sep 3, 2024

resgroup/enable_resgroup test hangs after applying patch.

@BenderArenadata
Copy link

@BenderArenadata
Copy link

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/79897

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1891649

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/80245

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1911167

@BenderArenadata
Copy link

Failed job Build ubuntu22 for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1924163

@BenderArenadata
Copy link

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/80289

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1926814

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/80301

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/86597

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/86650

bandetto
bandetto previously approved these changes Nov 25, 2024
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/90016

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/90168

-- The runaway detector test. A query with a large number of slices should
-- be terminated due to high memory consumption.
select count(*) from t1 a1 join t1 a2 using(a) join t1 a3 using(a) join t1 a4 using(a) join t1 a5 using(a) join t1 a6 using(a) join t1 a7 using(a) join t1 a8 using(a) join t1 a9 using(a) join t1 a10 using(a);
ERROR: Canceling query because of high VMEM usage. current group id is 712716, group memory usage 133 MB, group shared memory quota is 102 MB, slot memory quota is 0 MB, global freechunks memory is 277 MB, global safe memory threshold is 277 MB (runaway_cleaner.c:197) (seg1 slice10 172.18.0.3:6003 pid=88018) (runaway_cleaner.c:197)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite please the test so that the error about memory consumption before the patch and after the patch occurs for different reasons

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/90809

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/90926

Comment on lines 658 to 662
if (startUpMbRemains > 0)
{
size = Max(0, size - startUpMbRemains);
startUpMbRemains = Max(0, startUpMbRemains - size);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code calls MemoryContextAlloc when size is 0.
When the input is startUpMbRemains = 12 and size = 4, the output is size = max(0, 4-12) = 0 and startUpMbRemains = Max(0,12-0) = 12. But the output startUpMbRemains should be 8

Suggested change
if (startUpMbRemains > 0)
{
size = Max(0, size - startUpMbRemains);
startUpMbRemains = Max(0, startUpMbRemains - size);
}
if (startUpMbRemains >= size)
{
startUpMbRemains -= size;
PG_RETURN_INT32(0);
}
size -= startUpMbRemains;
startUpMbRemains = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@andr-sokolov
Copy link
Member

"but after the backend is assigned a resource group, this memory is not counted as consumed by the group." - it has been fixed, so let's write in past tense

* startup memory consumpion, but let it be just for symmetry.
*/
void
ResGroupProcSubStartupChunks(int32 chunks)
Copy link
Member

@andr-sokolov andr-sokolov Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about removing this function and calling ResGroupProcAddStartupChunks(-startupChunks) instead of ResGroupProcSubStartupChunks(startupChunks) or removing the chunks argument and using VmemTracker_GetStartupChunks()?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@@ -11,6 +11,10 @@
--
-- end_matchsubs

-- start_ignore
! gpstop -rai;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this line is added?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

@@ -135,7 +139,7 @@ SELECT num_running FROM gp_toolkit.gp_resgroup_status WHERE rsgname='rg_move_que
1&: SELECT pg_sleep(3);
2: SET ROLE role_move_query_mem_small;
2: BEGIN;
2: SELECT hold_memory_by_percent_on_qe(1,0.1);
2: SELECT hold_memory_by_percent_on_qe(1,0.2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 0.1 is replaced with 0.2 here and below?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

ALTER RESOURCE GROUP admin_group SET memory_shared_quota 0;
ALTER RESOURCE GROUP default_group SET memory_shared_quota 0;

create resource group rg1 with (cpu_rate_limit=20, memory_limit=15, memory_shared_quota=100, memory_spill_ratio=0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need memory_spill_ratio?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 28 to 29
ALTER RESOURCE GROUP admin_group SET memory_shared_quota 0;
ALTER RESOURCE GROUP default_group SET memory_shared_quota 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change memory_shared_quota?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/91028

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/91031

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/91084

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants