ADBDEV-6156 Count startup memory of each process when using resource groups #1023

dnskvlnk · 2024-08-23T13:14:43Z

Count the startup memory of each active process when using resource groups

Make the resource manager track the startup memory of each active backend so
that the runaway detector would estimate memory more accurately.

The startup memory is the memory that the backend consumes after startup before
the memory managers (Vmem tracker and resource groups) are initialized. The Vmem
tracker counts this memory as consumed by the segment, but after the backend was
assigned a resource group, this memory was not counted as consumed by the group.

This patch adds startup memory consumption to self->memUsage to make resource
groups consider this memory.
Additionally, this patch slightly modifies the resGroupPalloc function so that
it takes startup memory into account. This is necessary to avoid changing or
complicating the logic of existing tests.

It is worth noting that this patch fixes the accounting of memory consumption by
only active backends (which execute the query). Accounting for the memory
occupied by idle backends is a more complex task that should be done separately.

BenderArenadata · 2024-08-23T13:47:57Z

Failed job Deploy multiarch Dockerimages: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807317

BenderArenadata · 2024-08-26T06:22:36Z

Allure report https://allure.adsw.io/launch/78383

BenderArenadata · 2024-08-26T06:39:58Z

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807327

BenderArenadata · 2024-08-26T06:44:02Z

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807328

BenderArenadata · 2024-08-26T16:59:36Z

Allure report https://allure.adsw.io/launch/78465

BenderArenadata · 2024-08-26T17:04:35Z

Failed job Regression tests with ORCA on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817041

BenderArenadata · 2024-08-26T17:04:49Z

Failed job Regression tests with Postgres on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817039

BenderArenadata · 2024-08-26T17:50:58Z

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817048

BenderArenadata · 2024-08-27T07:11:23Z

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1818658

BenderArenadata · 2024-08-27T07:13:36Z

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1818659

RekGRpth · 2024-08-28T03:15:39Z

Can you write some tests to check?

src/include/utils/resgroup.h

BenderArenadata · 2024-09-03T05:47:14Z

Failed job Build for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1850837

src/test/isolation2/isolation2_resgroup_schedule

RekGRpth · 2024-09-03T08:27:00Z

src/test/isolation2/output/resgroup/resgroup_startup_memory.source

+DROP
+
+-- start_ignore
+! gpstop -rai;


may be add

! gpconfig -r gp_resource_manager;

before this line?

This is done in disable_resgroup test

RekGRpth · 2024-09-03T08:28:54Z

resgroup/enable_resgroup test hangs after applying patch.

BenderArenadata · 2024-09-08T11:11:14Z

Failed job Build for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1880757

BenderArenadata · 2024-09-10T11:58:36Z

Failed job Build for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1890216

BenderArenadata · 2024-09-10T17:22:18Z

Allure report https://allure.adsw.io/launch/79897

BenderArenadata · 2024-09-10T17:40:59Z

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1891649

BenderArenadata · 2024-09-13T16:55:57Z

Allure report https://allure.adsw.io/launch/80245

BenderArenadata · 2024-09-13T17:14:15Z

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1911167

src/test/regress/regress_gp.c

BenderArenadata · 2024-09-14T15:01:26Z

Failed job Build ubuntu22 for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1924163

BenderArenadata · 2024-09-14T15:06:19Z

Failed job Build for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1924162

BenderArenadata · 2024-09-14T19:04:56Z

Allure report https://allure.adsw.io/launch/80289

BenderArenadata · 2024-09-14T19:22:58Z

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1926814

BenderArenadata · 2024-09-15T17:07:27Z

Allure report https://allure.adsw.io/launch/80301

BenderArenadata · 2024-11-22T09:34:55Z

Allure report https://allure.adsw.io/launch/86597

BenderArenadata · 2024-11-22T13:56:36Z

Allure report https://allure.adsw.io/launch/86650

BenderArenadata · 2024-12-26T06:25:21Z

Allure report https://allure.adsw.io/launch/90016

BenderArenadata · 2024-12-27T04:40:50Z

Allure report https://allure.adsw.io/launch/90168

andr-sokolov · 2024-12-27T09:31:26Z

src/test/isolation2/expected/resgroup/resgroup_startup_memory.out

+-- The runaway detector test. A query with a large number of slices should
+-- be terminated due to high memory consumption.
+select count(*) from t1 a1 join t1 a2 using(a) join t1 a3 using(a) join t1 a4 using(a) join t1 a5 using(a) join t1 a6 using(a) join t1 a7 using(a) join t1 a8 using(a) join t1 a9 using(a) join t1 a10 using(a);
+ERROR:  Canceling query because of high VMEM usage. current group id is 712716, group memory usage 133 MB, group shared memory quota is 102 MB, slot memory quota is 0 MB, global freechunks memory is 277 MB, global safe memory threshold is 277 MB (runaway_cleaner.c:197)  (seg1 slice10 172.18.0.3:6003 pid=88018) (runaway_cleaner.c:197)


Rewrite please the test so that the error about memory consumption before the patch and after the patch occurs for different reasons

BenderArenadata · 2025-01-14T09:16:18Z

Allure report https://allure.adsw.io/launch/90809

src/test/regress/regress_gp.c

BenderArenadata · 2025-01-15T12:31:49Z

Allure report https://allure.adsw.io/launch/90926

andr-sokolov · 2025-01-15T13:03:12Z

src/test/regress/regress_gp.c

+	if (startUpMbRemains > 0)
+	{
+		size = Max(0, size - startUpMbRemains);
+		startUpMbRemains = Max(0, startUpMbRemains - size);
+	}


The current code calls MemoryContextAlloc when size is 0.
When the input is startUpMbRemains = 12 and size = 4, the output is size = max(0, 4-12) = 0 and startUpMbRemains = Max(0,12-0) = 12. But the output startUpMbRemains should be 8

Suggested change

if (startUpMbRemains > 0)

{

size = Max(0, size - startUpMbRemains);

startUpMbRemains = Max(0, startUpMbRemains - size);

}

if (startUpMbRemains >= size)

{

startUpMbRemains -= size;

PG_RETURN_INT32(0);

}

size -= startUpMbRemains;

startUpMbRemains = 0;

andr-sokolov · 2025-01-15T13:50:07Z

"but after the backend is assigned a resource group, this memory is not counted as consumed by the group." - it has been fixed, so let's write in past tense

andr-sokolov · 2025-01-15T13:54:54Z

src/backend/utils/resgroup/resgroup.c

+ * startup memory consumpion, but let it be just for symmetry.
+ */
+void
+ResGroupProcSubStartupChunks(int32 chunks)


What about removing this function and calling ResGroupProcAddStartupChunks(-startupChunks) instead of ResGroupProcSubStartupChunks(startupChunks) or removing the chunks argument and using VmemTracker_GetStartupChunks()?

andr-sokolov · 2025-01-16T05:37:15Z

src/test/isolation2/input/resgroup/resgroup_move_query.source

@@ -11,6 +11,10 @@
 --
 -- end_matchsubs

+-- start_ignore
+! gpstop -rai;


Why this line is added?

andr-sokolov · 2025-01-16T05:38:14Z

src/test/isolation2/input/resgroup/resgroup_move_query.source

@@ -135,7 +139,7 @@ SELECT num_running FROM gp_toolkit.gp_resgroup_status WHERE rsgname='rg_move_que
 1&: SELECT pg_sleep(3);
 2: SET ROLE role_move_query_mem_small;
 2: BEGIN;
-2: SELECT hold_memory_by_percent_on_qe(1,0.1);
+2: SELECT hold_memory_by_percent_on_qe(1,0.2);


Why 0.1 is replaced with 0.2 here and below?

andr-sokolov · 2025-01-16T07:12:16Z

src/test/isolation2/input/resgroup/resgroup_startup_memory.source

+ALTER RESOURCE GROUP admin_group SET memory_shared_quota 0;
+ALTER RESOURCE GROUP default_group SET memory_shared_quota 0;
+
+create resource group rg1 with (cpu_rate_limit=20, memory_limit=15, memory_shared_quota=100, memory_spill_ratio=0);


Do we really need memory_spill_ratio?

andr-sokolov · 2025-01-16T09:33:50Z

src/test/isolation2/input/resgroup/resgroup_startup_memory.source

+ALTER RESOURCE GROUP admin_group SET memory_shared_quota 0;
+ALTER RESOURCE GROUP default_group SET memory_shared_quota 0;


Why do we need to change memory_shared_quota?

BenderArenadata · 2025-01-16T11:47:50Z

Allure report https://allure.adsw.io/launch/91028

BenderArenadata · 2025-01-16T12:18:20Z

Allure report https://allure.adsw.io/launch/91031

BenderArenadata · 2025-01-17T04:47:20Z

Allure report https://allure.adsw.io/launch/91084

dnskvlnk force-pushed the ADBDEV-6156 branch from bb36a62 to 398662f Compare August 26, 2024 16:22

dnskvlnk marked this pull request as ready for review August 26, 2024 16:22

RekGRpth reviewed Aug 28, 2024

View reviewed changes

src/include/utils/resgroup.h Outdated Show resolved Hide resolved

RekGRpth reviewed Sep 3, 2024

View reviewed changes

src/test/isolation2/isolation2_resgroup_schedule Show resolved Hide resolved

RekGRpth reviewed Sep 3, 2024

View reviewed changes

RekGRpth reviewed Sep 14, 2024

View reviewed changes

src/test/regress/regress_gp.c Outdated Show resolved Hide resolved

Fix library path for test output

475f3cb

Merge branch 'adb-6.x-dev' into ADBDEV-6156

ef3a7e5

bandetto previously approved these changes Nov 25, 2024

View reviewed changes

KnightMurloc dismissed bandetto’s stale review via 4a4a3c2 December 26, 2024 05:50

new test

524afa5

KnightMurloc force-pushed the ADBDEV-6156 branch from 4a4a3c2 to 524afa5 Compare December 26, 2024 05:52

fix test

021c829

andr-sokolov reviewed Dec 27, 2024

View reviewed changes

rework test

bd7ab6e

andr-sokolov reviewed Jan 15, 2025

View reviewed changes

src/test/regress/regress_gp.c Outdated Show resolved Hide resolved

rework resGroupPalloc

c765b67

andr-sokolov reviewed Jan 15, 2025

View reviewed changes

andr-sokolov reviewed Jan 16, 2025

View reviewed changes

KnightMurloc added 2 commits January 16, 2025 17:46

improve the test

0c1831d

improve the test

caa4e57

Merge branch 'adb-6.x-dev' into ADBDEV-6156

492e4bd

reduce diff

c4f373c

-	if (startUpMbRemains > 0)
-	{
-		size = Max(0, size - startUpMbRemains);
-		startUpMbRemains = Max(0, startUpMbRemains - size);
-	}
+	if (startUpMbRemains >= size)
+	{
+		startUpMbRemains -= size;
+		PG_RETURN_INT32(0);
+	}
+	size -= startUpMbRemains;
+	startUpMbRemains = 0;

		ALTER RESOURCE GROUP admin_group SET memory_shared_quota 0;
		ALTER RESOURCE GROUP default_group SET memory_shared_quota 0;

ADBDEV-6156 Count startup memory of each process when using resource groups #1023

Are you sure you want to change the base?

ADBDEV-6156 Count startup memory of each process when using resource groups #1023

Conversation

dnskvlnk commented Aug 23, 2024 • edited by KnightMurloc Loading

BenderArenadata commented Aug 23, 2024

BenderArenadata commented Aug 26, 2024

BenderArenadata commented Aug 26, 2024

BenderArenadata commented Aug 26, 2024

BenderArenadata commented Aug 26, 2024

BenderArenadata commented Aug 26, 2024

BenderArenadata commented Aug 26, 2024

BenderArenadata commented Aug 26, 2024

BenderArenadata commented Aug 27, 2024

BenderArenadata commented Aug 27, 2024

RekGRpth commented Aug 28, 2024

BenderArenadata commented Sep 3, 2024

Choose a reason for hiding this comment

dnskvlnk Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

RekGRpth commented Sep 3, 2024

BenderArenadata commented Sep 8, 2024

BenderArenadata commented Sep 10, 2024

BenderArenadata commented Sep 10, 2024

BenderArenadata commented Sep 10, 2024

BenderArenadata commented Sep 13, 2024

BenderArenadata commented Sep 13, 2024

BenderArenadata commented Sep 14, 2024

BenderArenadata commented Sep 14, 2024

BenderArenadata commented Sep 14, 2024

BenderArenadata commented Sep 14, 2024

BenderArenadata commented Sep 15, 2024

BenderArenadata commented Nov 22, 2024

BenderArenadata commented Nov 22, 2024

BenderArenadata commented Dec 26, 2024

BenderArenadata commented Dec 27, 2024

Choose a reason for hiding this comment

BenderArenadata commented Jan 14, 2025

BenderArenadata commented Jan 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andr-sokolov commented Jan 15, 2025

andr-sokolov Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenderArenadata commented Jan 16, 2025

BenderArenadata commented Jan 16, 2025

BenderArenadata commented Jan 17, 2025

dnskvlnk commented Aug 23, 2024 •

edited by KnightMurloc

Loading

dnskvlnk Nov 14, 2024 •

edited

Loading

andr-sokolov Jan 15, 2025 •

edited

Loading