Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Add LinuxMemoryChecker check/warning to ensure system-mem-limit-gb is reasonably set #24149

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

minhancao
Copy link
Contributor

@minhancao minhancao commented Nov 26, 2024

Description

Add additional checks and warnings to ensure
system-memory-gb <= system-mem-limit-gb < available machine memory of deployment.

For cgroup v1:
Set available machine memory of deployment to be the smaller number
between /proc/meminfo and memory.limit_in_bytes.

For cgroup v2:
Set available machine memory of deployment to be the smaller number
between /proc/meminfo and memory.max.
If memory.max contains "max" string, then look at
/proc/meminfo for the MemTotal, otherwise use the
value in memory.max.

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== NO RELEASE NOTE ==

@minhancao minhancao self-assigned this Nov 26, 2024
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Nov 26, 2024
@prestodb-ci prestodb-ci requested review from a team, psnv03 and pramodsatya and removed request for a team November 26, 2024 02:04
@minhancao minhancao marked this pull request as ready for review November 26, 2024 02:07
@minhancao minhancao requested a review from a team as a code owner November 26, 2024 02:07
@minhancao minhancao changed the title [native] Add LinuxMemoryChecker warnings to ensure system-memory-gb < system-mem-limit-gb < actual total memory capacity [native] Add LinuxMemoryChecker warnings to ensure system-mem-limit-gb is reasonably set Nov 26, 2024
@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch from 4478ae1 to 15f55bb Compare November 26, 2024 02:29
@minhancao minhancao changed the title [native] Add LinuxMemoryChecker warnings to ensure system-mem-limit-gb is reasonably set [native] Add LinuxMemoryChecker check/warning to ensure system-mem-limit-gb is reasonably set Nov 26, 2024
@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch from 15f55bb to 7646600 Compare November 26, 2024 06:06
Copy link
Contributor

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test with fake files again just like we did with the original tests for this class?
That way we can try the "max" value for cgv2, and gigantic values and reasonable values. Basically testing the various situations we saw when investigating this.

@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch 2 times, most recently from 8da401b to 4ae2cee Compare December 3, 2024 20:08
Copy link
Contributor

@pramodsatya pramodsatya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @minhancao, could you please squash the commits?

@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch from 85b3b9d to dab2335 Compare December 13, 2024 00:00
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ideal if we can avoid checking in data files for testing.
We only need a few fields from the file for testing.
Can we write these required fields to a temporary file as part of the testing?

@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch 2 times, most recently from 71c9fa9 to 89a50a8 Compare January 16, 2025 23:53
@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch from 89a50a8 to 163880b Compare January 22, 2025 18:16
@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch from 163880b to 29dc3b5 Compare January 28, 2025 00:40
@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch 2 times, most recently from 9a7cbf7 to 80ca9b5 Compare February 6, 2025 22:25
@minhancao
Copy link
Contributor Author

@czentgr @majetideepak @pramodsatya
I have addressed all the PR comments, please review this PR when you can, thank you!

@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch 4 times, most recently from f25e5b1 to b52384a Compare February 8, 2025 00:23
@minhancao
Copy link
Contributor Author

@majetideepak I have updated the PR with some new changes, please review when you can, thank you!

@majetideepak majetideepak self-requested a review February 10, 2025 20:23
@majetideepak majetideepak dismissed their stale review February 10, 2025 20:24

Issue addressed

@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch from b52384a to 9be902c Compare February 10, 2025 23:40
@minhancao minhancao requested review from steveburnett, elharo and a team as code owners February 10, 2025 23:40
@minhancao minhancao requested a review from presto-oss February 10, 2025 23:40
@minhancao
Copy link
Contributor Author

@czentgr @majetideepak Just updated the PR, please review when you can!

steveburnett
steveburnett previously approved these changes Feb 11, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, new doc build, looks good. Thanks!

@minhancao
Copy link
Contributor Author

minhancao commented Feb 11, 2025

Confirmed works on cgroup v1 machine:

I0211 09:50:27.443691 490102 LinuxMemoryChecker.cpp:35] [PRESTO_STARTUP] Using cgroup v1.
I0211 09:50:27.443755 490102 LinuxMemoryChecker.cpp:55] [PRESTO_STARTUP] Using memory stat file: /sys/fs/cgroup/memory/memory.stat
I0211 09:50:27.443924 490102 LinuxMemoryChecker.cpp:58] [PRESTO_STARTUP] Using memory max file /sys/fs/cgroup/memory/memory.limit_in_bytes
I0211 09:50:27.444255 490102 LinuxMemoryChecker.cpp:90] [PRESTO_STARTUP] System memory in bytes: 2147483648
I0211 09:50:27.444301 490102 LinuxMemoryChecker.cpp:93] [PRESTO_STARTUP] System memory limit in bytes: 2147483648
I0211 09:50:27.444613 490102 LinuxMemoryChecker.cpp:97] [PRESTO_STARTUP] Available machine memory of deployment in bytes: 8331362304
I0211 09:50:27.444728 490102 PeriodicMemoryChecker.cpp:48] [PRESTO_STARTUP] Creating server memory pushback checker, memory check interval 1000ms, system memory limit: 2.00GB, memory shrink size: 1.00GB
I0211 09:50:27.444770 490102 PeriodicMemoryChecker.cpp:57] [PRESTO_STARTUP] Malloc memory heap dumper is not enabled

Error - system-mem-limit-gb was higher than available machine memory of deployment:

I0211 09:34:08.711416 489817 LinuxMemoryChecker.cpp:35] [PRESTO_STARTUP] Using cgroup v1.
I0211 09:34:08.711459 489817 LinuxMemoryChecker.cpp:55] [PRESTO_STARTUP] Using memory stat file: /sys/fs/cgroup/memory/memory.stat
I0211 09:34:08.711473 489817 LinuxMemoryChecker.cpp:58] [PRESTO_STARTUP] Using memory max file /sys/fs/cgroup/memory/memory.limit_in_bytes
I0211 09:34:08.711670 489817 LinuxMemoryChecker.cpp:90] [PRESTO_STARTUP] System memory in bytes: 2147483648
I0211 09:34:08.711715 489817 LinuxMemoryChecker.cpp:93] [PRESTO_STARTUP] System memory limit in bytes: 64424509440
I0211 09:34:08.711890 489817 LinuxMemoryChecker.cpp:97] [PRESTO_STARTUP] Available machine memory of deployment in bytes: 8331362304
E0211 09:34:08.711936 489817 Exceptions.h:66] Line: /root/presto/presto-native-execution/presto_cpp/main/LinuxMemoryChecker.cpp:101, Function:start, Expression: config_.systemMemLimitBytes <= availableMemoryOfDeployment (64424509440 vs. 8331362304) system memory limit = 64424509440 bytes is higher than the available machine memory of deployment = 8331362304 bytes., Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (64424509440 vs. 8331362304) system memory limit = 64424509440 bytes is higher than the available machine memory of deployment = 8331362304 bytes.
Retriable: False
Expression: config_.systemMemLimitBytes <= availableMemoryOfDeployment
Function: start
File: /root/presto/presto-native-execution/presto_cpp/main/LinuxMemoryChecker.cpp
Line: 101
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_
# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 3  _ZN8facebook5velox17VeloxRuntimeErrorC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_
# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 5  _ZN8facebook6presto18LinuxMemoryChecker5startEv
# 6  _ZN8facebook6presto12PrestoServer28addMemoryCheckerPeriodicTaskEv
# 7  _ZN8facebook6presto12PrestoServer3runEv
# 8  main
# 9  __libc_start_main
# 10 _start

*** Aborted at 1739295248 (Unix time, try 'date -d @1739295248') ***
*** Signal 6 (SIGABRT) (0x77959) received by PID 489817 (pthread TID 0x7fda627b8e80) (linux TID 489817) (maybe from PID 489817, UID 0) (code: -6), stack trace: ***
    @ 000000000aca5db1 _ZN5folly10symbolizer12_GLOBAL__N_113signalHandlerEiP9siginfo_tPv
                       /root/presto/presto-native-execution/dependencies/deps-download/folly/folly/experimental/symbolizer/SignalHandler.cpp:453
    @ 000000000001441f (unknown)
    @ 000000000004300b gsignal
    @ 0000000000022858 abort
    @ 00000000000a4ee5 (unknown)
    @ 00000000000b6f8b (unknown)
    @ 00000000000b6ff6 _ZSt9terminatev
    @ 00000000000b7257 __cxa_throw
    @ 000000000ac4809a __cxa_throw
                       /root/presto/presto-native-execution/dependencies/deps-download/folly/folly/debugging/exception_tracer/ExceptionTracerLib.cpp:159
    @ 000000000ab4659d _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
                       /root/presto/presto-native-execution/velox/velox/common/base/Exceptions.h:74
                       -> /root/presto/presto-native-execution/velox/velox/common/base/Exceptions.cpp
    @ 000000000099ec70 _ZN8facebook6presto18LinuxMemoryChecker5startEv
                       /root/presto/presto-native-execution/presto_cpp/main/LinuxMemoryChecker.cpp:101
    @ 0000000000ccb692 _ZN8facebook6presto12PrestoServer28addMemoryCheckerPeriodicTaskEv
                       /root/presto/presto-native-execution/presto_cpp/main/PrestoServer.cpp:1044
    @ 0000000000cc6fd2 _ZN8facebook6presto12PrestoServer3runEv
                       /root/presto/presto-native-execution/presto_cpp/main/PrestoServer.cpp:552
    @ 00000000009e137c main
                       /root/presto/presto-native-execution/presto_cpp/main/PrestoMain.cpp:30
    @ 0000000000024082 __libc_start_main
    @ 00000000007c909d _start
Fatal signal handler. ThreadDebugInfo object not found.
Aborted (core dumped)

@minhancao
Copy link
Contributor Author

minhancao commented Feb 11, 2025

Confirmed works on cgroup v2 machine:

I0211 09:40:50.499085  4661 LinuxMemoryChecker.cpp:46] [PRESTO_STARTUP] Using cgroup v2.
I0211 09:40:50.499154  4661 LinuxMemoryChecker.cpp:55] [PRESTO_STARTUP] Using memory stat file: /sys/fs/cgroup/memory.stat
I0211 09:40:50.499213  4661 LinuxMemoryChecker.cpp:58] [PRESTO_STARTUP] Using memory max file /proc/meminfo
I0211 09:40:50.499545  4661 LinuxMemoryChecker.cpp:89] [PRESTO_STARTUP] System memory in bytes: 2147483648
I0211 09:40:50.499578  4661 LinuxMemoryChecker.cpp:92] [PRESTO_STARTUP] System memory limit in bytes: 4294967296
I0211 09:40:50.499729  4661 LinuxMemoryChecker.cpp:96] [PRESTO_STARTUP] Available machine memory of deployment in bytes: 67421741056
I0211 09:40:50.499758  4661 PeriodicMemoryChecker.cpp:48] [PRESTO_STARTUP] Creating server memory pushback checker, memory check interval 1000ms, system memory limit: 4.00GB, memory shrink size: 20.00GB
I0211 09:40:50.499864  4661 PeriodicMemoryChecker.cpp:57] [PRESTO_STARTUP] Malloc memory heap dumper is not enabled

Error - system-mem-limit-gb was higher than available machine memory of deployment:

I0211 09:44:01.242293  4985 LinuxMemoryChecker.cpp:46] [PRESTO_STARTUP] Using cgroup v2.
I0211 09:44:01.242357  4985 LinuxMemoryChecker.cpp:55] [PRESTO_STARTUP] Using memory stat file: /sys/fs/cgroup/memory.stat
I0211 09:44:01.242378  4985 LinuxMemoryChecker.cpp:58] [PRESTO_STARTUP] Using memory max file /proc/meminfo
I0211 09:44:01.242748  4985 LinuxMemoryChecker.cpp:89] [PRESTO_STARTUP] System memory in bytes: 2147483648
I0211 09:44:01.242784  4985 LinuxMemoryChecker.cpp:92] [PRESTO_STARTUP] System memory limit in bytes: 107374182400
I0211 09:44:01.242952  4985 LinuxMemoryChecker.cpp:96] [PRESTO_STARTUP] Available machine memory of deployment in bytes: 67421741056
E0211 09:44:01.242988  4985 Exceptions.h:66] Line: /root/presto/presto-native-execution/presto_cpp/main/LinuxMemoryChecker.cpp:99, Function:start, Expression: config_.systemMemLimitBytes <= availableMemoryOfDeployment (107374182400 vs. 67421741056) system memory limit = 107374182400 bytes is higher than the available machine memory of deployment = 67421741056 bytes., Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (107374182400 vs. 67421741056) system memory limit = 107374182400 bytes is higher than the available machine memory of deployment = 67421741056 bytes.
Retriable: False
Expression: config_.systemMemLimitBytes <= availableMemoryOfDeployment
Function: start
File: /root/presto/presto-native-execution/presto_cpp/main/LinuxMemoryChecker.cpp
Line: 99
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_
# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 3  _ZN8facebook5velox17VeloxRuntimeErrorC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_
# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 5  _ZN8facebook6presto18LinuxMemoryChecker5startEv
# 6  _ZN8facebook6presto12PrestoServer28addMemoryCheckerPeriodicTaskEv
# 7  _ZN8facebook6presto12PrestoServer3runEv
# 8  main
# 9  0x0000000000029d8f
# 10 __libc_start_main
# 11 _start

*** Aborted at 1739295959 (Unix time, try 'date -d @1739295959') ***
*** Signal 6 (SIGABRT) (0x1372) received by PID 4985 (pthread TID 0x7ffff726b5c0) (linux TID 4985) (maybe from PID 4978, UID 0) (code: 0), stack trace: ***
I0211 09:45:59.955230  4992 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
    @ 000000000a0017c7 _ZN5folly10symbolizer12_GLOBAL__N_113signalHandlerEiP9siginfo_tPv
                       /root/presto_oss_dependencies/folly/folly/experimental/symbolizer/SignalHandler.cpp:453
    @ 000000000004251f (unknown)
    @ 00000000000969fc pthread_kill
    @ 0000000000042475 raise
    @ 00000000000287f2 abort
    @ 00000000000a2b9d (unknown)
    @ 00000000000ae20b (unknown)
    @ 00000000000ae276 _ZSt9terminatev
    @ 00000000000ae4d7 __cxa_throw
    @ 0000000009fefb9e __cxa_throw
                       /root/presto_oss_dependencies/folly/folly/experimental/exception_tracer/ExceptionTracerLib.cpp:159
    @ 0000000009ea03b2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
                       /root/presto/presto-native-execution/velox/velox/common/base/Exceptions.h:82
                       -> /root/presto/presto-native-execution/velox/velox/common/base/Exceptions.cpp
    @ 00000000008bc897 _ZN8facebook6presto18LinuxMemoryChecker5startEv
                       /root/presto/presto-native-execution/presto_cpp/main/LinuxMemoryChecker.cpp:99
    @ 0000000000be378e _ZN8facebook6presto12PrestoServer28addMemoryCheckerPeriodicTaskEv
                       /root/presto/presto-native-execution/presto_cpp/main/PrestoServer.cpp:1044
    @ 0000000000bdf34e _ZN8facebook6presto12PrestoServer3runEv
                       /root/presto/presto-native-execution/presto_cpp/main/PrestoServer.cpp:552
    @ 00000000008fe625 main
                       /root/presto/presto-native-execution/presto_cpp/main/PrestoMain.cpp:30
    @ 0000000000029d8f (unknown)
    @ 0000000000029e3f __libc_start_main
    @ 00000000006e4ec4 _start
Fatal signal handler. ThreadDebugInfo object not found.

…mit-gb is reasonably set

Add additional checks and warnings to ensure
system-memory-gb <= system-mem-limit-gb < available machine memory of deployment.

For cgroup v1:
Set available machine memory of deployment to be the smaller number
between /proc/meminfo and memory.limit_in_bytes.

For cgroup v2:
Set available machine memory of deployment to be the smaller number
between /proc/meminfo and memory.max.
If memory.max contains "max" string, then look at
/proc/meminfo for the MemTotal, otherwise use the
value in memory.max.
@minhancao minhancao force-pushed the linuxmemorychecker_mem_limit_check branch from 988ccf0 to 822c4e1 Compare February 11, 2025 19:28
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @minhancao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PR from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants