-
Notifications
You must be signed in to change notification settings - Fork 185
Troubleshooting
Please follow the steps below before submitting a bug.
It is important to determine what is hanging, i.e. whether it is CFEngine itself
or something that CFEngine is interacting with. Run the program with -v
and
-d
flags to see if it has gone into an infinite loop, or if it is waiting
for something.
Common causes of hanging processes:
- DBM database corruption: try to delete
*.lmdb
files in/var/cfengine
- Command processes that do not properly close their file descriptiors of child processes. Try running the commands with a shell enabled (
use_shell => "yes"
) and use</dev/null >/dev/null
to close the descriptors.
Segfaults in CFEngine may be caused by the incorrect build environment or by bugs in CFEngine or the libraries it uses.
- Install GDB
- Run the program in GDB from the command line
- use
--args
option for GDB to pass options to the component - for all components, add
--verbose
option - for daemons, add
--no-fork
option
- use
- At the gdb prompt, enter
run
- When the program stops/segfaults, enter
backtrace
- Copy info from gdb ( at least the last ~100 lines) into a ticket
For example:
% gdb --args ./cf-agent/.libs/lt-cf-agent -KI -f POLICYFILE
- Install GDB
- Run the component under gdb from the command line:
- use
--args
option for GDB to pass options to the component - for all components add
--verbose
option. - for daemons add
--no-fork
option.
- use
- In the gdb prompt:
- Add a breakpoint for programming errors -
br __ProgrammingError
- Start the program by typing
run
- Add a breakpoint for programming errors -
- When it stops on the gdb prompt, enter
backtrace
- Copy info from gdb (at least the last ~100 lines) into a ticket
For example:
[root ~]# gdb --args /var/cfengine/bin/cf-agent -Kd
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /var/cfengine/bin/cf-agent...done.
(gdb) br __ProgrammingError
Breakpoint 1 at 0x40ab38
(gdb) run
...
Breakpoint 1, __ProgrammingError (file=0x7ffff7b93939 "rlist.c", lineno=87,
format=0x7ffff7b93cf8 "Rlist value contains type %c instead of expected scalar") at misc_lib.c:53
53 misc_lib.c: No such file or directory.
(gdb) bt
#0 __ProgrammingError (file=0x7ffff7b93939 "rlist.c", lineno=87,
format=0x7ffff7b93cf8 "Rlist value contains type %c instead of expected scalar") at misc_lib.c:53
#1 0x00007ffff7b40ece in RlistScalarValue (rlist=<optimized out>) at rlist.c:87
#2 0x00007ffff7b31958 in PromiseRuntimeHash (pp=0xae6870, salt=<optimized out>, digest=0x7fffffff7740 "", type=<optimized out>)
at locks.c:601
#3 0x00007ffff7b3262d in AcquireLock (ctx=0x665680, operand=0x7fffffff8160 "proc-*-norestart", host=0x7ffff7dc8f00 <VUQNAME> "prihub",
now=1488296640, tc=..., pp=0xae6870, ignoreProcesses=false) at locks.c:672
#4 0x000000000042008f in VerifyProcesses (pp=0xae6870, a=..., ctx=0x665680) at verify_processes.c:120
#5 VerifyProcessesPromise (ctx=0x665680, pp=0xae6870) at verify_processes.c:55
#6 0x000000000040ceab in KeepAgentPromise (ctx=0x665680, pp=0xae6870, param=<optimized out>) at cf-agent.c:1582
#7 0x00007ffff7b24669 in ExpandPromiseAndDo (param=0x0, act_on_promise=0x40cba0 <KeepAgentPromise>, iterctx=0xb0f2c0, ctx=0x665680)
at expand.c:215
#8 ExpandPromise (ctx=0x665680, pp=<optimized out>, act_on_promise=0x40cba0 <KeepAgentPromise>, param=0x0) at expand.c:283
#9 0x000000000040c8b3 in ScheduleAgentOperations (ctx=0x665680, bp=0xa10b60) at cf-agent.c:1329
#10 0x000000000041d26a in VerifyMethod (ctx=0x665680, call=..., a=..., pp=0xba4880) at verify_methods.c:173
#11 0x000000000041d75a in VerifyMethodsPromise (ctx=0x665680, pp=0xba4880) at verify_methods.c:75
#12 0x000000000040d1c7 in KeepAgentPromise (ctx=0x665680, pp=0xba4880, param=<optimized out>) at cf-agent.c:1639
#13 0x00007ffff7b24669 in ExpandPromiseAndDo (param=0x0, act_on_promise=0x40cba0 <KeepAgentPromise>, iterctx=0xb89570, ctx=0x665680)
at expand.c:215
#14 ExpandPromise (ctx=0x665680, pp=<optimized out>, act_on_promise=0x40cba0 <KeepAgentPromise>, param=0x0) at expand.c:283
#15 0x000000000040c8b3 in ScheduleAgentOperations (ctx=0x665680, bp=0xa08400) at cf-agent.c:1329
#16 0x000000000040eeea in KeepPromiseBundles (config=0x665500, policy=0xa77500, ctx=0x665680) at cf-agent.c:1243
#17 KeepPromises (config=<optimized out>, policy=0xa77500, ctx=0x665680) at cf-agent.c:724
#18 main (argc=<optimized out>, argv=<optimized out>) at cf-agent.c:252
(gdb)
The symptom here is that CFEngine clients (e.g. cf-agent) does not get a timely response from a remote cf-serverd, e.g. when asking for new policy. You would see messages similar to the following on the CFEngine clients:
Failed to establish TLS connection: underlying network error (Connection reset by peer)
No suitable server responded to hail
If this happens on many of the clients, it is likely due to cf-serverd not being able to handle incoming connections fast enough so they start to pile up.
Sometimes cf-serverd is seen to use a lot of CPU time or memory, but it might also be using close to zero CPU. In these cases it is important to understand why cf-serverd is not able to handle the connections fast enough.
To see where the threads of cf-serverd is running at a given time, the following commands can be used
gdb -batch -p $(pgrep cf-serverd) -ex 'info threads' > info_threads.txt
gdb -batch -p $(pgrep cf-serverd) -ex 'thread apply all bt' > backtrace.txt
gdb -batch -p $(pgrep cf-serverd) -ex 'thread apply all bt full' > backtrace-full.txt
With this debugging information you could see if the process is spending time waiting for DBM files, or executes a hot part of the code.
- Install Valgrind.
- Run the leaking CFEngine component inside valgrind
** for all components, add
--verbose
option. ** for daemons, ad--no-fork
option.
For example:
valgrind --leak-check=full \
/var/cfengine/bin/cf-serverd --no-fork 2>/root/valgrind-cf-serverd &
- If you are debugging a daemon, let it run for such a long time that you are confident that the consumed memory is a bug (remember that valgrind also consumes memory).
- Send SIGINT to the valgrind process (prefixed memcheck)
# ps -e|grep mem
2194 pts/0 00:00:24 memcheck-x86-li
# kill -SIGINT 2194
After a successful memory trace has been obtained in /root/valgrind-*.txt
,
check the end of the trace to verify that at least 10-20MB are lost. Otherwise,
rerun the tracing for a longer period of time to gather enough data.
Probably CFEngine is getting stuck on the long task. Kill the existing
processes, and run cf-agent -v
to see what's going on.
This is not a bug, but a feature. Have a look at Locks section in Reference Manual.