Skip to content
Ole Herman Schumacher Elgesem edited this page Oct 9, 2018 · 9 revisions

Troubleshoot CFEngine

Please follow the steps below before submitting a bug.

cf-agent appears to hang

It is important to determine what is hanging, i.e. whether it is CFEngine itself or something that CFEngine is interacting with. Run the program with -v and -d flags to see if it has gone into an infinite loop, or if it is waiting for something.

Common causes of hanging processes:

  • DBM database corruption: try to delete *.lmdb files in /var/cfengine
  • Command processes that do not properly close their file descriptiors of child processes. Try running the commands with a shell enabled (use_shell => "yes") and use </dev/null >/dev/null to close the descriptors.

CFEngine generates a segmentation fault

Segfaults in CFEngine may be caused by the incorrect build environment or by bugs in CFEngine or the libraries it uses.

  • Install GDB
  • Run the program in GDB from the command line
    • use --args option for GDB to pass options to the component
    • for all components, add --verbose option
    • for daemons, add --no-fork option
  • At the gdb prompt, enter run
  • When the program stops/segfaults, enter backtrace
  • Copy info from gdb ( at least the last ~100 lines) into a ticket

For example:

% gdb --args ./cf-agent/.libs/lt-cf-agent -KI -f POLICYFILE

CFEngine generates programming error

  • Install GDB
  • Run the component under gdb from the command line:
    • use --args option for GDB to pass options to the component
    • for all components add --verbose option.
    • for daemons add --no-fork option.
  • In the gdb prompt:
    • Add a breakpoint for programming errors - br __ProgrammingError
    • Start the program by typing run
  • When it stops on the gdb prompt, enter backtrace
  • Copy info from gdb (at least the last ~100 lines) into a ticket

For example:

[root ~]# gdb --args /var/cfengine/bin/cf-agent -Kd
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /var/cfengine/bin/cf-agent...done.
(gdb) br __ProgrammingError
Breakpoint 1 at 0x40ab38
(gdb) run

...

Breakpoint 1, __ProgrammingError (file=0x7ffff7b93939 "rlist.c", lineno=87, 
    format=0x7ffff7b93cf8 "Rlist value contains type %c instead of expected scalar") at misc_lib.c:53
53	misc_lib.c: No such file or directory.

(gdb) bt
#0  __ProgrammingError (file=0x7ffff7b93939 "rlist.c", lineno=87, 
    format=0x7ffff7b93cf8 "Rlist value contains type %c instead of expected scalar") at misc_lib.c:53
#1  0x00007ffff7b40ece in RlistScalarValue (rlist=<optimized out>) at rlist.c:87
#2  0x00007ffff7b31958 in PromiseRuntimeHash (pp=0xae6870, salt=<optimized out>, digest=0x7fffffff7740 "", type=<optimized out>)
    at locks.c:601
#3  0x00007ffff7b3262d in AcquireLock (ctx=0x665680, operand=0x7fffffff8160 "proc-*-norestart", host=0x7ffff7dc8f00 <VUQNAME> "prihub", 
    now=1488296640, tc=..., pp=0xae6870, ignoreProcesses=false) at locks.c:672
#4  0x000000000042008f in VerifyProcesses (pp=0xae6870, a=..., ctx=0x665680) at verify_processes.c:120
#5  VerifyProcessesPromise (ctx=0x665680, pp=0xae6870) at verify_processes.c:55
#6  0x000000000040ceab in KeepAgentPromise (ctx=0x665680, pp=0xae6870, param=<optimized out>) at cf-agent.c:1582
#7  0x00007ffff7b24669 in ExpandPromiseAndDo (param=0x0, act_on_promise=0x40cba0 <KeepAgentPromise>, iterctx=0xb0f2c0, ctx=0x665680)
    at expand.c:215
#8  ExpandPromise (ctx=0x665680, pp=<optimized out>, act_on_promise=0x40cba0 <KeepAgentPromise>, param=0x0) at expand.c:283
#9  0x000000000040c8b3 in ScheduleAgentOperations (ctx=0x665680, bp=0xa10b60) at cf-agent.c:1329
#10 0x000000000041d26a in VerifyMethod (ctx=0x665680, call=..., a=..., pp=0xba4880) at verify_methods.c:173
#11 0x000000000041d75a in VerifyMethodsPromise (ctx=0x665680, pp=0xba4880) at verify_methods.c:75
#12 0x000000000040d1c7 in KeepAgentPromise (ctx=0x665680, pp=0xba4880, param=<optimized out>) at cf-agent.c:1639
#13 0x00007ffff7b24669 in ExpandPromiseAndDo (param=0x0, act_on_promise=0x40cba0 <KeepAgentPromise>, iterctx=0xb89570, ctx=0x665680)
    at expand.c:215
#14 ExpandPromise (ctx=0x665680, pp=<optimized out>, act_on_promise=0x40cba0 <KeepAgentPromise>, param=0x0) at expand.c:283
#15 0x000000000040c8b3 in ScheduleAgentOperations (ctx=0x665680, bp=0xa08400) at cf-agent.c:1329
#16 0x000000000040eeea in KeepPromiseBundles (config=0x665500, policy=0xa77500, ctx=0x665680) at cf-agent.c:1243
#17 KeepPromises (config=<optimized out>, policy=0xa77500, ctx=0x665680) at cf-agent.c:724
#18 main (argc=<optimized out>, argv=<optimized out>) at cf-agent.c:252
(gdb) 

Clients time out as remote cf-serverd is overloaded or waits for resources

The symptom here is that CFEngine clients (e.g. cf-agent) does not get a timely response from a remote cf-serverd, e.g. when asking for new policy. You would see messages similar to the following on the CFEngine clients:

Failed to establish TLS connection: underlying network error (Connection reset by peer)
No suitable server responded to hail

If this happens on many of the clients, it is likely due to cf-serverd not being able to handle incoming connections fast enough so they start to pile up.

Sometimes cf-serverd is seen to use a lot of CPU time or memory, but it might also be using close to zero CPU. In these cases it is important to understand why cf-serverd is not able to handle the connections fast enough.

To see where the threads of cf-serverd is running at a given time, the following commands can be used

gdb -batch -p $(pgrep cf-serverd) -ex 'info threads' > info_threads.txt
gdb -batch -p $(pgrep cf-serverd) -ex 'thread apply all bt' > backtrace.txt
gdb -batch -p $(pgrep cf-serverd) -ex 'thread apply all bt full' > backtrace-full.txt

With this debugging information you could see if the process is spending time waiting for DBM files, or executes a hot part of the code.

Memory leak

  • Install Valgrind.
  • Run the leaking CFEngine component inside valgrind ** for all components, add --verbose option. ** for daemons, ad --no-fork option.

For example:

valgrind --leak-check=full \
  /var/cfengine/bin/cf-serverd --no-fork 2>/root/valgrind-cf-serverd &
  • If you are debugging a daemon, let it run for such a long time that you are confident that the consumed memory is a bug (remember that valgrind also consumes memory).
  • Send SIGINT to the valgrind process (prefixed memcheck)
# ps -e|grep mem
 2194 pts/0    00:00:24 memcheck-x86-li
# kill -SIGINT 2194

After a successful memory trace has been obtained in /root/valgrind-*.txt, check the end of the trace to verify that at least 10-20MB are lost. Otherwise, rerun the tracing for a longer period of time to gather enough data.

Lots of cf-agents are piling up in the process table

Probably CFEngine is getting stuck on the long task. Kill the existing processes, and run cf-agent -v to see what's going on.

Promises are not evaluated on second run

This is not a bug, but a feature. Have a look at Locks section in Reference Manual.

Clone this wiki locally