Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS config agent hung - reporting "running" but OS unknown in patch consul / not patching / logging #726

Open
bwaters opened this issue Nov 11, 2024 · 3 comments

Comments

@bwaters
Copy link

bwaters commented Nov 11, 2024

On a few of our hosts in google console is not reporting their os other info from the agent, agent is not running patching jobs or logging even though it is still reporting as running.

Screenshot 2024-11-11 at 4 45 48 PM

Logging onto the hosts I see that the agent is running. But hasn't written a new line to the log file in 15 days

systemctl status google-osconfig-agent
● google-osconfig-agent.service - Google OSConfig Agent
     Loaded: loaded (/lib/systemd/system/google-osconfig-agent.service; enabled; preset: enabled)
     Active: active (running) since Thu 2024-10-24 04:19:55 UTC; 2 weeks 4 days ago
   Main PID: 660 (google_osconfig)
      Tasks: 22 (limit: 4677)
     Memory: 105.2M
        CPU: 1h 8min 26.925s
     CGroup: /system.slice/google-osconfig-agent.service
             ├─    660 /usr/bin/google_osconfig_agent
             ├─1320421 /usr/bin/apt-get update
             ├─1320433 /usr/lib/apt/methods/https
             ├─1320434 /usr/lib/apt/methods/http
             ├─1320435 /usr/lib/apt/methods/ar+https
             ├─1320441 /usr/lib/apt/methods/http
             └─1320443 /usr/lib/apt/methods/gpgv

Oct 26 23:20:11 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-26T23:20:11.0569Z OSConfigAgent Info: Writing inventory to guest attributes
Oct 26 23:30:10 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-26T23:30:10.9839Z OSConfigAgent Info: Writing inventory to guest attributes
Oct 26 23:40:10 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-26T23:40:10.6683Z OSConfigAgent Info: Writing inventory to guest attributes
Oct 26 23:50:10 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-26T23:50:10.8159Z OSConfigAgent Info: Writing inventory to guest attributes
Oct 27 00:00:10 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-27T00:00:10.7159Z OSConfigAgent Info: Writing inventory to guest attributes
Oct 27 00:10:10 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-27T00:10:10.8140Z OSConfigAgent Info: Writing inventory to guest attributes
Oct 27 00:20:10 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-27T00:20:10.8854Z OSConfigAgent Info: Writing inventory to guest attributes
Oct 27 00:30:10 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-27T00:30:10.9676Z OSConfigAgent Info: Writing inventory to guest attributes
Oct 27 00:40:10 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-27T00:40:10.9233Z OSConfigAgent Info: Writing inventory to guest attributes
Oct 27 00:50:10 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-27T00:50:10.5406Z OSConfigAgent Info: Writing inventory to guest attributes

Last logs that were not Writing inventory to guest attributes: looked like this.

Oct 24 04:20:56 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-24T04:20:56.0314Z OSConfigAgent Info: Successfully completed ApplyConfigTask
Oct 24 06:00:06 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-24T06:00:06.5103Z OSConfigAgent Error inventory.go:55: packages.GetPackageUpdates() error: error getting apt updates: exit status 100
Oct 24 12:19:01 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-24T12:19:01.1567Z OSConfigAgent Warning: Error waiting for task (attempt 1 of 10): rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: "server_shutting_down"
Oct 25 06:00:06 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-25T06:00:06.5184Z OSConfigAgent Error inventory.go:55: packages.GetPackageUpdates() error: error getting apt updates: exit status 100
Oct 26 06:00:06 lightstep-satellite-f8fh OSConfigAgent[660]: 2024-10-26T06:00:06.5876Z OSConfigAgent Error inventory.go:55: packages.GetPackageUpdates() error: error getting apt updates: exit status 100
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root         660       1  0 311612 24516  1 Oct24 ?        00:02:13 /usr/bin/google_osconfig_agent

Doing a manual restart -- systemctl restart google-osconfig-agent returns the agent to a working state (at least temporarily) and google console shows proper operating system.

strace reveals most threads are blocked on futex call.

strace -p 660
strace: Process 660 attached
futex(0x1613f80, FUTEX_WAIT_PRIVATE, 0, NULL

 ps -efL | grep google_osconfig_agent
root         660       1     660  0   11 Oct24 ?        00:02:13 /usr/bin/google_osconfig_agent
root         660       1     683  0   11 Oct24 ?        00:00:35 /usr/bin/google_osconfig_agent
root         660       1     684  0   11 Oct24 ?        00:00:03 /usr/bin/google_osconfig_agent
root         660       1     685  0   11 Oct24 ?        00:00:12 /usr/bin/google_osconfig_agent
root         660       1     686  0   11 Oct24 ?        00:00:12 /usr/bin/google_osconfig_agent
root         660       1     723  0   11 Oct24 ?        00:00:18 /usr/bin/google_osconfig_agent
root         660       1     725  0   11 Oct24 ?        00:00:00 /usr/bin/google_osconfig_agent
root         660       1     726  0   11 Oct24 ?        00:00:00 /usr/bin/google_osconfig_agent
root         660       1     727  0   11 Oct24 ?        00:00:19 /usr/bin/google_osconfig_agent
root         660       1    3016  0   11 Oct24 ?        00:00:17 /usr/bin/google_osconfig_agent
root         660       1  896728  0   11 Oct26 ?        00:00:16 /usr/bin/google_osconfig_agent

# strace -p 896728
strace: Process 896728 attached
futex(0xc00035a148, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 896728 detached
 <detached ...>

# strace -p 3016
strace: Process 3016 attached
futex(0xc0003fd148, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 3016 detached
 <detached ...>

# strace -p 727
strace: Process 727 attached
futex(0xc0003fc948, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 727 detached
 <detached ...>

# strace -p 726
strace: Process 726 attached
futex(0x1675278, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 726 detached
 <detached ...>

# strace -p 725
strace: Process 725 attached
futex(0x1675400, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 725 detached
 <detached ...>

# strace -p 723
strace: Process 723 attached
futex(0xc000090948, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 723 detached
 <detached ...>

# strace -p 686
strace: Process 686 attached
futex(0xc000090148, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 686 detached
 <detached ...>

# strace -p 685
strace: Process 685 attached
futex(0xc00006b148, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(6, "\0", 1)                       = 1
futex(0xc00006b148, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 0)     = 0
epoll_pwait(4, [], 128, 3825, NULL, 0)  = 0
epoll_pwait(4, [], 128, 0, NULL, 0)     = 0
futex(0xc0003fc948, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_pwait(4, [], 128, 0, NULL, 0)     = 0
futex(0xc00006b148, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 685 detached
 <detached ...>

# strace -p 684
strace: Process 684 attached
waitid(P_PID, 1320421, ^Cstrace: Process 684 detached
 <detached ...>

# strace -p 683
strace: Process 683 attached
restart_syscall(<... resuming interrupted read ...>) = -1 ETIMEDOUT (Connection timed out)
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
futex(0x16159a0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=997095762}) = -1 ETIMEDOUT (Connection timed out)
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
futex(0x16159a0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=76850051}) = -1 ETIMEDOUT (Connection timed out)
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
futex(0x16159a0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=4, tv_nsec=921581186}) = -1 ETIMEDOUT (Connection timed out)
nanosleep({tv_sec=0, tv_nsec=40000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=80000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=160000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=320000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=640000}, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=1280000}, NULL) = 0
futex(0x16159a0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=74422472}) = -1 ETIMEDOUT (Connection timed out)
nanosleep({tv_sec=0, tv_nsec=2560000}, NULL) = 0
futex(0x16159a0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=3, tv_nsec=918451268}) = -1 ETIMEDOUT (Connection timed out)
nanosleep({tv_sec=0, tv_nsec=5120000}, NULL) = 0
futex(0x16159a0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=994605225}) = -1 ETIMEDOUT (Connection timed out)
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
futex(0x16159a0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=68601132}) = -1 ETIMEDOUT (Connection timed out)
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
futex(0x16159a0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=4, tv_nsec=910377910}^Cstrace: Process 683 detached
 <detached ...>
@bwaters bwaters changed the title OS config agent running but OS reporting unkown in patch consul OS config agent "running" but OS reporting unknown in patch consul / not patching Nov 11, 2024
@bwaters bwaters changed the title OS config agent "running" but OS reporting unknown in patch consul / not patching OS config agent hung - reporting "running" but OS unknown in patch consul / not patching / logging Nov 11, 2024
@burov
Copy link
Member

burov commented Nov 14, 2024

Hello @bwaters , Could you provide a bit of details:

  • What OS VMs are running, is it isolated to single OS or multiple ?
  • What is the Agent version ?

@bwaters
Copy link
Author

bwaters commented Dec 16, 2024

Sorry for the delay:

debian 12 bookworm

apt list --installed | grep google

apt-transport-artifact-registry/apt-transport-artifact-registry-stable,google-compute-engine-bookworm-stable,now 1:20241121.00-g1 amd64 [installed]
google-cloud-cli-anthoscli/cloud-sdk-bookworm,now 502.0.0-0 amd64 [installed,upgradable to: 503.0.0-0]
google-cloud-cli/cloud-sdk-bookworm,now 502.0.0-0 amd64 [installed,upgradable to: 503.0.0-0]
google-cloud-packages-archive-keyring/now 1.2-629101324 all [installed,local]
google-compute-engine-oslogin/google-compute-engine-bookworm-stable,now 1:20240701.00-g1+deb12 amd64 [installed]
google-compute-engine/now 1:20240607.00-g1 all [installed,local]
google-guest-agent/google-compute-engine-bookworm-stable,now 1:20241011.01-g1 amd64 [installed]
google-osconfig-agent/google-compute-engine-bookworm-stable,now 1:20240926.03-g1 amd64 [installed]

@burov
Copy link
Member

burov commented Jan 14, 2025

Hello @bwaters, sorry for long reply, i suggest to firstly update agent to the latest version 20240926.03-g1 and confirm that behavior is still exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants