Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OTLP Ingestion by the Datadog Agent doesn't start since v7.61.0 #32947

Open
keisku opened this issue Jan 14, 2025 · 2 comments
Open

[BUG] OTLP Ingestion by the Datadog Agent doesn't start since v7.61.0 #32947

keisku opened this issue Jan 14, 2025 · 2 comments
Labels

Comments

@keisku
Copy link
Contributor

keisku commented Jan 14, 2025

Agent Environment

uname -a
Linux ip-10-0-133-150 6.8.0-1021-aws #23~22.04.1-Ubuntu SMP Tue Dec 10 16:50:46 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

docker --version
Docker version 27.4.1, build b9d17ea

docker exec datadog-agent agent version
Agent 7.61.0 - Commit: 202f54bcf5 - Serialization version: v5.0.137 - Go version: go1.22.8

Describe what happened:

OTLP Ingestion by the Datadog Agent doesn't start.

2025-01-14 09:44:20 UTC | CORE | ERROR | (comp/otelcol/collector/impl-pipeline/pipeline.go:111 in func1) | Error running the OTLP ingest pipeline: failed to register process metrics: process does not exist

Describe what you expected:

agent should listen on *:4318 and *:4317 like

docker exec -it datadog-agent ss -tlpn
State                                                 Recv-Q                                                Send-Q                                                                                               Local Address:Port                                                                                                Peer Address:Port                                               Process                                                                                 
LISTEN                                                0                                                     4096                                                                                                     127.0.0.1:6162                                                                                                     0.0.0.0:*                                                   users:(("process-agent",pid=400,fd=18))                                                
LISTEN                                                0                                                     4096                                                                                                     127.0.0.1:6062                                                                                                     0.0.0.0:*                                                   users:(("process-agent",pid=400,fd=12))                                                
LISTEN                                                0                                                     4096                                                                                                     127.0.0.1:5012                                                                                                     0.0.0.0:*                                                   users:(("trace-agent",pid=397,fd=15))                                                  
LISTEN                                                0                                                     4096                                                                                                     127.0.0.1:5001                                                                                                     0.0.0.0:*                                                   users:(("agent",pid=402,fd=23))                                                        
LISTEN                                                0                                                     4096                                                                                                     127.0.0.1:5000                                                                                                     0.0.0.0:*                                                   users:(("agent",pid=402,fd=26))                                                        
LISTEN                                                0                                                     4096                                                                                                    127.0.0.11:43519                                                                                                    0.0.0.0:*                                                                                                                                          
LISTEN                                                0                                                     4096                                                                                                             *:8126                                                                                                           *:*                                                   users:(("trace-agent",pid=397,fd=11))                                                  
LISTEN                                                0                                                     4096                                                                                                             *:4318                                                                                                           *:*                                                   users:(("agent",pid=402,fd=22))                                                        
LISTEN                                                0                                                     4096                                                                                                             *:4317                                                                                                           *:*                                                   users:(("agent",pid=402,fd=21))                                                        
LISTEN                                                0                                                     4096                                                                                                             *:5003                                                                                                           *:*                                                   users:(("trace-agent",pid=397,fd=13))                                                  

Steps to reproduce the issue:

services:
  agent:
    container_name: datadog-agent
    image: datadog/agent:7.61.0
    volumes:
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
      - /proc/:/host/proc/:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      - DD_APM_ENABLED=true
      - DD_APM_NON_LOCAL_TRAFFIC=true
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_GRPC_ENDPOINT=0.0.0.0:4317
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT=0.0.0.0:4318
    env_file:
      - ~/sandbox.docker.env # DD_API_KEY is here.
docker compose up -d

docker logs datadog-agent | grep '\: process does not exist'
2025-01-14 09:44:20 UTC | CORE | ERROR | (comp/otelcol/collector/impl-pipeline/pipeline.go:111 in func1) | Error running the OTLP ingest pipeline: failed to register process metrics: process does not exist

# agent doesn't listen on `*:4318` and `*:4317` like
docker exec -it datadog-agent ss -tlpn
State                                                 Recv-Q                                                Send-Q                                                                                               Local Address:Port                                                                                                Peer Address:Port                                               Process                                                                                 
LISTEN                                                0                                                     4096                                                                                                    127.0.0.11:35859                                                                                                    0.0.0.0:*                                                                                                                                          
LISTEN                                                0                                                     4096                                                                                                     127.0.0.1:6162                                                                                                     0.0.0.0:*                                                   users:(("process-agent",pid=401,fd=18))                                                
LISTEN                                                0                                                     4096                                                                                                     127.0.0.1:5012                                                                                                     0.0.0.0:*                                                   users:(("trace-agent",pid=397,fd=15))                                                  
LISTEN                                                0                                                     4096                                                                                                     127.0.0.1:5001                                                                                                     0.0.0.0:*                                                   users:(("agent",pid=402,fd=20))                                                        
LISTEN                                                0                                                     4096                                                                                                     127.0.0.1:5000                                                                                                     0.0.0.0:*                                                   users:(("agent",pid=402,fd=24))                                                        
LISTEN                                                0                                                     4096                                                                                                     127.0.0.1:6062                                                                                                     0.0.0.0:*                                                   users:(("process-agent",pid=401,fd=11))                                                
LISTEN                                                0                                                     4096                                                                                                             *:8126                                                                                                           *:*                                                   users:(("trace-agent",pid=397,fd=11))                                                  
LISTEN                                                0                                                     4096                                                                                                             *:5003                                                                                                           *:*                                                   users:(("trace-agent",pid=397,fd=13))                                                  

Additional environment details (Operating System, Cloud provider, etc):

Found three workarounds.

1. Use 7.60.1

Using this version could be a workaround.

2. Avoid using /proc from host

Set HOST_PROC=/proc

services:
  agent:
    container_name: datadog-agent
    image: datadog/agent:7.61.0
    volumes:
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
      - /proc/:/host/proc/:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      - HOST_PROC=/proc # WORKAROUND
      - DD_APM_ENABLED=true
      - DD_APM_NON_LOCAL_TRAFFIC=true
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_GRPC_ENDPOINT=0.0.0.0:4317
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT=0.0.0.0:4318
    env_file:
      - ~/sandbox.docker.env # DD_API_KEY is here.

Remove /proc/:/host/proc/:ro from volumes

services:
  agent:
    container_name: datadog-agent
    image: datadog/agent:7.61.0
    volumes:
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
      # - /proc/:/host/proc/:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /var/lib/cloud/data/instance-id:/var/lib/cloud/data/instance-id:ro
    environment:
      # - HOST_PROC=/proc
      - DD_APM_ENABLED=true
      - DD_APM_NON_LOCAL_TRAFFIC=true
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_GRPC_ENDPOINT=0.0.0.0:4317
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT=0.0.0.0:4318
    env_file:
      - ~/sandbox.docker.env # DD_API_KEY is here.

3. Set pid: host

services:
  agent:
    container_name: datadog-agent
    image: datadog/agent:7.61.0
    volumes:
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
      # - /proc/:/host/proc/:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /var/lib/cloud/data/instance-id:/var/lib/cloud/data/instance-id:ro
    pid: host # WORKAROUND
    environment:
      # - HOST_PROC=/proc
      - DD_APM_ENABLED=true
      - DD_APM_NON_LOCAL_TRAFFIC=true
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_GRPC_ENDPOINT=0.0.0.0:4317
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT=0.0.0.0:4318
    env_file:
      - ~/sandbox.docker.env # DD_API_KEY is here.

Related Information/Code

Investigation

services:
  agent:
    container_name: datadog-agent
    image: datadog/agent:7.61.0
    volumes:
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
      - /proc/:/host/proc/:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      - DD_APM_ENABLED=true
      - DD_APM_NON_LOCAL_TRAFFIC=true
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_GRPC_ENDPOINT=0.0.0.0:4317
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT=0.0.0.0:4318
    env_file:
      - ~/sandbox.docker.env # DD_API_KEY is here.
docker exec -it datadog-agent ps -ef | grep agent
root         399       1  0 12:49 ?        00:00:00 s6-supervise agent
root         400     393  0 12:49 ?        00:00:01 trace-agent --config=/etc/da
root         404     397  0 12:49 ?        00:00:01 process-agent --cfgpath=/etc
root         405     399  1 12:49 ?        00:00:11 agent run

docker exec -it datadog-agent stat /proc/405
  File: /proc/405
  Size: 0         	Blocks: 0          IO Block: 1024   directory
Device: 0,66	Inode: 156493      Links: 9
Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-01-14 12:49:33.906554629 +0000
Modify: 2025-01-14 12:49:33.906554629 +0000
Change: 2025-01-14 12:49:33.906554629 +0000
 Birth: -

# This caused this issue.
docker exec -it datadog-agent stat /host/proc/405
stat: cannot statx '/host/proc/405': No such file or directory

Thoughts

Updating go.opentelemetry.io/collector/otelcol v0.116.0 to v0.117.0 could be a solution.

Looking at the release of shirou/gopsutil v4.24.12,
shirou/gopsutil#1716 seems to solve related issue such as shirou/gopsutil#1709.

go.opentelemetry.io/collector/otelcol v0.117.0 uses shirou/gopsutil v4.24.12. See https://github.com/open-telemetry/opentelemetry-collector/blob/v0.117.0/otelcol/go.mod#L67

@songy23
Copy link
Member

songy23 commented Jan 14, 2025

Thanks for flagging and looking into this @keisku. A minor correction is Agent 7.61.0 uses otel v0.114.0 rather than v0.116.0. So the issue should already be present in github.com/shirou/gopsutil/v4 v4.24.10 https://github.com/open-telemetry/opentelemetry-collector/blob/v0.114.0/service/go.mod#L10

That also means we will need to backport the version upgrade to both 7.61.x & 7.62.x

@keisku
Copy link
Contributor Author

keisku commented Jan 14, 2025

@songy23 thank you for triaging this issue!

Update OTel Collector dependencies to v0.117.0 doesn't solve it...

services:
  agent:
    container_name: datadog-agent
    # Built this image by https://gitlab.ddbuild.io/DataDog/datadog-agent/-/jobs/765983422
    image: datadog/agent-dev:update-otel-collector-dependencies-0-117-0-py3
    volumes:
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
      - /proc/:/host/proc/:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      - DD_APM_ENABLED=true
      - DD_APM_NON_LOCAL_TRAFFIC=true
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_GRPC_ENDPOINT=0.0.0.0:4317
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT=0.0.0.0:4318
    env_file:
      - ~/sandbox.docker.env # DD_API_KEY is here.
docker logs datadog-agent | grep '\: process does not exist'
2025-01-14 23:42:26 UTC | CORE | ERROR | (comp/otelcol/collector/impl-pipeline/pipeline.go:112 in func1) | Error running the OTLP ingest pipeline: failed to register process metrics: process does not exist

Looks agent fails to find own PID (= 404 in this case) from /host/proc.
PID 404 is only available in pid namespaces of agent container. This means PID 404 does not exist in procfs of the host.

# /proc and /host/proc are mounted in this container.
docker exec -it datadog-agent mount | grep '/proc type'
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
proc on /host/proc type proc (ro,nosuid,nodev,noexec,relatime)

# Find agent PID
docker exec -it datadog-agent ps -ef | grep 'agent run'
root         404     398  1 23:42 ?        00:00:04 agent run

# Process is found in /proc
docker exec -it datadog-agent bash -c 'find /proc -name '404' -type d 2>/dev/null'
/proc/404
/proc/404/task/404

# Not found in /host/proc
docker exec -it datadog-agent bash -c 'find /host/proc -name '404' -type d 2>/dev/null'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants