java flamegraph not complete #2101

datavisorhenryzhao · 2025-01-28T12:59:20Z

Describe the bug
I deployed the pixie (self host), find the flame-graph is not complete

To Reproduce
Steps to reproduce the behavior:

Deploy pixie for k8s
os: Ubuntu 22.04.3 LTS
kernel: 6.2.0-1015-aws
vizier-pem: gcr.io/pixie-oss/pixie-prod/vizier-pem_image:0.14.9
java version

openjdk version "1.8.0-builds.shipilev.net-openjdk-shenandoah-jdk8-b798-20210222"
OpenJDK Runtime Environment (build 1.8.0-builds.shipilev.net-openjdk-shenandoah-jdk8-b798-20210222-testing-b00)
OpenJDK 64-Bit Server VM (build 25.71-b00, mixed mode)

Write java code

public class HelloWorld {
    public static void main(String[] args) {
        while (true) { 
                int i = 0;
                while (i < 1000000) {
                        f1();
                        i++;
                }
                try {
                // 暂停 5 毫秒
                        Thread.sleep(5);
                    } catch (InterruptedException e) {
                        System.out.println("Sleep was interrupted.");
                    }
        }
    }
    public static void f1() {
                try {
                // 暂停 5 毫秒
                        Thread.sleep(1);
                    } catch (InterruptedException e) {
                        System.out.println("Sleep was interrupted.");
                    }
        f2();
    }
    public static void f2() {
                try {
                // 暂停 5 毫秒
                        Thread.sleep(1);
                    } catch (InterruptedException e) {
                        System.out.println("Sleep was interrupted.");
                    }
        f3();
    }
    public static void f3() {
                try {
                // 暂停 5 毫秒
                        Thread.sleep(1);
                    } catch (InterruptedException e) {
                        System.out.println("Sleep was interrupted.");
                    }

    }

}

set entry point of the container image; and run in pod
java -XX:+PreserveFramePointer HelloWorld
select the pod and lookup Flamegraph

but I can not see f1/f2/f3 symbol

App information (please complete the following information):

Pixie version: gcr.io/pixie-oss/pixie-prod/vizier-pem_image:0.14.9
K8s : 1.24
Node Kernel version: 6.2.0-1015-aws
Browser version: Chrome 131.0.6778.266

The text was updated successfully, but these errors were encountered:

datavisorhenryzhao · 2025-01-28T16:41:02Z

I find only when i execute javac -g HelloWorld1.java, i can see the function f1(), by both the following cmd. So I think PreserveFramePointer have no effect

java -XX:+PreserveFramePointer HelloWorld1

java HelloWorld1

ddelnano · 2025-01-30T00:39:59Z

Hi @datavisorhenryzhao, thanks for the bug report.

Can you collect the PEM logs via px collect-logs while that workload is running? It would also be helpful (but not required) if you could add the following cli flag to the PEM DaemonSet before collecting the logs: --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1

I will try to reproduce the issue on my own sometime this week as well, but having the logs would be very helpful.

datavisorhenryzhao · 2025-01-30T11:10:25Z

when compiled without '-g'

2.when compiled with '-g', can find f1 symbol, but f2/f3 not found

info of pem-pod(where pod henry runs with same node)

By the way how to add the flag for pem ds? Do you know the full args?
--vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1

pixie_logs_20250130110549.zip

I also tried parca project (based on ebpf), it shows the full java stack (f1/f2/f3) only when compiled with '-g'

datavisorhenryzhao · 2025-02-04T17:34:31Z

@ddelnano hi, can you help me lookup ? It is weird, because parca works well compared with pixie

ddelnano · 2025-02-05T02:45:06Z

@datavisorhenryzhao sorry for the slow response. I have time set aside to investigate this tomorrow and reproduce it myself.

In order to enable that verbose logging, you need to add the following to the DaemonSet pem's container block (example location). Note: there doesn't need to be any additional args provided.

args:
- --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1`

I recommend using kubectl -n pl edit daemonset/vizier-pem to edit it interactively. If you have the operator running, you will need to remove its namespace prior or it might redeploy the DaemonSet and undo the manifest change.

As for the logs you've collected already, I noticed that there are lots of lost stack frames. See this log from your zip file:

I20250130 10:25:54.157191 2541577 perf_profile_connector.cc:433] PerfProfileConnector statistics: kBPFMapSwitchoverEvent=21 kCumulativeSumOfAllStackTraces=464498 **kLossHistoEvent=9554**

For every kLossHistoEvent recorded, that means stack trace data was lost. This happens when Pixie isn't able to process the perf buffer used by our BPF program quickly enough. This likely means that Pixie is under provisioned for your given instance type size and the features you have enabled.

$ grep -r kLossHistoEvent ~/Downloads/pixie_logs_20250130110549 | grep -v 'kLossHistoEvent=0' | cut -d: -f1 | uniq -c
   5 /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-v24kb__pem.log
   4 /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-6dfzq__pem.log
   5 /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-gw4zk__pem.log

$ grep CPUs /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-v24kb__pem.log
I20250130 10:14:07.961859 2771161 system_info.cc:44] Number of CPUs: 8
$ grep CPUs /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-6dfzq__pem.log
I20250130 10:14:31.381803 2538836 system_info.cc:44] Number of CPUs: 8
$ grep CPUs /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-gw4zk__pem.log
I20250130 10:14:08.298007 1112290 system_info.cc:44] Number of CPUs: 8

Since your cluster has a variety of instance types and the kLossHistoEvents are happening on the instances with 8 CPUs, it's possible these instances are under provisioned for Pixie's socket tracer and perf profiler.

In summary, I would try both of these things below:

Pin the Java workload to one of the nodes that isn't experiencing the kLossHistoEvents. This will allow us to rule out the lost data as the source of the invalid flame graph
Your vizier has PL_TABLE_STORE_DATA_LIMIT_MB: 2000 specified. I would remove this and/or increase the PEM's memory limit until the kLossHistoEvents are not present

If we determine that the kLossHistoEvents are the cause, we can discuss how to tune the PEMs memory usage in more detail.

datavisorhenryzhao · 2025-02-05T09:43:34Z

hi, @ddelnano

1.the pem-pod restarted with args

args:
- --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1

2.the pod henry and pem-pod (vizier-pem-7gzw5) run on the same node.

check the pod(vizier-pem-7gzw5) logs, within 6mins, I did not find the kLossHistoEvent=0

still missing f1/f2/f3 stacks, no matter if using javac -g

ddelnano · 2025-02-05T14:29:37Z

@datavisorhenryzhao did you capture the logs with px collect-logs again with that verbose logging enabled?

I did not find the kLossHistoEvent=0

Did you mean kLossHistoEvent=0? That would actually be a good sign as it would indicate no data loss. The case I pointed out showed significant kLossHistoEvents.

ddelnano · 2025-02-05T16:45:18Z

I'm able to reproduce the issue with the following Dockerfile and pod manifest:

Dockerfile and manifest

# Dockerfile
FROM debian:bookworm-slim AS build

# Install dependencies
RUN apt-get update && apt-get install -y \
    curl \
    xz-utils \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Download and extract the specified OpenJDK version
RUN curl -L -o openjdk.tar.xz https://builds.shipilev.net/openjdk-jdk8/openjdk-jdk8-linux-x86_64-server.tar.xz \
    && mkdir -p /opt/openjdk \
    && tar -xJf openjdk.tar.xz -C /opt/openjdk --strip-components=1 \
    && rm openjdk.tar.xz

# Set JAVA_HOME and update PATH
ENV JAVA_HOME=/opt/openjdk
ENV PATH="$JAVA_HOME/bin:$PATH"

# Copy application source code
COPY HelloWorld.java .

# Compile Java application
RUN javac HelloWorld.java

# Runtime image
FROM debian:bookworm-slim

# Install dependencies
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy the compiled Java application and JDK from build stage
COPY --from=build /opt/openjdk /opt/openjdk
COPY --from=build /app/HelloWorld.class .

# Set JAVA_HOME and update PATH
ENV JAVA_HOME=/opt/openjdk
ENV PATH="$JAVA_HOME/bin:$PATH"

# Command to run the application
CMD ["sh", "-c", "java -version && java -XX:+PreserveFramePointer HelloWorld"]

# k8s manifest
apiVersion: v1
kind: Pod
metadata:
  name: java-helloworld
  labels:
    app: java-helloworld
spec:
  containers:
    - args: ["sh", "-c", "java -version && java -XX:+PreserveFramePointer HelloWorld"]
      name: java-helloworld
      image: ddelnano/java-helloworld:latest
      imagePullPolicy: Always
      resources:
        limits:
          cpu: "500m"
          memory: "512Mi"
        requests:
          cpu: "250m"
          memory: "256Mi"
      ports:
        - containerPort: 8080

I've collected logs and don't see anything immediately obvious, but I'm continuing to investigate.

datavisorhenryzhao · 2025-02-05T18:38:13Z

Thanks for your effort

ddelnano added triage/accepted Indicates an issue or PR is ready to be actively worked on. area/datacollector Issues related to Stirling (datacollector) labels Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java flamegraph not complete #2101

java flamegraph not complete #2101

datavisorhenryzhao commented Jan 28, 2025

datavisorhenryzhao commented Jan 28, 2025

ddelnano commented Jan 30, 2025 •

edited

Loading

datavisorhenryzhao commented Jan 30, 2025

datavisorhenryzhao commented Feb 4, 2025

ddelnano commented Feb 5, 2025

datavisorhenryzhao commented Feb 5, 2025 •

edited

Loading

ddelnano commented Feb 5, 2025 •

edited

Loading

ddelnano commented Feb 5, 2025

datavisorhenryzhao commented Feb 5, 2025

java flamegraph not complete #2101

java flamegraph not complete #2101

Comments

datavisorhenryzhao commented Jan 28, 2025

datavisorhenryzhao commented Jan 28, 2025

ddelnano commented Jan 30, 2025 • edited Loading

datavisorhenryzhao commented Jan 30, 2025

datavisorhenryzhao commented Feb 4, 2025

ddelnano commented Feb 5, 2025

datavisorhenryzhao commented Feb 5, 2025 • edited Loading

ddelnano commented Feb 5, 2025 • edited Loading

ddelnano commented Feb 5, 2025

datavisorhenryzhao commented Feb 5, 2025

ddelnano commented Jan 30, 2025 •

edited

Loading

datavisorhenryzhao commented Feb 5, 2025 •

edited

Loading

ddelnano commented Feb 5, 2025 •

edited

Loading