From 50978887475e7eaf692d53e3f90b0d855164aefa Mon Sep 17 00:00:00 2001 From: Robusta Runner Date: Thu, 24 Oct 2024 18:09:49 +0300 Subject: [PATCH 1/3] test new prompts --- .../prompts/_general_instructions.jinja2 | 4 ++-- .../prompts/generic_investigation.jinja2 | 17 ++++++++++++++--- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/holmes/plugins/prompts/_general_instructions.jinja2 b/holmes/plugins/prompts/_general_instructions.jinja2 index cec29363..f8dd4d3c 100644 --- a/holmes/plugins/prompts/_general_instructions.jinja2 +++ b/holmes/plugins/prompts/_general_instructions.jinja2 @@ -18,13 +18,13 @@ If investigating Kubernetes problems: * run as many kubectl commands as you need to gather more information, then respond. * if possible, do so repeatedly on different Kubernetes objects. * for example, for deployments first run kubectl on the deployment then a replicaset inside it, then a pod inside that. -* when investigating a pod that crashed or application errors, always run kubectl_describe and fetch logs with both kubectl_previous_logs and kubectl_logs so that you see current logs and any logs from before a crash. +* when investigating a pod that crashed or application errors, always run kubectl_describe and fetch logs with both kubectl_previous_logs and kubectl_logs so that you see current logs and any logs from before a crash (even though kubectl_logs and kubectl_previous_logs are separate tools, in your output treat them as one Logs check that you did) * do not give an answer like "The pod is pending" as that doesn't state why the pod is pending and how to fix it. * do not give an answer like "Pod's node affinity/selector doesn't match any available nodes" because that doesn't include data on WHICH label doesn't match * if investigating an issue on many pods, there is no need to check more than 3 individual pods in the same deployment. pick up to a representative 3 from each deployment if relevant * if the user says something isn't working, ALWAYS: ** use kubectl_describe on the owner workload + individual pods and look for any transient issues they might have been referring to -** check the application aspects with kubectl_logs + kubectl_previous_logs and other relevant tools +** check the application aspects with kubectl_logs + kubectl_previous_logs and other relevant tools (even though kubectl_logs and kubectl_previous_logs are separate tools, in your output treat them as one Logs check that you did) ** look for misconfigured ingresses/services etc Special cases and how to reply: diff --git a/holmes/plugins/prompts/generic_investigation.jinja2 b/holmes/plugins/prompts/generic_investigation.jinja2 index 599a91d5..8693bc58 100644 --- a/holmes/plugins/prompts/generic_investigation.jinja2 +++ b/holmes/plugins/prompts/generic_investigation.jinja2 @@ -19,13 +19,24 @@ Style Guide: * But only quote relevant numbers or metrics that are available. Do not guess. * Remove unnecessary words -Give your answer in the following format (there is no need for a section listing all tools that were called but you can mention them in other sections if relevant) +Give your answer in the following format (do NOT add a "Tools" section to the output) # Alert Explanation -<1-2 sentences explaining the alert itself - note don't say "The alert indicates a warning event related to a Kubernetes pod doing blah" rather just say "The pod XYZ did blah" because that is what the user actually cares about> +<1-2 sentences explaining the alert itself - note don't say "The alert indicates a warning event related to a Kubernetes pod doing blah" rather just say "The pod XYZ did blah". In other words, don't say "The alert was triggered because XYZ" rather say "XYZ"> # Investigation - +< +what you checked and found +each point should start with +🟢 if the check was successful +🟡 if the check showed a potential problem or minor issue +🔴 if there was a definite major issue. +🔒 if you couldn't run the check itself (e.g. due to lack of permissions or lack of integration) + +A check should be in the format 'EMOJI *Check name*: details' +If there is both a logs and previous_logs tool usage (regardless of whether previous logs failed or succeeded), merge them together into one item named Logs. +Never mention that you were unable to retrieve previous logs! That error should be ignored and not shown to the user. +> # Conclusions and Possible Root causes From 50210ccfc4a1949932e31bffa12fb908ba49ef6a Mon Sep 17 00:00:00 2001 From: Robusta Runner Date: Tue, 29 Oct 2024 20:52:37 +0200 Subject: [PATCH 2/3] initial commit --- Dockerfile | 1 + holmes/plugins/toolsets/kubernetes.yaml | 3 +++ 2 files changed, 4 insertions(+) diff --git a/Dockerfile b/Dockerfile index e9a3baf6..1ba945fe 100644 --- a/Dockerfile +++ b/Dockerfile @@ -72,6 +72,7 @@ COPY . /app RUN apt-get update \ && apt-get install -y \ git \ + jq \ apt-transport-https \ gnupg2 \ && apt-get purge -y --auto-remove \ diff --git a/holmes/plugins/toolsets/kubernetes.yaml b/holmes/plugins/toolsets/kubernetes.yaml index 222b57b0..61616bb3 100644 --- a/holmes/plugins/toolsets/kubernetes.yaml +++ b/holmes/plugins/toolsets/kubernetes.yaml @@ -54,6 +54,9 @@ toolsets: #- name: "healthcheck_plugin" # description: "Check why a kubernetes health probe is failing. First call get_healthcheck_details" # command: "kubectl exec -n {{namespace}} {{ pod_name }} -- wget {{ url }}:{{port}}" + - name: "kubernetes_jq_query" + description: Use kubectl to get json for all resources of a specific kind pipe the results to jq to filter them. Do not worry about escaping the jq_expr it will be done by the system on an unescaped expression that you give. e.g. give an expression like .items[] | .spec.containers[].image | select(test("^gcr.io/") | not) + command: kubectl get {{ kind }} --all-namespaces -o json | jq -r {{ jq_expr }} # try adding your own tools here! # e.g. to query company-specific data or run your own commands From 43175348ea0d4fd8e1b9eccd9d0f0507490d3e45 Mon Sep 17 00:00:00 2001 From: Robusta Runner Date: Tue, 29 Oct 2024 20:53:22 +0200 Subject: [PATCH 3/3] Revert "test new prompts" This reverts commit 50978887475e7eaf692d53e3f90b0d855164aefa. --- .../prompts/_general_instructions.jinja2 | 4 ++-- .../prompts/generic_investigation.jinja2 | 17 +++-------------- 2 files changed, 5 insertions(+), 16 deletions(-) diff --git a/holmes/plugins/prompts/_general_instructions.jinja2 b/holmes/plugins/prompts/_general_instructions.jinja2 index f8dd4d3c..cec29363 100644 --- a/holmes/plugins/prompts/_general_instructions.jinja2 +++ b/holmes/plugins/prompts/_general_instructions.jinja2 @@ -18,13 +18,13 @@ If investigating Kubernetes problems: * run as many kubectl commands as you need to gather more information, then respond. * if possible, do so repeatedly on different Kubernetes objects. * for example, for deployments first run kubectl on the deployment then a replicaset inside it, then a pod inside that. -* when investigating a pod that crashed or application errors, always run kubectl_describe and fetch logs with both kubectl_previous_logs and kubectl_logs so that you see current logs and any logs from before a crash (even though kubectl_logs and kubectl_previous_logs are separate tools, in your output treat them as one Logs check that you did) +* when investigating a pod that crashed or application errors, always run kubectl_describe and fetch logs with both kubectl_previous_logs and kubectl_logs so that you see current logs and any logs from before a crash. * do not give an answer like "The pod is pending" as that doesn't state why the pod is pending and how to fix it. * do not give an answer like "Pod's node affinity/selector doesn't match any available nodes" because that doesn't include data on WHICH label doesn't match * if investigating an issue on many pods, there is no need to check more than 3 individual pods in the same deployment. pick up to a representative 3 from each deployment if relevant * if the user says something isn't working, ALWAYS: ** use kubectl_describe on the owner workload + individual pods and look for any transient issues they might have been referring to -** check the application aspects with kubectl_logs + kubectl_previous_logs and other relevant tools (even though kubectl_logs and kubectl_previous_logs are separate tools, in your output treat them as one Logs check that you did) +** check the application aspects with kubectl_logs + kubectl_previous_logs and other relevant tools ** look for misconfigured ingresses/services etc Special cases and how to reply: diff --git a/holmes/plugins/prompts/generic_investigation.jinja2 b/holmes/plugins/prompts/generic_investigation.jinja2 index 8693bc58..599a91d5 100644 --- a/holmes/plugins/prompts/generic_investigation.jinja2 +++ b/holmes/plugins/prompts/generic_investigation.jinja2 @@ -19,24 +19,13 @@ Style Guide: * But only quote relevant numbers or metrics that are available. Do not guess. * Remove unnecessary words -Give your answer in the following format (do NOT add a "Tools" section to the output) +Give your answer in the following format (there is no need for a section listing all tools that were called but you can mention them in other sections if relevant) # Alert Explanation -<1-2 sentences explaining the alert itself - note don't say "The alert indicates a warning event related to a Kubernetes pod doing blah" rather just say "The pod XYZ did blah". In other words, don't say "The alert was triggered because XYZ" rather say "XYZ"> +<1-2 sentences explaining the alert itself - note don't say "The alert indicates a warning event related to a Kubernetes pod doing blah" rather just say "The pod XYZ did blah" because that is what the user actually cares about> # Investigation -< -what you checked and found -each point should start with -🟢 if the check was successful -🟡 if the check showed a potential problem or minor issue -🔴 if there was a definite major issue. -🔒 if you couldn't run the check itself (e.g. due to lack of permissions or lack of integration) - -A check should be in the format 'EMOJI *Check name*: details' -If there is both a logs and previous_logs tool usage (regardless of whether previous logs failed or succeeded), merge them together into one item named Logs. -Never mention that you were unable to retrieve previous logs! That error should be ignored and not shown to the user. -> + # Conclusions and Possible Root causes