From 50978887475e7eaf692d53e3f90b0d855164aefa Mon Sep 17 00:00:00 2001
From: Robusta Runner <aantny@gmail.com>
Date: Thu, 24 Oct 2024 18:09:49 +0300
Subject: [PATCH 1/3] test new prompts

---
 .../prompts/_general_instructions.jinja2        |  4 ++--
 .../prompts/generic_investigation.jinja2        | 17 ++++++++++++++---
 2 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/holmes/plugins/prompts/_general_instructions.jinja2 b/holmes/plugins/prompts/_general_instructions.jinja2
index cec29363..f8dd4d3c 100644
--- a/holmes/plugins/prompts/_general_instructions.jinja2
+++ b/holmes/plugins/prompts/_general_instructions.jinja2
@@ -18,13 +18,13 @@ If investigating Kubernetes problems:
 * run as many kubectl commands as you need to gather more information, then respond.
 * if possible, do so repeatedly on different Kubernetes objects.
 * for example, for deployments first run kubectl on the deployment then a replicaset inside it, then a pod inside that.
-* when investigating a pod that crashed or application errors, always run kubectl_describe and fetch logs with both kubectl_previous_logs and kubectl_logs so that you see current logs and any logs from before a crash.
+* when investigating a pod that crashed or application errors, always run kubectl_describe and fetch logs with both kubectl_previous_logs and kubectl_logs so that you see current logs and any logs from before a crash (even though kubectl_logs and kubectl_previous_logs are separate tools, in your output treat them as one Logs check that you did)
 * do not give an answer like "The pod is pending" as that doesn't state why the pod is pending and how to fix it.
 * do not give an answer like "Pod's node affinity/selector doesn't match any available nodes" because that doesn't include data on WHICH label doesn't match
 * if investigating an issue on many pods, there is no need to check more than 3 individual pods in the same deployment. pick up to a representative 3 from each deployment if relevant
 * if the user says something isn't working, ALWAYS:
 ** use kubectl_describe on the owner workload + individual pods and look for any transient issues they might have been referring to
-** check the application aspects with kubectl_logs + kubectl_previous_logs and other relevant tools
+** check the application aspects with kubectl_logs + kubectl_previous_logs and other relevant tools (even though kubectl_logs and kubectl_previous_logs are separate tools, in your output treat them as one Logs check that you did)
 ** look for misconfigured ingresses/services etc
 
 Special cases and how to reply:
diff --git a/holmes/plugins/prompts/generic_investigation.jinja2 b/holmes/plugins/prompts/generic_investigation.jinja2
index 599a91d5..8693bc58 100644
--- a/holmes/plugins/prompts/generic_investigation.jinja2
+++ b/holmes/plugins/prompts/generic_investigation.jinja2
@@ -19,13 +19,24 @@ Style Guide:
 * But only quote relevant numbers or metrics that are available. Do not guess.
 * Remove unnecessary words
 
-Give your answer in the following format (there is no need for a section listing all tools that were called but you can mention them in other sections if relevant)
+Give your answer in the following format (do NOT add a "Tools" section to the output)
 
 # Alert Explanation
-<1-2 sentences explaining the alert itself - note don't say "The alert indicates a warning event related to a Kubernetes pod doing blah" rather just say "The pod XYZ did blah" because that is what the user actually cares about>
+<1-2 sentences explaining the alert itself - note don't say "The alert indicates a warning event related to a Kubernetes pod doing blah" rather just say "The pod XYZ did blah". In other words, don't say "The alert was triggered because XYZ" rather say "XYZ">
 
 # Investigation
-<what you checked and found>
+<
+what you checked and found
+each point should start with 
+🟢 if the check was successful
+🟡 if the check showed a potential problem or minor issue
+🔴 if there was a definite major issue.
+🔒 if you couldn't run the check itself (e.g. due to lack of permissions or lack of integration)
+
+A check should be in the format 'EMOJI *Check name*: details'
+If there is both a logs and previous_logs tool usage (regardless of whether previous logs failed or succeeded), merge them together into one item named Logs.
+Never mention that you were unable to retrieve previous logs! That error should be ignored and not shown to the user.
+>
 
 # Conclusions and Possible Root causes
 <what conclusions can you reach based on the data you found? what are possible root causes (if you have enough conviction to say) or what uncertainty remains>

From 50210ccfc4a1949932e31bffa12fb908ba49ef6a Mon Sep 17 00:00:00 2001
From: Robusta Runner <aantny@gmail.com>
Date: Tue, 29 Oct 2024 20:52:37 +0200
Subject: [PATCH 2/3] initial commit

---
 Dockerfile                              | 1 +
 holmes/plugins/toolsets/kubernetes.yaml | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/Dockerfile b/Dockerfile
index e9a3baf6..1ba945fe 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -72,6 +72,7 @@ COPY . /app
 RUN apt-get update \
     && apt-get install -y \
        git \
+       jq \
        apt-transport-https \
        gnupg2 \
     && apt-get purge -y --auto-remove \
diff --git a/holmes/plugins/toolsets/kubernetes.yaml b/holmes/plugins/toolsets/kubernetes.yaml
index 222b57b0..61616bb3 100644
--- a/holmes/plugins/toolsets/kubernetes.yaml
+++ b/holmes/plugins/toolsets/kubernetes.yaml
@@ -54,6 +54,9 @@ toolsets:
   #- name: "healthcheck_plugin"
   #  description: "Check why a kubernetes health probe is failing. First call get_healthcheck_details"
   #  command: "kubectl exec -n {{namespace}} {{ pod_name }} -- wget {{ url }}:{{port}}"
+  - name: "kubernetes_jq_query"
+    description: Use kubectl to get json for all resources of a specific kind pipe the results to jq to filter them. Do not worry about escaping the jq_expr it will be done by the system on an unescaped expression that you give. e.g. give an expression like .items[] | .spec.containers[].image | select(test("^gcr.io/") | not)
+    command: kubectl get {{ kind }} --all-namespaces -o json | jq -r {{ jq_expr }}
 
   # try adding your own tools here!
   # e.g. to query company-specific data or run your own commands

From 43175348ea0d4fd8e1b9eccd9d0f0507490d3e45 Mon Sep 17 00:00:00 2001
From: Robusta Runner <aantny@gmail.com>
Date: Tue, 29 Oct 2024 20:53:22 +0200
Subject: [PATCH 3/3] Revert "test new prompts"

This reverts commit 50978887475e7eaf692d53e3f90b0d855164aefa.
---
 .../prompts/_general_instructions.jinja2        |  4 ++--
 .../prompts/generic_investigation.jinja2        | 17 +++--------------
 2 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/holmes/plugins/prompts/_general_instructions.jinja2 b/holmes/plugins/prompts/_general_instructions.jinja2
index f8dd4d3c..cec29363 100644
--- a/holmes/plugins/prompts/_general_instructions.jinja2
+++ b/holmes/plugins/prompts/_general_instructions.jinja2
@@ -18,13 +18,13 @@ If investigating Kubernetes problems:
 * run as many kubectl commands as you need to gather more information, then respond.
 * if possible, do so repeatedly on different Kubernetes objects.
 * for example, for deployments first run kubectl on the deployment then a replicaset inside it, then a pod inside that.
-* when investigating a pod that crashed or application errors, always run kubectl_describe and fetch logs with both kubectl_previous_logs and kubectl_logs so that you see current logs and any logs from before a crash (even though kubectl_logs and kubectl_previous_logs are separate tools, in your output treat them as one Logs check that you did)
+* when investigating a pod that crashed or application errors, always run kubectl_describe and fetch logs with both kubectl_previous_logs and kubectl_logs so that you see current logs and any logs from before a crash.
 * do not give an answer like "The pod is pending" as that doesn't state why the pod is pending and how to fix it.
 * do not give an answer like "Pod's node affinity/selector doesn't match any available nodes" because that doesn't include data on WHICH label doesn't match
 * if investigating an issue on many pods, there is no need to check more than 3 individual pods in the same deployment. pick up to a representative 3 from each deployment if relevant
 * if the user says something isn't working, ALWAYS:
 ** use kubectl_describe on the owner workload + individual pods and look for any transient issues they might have been referring to
-** check the application aspects with kubectl_logs + kubectl_previous_logs and other relevant tools (even though kubectl_logs and kubectl_previous_logs are separate tools, in your output treat them as one Logs check that you did)
+** check the application aspects with kubectl_logs + kubectl_previous_logs and other relevant tools
 ** look for misconfigured ingresses/services etc
 
 Special cases and how to reply:
diff --git a/holmes/plugins/prompts/generic_investigation.jinja2 b/holmes/plugins/prompts/generic_investigation.jinja2
index 8693bc58..599a91d5 100644
--- a/holmes/plugins/prompts/generic_investigation.jinja2
+++ b/holmes/plugins/prompts/generic_investigation.jinja2
@@ -19,24 +19,13 @@ Style Guide:
 * But only quote relevant numbers or metrics that are available. Do not guess.
 * Remove unnecessary words
 
-Give your answer in the following format (do NOT add a "Tools" section to the output)
+Give your answer in the following format (there is no need for a section listing all tools that were called but you can mention them in other sections if relevant)
 
 # Alert Explanation
-<1-2 sentences explaining the alert itself - note don't say "The alert indicates a warning event related to a Kubernetes pod doing blah" rather just say "The pod XYZ did blah". In other words, don't say "The alert was triggered because XYZ" rather say "XYZ">
+<1-2 sentences explaining the alert itself - note don't say "The alert indicates a warning event related to a Kubernetes pod doing blah" rather just say "The pod XYZ did blah" because that is what the user actually cares about>
 
 # Investigation
-<
-what you checked and found
-each point should start with 
-🟢 if the check was successful
-🟡 if the check showed a potential problem or minor issue
-🔴 if there was a definite major issue.
-🔒 if you couldn't run the check itself (e.g. due to lack of permissions or lack of integration)
-
-A check should be in the format 'EMOJI *Check name*: details'
-If there is both a logs and previous_logs tool usage (regardless of whether previous logs failed or succeeded), merge them together into one item named Logs.
-Never mention that you were unable to retrieve previous logs! That error should be ignored and not shown to the user.
->
+<what you checked and found>
 
 # Conclusions and Possible Root causes
 <what conclusions can you reach based on the data you found? what are possible root causes (if you have enough conviction to say) or what uncertainty remains>