Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retina Windows Crashing Due to $env Being Set to C:\hpc\config on Helm Chart Redeploy #1138

Open
rayaisaiah opened this issue Dec 12, 2024 · 0 comments

Comments

@rayaisaiah
Copy link
Contributor

rayaisaiah commented Dec 12, 2024

Describe the bug
For Retina Windows daemonset there is a bug that occurs where the env variable does not correctly set in the powershell.exe command when the helm chart is installed with an incorrect command before being uninstalled and reinstalled with a correct command.

This results in the retina-win pods crashloopbackoff'ing with the error: Error: starting daemon: creating controller-runtime manager: CreateFile C:\hpc\kubeconfig: The system cannot find the file specified.

To Reproduce
Steps to reproduce the behavior:

  1. Edit the Retina Win daemonset command with the following
- powershell.exe
- -command
  1. make helm-install-without-tls to install retina with hubble
  2. Observe the retina-win pods crashing
  3. make helm-uninstall to remove retina pods
  4. Edit the Retina Win daemonset command to set the $env
- powershell.exe
- -command
- $env:CONTAINER_SANDBOX_MOUNT_POINT/controller.exe --config ./retina/config.yaml
  1. make helm-install-without-tls to install retina with hubble
  2. Check retina-win pod logs for the error: Error: starting daemon: creating controller-runtime manager: CreateFile C:\hpc\kubeconfig: The system cannot find the file specified.

Expected behavior
The #env variable for KUBECONFIG is incorrectly set to C:\hpc\config. The retina-win pods crashloopbackoff'ing with the error: Error: starting daemon: creating controller-runtime manager: CreateFile C:\hpc\kubeconfig: The system cannot find the file specified.

Screenshots

Image

Platform (please complete the following information):

  • OS: Windows
  • Kubernetes Version: 1.28 and 1.30
  • Host: AKS
  • Retina Version: v0.0.20 and v0.0.17

Additional context
Discovered after testing #1118 and setting the incorrect Powershell command for the Retina Windows helm chart.

Mitigation
Configure the Retina Windows daemonset with the latest helm chart on Main and create new Windows nodes. The retina-win pods that come up will have the working powershell commands and set KUBECONFIG correctly. Alternatively create a fresh cluster and helm install the Windows daemonset with the correct commands.

@rayaisaiah rayaisaiah changed the title Retina Windows Crashing Due to $env Not Saving on Helm Chart Redeploy Retina Windows Crashing Due to $env Being Set to C:\hpc\config on Helm Chart Redeploy Dec 12, 2024
github-merge-queue bot pushed a commit that referenced this issue Jan 3, 2025
…ues (#1128)

# Description

This PR aims to fix the stability of the retina windows agent. There
were 4 causes identified and each commit resolves one respectively.

1. Invalid rendering of the namespace helm value (1st commit)
```
matmerr@matmerr-cloud-dev: ~/go/src/github.com/Azure/telescope
[06:56:29 PM][matmerr-aks-pktmon-11][matmerr/enable-ama]$ k logs -f retina-agent-win-7f7kb
Starting Retina Agent
starting Retina daemon with legacy control plane v0.0.17
2024/12/02 18:56:22 metricsInterval is deprecated, please use metricsIntervalDuration instead
init client-go
KUBECONFIG set, using kubeconfig:  C:\hpc\kubeconfig
Error: starting daemon: creating controller-runtime manager: error loading config file "C:\hpc\kubeconfig": yaml: invalid map key: map[interface {}]interface {}{".Values.namespace":interface {}(nil)}
```

2. Default operator value is enabled and will cause RBAC issues for the
windows agents (2nd commit)

```
ts=2024-12-10T16:58:48.634Z level=info caller=hnsstats/hnsstats_windows.go:212 msg="Start hnsstats plugin..."
W1210 16:58:49.990792    7108 reflector.go:547] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:232: failed to list *v1alpha1.MetricsConfiguration: metricsconfigurations.retina.sh is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "metricsconfigurations" in API group "retina.sh" at the cluster scope
```

3. Telemetry enabled also causes the agent to panic if application
insights is not defined. User can change the config map as desired but
default values should not cause the agent to crash (3rd commit)

4. `kubeconfig` file cannot be found for the legacy chart values.
Executing the `setkubeconfigpath.ps1` was required for the container
setup (4th commit).

Update:
It was later found that the missing `kubeconfig` error only exists on
redeploy if the initial retina was before this change
(#1118). A later GH issue was
created - #1138

```
beegii@bignamboi:~/src/retina$ k logs retina-agent-win-4tl7m -n kube-system
Starting Retina Agent
starting Retina daemon with legacy control plane v0.0.17
2024/12/11 18:40:15 metricsInterval is deprecated, please use metricsIntervalDuration instead
init client-go
KUBECONFIG set, using kubeconfig:  C:\hpc\kubeconfig
Error: starting daemon: creating controller-runtime manager: CreateFile C:\hpc\kubeconfig: The system cannot find the file specified.
```

## Related Issue

#1122

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/contributing).
- [x] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [x] I have updated the documentation, if necessary.
- [x] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

Each commit corresponding image was built and tested on the cluster to
confirm each fix works!


![image](https://github.com/user-attachments/assets/dde7fe23-22ff-49bf-8c96-2c1a42c96f9d)

## Additional Notes

First three problems were experienced when deploying retina using the
hubble path and the last issue was experienced when deploying retina
using the legacy path

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant