Skip to content

Known Issues

Laveesh Rohra edited this page Apr 16, 2021 · 25 revisions

List of known issues:

2.2.19 - [ProtocolError] /var/lib/waagent/GoalState.1.xml is missing

WireServer may reset the incarnation to an arbitrary value. The guest agent does not consider the incarnation value when determining the list of files to purge periodically. The purge process looks at the latest files as determined by an integer value. If the incarnation is reset, the agent must use the incarnation when considering the values to purge, but it does not. This the bug that causes this behavior.

The agent continually executes the following cycle.

Determine the latest incarnation from WireServer, and save it. Purge the last 50 old files based on the highest integer value associated with the file name. (For example, GoalState.11.xml, has integer value 11.) If the value is reset to 1, and the latest integer value is 51, then 1 will be deleted by the purge process.

This cycle will repeat over and over again every three seconds.

It is sufficient to delete all of the files that would be purged to stop this behavior. The files on disk are simply a cache, and good for debugging. They are not needed by the agent. The list of files to be deleted under /var/lib/waagent match the following glob:

*.[0-9]*.xml

To mitigate this issue, run the following command:

sudo rm -f /var/lib/waagent/*.[0-9]*.xml

To run this mitigation via an extension:

az vm extension set --resource-group rgName --vm-name vmName --name CustomScript --publisher  Microsoft.Azure.Extensions --version 2.0 --settings '{"commandToExecute": "rm -f /var/lib/waagent/*.[0-9]*.xml"}'

To run via PowerShell:

Set-AzureRmVMExtension -ResourceGroupName <resourceGroupName> -VMName <vmName> -Location <location> -Publisher Microsoft.Azure.Extensions -ExtensionType CustomScript -Name CustomScript -TypeHandlerVersion 2.0 -SettingString '{"commandToExecute": "rm -f /var/lib/waagent/*.[0-9]*.xml"}'

You can also execute one of the above from Cloud Shell.

Then restart the agent; usually this command is similar to sudo service walinuxagent restart, but it depends on the Linux distribution. Also note that this issue may self-mitigate when a new goal-state is sent to the VM. This can be done by installing or updating an extension, adding/removing data disks, etc.

These self-mitigations should be considered temporary. Until the hotfix rolls out to the VM region, the condition may re-occur when the agent restarts, requiring the same mitigation.

[2.2.21, .22, .23 - [ExtensionError] Downgrade not allowed or Downgrade is not allowed. Skipping install and enable.)

There was a cache invalidation bug in WALinuxAgent 2.2.21, 2.2.22, and 2.2.23 that caused the Linux guest agent to unnecessarily complain about a downgrade, and prevent the execution of extensions. A downgrade is when the agent believes it is being asked to install an extension with a version less that the currently installed extension version. For example, CustomScript 2.0.6 is installed, but the agent thinks CustomScript v2.0.3 is about to be installed. The agent does not support downgrade.

A fixed has been shipped in WALinuxAgent 2.2.25.

If you are currently experiencing this issue, you can self mitigate the issue by issuing the commands below. This covers the most affected extensions. You may not have all of these extensions installed on your VM in which case the command can be safely ignored. If your extension is not listed below, please delete it manually using the same pattern shown below.

rm -f /var/lib/waagent/Prod.*.agentsManifest 
rm -f /var/lib/waagent/Microsoft.Azure.RecoveryServices.VMSnapshotLinux.[0-9]*.manifest.xml
rm -f /var/lib/waagent/Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux.[0-9]*.manifest.xml
rm -f /var/lib/waagent/Microsoft.OSTCExtensions.LinuxDiagnostic.[0-9]*.manifest.xml
rm -f /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript.[0-9]*.manifest.xml
rm -f /var/lib/waagent/Microsoft.OSTCExtensions.VMAccessForLinux.[0-9]*.manifest.xml
rm -f /var/lib/waagent/Microsoft.Azure.Security.AzureDiskEncryptionForLinux.[0-9]*.manifest.xml
rm -f /var/lib/waagent/Microsoft.Azure.NetworkWatcher.NetworkWatcherAgentLinux.[0-9]*.manifest.xml
rm -f /var/lib/waagent/Microsoft.OSTCExtensions.CustomScriptForLinux.[0-9]*.manifest.xml
rm -f /var/lib/waagent/Microsoft.Azure.RecoveryServices.SiteRecovery.Linux.[0-9]*.manifest.xml
rm -f /var/lib/waagent/Microsoft.Azure.Diagnostics.LinuxDiagnostic.[0-9]*.manifest.xml
rm -f /var/lib/waagent/Microsoft.Azure.RecoveryServices.SiteRecovery.LinuxRHEL7.[0-9]*.manifest.xml

[2.2.54.2] - waagent-network-setup.service: main process exited, code=exited, status=2/INVALIDARGUMENT

The waagent-network-setup.service is a helper utility service installed by the GuestAgent whose only purpose is to setup the firewall rules on the Linux VMs on system reboot before network.service is established on the VM. When a custom image is created from another VM, the new VM also inherits the unit file of this helper service which throws this error. Here's a snippet of the error -

Logs from the waagent-network-setup.service since system boot:
 -- Logs begin at Thu 2021-04-15 20:33:35 UTC, end at Thu 2021-04-15 20:34:40 UTC. --
Apr 15 20:33:59 p0077c1d14pm01c000002 python[1079]: /usr/bin/python: can't open file '/var/lib/waagent/WALinuxAgent-2.2.54.2/bin/WALinuxAgent-2.2.54.2-py2.7.egg': [Errno 2] No such file or directory
Apr 15 20:34:01 p0077c1d14pm01c000002 systemd[1]: waagent-network-setup.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 15 20:34:01 p0077c1d14pm01c000002 systemd[1]: Failed to start Setup network rules for WALinuxAgent.
Apr 15 20:34:01 p0077c1d14pm01c000002 systemd[1]: Unit waagent-network-setup.service entered failed state.
Apr 15 20:34:01 p0077c1d14pm01c000002 systemd[1]: waagent-network-setup.service failed.

This is a benign issue which does not impact the working of the GuestAgent at all. All this error means is that the helper utility was unable to setup the firewall rules for the VM on boot, which is independent of the workings of the GuestAgent completely. On top of it, this error would go away as soon as the system is rebooted again because the agent on startup replaces the missing egg file with the most recent one for that specific VM.

The root cause of this issue is that during deprovision of the original VM, the unit file is not being cleaned up. This is due to the daemon version < 2.2.54.2 in the VM. Once the daemon version is updated to this version on the original VM, the unit file would be cleaned up properly on deprovision and you would not see this error anymore.

As mentioned before, this error message is benign and does not mean anything is wrong with the WALinuxAgent at all, but if it bothers the customers some temporary mitigations that can be applied are -

  • Reboot the VM
  • If restarting the VM is not an option, restart the service - systemctl restart waagent-network-setup.service or systemctl restart walinuxagent-network-setup.service based on the distro.
  • (Preventive measure) During deprovision of your VM, manually delete the unit files from the VM by executing these commands -
rm -rf /usr/lib/systemd/system/wa*agent-network-setup.service*
rm -rf /lib/systemd/system/wa*agent-network-setup.service*