Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expression of trigger: Backup {#JOBTYPENAME} {#JOBNAME} in progress > {$VEEAM_JOB_HOURS_WARN:"{#JOBID}"}h #19

Open
thmcon opened this issue Jul 2, 2022 · 11 comments

Comments

@thmcon
Copy link

thmcon commented Jul 2, 2022

Hi,
I receive strange problems regarding the trigger: Backup {#JOBTYPENAME} {#JOBNAME} in progress > {$VEEAM_JOB_HOURS_WARN:"{#JOBID}"}h
Jobs with duration < 5hrs get problems triggered that the job lasts longer than 8 hrs.

I investigated the trigger which is defined as:
last(/Veeam Backup And Replication/veeam.Jobs.Result['{#JOBTYPEID}','{#JOBID}'])<0 and (now()-avg(/Veeam Backup And Replication/veeam.Jobs.LastStart['{#JOBTYPEID}','{#JOBID}'],5m))>{$VEEAM_JOB_HOURS_WARN:"{#JOBID}"}*3600

I am pretty sure, the avg function is wrong here.

My thoughts about the current trigger logic:

  • I have a job the backups VMs once a day
  • my copy job gets execetued directly after the backup job ends
  • this leads to one start time per day => an average of start times does calculate somewhere in the middle of the 5 days, which will always be more than 8 hrs. => will always tigger a problem.

For my understanding the avg-function should be removed completely and the trigger should work as expected like that:

last(/Veeam Backup And Replication/veeam.Jobs.Result['{#JOBTYPEID}','{#JOBID}'])<0 and (now()-last(/Veeam Backup And Replication/veeam.Jobs.LastStart['{#JOBTYPEID}','{#JOBID}']))>({$VEEAM_JOB_HOURS_WARN:"{#JOBID}"}*3600)

@aholiveira
Copy link
Contributor

aholiveira commented Jul 2, 2022

Hello @thmcon,
Which version of the template are you using?
I've had a few issues with that trigger:

When using last(LastStart) I was having the trigger active right when the backup was beginning ("early trigger"), which cleared after a few seconds.

Using avg() resolved that, and I was not having any more "early triggers", but then only when using a time argument for the avg() function which was less than the data sampling rate. I have never had, however, a situation where it would trigger an alert several at 5 hours when the threshold was defined at 8 hours. It would either trigger right at the start of the job (and clear 5-10 seconds later), or if the job was actually running for more than 8 hours (one of the full backups, which runs on a weekly basis, takes around 12-14 hours, so I'd have the trigger active between the 8th hour and until the job completed).

However, on the infrastructures that I'm managing, all the backups run on a daily basis (that backup job I mentioned is doing incrementals daily and a full on Saturdays). The normal "Backup jobs" run daily and the corresponding "Backup copy" run after them. You mentioned that the trigger is calculating "in the middle of the 5 days", so I would guess you have a bigger interval between backups which might affect the trigger behavior.

Since I don't have a test infrastructure for this, please feel free to experiment with the trigger definition and see what works with you. Then please let us know what works and we could incorporate that information into the trigger.

Before the testing, please use the template from this pull request here to make sure you are using the last version. @romainsi has yet to merge this (if he approves it) but I've tweaked that trigger on that version to try to avoid those "false triggers".

Hope this helps.

Regards.

@thmcon
Copy link
Author

thmcon commented Jul 2, 2022

@aholiveira Thanks for your reply! I was using the main branch, which I get when I just click on the "code" tab in github (https://github.com/romainsi/zabbix-VB-R-SQL).

I will use the one you have linked in your reply and then will investigate more in depth. Probably it will take some time until I can give a more detailed reply on that.

@thmcon
Copy link
Author

thmcon commented Jul 3, 2022

Hello @aholiveira ,

my current observations and changes:

let's assume that Veeam software might change the status flag sooner than the start time field and it takes some time until both values match the real situation
=> when we check last and last before status values to be -1 we will cover a time period of 5 minutes from the real start of the veeam jobresult
=> we compare the last start time value
=> veeam as a time period of almost 5 minutes (= frequency of data item "veeam.Jobs.Duration") to change the start time field of the current job processing ater it changed the status value

I also oberved that in my scenario the value veeam.Jobs.LastStart for the copy job changes according to the backup job, but the the value veeam.Jobs.Result for the copy job always remains -1 (running) - even when the veeam gui says that it's idle. I also checked this in the Veeam database.
In the trigger I added an additional check against the progress value not to be 100 in the last 2 time periods (5 Minutes).

That made my problems disappear. I will observe that further.

Currently my trigger looks like this:

last(/Veeam Backup And Replication/veeam.Jobs.Result['{#JOBTYPEID}','{#JOBID}'],#2) = last(/Veeam Backup And Replication/veeam.Jobs.Result['{#JOBTYPEID}','{#JOBID}'],#1) and
last(/Veeam Backup And Replication/veeam.Jobs.Result['{#JOBTYPEID}','{#JOBID}'],#1)<0 and 
last(/Veeam Backup And Replication/veeam.Jobs.Progress['{#JOBTYPEID}','{#JOBID}'],#1)<>100
and last(/Veeam Backup And Replication/veeam.Jobs.Progress['{#JOBTYPEID}','{#JOBID}'],#1) = last(/Veeam Backup And Replication/veeam.Jobs.Progress['{#JOBTYPEID}','{#JOBID}'],#2) and 
(now()-last(/Veeam Backup And Replication/veeam.Jobs.LastStart['{#JOBTYPEID}','{#JOBID}']))>({$VEEAM_JOB_HOURS_WARN:"{#JOBID}"}*3600)

@aholiveira
Copy link
Contributor

Hello @thmcon,
Thank you for your input. Let's keep this under observation.
One question: What is the {#JOBTYPEID} for the job you are having issues?
You can check this on the item definition, it will be the first value between the square brackets:
veeam.Jobs.Progress['63','5bae71da-a1bd-486a-80ab-badcb60b29fc']
In this case, my JOBTYPEID is 63.
This corresponds to column "type" on table [BJobs] of the Veeam database, for the corresponding job.

Regards.

@aholiveira
Copy link
Contributor

aholiveira commented Jul 3, 2022

Hello again,
This part of your condition will prevent the trigger to activate if the last two values of the job Progress are different (so it will almost never trigger). It will only trigger if the job is "stuck" on the same progress percent for more than 8 hours (which is not the trigger's intention).
last(/Veeam Backup And Replication/veeam.Jobs.Progress['{#JOBTYPEID}','{#JOBID}'],#1) = last(/Veeam Backup And Replication/veeam.Jobs.Progress['{#JOBTYPEID}','{#JOBID}'],#2)

However, based on your input I've built a new expression that might trigger only in the correct conditions (the job is actually running for more than 8 hours). It should not trigger also at the start of the job.
I haven't tested this yet (I will see how it works later tonight once the backups jobs actually run).
@thmcon, can you also test this on your infrastructure and see if it works?
I would also suggest to reduce the value of {$VEEAM_JOB_HOURS_WARN} to something like 3 or 5 to see if this is triggering correctly only when the job is running for more than that.
The test expression would be:
last(/Veeam Backup And Replication/veeam.Jobs.Progress['{#JOBTYPEID}','{#JOBID}'])<100 and last(/Veeam Backup And Replication/veeam.Jobs.LastStart['{#JOBTYPEID}','{#JOBID}'],#3)=last(/Veeam Backup And Replication/veeam.Jobs.LastStart['{#JOBTYPEID}','{#JOBID}']) and (now()-last(/Veeam Backup And Replication/veeam.Jobs.LastStart['{#JOBTYPEID}','{#JOBID}']))>{$VEEAM_JOB_HOURS_WARN:"{#JOBID}"}*3600

Expression rationale:

  • The job percent is below 100 (it is still not completed)
  • The LastStart from three readings ago (15m for a 5m sampling rate) equals the last reading (to avoid "early triggers" at the job start)
  • Elapsed time (now-LastStart) is greater than the threshold

I've removed the use of avg() and am now only using last()

Let me know how it goes.

Regards.

@thmcon
Copy link
Author

thmcon commented Jul 3, 2022

Hallo @aholiveira,

thanks a lot!

What is the {#JOBTYPEID} for the job you are having issues?
The Job type ID of my copy jobs is 51

Here you can see my copy jobs in veeam gui:
image

this is my current version of veeam:
image

I've updated my script and template to the pull request #17 you have linked above.
Then I changed the trigger expression to the code you posted above.

I can let you know about the results tomorow after the backup jobs have been running tonight.

Thanks & Regards,
Thomas

@thmcon
Copy link
Author

thmcon commented Jul 4, 2022

Good Morning,

from the results of last night's backup I would say the trigger works correct now!

There is still another issue with a job showing the wrong progress. I will create another issue for that one.
Thanks!

@aholiveira
Copy link
Contributor

Hello @thmcon,
On the installations I'm managing that expression gave "early triggers", activating the trigger at the start of the job for a few seconds. Did you not have any of these?
I really have to try and investigate why this is happening, because it is really strange.

Regards.

@aholiveira
Copy link
Contributor

aholiveira commented Jul 5, 2022

Hello,
I've let this run again tonight without changing the trigger expression, and again, all backup jobs the trigger was active for a few seconds right at the start of each job.
I still could not investigate further on why this is happening with the trigger expression I've posted above. Using avg() (has I have on version 2.8) does seem to prevent this, but I'd like to fully understand what's happening here.
@thmcon, any news on your side?

Regards.

@thmcon
Copy link
Author

thmcon commented Jul 5, 2022

Hello @aholiveira,

sorry for the delayed response.

On the installations I'm managing that expression gave "early triggers", activating the trigger at the start of the job for a few seconds. Did you not have any of these?

No, I did never get "early triggers" in none of the different trigger settings (the first version in the template, my version and your latest version).

I had that early trigger problem in my mind when I added the "last, #1" = "last, #2" expression for the job status item in my version of the trigger. My idea behind that was: the job status will be collected every 5 minutes and the trigger should only fire when the job status is below 0 (means job is running) two times in a row and the progress is not changing within 5 Minutes and the progress is not 100% and the duration is "too long" according to the macro value.

My trigger expression was not perfect, at least regarding the assumption to have a job progress of at least 1% in 5 Minutes.

Since I am using your last version of the expression, I only got this problem where my job in the database is 0% completed even when the veeam gui is saying it's idle.

It is just this one job that is idle with 0 % progress.
The job type ID is still 51, so no changes at all in this complete setup of veeam plus zabbix.

I think I have to apply the latest updates to veeam, but I can't do that on short notice.

@thmcon
Copy link
Author

thmcon commented Jul 6, 2022

Also after upgrading veeam to the latest version I still do not get false "early triggers"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants