[RHINENG-7833] - Send notification newly-stale hosts #2123

jpramos123 · 2024-12-09T14:49:28Z

Overview

This PR is being created to address RHINENG-7833.
It adds the functionality of sending notifications for hosts that went stale in the last X seconds. To define this X seconds, it was used the time windows between the job runs. Let say the last run was 1 hour ago, the new job that is going to run is going to fetch the hosts that went stale 1 hour ago. In cases that the job fails to run for some reason, this gap will increase. By default, the job is running every 1 hour.

To save the last run, it was created a tabled called hbi_metadata, that has 3 attributes:

name
type
last_succeed

The last_update field is the one used to create the window time between the successful jobs runs.

It also refacts the host reaper, as this new feature is also a job that runs every certain amount of time. The common code used by both jobs were added to the jobs/common.py file.

PR Checklist

Secure Coding Practices Documentation Reference

You can find documentation on this checklist here.

Secure Coding Checklist

jpramos123 · 2024-12-17T14:15:04Z

/retest

FabriciaDinizRH

I think this needs more tests, but I'll not block it

msager27 · 2025-01-02T15:22:28Z

/retest

jpramos123 · 2025-01-21T20:51:18Z

/retest

kruai

I wanna give this a more thorough review tomorrow, but a couple of quick things:

staleness_to_conditions got moved, so this PR is re-adding it in a new location
I think it would be better if we didn't pass the config through all those functions that don't need it just to fetch the verification window. If we do need it, we could grab the value straight from inventory_config, for instance. But also, I don't know whether _stale_in should be built into the Conditions class; "stale_in" isn't a valid staleness state, and I think that makes staleness_to_conditions much less straightforward. I'd lean towards making a new filtering function, or potentially a new function to pass into timestamp_filter_func (haven't though that one through yet though)

app/culling.py

deploy/clowdapp.yml

kruai

I'll need to make a second pass at this tomorrow, but here are my thoughts so far, in addition to the inline comments:

Thanks for refactoring the common job code! Great DRYing
I don't see where the HostInventoryMetadata record is updated after the job completes. I see where it's created if it doesn't exist, but nowhere else
The HostStale class feels kinda redundant, and its naming does not make its purpose very clear. I'd need to play around with the code a bit to see if there is a simpler and clearer way to achieve this; I encourage other reviewers to try as well

migrations/versions/4f0b36e6cf28_create_metadata_table_to_store_host_.py

lib/metrics.py

app/models.py

jobs/common.py

app/models.py

stale_host_notification.py

thearifismail

Still reviewing

thearifismail · 2025-01-29T14:26:09Z

stale_host_notification.py

+PROMETHEUS_JOB = "inventory-stale-host-notification"
+COLLECTED_METRICS = (
+    stale_host_notification_count,
+    stale_host_notification_processing_time,


The word processing_time suggests that it is time-delta between the start and end of notification processing event, but in the code I don't see such a thing. It looks like it is simply a timestamp when the notification failure is recorded. If that is the case then stale_host_notification_time sounds more appropriate.

This is exactly what it is doing. from the methods description:

Time a block of code or function, and observe the duration in seconds.

thearifismail · 2025-01-29T14:51:59Z

stale_host_notification.py

+LOGGER_NAME = "stale_host_notification"
+PROMETHEUS_JOB = "inventory-stale-host-notification"
+COLLECTED_METRICS = (
+    stale_host_notification_count,


Since there is a failure count, this metric should be renamed to indicate success; so the metric should be renamed to stale_host_notification_success_count.

thearifismail · 2025-01-29T20:55:44Z

dev.yml

@@ -107,6 +107,12 @@ services:
        db:
          condition: service_healthy

+    prometheus-gateway:
+      container_name: prometheus_gateway
+      image: prom/pushgateway


Suggested change

image: prom/pushgateway

image: quay.io/prometheus/pushgateway

For images, avoid pulling them from docker. Here is what I found in quay.io and it seems to work.

thearifismail · 2025-01-29T22:24:34Z

jobs/common.py

@@ -0,0 +1,106 @@
+from functools import partial


the name module name common should be renamed to communicate its purpose or reason of its existence.

IMPOV, the structure is fine.

We have the folder called jobs that indicates that the it is going to contain jobs related code. Inside it, we have the common.py file, that indicates the we have common code for the jobs in that file

thearifismail · 2025-01-29T22:28:11Z

jobs/common.py

+    return query_filters
+
+
+def main(logger, collected_metrics, prometheus_job_name):


+1 on @kruai comment. I got here searching the definition of notification_event_producer and found the call to main function very confusing and wonder about the purpose of main before seeing Asa's comment.

Since this common is really a job config, could the setup part be moved to app.config.py and various helper functions moved to the lib folder in a new file/module? OR move the common.main function to stale_host_notification.py, where the configuration part should be added to app.config and the supporting functions to a new file in lib directory

thearifismail · 2025-01-29T22:42:11Z

app/models.py

+class HostInventoryMetadata(db.Model):  # type: ignore [name-defined]
+    __tablename__ = "hbi_metadata"
+    __table_args__ = (
+        UniqueConstraint("name", "type", name="hbi_metadata_unique_constraint"),


what constraint? why it has to be unique?

Removed this constraint in order to the the PK constrain using two columns: name and type

thearifismail · 2025-01-29T23:12:53Z

stale_host_notification.py

+    )
+
+
+def _create_host_stale_metadata_in_db(session: Session) -> HostInventoryMetadata:


Is it creating or checking staleness? A host's staleness is set when the host is created or updated; the name is confusing. May be adding the comment before the function signature will be helpful.

the function name indicates it is creating a host stale metadata in the database, as we can see it starts with create.

thearifismail

@jpramos123 Based on the code review so far, three issues have come up to the fore.

Modules arrangement. Though this is a project wide issue but to get started from now on let's limit ourselves to stale_host_inventory and jobs/common.py. I suggest starting with combining "stale_host_notification.run()" and "jobs.common.main()" functions; rest of the common code should go to the "lib" folder. IMO, the jobs folder is not needed.
Confusing names of variables, functions, and modules. Look at each variable and what is this variable for and does its name communicates its purpose? If it does not then, rename it to communicate its purpose. If the names get too long, don't worry to get started. Please look at my comments to see which names I found confusing or need clarification.
HostInventoryMetadata table. First of all the name is confusing. Is this table really needed? Which I guess it is to save the "last update time" between two different notification events. In its current form, this table is updated when all notifications have been sent? Question is, what happens when there are 1000 hosts which have gone stale and notifications for 999 have been sent successfully but the last one did not go due to some problem. This means the notification event did not complete successfully, so the last_update_time can not be updated. Therefore it seems more appropriate that "last_update_time" should be recorded for each host. The problem with this approach is that there will be way too much traffic updating this table. If database should not hold "host-inventory-metadata", what other options are available to persist last update time between different notification events/episodes.

stale_host_notification.py

thearifismail · 2025-01-30T15:57:15Z

lib/metrics.py

@@ -52,3 +52,14 @@
    "inventory_new_export_seconds", "Time spent to create a new host export report"
 )
 create_export_count = Counter("inventory_create_export", "The total amount of host exports created")
+
+# Stale Host Notification
+stale_host_notification_count = Counter(


Change to stale_host_notification_success_count, if the metric name is changed in stale_host_notification.py.

Done. Addressed by your previous comment.

thearifismail · 2025-01-30T16:07:13Z

app/host_stale.py

+RUNTIME_ENVIRONMENT = RuntimeEnvironment.JOB
+
+
+class HostStale:


HostStale sounds confusing and be misread as "host tail". How about StaleHost and less favorable is host_stale.

thearifismail · 2025-01-30T16:23:00Z

stale_host_notification.py

+    try:
+        stale_host_timestamp = (
+            session.query(HostInventoryMetadata)
+            .where(HostInventoryMetadata.name == "host_stale", HostInventoryMetadata.type == "job")


What are the other possible values for name and type in HostInventoryMetadata?

Possibilities are:

name: hbi_api type: api

or

name: host_reaper type: job

This is what comes to my mind right now

thearifismail · 2025-01-30T16:30:34Z

app/models.py

@@ -661,6 +661,26 @@ def update(self, input_acc):
    modified_on = db.Column(db.DateTime(timezone=True), default=_time_now, onupdate=_time_now)


+class HostInventoryMetadata(db.Model):  # type: ignore [name-defined]


Is a separate table really needed, if the calls to creating HostInventoryMetadata originate from def _query_or_create_host_stale(), which has been created to calculate staleness time. Since the name host_stale and type job do not look unique, what happens to this data when a staleness notification job has completed? This information looks very transitory and makes one wonder, if a separate table is needed which will hold the in the table always.

In other words, could the staleness time be kept in an object instead of persisting it in a db. Once the stale host notifications have been set, the object is removed. The next session can create its own staleness time stamp.

Yes, this table is needed as we need a way of persisting the last successful run from this job. We use this value to calculate the window time to query the hosts that went stale in this same window. We used this approach because, in case the job fails to start/run for some reason, we can have a dynamic window time being calculated and no stale hosts are going to be left of being notified by the job. See this slack thread where we discussed this approach: https://redhat-internal.slack.com/archives/CQFKM031T/p1737492464040599

Sorry if not clear, but name and type are unique in the DB. So host_stale and job are unique (one single row) in the table.

thearifismail · 2025-01-30T16:37:26Z

stale_host_notification.py

+@stale_host_notification_fail_count.count_exceptions()
+def run(logger, session, notification_event_producer, shutdown_handler, application):
+    with application.app.app_context(), stale_host_notification_processing_time.time():
+        stale_host_timestamp = _query_or_create_host_stale(session)


Is the purpose of calling _query_or_create_host_stale(session) function to determine stale_timestamp? If it is, isn't there a function already in the project to for determining stale_timestamp? If not found, it leads to adding a record to the host_inventory_metadata table.

No, it is querying the HostInventoryMetadata class, filtering it by: name == host_stale and type == job. Do not confuse stale_timestamp with stale_host_timestamp.

thearifismail · 2025-01-30T16:43:32Z

stale_host_notification.py

+    except NoResultFound:
+        stale_host_timestamp = _create_host_stale_metadata_in_db(session)
+
+    return stale_host_timestamp


oh this is creating timestamp for staleness. The word "host" makes it more confusing than not having it.

The feature it self is related to newly-stale hosts being notified, I believe we need to have word host to specify the feature aim... Please, leave a suggestion if any!

kruai

I think there are further things that can be refactored, but first, I wanna propose that we remove the HostStale class

api/filtering/db_filters.py

strider

LGTM

thearifismail

LGTM

jpramos123 force-pushed the add-notif-stale-host branch from ee80365 to 952eaa5 Compare December 12, 2024 21:35

jpramos123 marked this pull request as ready for review December 12, 2024 21:36

jpramos123 requested a review from a team as a code owner December 12, 2024 21:36

jpramos123 force-pushed the add-notif-stale-host branch 5 times, most recently from 746edf5 to 8061985 Compare December 17, 2024 13:28

jpramos123 force-pushed the add-notif-stale-host branch 6 times, most recently from 6820653 to c9a1324 Compare December 17, 2024 20:37

FabriciaDinizRH previously approved these changes Dec 24, 2024

View reviewed changes

jpramos123 dismissed FabriciaDinizRH’s stale review via afdf2f6 January 21, 2025 13:13

jpramos123 force-pushed the add-notif-stale-host branch 3 times, most recently from 355f9cf to 0e11cec Compare January 21, 2025 20:32

jpramos123 requested a review from FabriciaDinizRH January 21, 2025 20:48

kruai requested changes Jan 21, 2025

View reviewed changes

app/culling.py Outdated Show resolved Hide resolved

app/culling.py Outdated Show resolved Hide resolved

deploy/clowdapp.yml Outdated Show resolved Hide resolved

deploy/clowdapp.yml Outdated Show resolved Hide resolved

jpramos123 force-pushed the add-notif-stale-host branch from 0e11cec to bddad3b Compare January 23, 2025 17:16

jpramos123 added ready for review The PR is ready for review and removed ready for review The PR is ready for review labels Jan 23, 2025

jpramos123 force-pushed the add-notif-stale-host branch from bddad3b to e82ad83 Compare January 23, 2025 17:21

jpramos123 requested a review from kruai January 24, 2025 14:07

jpramos123 force-pushed the add-notif-stale-host branch from e82ad83 to 249b132 Compare January 24, 2025 14:40

jpramos123 force-pushed the add-notif-stale-host branch 2 times, most recently from e2146d7 to 08f32e7 Compare January 28, 2025 19:45

jpramos123 marked this pull request as ready for review January 28, 2025 19:50

kruai requested changes Jan 29, 2025

View reviewed changes

thearifismail reviewed Jan 29, 2025

View reviewed changes

thearifismail reviewed Jan 30, 2025

View reviewed changes

jpramos123 force-pushed the add-notif-stale-host branch from 08f32e7 to 5e090fa Compare February 3, 2025 19:46

jpramos123 requested review from thearifismail and kruai February 3, 2025 19:49

kruai requested changes Feb 3, 2025

View reviewed changes

api/filtering/db_filters.py Outdated Show resolved Hide resolved

jpramos123 force-pushed the add-notif-stale-host branch from 07ab03e to c70c2cd Compare February 5, 2025 18:21

jpramos123 requested a review from kruai February 5, 2025 18:22

jpramos123 added the ready for review The PR is ready for review label Feb 5, 2025

jpramos123 force-pushed the add-notif-stale-host branch from c70c2cd to 945e93d Compare February 5, 2025 18:50

kruai approved these changes Feb 5, 2025

View reviewed changes

jpramos123 added 8 commits February 6, 2025 08:05

refact: refact host-reaper

7cbdbb1

feat: implement stale host notification

b4ed276

tests: Add tests

6ec10f9

feat: Add job to Clowdapp

d819c6b

feat: Update Dockerfile

ee12dac

feat: improve logging and add metrics

b3e6b11

feat: Add Prometheus pushgateway container

f4a8f84

feat: Create table and ORM model

525614e

jpramos123 force-pushed the add-notif-stale-host branch from 945e93d to 525614e Compare February 6, 2025 11:05

strider approved these changes Feb 6, 2025

View reviewed changes

thearifismail reviewed Feb 6, 2025

View reviewed changes

thearifismail approved these changes Feb 6, 2025

View reviewed changes

Merge branch 'master' into add-notif-stale-host

e896449

kruai merged commit 1dd8e01 into RedHatInsights:master Feb 6, 2025
3 of 10 checks passed

thearifismail mentioned this pull request Feb 10, 2025

(RHINENG-12028): serialize_host refactoring #2240

Merged

24 tasks

	image: prom/pushgateway
	image: quay.io/prometheus/pushgateway

		return query_filters


		def main(logger, collected_metrics, prometheus_job_name):

		)


		def _create_host_stale_metadata_in_db(session: Session) -> HostInventoryMetadata:

		RUNTIME_ENVIRONMENT = RuntimeEnvironment.JOB


		class HostStale:

		@@ -661,6 +661,26 @@ def update(self, input_acc):
		modified_on = db.Column(db.DateTime(timezone=True), default=_time_now, onupdate=_time_now)


		class HostInventoryMetadata(db.Model): # type: ignore [name-defined]

[RHINENG-7833] - Send notification newly-stale hosts #2123

[RHINENG-7833] - Send notification newly-stale hosts #2123

Conversation

jpramos123 commented Dec 9, 2024 • edited Loading

Overview

PR Checklist

Secure Coding Practices Documentation Reference

Secure Coding Checklist

jpramos123 commented Dec 17, 2024

FabriciaDinizRH left a comment

Choose a reason for hiding this comment

msager27 commented Jan 2, 2025

jpramos123 commented Jan 21, 2025

kruai left a comment

Choose a reason for hiding this comment

kruai left a comment

Choose a reason for hiding this comment

thearifismail left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpramos123 Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

thearifismail left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpramos123 Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kruai left a comment

Choose a reason for hiding this comment

strider left a comment

Choose a reason for hiding this comment

thearifismail left a comment

Choose a reason for hiding this comment

jpramos123 commented Dec 9, 2024 •

edited

Loading

jpramos123 Feb 3, 2025 •

edited

Loading

jpramos123 Feb 3, 2025 •

edited

Loading