Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RHINENG-7833] - Send notification newly-stale hosts #2123

Merged
merged 9 commits into from
Feb 6, 2025

Conversation

jpramos123
Copy link
Contributor

@jpramos123 jpramos123 commented Dec 9, 2024

Overview

This PR is being created to address RHINENG-7833.
It adds the functionality of sending notifications for hosts that went stale in the last X seconds. To define this X seconds, it was used the time windows between the job runs. Let say the last run was 1 hour ago, the new job that is going to run is going to fetch the hosts that went stale 1 hour ago. In cases that the job fails to run for some reason, this gap will increase. By default, the job is running every 1 hour.

To save the last run, it was created a tabled called hbi_metadata, that has 3 attributes:

  • name
  • type
  • last_succeed

The last_update field is the one used to create the window time between the successful jobs runs.

It also refacts the host reaper, as this new feature is also a job that runs every certain amount of time. The common code used by both jobs were added to the jobs/common.py file.

PR Checklist

  • Keep PR title short, ideally under 72 characters
  • Descriptive comments provided in complex code blocks
  • Include raw query examples in the PR description, if adding/modifying SQL query
  • Tests: validate optimal/expected output
  • Tests: validate exceptions and failure scenarios
  • Tests: edge cases
  • Recovers or fails gracefully during potential resource outages (e.g. DB, Kafka)
  • Uses type hinting, if convenient
  • Documentation, if this PR changes the way other services interact with host inventory
  • Links to related PRs

Secure Coding Practices Documentation Reference

You can find documentation on this checklist here.

Secure Coding Checklist

  • Input Validation
  • Output Encoding
  • Authentication and Password Management
  • Session Management
  • Access Control
  • Cryptographic Practices
  • Error Handling and Logging
  • Data Protection
  • Communication Security
  • System Configuration
  • Database Security
  • File Management
  • Memory Management
  • General Coding Practices

@jpramos123 jpramos123 marked this pull request as ready for review December 12, 2024 21:36
@jpramos123 jpramos123 requested a review from a team as a code owner December 12, 2024 21:36
@jpramos123 jpramos123 force-pushed the add-notif-stale-host branch 5 times, most recently from 746edf5 to 8061985 Compare December 17, 2024 13:28
@jpramos123
Copy link
Contributor Author

/retest

@jpramos123 jpramos123 force-pushed the add-notif-stale-host branch 6 times, most recently from 6820653 to c9a1324 Compare December 17, 2024 20:37
Copy link
Contributor

@FabriciaDinizRH FabriciaDinizRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs more tests, but I'll not block it

@msager27
Copy link

msager27 commented Jan 2, 2025

/retest

@jpramos123
Copy link
Contributor Author

/retest

Copy link
Collaborator

@kruai kruai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanna give this a more thorough review tomorrow, but a couple of quick things:

  • staleness_to_conditions got moved, so this PR is re-adding it in a new location
  • I think it would be better if we didn't pass the config through all those functions that don't need it just to fetch the verification window. If we do need it, we could grab the value straight from inventory_config, for instance. But also, I don't know whether _stale_in should be built into the Conditions class; "stale_in" isn't a valid staleness state, and I think that makes staleness_to_conditions much less straightforward. I'd lean towards making a new filtering function, or potentially a new function to pass into timestamp_filter_func (haven't though that one through yet though)

@jpramos123 jpramos123 force-pushed the add-notif-stale-host branch from 0e11cec to bddad3b Compare January 23, 2025 17:16
@jpramos123 jpramos123 added ready for review The PR is ready for review and removed ready for review The PR is ready for review labels Jan 23, 2025
@jpramos123 jpramos123 force-pushed the add-notif-stale-host branch from bddad3b to e82ad83 Compare January 23, 2025 17:21
@jpramos123 jpramos123 requested a review from kruai January 24, 2025 14:07
@jpramos123 jpramos123 force-pushed the add-notif-stale-host branch from e82ad83 to 249b132 Compare January 24, 2025 14:40
@jpramos123 jpramos123 force-pushed the add-notif-stale-host branch 2 times, most recently from e2146d7 to 08f32e7 Compare January 28, 2025 19:45
@jpramos123 jpramos123 marked this pull request as ready for review January 28, 2025 19:50
Copy link
Collaborator

@kruai kruai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to make a second pass at this tomorrow, but here are my thoughts so far, in addition to the inline comments:

  • Thanks for refactoring the common job code! Great DRYing
  • I don't see where the HostInventoryMetadata record is updated after the job completes. I see where it's created if it doesn't exist, but nowhere else
  • The HostStale class feels kinda redundant, and its naming does not make its purpose very clear. I'd need to play around with the code a bit to see if there is a simpler and clearer way to achieve this; I encourage other reviewers to try as well

Copy link
Contributor

@thearifismail thearifismail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing

PROMETHEUS_JOB = "inventory-stale-host-notification"
COLLECTED_METRICS = (
stale_host_notification_count,
stale_host_notification_processing_time,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word processing_time suggests that it is time-delta between the start and end of notification processing event, but in the code I don't see such a thing. It looks like it is simply a timestamp when the notification failure is recorded. If that is the case then stale_host_notification_time sounds more appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what it is doing. from the methods description:

Time a block of code or function, and observe the duration in seconds.

LOGGER_NAME = "stale_host_notification"
PROMETHEUS_JOB = "inventory-stale-host-notification"
COLLECTED_METRICS = (
stale_host_notification_count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is a failure count, this metric should be renamed to indicate success; so the metric should be renamed to stale_host_notification_success_count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

dev.yml Outdated
@@ -107,6 +107,12 @@ services:
db:
condition: service_healthy

prometheus-gateway:
container_name: prometheus_gateway
image: prom/pushgateway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image: prom/pushgateway
image: quay.io/prometheus/pushgateway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For images, avoid pulling them from docker. Here is what I found in quay.io and it seems to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,106 @@
from functools import partial
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name module name common should be renamed to communicate its purpose or reason of its existence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMPOV, the structure is fine.

We have the folder called jobs that indicates that the it is going to contain jobs related code. Inside it, we have the common.py file, that indicates the we have common code for the jobs in that file

jobs/common.py Outdated
return query_filters


def main(logger, collected_metrics, prometheus_job_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @kruai comment. I got here searching the definition of notification_event_producer and found the call to main function very confusing and wonder about the purpose of main before seeing Asa's comment.

Since this common is really a job config, could the setup part be moved to app.config.py and various helper functions moved to the lib folder in a new file/module? OR move the common.main function to stale_host_notification.py, where the configuration part should be added to app.config and the supporting functions to a new file in lib directory

app/models.py Outdated
class HostInventoryMetadata(db.Model): # type: ignore [name-defined]
__tablename__ = "hbi_metadata"
__table_args__ = (
UniqueConstraint("name", "type", name="hbi_metadata_unique_constraint"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what constraint? why it has to be unique?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this constraint in order to the the PK constrain using two columns: name and type

)


def _create_host_stale_metadata_in_db(session: Session) -> HostInventoryMetadata:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it creating or checking staleness? A host's staleness is set when the host is created or updated; the name is confusing. May be adding the comment before the function signature will be helpful.

Copy link
Contributor Author

@jpramos123 jpramos123 Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function name indicates it is creating a host stale metadata in the database, as we can see it starts with create.

Copy link
Contributor

@thearifismail thearifismail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpramos123 Based on the code review so far, three issues have come up to the fore.

  1. Modules arrangement. Though this is a project wide issue but to get started from now on let's limit ourselves to stale_host_inventory and jobs/common.py. I suggest starting with combining "stale_host_notification.run()" and "jobs.common.main()" functions; rest of the common code should go to the "lib" folder. IMO, the jobs folder is not needed.
  2. Confusing names of variables, functions, and modules. Look at each variable and what is this variable for and does its name communicates its purpose? If it does not then, rename it to communicate its purpose. If the names get too long, don't worry to get started. Please look at my comments to see which names I found confusing or need clarification.
  3. HostInventoryMetadata table. First of all the name is confusing. Is this table really needed? Which I guess it is to save the "last update time" between two different notification events. In its current form, this table is updated when all notifications have been sent? Question is, what happens when there are 1000 hosts which have gone stale and notifications for 999 have been sent successfully but the last one did not go due to some problem. This means the notification event did not complete successfully, so the last_update_time can not be updated. Therefore it seems more appropriate that "last_update_time" should be recorded for each host. The problem with this approach is that there will be way too much traffic updating this table. If database should not hold "host-inventory-metadata", what other options are available to persist last update time between different notification events/episodes.

lib/metrics.py Outdated
@@ -52,3 +52,14 @@
"inventory_new_export_seconds", "Time spent to create a new host export report"
)
create_export_count = Counter("inventory_create_export", "The total amount of host exports created")

# Stale Host Notification
stale_host_notification_count = Counter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to stale_host_notification_success_count, if the metric name is changed in stale_host_notification.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Addressed by your previous comment.

RUNTIME_ENVIRONMENT = RuntimeEnvironment.JOB


class HostStale:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HostStale sounds confusing and be misread as "host tail". How about StaleHost and less favorable is host_stale.

try:
stale_host_timestamp = (
session.query(HostInventoryMetadata)
.where(HostInventoryMetadata.name == "host_stale", HostInventoryMetadata.type == "job")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the other possible values for name and type in HostInventoryMetadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibilities are:

name: hbi_api
type: api

or

name: host_reaper
type: job

This is what comes to my mind right now

@@ -661,6 +661,26 @@ def update(self, input_acc):
modified_on = db.Column(db.DateTime(timezone=True), default=_time_now, onupdate=_time_now)


class HostInventoryMetadata(db.Model): # type: ignore [name-defined]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a separate table really needed, if the calls to creating HostInventoryMetadata originate from def _query_or_create_host_stale(), which has been created to calculate staleness time. Since the name host_stale and type job do not look unique, what happens to this data when a staleness notification job has completed? This information looks very transitory and makes one wonder, if a separate table is needed which will hold the in the table always.

In other words, could the staleness time be kept in an object instead of persisting it in a db. Once the stale host notifications have been set, the object is removed. The next session can create its own staleness time stamp.

Copy link
Contributor Author

@jpramos123 jpramos123 Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this table is needed as we need a way of persisting the last successful run from this job. We use this value to calculate the window time to query the hosts that went stale in this same window. We used this approach because, in case the job fails to start/run for some reason, we can have a dynamic window time being calculated and no stale hosts are going to be left of being notified by the job. See this slack thread where we discussed this approach: https://redhat-internal.slack.com/archives/CQFKM031T/p1737492464040599

Sorry if not clear, but name and type are unique in the DB. So host_stale and job are unique (one single row) in the table.

@stale_host_notification_fail_count.count_exceptions()
def run(logger, session, notification_event_producer, shutdown_handler, application):
with application.app.app_context(), stale_host_notification_processing_time.time():
stale_host_timestamp = _query_or_create_host_stale(session)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose of calling _query_or_create_host_stale(session) function to determine stale_timestamp? If it is, isn't there a function already in the project to for determining stale_timestamp? If not found, it leads to adding a record to the host_inventory_metadata table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is querying the HostInventoryMetadata class, filtering it by: name == host_stale and type == job. Do not confuse stale_timestamp with stale_host_timestamp.

except NoResultFound:
stale_host_timestamp = _create_host_stale_metadata_in_db(session)

return stale_host_timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this is creating timestamp for staleness. The word "host" makes it more confusing than not having it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature it self is related to newly-stale hosts being notified, I believe we need to have word host to specify the feature aim... Please, leave a suggestion if any!

Copy link
Collaborator

@kruai kruai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are further things that can be refactored, but first, I wanna propose that we remove the HostStale class

@jpramos123 jpramos123 force-pushed the add-notif-stale-host branch from 07ab03e to c70c2cd Compare February 5, 2025 18:21
@jpramos123 jpramos123 requested a review from kruai February 5, 2025 18:22
@jpramos123 jpramos123 added the ready for review The PR is ready for review label Feb 5, 2025
@jpramos123 jpramos123 force-pushed the add-notif-stale-host branch from c70c2cd to 945e93d Compare February 5, 2025 18:50
@jpramos123 jpramos123 force-pushed the add-notif-stale-host branch from 945e93d to 525614e Compare February 6, 2025 11:05
Copy link
Contributor

@strider strider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@thearifismail thearifismail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kruai kruai merged commit 1dd8e01 into RedHatInsights:master Feb 6, 2025
3 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for review The PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants