-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RHINENG-7833] - Send notification newly-stale hosts #2123
Conversation
ee80365
to
952eaa5
Compare
746edf5
to
8061985
Compare
/retest |
6820653
to
c9a1324
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs more tests, but I'll not block it
/retest |
355f9cf
to
0e11cec
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanna give this a more thorough review tomorrow, but a couple of quick things:
staleness_to_conditions
got moved, so this PR is re-adding it in a new location- I think it would be better if we didn't pass the config through all those functions that don't need it just to fetch the verification window. If we do need it, we could grab the value straight from
inventory_config
, for instance. But also, I don't know whether_stale_in
should be built into the Conditions class; "stale_in" isn't a valid staleness state, and I think that makesstaleness_to_conditions
much less straightforward. I'd lean towards making a new filtering function, or potentially a new function to pass into timestamp_filter_func (haven't though that one through yet though)
0e11cec
to
bddad3b
Compare
bddad3b
to
e82ad83
Compare
e82ad83
to
249b132
Compare
e2146d7
to
08f32e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to make a second pass at this tomorrow, but here are my thoughts so far, in addition to the inline comments:
- Thanks for refactoring the common job code! Great DRYing
- I don't see where the HostInventoryMetadata record is updated after the job completes. I see where it's created if it doesn't exist, but nowhere else
- The
HostStale
class feels kinda redundant, and its naming does not make its purpose very clear. I'd need to play around with the code a bit to see if there is a simpler and clearer way to achieve this; I encourage other reviewers to try as well
migrations/versions/4f0b36e6cf28_create_metadata_table_to_store_host_.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing
stale_host_notification.py
Outdated
PROMETHEUS_JOB = "inventory-stale-host-notification" | ||
COLLECTED_METRICS = ( | ||
stale_host_notification_count, | ||
stale_host_notification_processing_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word processing_time
suggests that it is time-delta between the start and end of notification processing event, but in the code I don't see such a thing. It looks like it is simply a timestamp when the notification failure is recorded. If that is the case then stale_host_notification_time
sounds more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly what it is doing. from the methods description:
Time a block of code or function, and observe the duration in seconds.
stale_host_notification.py
Outdated
LOGGER_NAME = "stale_host_notification" | ||
PROMETHEUS_JOB = "inventory-stale-host-notification" | ||
COLLECTED_METRICS = ( | ||
stale_host_notification_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is a failure count, this metric should be renamed to indicate success; so the metric should be renamed to stale_host_notification_success_count
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
dev.yml
Outdated
@@ -107,6 +107,12 @@ services: | |||
db: | |||
condition: service_healthy | |||
|
|||
prometheus-gateway: | |||
container_name: prometheus_gateway | |||
image: prom/pushgateway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image: prom/pushgateway | |
image: quay.io/prometheus/pushgateway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For images, avoid pulling them from docker. Here is what I found in quay.io and it seems to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -0,0 +1,106 @@ | |||
from functools import partial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name module name common
should be renamed to communicate its purpose or reason of its existence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMPOV, the structure is fine.
We have the folder called jobs
that indicates that the it is going to contain jobs
related code. Inside it, we have the common.py
file, that indicates the we have common code for the jobs
in that file
jobs/common.py
Outdated
return query_filters | ||
|
||
|
||
def main(logger, collected_metrics, prometheus_job_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on @kruai comment. I got here searching the definition of notification_event_producer
and found the call to main function very confusing and wonder about the purpose of main
before seeing Asa's comment.
Since this common is really a job config, could the setup part be moved to app.config.py
and various helper functions moved to the lib
folder in a new file/module? OR move the common.main
function to stale_host_notification.py
, where the configuration part should be added to app.config
and the supporting functions to a new file in lib
directory
app/models.py
Outdated
class HostInventoryMetadata(db.Model): # type: ignore [name-defined] | ||
__tablename__ = "hbi_metadata" | ||
__table_args__ = ( | ||
UniqueConstraint("name", "type", name="hbi_metadata_unique_constraint"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what constraint
? why it has to be unique
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this constraint in order to the the PK constrain using two columns: name
and type
stale_host_notification.py
Outdated
) | ||
|
||
|
||
def _create_host_stale_metadata_in_db(session: Session) -> HostInventoryMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it creating or checking staleness? A host's staleness is set when the host is created or updated; the name is confusing. May be adding the comment before the function signature will be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function name indicates it is creating a host stale metadata in the database, as we can see it starts with create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpramos123 Based on the code review so far, three issues have come up to the fore.
- Modules arrangement. Though this is a project wide issue but to get started from now on let's limit ourselves to stale_host_inventory and jobs/common.py. I suggest starting with combining "stale_host_notification.run()" and "jobs.common.main()" functions; rest of the common code should go to the "lib" folder. IMO, the jobs folder is not needed.
- Confusing names of variables, functions, and modules. Look at each variable and what is this variable for and does its name communicates its purpose? If it does not then, rename it to communicate its purpose. If the names get too long, don't worry to get started. Please look at my comments to see which names I found confusing or need clarification.
- HostInventoryMetadata table. First of all the name is confusing. Is this table really needed? Which I guess it is to save the "last update time" between two different notification events. In its current form, this table is updated when all notifications have been sent? Question is, what happens when there are 1000 hosts which have gone stale and notifications for 999 have been sent successfully but the last one did not go due to some problem. This means the notification event did not complete successfully, so the last_update_time can not be updated. Therefore it seems more appropriate that "last_update_time" should be recorded for each host. The problem with this approach is that there will be way too much traffic updating this table. If database should not hold "host-inventory-metadata", what other options are available to persist last update time between different notification events/episodes.
lib/metrics.py
Outdated
@@ -52,3 +52,14 @@ | |||
"inventory_new_export_seconds", "Time spent to create a new host export report" | |||
) | |||
create_export_count = Counter("inventory_create_export", "The total amount of host exports created") | |||
|
|||
# Stale Host Notification | |||
stale_host_notification_count = Counter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to stale_host_notification_success_count
, if the metric name is changed in stale_host_notification.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Addressed by your previous comment.
app/host_stale.py
Outdated
RUNTIME_ENVIRONMENT = RuntimeEnvironment.JOB | ||
|
||
|
||
class HostStale: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HostStale
sounds confusing and be misread as "host tail". How about StaleHost
and less favorable is host_stale
.
stale_host_notification.py
Outdated
try: | ||
stale_host_timestamp = ( | ||
session.query(HostInventoryMetadata) | ||
.where(HostInventoryMetadata.name == "host_stale", HostInventoryMetadata.type == "job") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the other possible values for name
and type
in HostInventoryMetadata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibilities are:
name: hbi_api
type: api
or
name: host_reaper
type: job
This is what comes to my mind right now
@@ -661,6 +661,26 @@ def update(self, input_acc): | |||
modified_on = db.Column(db.DateTime(timezone=True), default=_time_now, onupdate=_time_now) | |||
|
|||
|
|||
class HostInventoryMetadata(db.Model): # type: ignore [name-defined] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a separate table really needed, if the calls to creating HostInventoryMetadata originate from def _query_or_create_host_stale()
, which has been created to calculate staleness time. Since the name host_stale
and type job
do not look unique, what happens to this data when a staleness notification job has completed? This information looks very transitory and makes one wonder, if a separate table is needed which will hold the in the table always.
In other words, could the staleness time be kept in an object instead of persisting it in a db. Once the stale host
notifications have been set, the object is removed. The next session can create its own staleness time stamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this table is needed as we need a way of persisting the last successful run from this job. We use this value to calculate the window time to query the hosts that went stale in this same window. We used this approach because, in case the job fails to start/run for some reason, we can have a dynamic window time being calculated and no stale hosts are going to be left of being notified by the job. See this slack thread where we discussed this approach: https://redhat-internal.slack.com/archives/CQFKM031T/p1737492464040599
Sorry if not clear, but name
and type
are unique in the DB. So host_stale
and job
are unique (one single row) in the table.
stale_host_notification.py
Outdated
@stale_host_notification_fail_count.count_exceptions() | ||
def run(logger, session, notification_event_producer, shutdown_handler, application): | ||
with application.app.app_context(), stale_host_notification_processing_time.time(): | ||
stale_host_timestamp = _query_or_create_host_stale(session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the purpose of calling _query_or_create_host_stale(session)
function to determine stale_timestamp
? If it is, isn't there a function already in the project to for determining stale_timestamp
? If not found, it leads to adding a record to the host_inventory_metadata
table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is querying the HostInventoryMetadata
class, filtering it by: name == host_stale
and type == job
. Do not confuse stale_timestamp
with stale_host_timestamp
.
stale_host_notification.py
Outdated
except NoResultFound: | ||
stale_host_timestamp = _create_host_stale_metadata_in_db(session) | ||
|
||
return stale_host_timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this is creating timestamp for staleness. The word "host" makes it more confusing than not having it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature it self is related to newly-stale hosts being notified, I believe we need to have word host
to specify the feature aim... Please, leave a suggestion if any!
08f32e7
to
5e090fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are further things that can be refactored, but first, I wanna propose that we remove the HostStale class
07ab03e
to
c70c2cd
Compare
c70c2cd
to
945e93d
Compare
945e93d
to
525614e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Overview
This PR is being created to address RHINENG-7833.
It adds the functionality of sending notifications for hosts that went stale in the last X seconds. To define this X seconds, it was used the time windows between the job runs. Let say the last run was 1 hour ago, the new job that is going to run is going to fetch the hosts that went stale 1 hour ago. In cases that the job fails to run for some reason, this gap will increase. By default, the job is running every 1 hour.
To save the last run, it was created a tabled called
hbi_metadata
, that has 3 attributes:The
last_update
field is the one used to create the window time between the successful jobs runs.It also refacts the host reaper, as this new feature is also a job that runs every certain amount of time. The common code used by both jobs were added to the
jobs/common.py
file.PR Checklist
Secure Coding Practices Documentation Reference
You can find documentation on this checklist here.
Secure Coding Checklist