Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified Queue: fix failing tests and address TODOs #26067

Merged
merged 15 commits into from
Feb 5, 2025

Conversation

mna
Copy link
Member

@mna mna commented Feb 5, 2025

#23916

Checklist for submitter

  • Input data is properly validated, SELECT * is avoided, SQL injection is prevented (using placeholders for values in statements)
  • Added/updated automated tests

@@ -695,7 +695,7 @@ func (ds *Datastore) applyHostLabelFilters(ctx context.Context, filter fleet.Tea
return "", nil, ctxerr.Wrap(ctx, err, "get software installer metadata by team and title id")

default:
// TODO(uniq): prior code was joining on installer id but based on how list options are parsed [1] it seems like this should be the title id
// TODO(Sarah): prior code was joining on installer id but based on how list options are parsed [1] it seems like this should be the title id
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gillespi314 I changed the TODO to you since it's not related to the unified queue, let me know if that's ok with you (or if you want me to just remove it)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change it to a FIXME and I'll flag it for the g-software folks to take a closer look.

Copy link
Member Author

@mna mna Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, is it really "to fix" or is your change the actual fix? I.e. is it just to raise awareness that this was changed, or does that current implementation need to be changed? I'd remove the comment altogether if there's not really anything to fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mna, you're right, it was more so to raise awareness of the change in case I might have missed something.

@jahzielv, do you mind giving this a quick sanity check? There's a similar note regarding VPP to check here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gillespi314 so sorry, I totally missed this notification! I can take a look

@@ -1089,7 +1089,7 @@ WHERE
func (ds *Datastore) GetSummaryHostSoftwareInstalls(ctx context.Context, installerID uint) (*fleet.SoftwareInstallerStatusSummary, error) {
var dest fleet.SoftwareInstallerStatusSummary

// TODO(uniq): AFAICT we don't have uniqueness for host_id + title_id in upcoming or
// TODO(Sarah): AFAICT we don't have uniqueness for host_id + title_id in upcoming or
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and a couple more todos below @gillespi314 I changed the TODO to you since it's not related to the unified queue, let me know if that's ok with you (or if you want me to just remove it)? The load tests should (hopefully) tell us if we need to improve some queries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems a bit more related to UniQ insofar we probably want the ordering of activities to be consistent with other queries. What if we change these to use the same group-wise max approach we're using elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, yeah I think once the row is in host_software_installs or host_script_results, it doesn't matter much (the MAX(id) is as good as before) because the upcoming activities will create those rows in the correct order, it matters more when querying the upcoming_activities. Also, while looking at those uniqueness/groupwise max comments, I think I may have found an issue in the software status queries?

When the filter is for non-pending, we do this:

// for non-pending statuses, we'll join through host_software_installs filtered by the status
statusFilter := "hsi.status = :status"
if status == fleet.SoftwareFailed {
// failed is a special case, we must include both install and uninstall failures
statusFilter = "hsi.status IN (:installFailed, :uninstallFailed)"
}
// TODO(Sarah): AFAICT we don't have uniqueness for host_id + title_id in upcoming or
// past activities. In the past the max(id) approach was "good enough" as a proxy for the most
// recent activity since we didn't really need to worry about the order of activities.
// Moving to a time-based approach would be more accurate but would require a more complex and
// potentially slower query.
stmt := fmt.Sprintf(`JOIN (
SELECT
host_id
FROM
host_software_installs hsi
WHERE
software_title_id = :title_id
AND hsi.id IN(
SELECT
max(id) -- ensure we use only the most recent install attempt for each host
FROM host_software_installs
WHERE
host_id = hsi.host_id
AND software_title_id = :title_id
AND removed = 0
GROUP BY
host_id, software_title_id)
AND %s) hss ON hss.host_id = h.id
`, statusFilter)

But I think it should not consider any of those rows if there's also a pending request in upcoming_activities, right? I.e. if a software has an installed and a failed status in host_software_installs and a pending one in upcoming_activities, the only status that matters for the filter is the latest, so it is pending?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll address both points in a subsequent PR (use the left join approach for the groupwise max, and fix the status filter). Let me know if I'm mistaken and it's current behavior is correct, though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think it should not consider any of those rows if there's also a pending request in upcoming_activities, right?

Doesn't the early return at line 1271 cover it?

Or are you saying that when filtering on a terminal status (e.g., installed), if a host has matching record for a prior install the host should always be excluded from the results if there's an entry in upcoming activities? TBH I'm not very clear what is expected in that case. What you're suggesting makes sense if pending is always supposed to take precedence of prior activity. Does it depend on the specific use case (i.e. for "last install" we always give preference to upcoming, but maybe something else when filtering for a specific status)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the latter, if I filter on installed but the software has a pending entry in addition to a previous install, I think it's latest status is the one we want to filter on, so it should not be returned.

That's how it worked before, it always considers the latest install attempt, regardless of status:
https://github.com/fleetdm/fleet/blob/main/server/datastore/mysql/software_installers.go#L997-L1014

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, no I misread that... I think it does return the host if it has the requested status, it's just that we take the latest entry with that status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no, sorry again :D It's as I said before, it only considers the latest attempt, and then it returns the host if that attempt's status is the one requested by the filter. A bit hairy but yeah, that's the behavior to reproduce so I'll update that query accordingly.

ua.activity_type = 'script' AND
(
ua.payload->'$.sync_request' = 0 OR
ua.created_at >= DATE_SUB(NOW(), INTERVAL ? SECOND)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync scripts are treated as any other script in the upcoming queue: #22866 (comment)

@@ -117,9 +117,9 @@ func (ds *Datastore) GetSummaryHostVPPAppInstalls(ctx context.Context, teamID *u
// Currently there is no host_deleted_at in host_vpp_software_installs, so
// not handling it as part of the unified queue work.

// TODO(uniq): refactor vppHostStatusNamedQuery to use the same logic as below
// TODO(sarah): refactor vppHostStatusNamedQuery to use the same logic as below
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gillespi314 I'm not sure what you mean here, the vppAppHostStatusNamedQuery helper computes the status the same way AFAICT and it is still used in the ListHostSoftware mega-query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mainly concerned with ensuring consistency with whatever approach we ended up with regarding the group-wise max and whether we're using timestamp comparisons or staying with a purely id-based approach so feel free to remove this if you think we're good there.

@mna mna changed the title Mna 23916 fix failing tests 12 Unified Queue: fix failing tests and address TODOs Feb 5, 2025
Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 93.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 63.85%. Comparing base (18b7492) to head (c85816b).
Report is 1 commits behind head on feat-upcoming-activites-queue.

Files with missing lines Patch % Lines
server/datastore/mysql/activities.go 90.00% 0 Missing and 1 partial ⚠️
server/datastore/mysql/scripts.go 92.85% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@                        Coverage Diff                        @@
##           feat-upcoming-activites-queue   #26067      +/-   ##
=================================================================
+ Coverage                          63.52%   63.85%   +0.33%     
=================================================================
  Files                               1631     1632       +1     
  Lines                             155652   157625    +1973     
  Branches                            4061     4061              
=================================================================
+ Hits                               98878   100655    +1777     
- Misses                             48951    49082     +131     
- Partials                            7823     7888      +65     
Flag Coverage Δ
backend 64.70% <93.33%> (+0.35%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mna mna marked this pull request as ready for review February 5, 2025 14:14
@mna mna requested a review from a team as a code owner February 5, 2025 14:14
Copy link
Contributor

@gillespi314 gillespi314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Feel free to address any of my comments below in follow PRs :)

@@ -695,7 +695,7 @@ func (ds *Datastore) applyHostLabelFilters(ctx context.Context, filter fleet.Tea
return "", nil, ctxerr.Wrap(ctx, err, "get software installer metadata by team and title id")

default:
// TODO(uniq): prior code was joining on installer id but based on how list options are parsed [1] it seems like this should be the title id
// TODO(Sarah): prior code was joining on installer id but based on how list options are parsed [1] it seems like this should be the title id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change it to a FIXME and I'll flag it for the g-software folks to take a closer look.

@@ -1089,7 +1089,7 @@ WHERE
func (ds *Datastore) GetSummaryHostSoftwareInstalls(ctx context.Context, installerID uint) (*fleet.SoftwareInstallerStatusSummary, error) {
var dest fleet.SoftwareInstallerStatusSummary

// TODO(uniq): AFAICT we don't have uniqueness for host_id + title_id in upcoming or
// TODO(Sarah): AFAICT we don't have uniqueness for host_id + title_id in upcoming or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems a bit more related to UniQ insofar we probably want the ordering of activities to be consistent with other queries. What if we change these to use the same group-wise max approach we're using elsewhere?

@@ -117,9 +117,9 @@ func (ds *Datastore) GetSummaryHostVPPAppInstalls(ctx context.Context, teamID *u
// Currently there is no host_deleted_at in host_vpp_software_installs, so
// not handling it as part of the unified queue work.

// TODO(uniq): refactor vppHostStatusNamedQuery to use the same logic as below
// TODO(sarah): refactor vppHostStatusNamedQuery to use the same logic as below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mainly concerned with ensuring consistency with whatever approach we ended up with regarding the group-wise max and whether we're using timestamp comparisons or staying with a purely id-based approach so feel free to remove this if you think we're good there.

@mna mna merged commit 0710c7f into feat-upcoming-activites-queue Feb 5, 2025
31 of 32 checks passed
@mna mna deleted the mna-23916-fix-failing-tests-12 branch February 5, 2025 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants