Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch aggregation #4303

Merged
merged 34 commits into from
Jun 13, 2024
Merged

Batch aggregation #4303

merged 34 commits into from
Jun 13, 2024

Conversation

zwolf
Copy link
Member

@zwolf zwolf commented Mar 6, 2024

Adds the necessary pieces to implement batch aggregation. In total, they should include:

Aggregation table migration
Client to connect to aggregation service
Refactors of: model, factory, controller, schemas, policy, serializer
Specs of all the above

This is ready for review.

Review checklist

  • First, the most important one: is this PR small enough that you can actually review it? Feel free to just reject a branch if the changes are hard to review due to the length of the diff.
  • If there are any migrations, will they the previous version of the app work correctly after they've been run (e.g. the don't remove columns still known about by ActiveRecord).
  • If anything changed with regards to the public API, are those changes also documented in the apiary.apib file?
  • Are all the changes covered by tests? Think about any possible edge cases that might be left untested.

@zwolf zwolf marked this pull request as draft March 6, 2024 22:18
@zwolf zwolf marked this pull request as ready for review March 28, 2024 15:01
@zwolf zwolf changed the title [WIP] Batch aggregation Batch aggregation Mar 28, 2024
@zwolf zwolf requested a review from yuenmichelle1 March 28, 2024 16:10
Copy link
Collaborator

@yuenmichelle1 yuenmichelle1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. I had a couple non blocking comments and questions.
Since this feature has pretty much been siloed to just you, Cliff, (maybe Haley and Coleman?).
I think its worth adding more docs or comments on the PR so any one of us can go back and save some time on getting context.

Some things:

  • I think its worth documenting the flow of this to aggregations and back.
  • Semi related to ^, I'm currently not entirely sure of what the difference is between a task id vs the uuid vs the aggregations id. Maybe adding a description on the schema or more description on PR would be helpful?
  • Other thing/s I am wondering about: From what I understand of the new aggregations model, it is 1 aggregation per user per workflow.
    - What happens if a user wants to re-run an aggregation for a workflow? (Does an aggregation record on panoptes get "cleaned" up after a certain point? Do we manually have to remove the aggregation record from the db so that re-run can be done? Is it even a thing to "re-run"?)

app/models/aggregation.rb Outdated Show resolved Hide resolved
spec/policies/aggregation_policy_spec.rb Outdated Show resolved Hide resolved
spec/policies/aggregation_policy_spec.rb Outdated Show resolved Hide resolved
spec/policies/aggregation_policy_spec.rb Outdated Show resolved Hide resolved
app/schemas/aggregation_create_schema.rb Show resolved Hide resolved
@zwolf
Copy link
Member Author

zwolf commented May 28, 2024

I had some trouble with Doorkeeper scopes when I was testing this API outside of the specs (not Pundit scopes, but Doorkeeper's own). Nothing else in the app scopes solely by workflow and just adding to the optional scopes wasn't enough. Rather than keep fighting it, I switched the base scope for the Aggregation resource back to Project and added that column to the table--that's all it's used for so far, but the relation is logical and Doorkeeper is happy.

I think its worth documenting the flow of this to aggregations and back.

Added a page to the docs for the new model. @lcjohnso I think you wanted a ping to have a look.

Semi related to ^, I'm currently not entirely sure of what the difference is between a task id vs the uuid vs the aggregations id. Maybe adding a description on the schema or more description on PR would be helpful?

Included this info in the new docs. The task_id comes from Celery via Aggregation and represents the running background task. This id can be used to check on the status of a running job via the aggregation API and to see if it's been completed. This could be implemented by the front end as a way to check the status of pending, long-running jobs. The UUID is generated by the aggregation run and used in filenames and folders on blob storage. It'll be used to construct a URL that the files will be downloaded directly from.

Other thing/s I am wondering about: From what I understand of the new aggregations model, it is 1 aggregation per user per workflow. What happens if a user wants to re-run an aggregation for a workflow? (Does an aggregation record on panoptes get "cleaned" up after a certain point? Do we manually have to remove the aggregation record from the db so that re-run can be done? Is it even a thing to "re-run"?)

Yes, aggregations are unique per user & workflow. And true, this was missing a way to re-run the aggregation--what makes the most sense? The FE checks for an existing aggregation resource via GET api/aggregations?workflow_id=[wf_id]&user_id=[user_id]. I have now implemented DELETE on the controller and added a couple specs to more clearly indicate what the intended behavior is. And though I was trying to avoid making the FE worry about it, the standard approach might be easiest: FE modal checks for existing agg, and if it's completed/failed, a click of "rerun" DELETEs by id, then POSTs to create new one, triggering a new aggregation run.

Talk actually does clean up data requests after a day and blocks early rerun attempts, I don't love that either. For now, I think this functionality is enough to work with, but maybe I could add some kind of shortcut (via route, POST body short circuit, other?) to reduce the required FE calls. Thoughts?

end
end
context 'when there is an existing aggregation for that workflow' do
let!(:existing_agg) { create(:aggregation, workflow: workflow) }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSpec/LetSetup: Do not use let! to setup objects not referenced in tests.

describe '#index' do
let(:other_workflow) { create(:workflow) }
let!(:aggregations) { create(:aggregation, workflow: workflow) }
let!(:private_resource) { create(:aggregation, workflow: other_workflow) }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSpec/LetSetup: Do not use let! to setup objects not referenced in tests.


describe '#index' do
let(:other_workflow) { create(:workflow) }
let!(:aggregations) { create(:aggregation, workflow: workflow) }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSpec/LetSetup: Do not use let! to setup objects not referenced in tests.

@lcjohnso
Copy link
Member

lcjohnso commented Jun 5, 2024

Docs look good -- thanks!

One question regarding the existing aggregations table: I see you're removing the main aggregation data column in the migration, but will the >1 million rows stay in place? Or do those need to be cleaned up too? Is that happening elsewhere, or in this or a similar migration?

@zwolf
Copy link
Member Author

zwolf commented Jun 5, 2024

I see you're removing the main aggregation data column in the migration, but will the >1 million rows stay in place? Or do those need to be cleaned up too? Is that happening elsewhere, or in this or a similar migration?

The plan looks like this:

  • This PR merges and cleans up the references to the aggregations attribute on the workflows table.
  • A new PR runs a migration that actually deletes it
  • The data in the aggregations table is actually deleted. This could happen in a rake task but will likely just be me running a truncate on Postgres.
  • This PR merges and modifies the aggregations table along with everything else.

I added the extra tasks to the board, plan to get them done this week.


# THIS FUNCTIONALITY IS BEING DEPRECATED EFFECTIVE IMMEDIATELY
# A REPLACEMENT IS FORTHCOMING.
class Api::V1::AggregationsController < Api::ApiController
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/ClassAndModuleChildren: Use nested module/class definitions instead of compact style.

uuid: '557ebcfa3c29496787336bfbd7c4d856',
status: 'completed'
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/MultilineHashBraceLayout: Closing hash brace must be on the same line as the last hash element when opening brace is on the same line as the first hash element.

@zwolf
Copy link
Member Author

zwolf commented Jun 13, 2024

This is ready to merge. I truncated the aggregations table on staging in preparation:

irb(main):005:0> Aggregation.count
[primary]   (55.2ms)  SELECT COUNT(*) FROM "aggregations"
=> 1
irb(main):007:0> Aggregation.connection.truncate(Aggregation.table_name)
[primary]   (777.3ms)  TRUNCATE TABLE "aggregations"
=> #<PG::Result:0x00007fad5f56b968 status=PGRES_COMMAND_OK ntuples=0 nfields=0 cmd_tuples=0>
irb(main):008:0> Aggregation.count
[primary]   (65.2ms)  SELECT COUNT(*) FROM "aggregations"
=> 0

I plan on doing the same thing to production before this gets deployed. In the end, this feels like a data operation rather than a schema change and not something that should exist as a migration or a repeatable rake task. Truncate is the fastest way to remove the data, which will be important on prod where there are ~1.2M row (newest from 2017).

@zwolf zwolf merged commit 509c773 into master Jun 13, 2024
8 checks passed
@zwolf zwolf deleted the batch-aggregation branch June 13, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants