-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch aggregation #4303
Batch aggregation #4303
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine. I had a couple non blocking comments and questions.
Since this feature has pretty much been siloed to just you, Cliff, (maybe Haley and Coleman?).
I think its worth adding more docs or comments on the PR so any one of us can go back and save some time on getting context.
Some things:
- I think its worth documenting the flow of this to aggregations and back.
- Semi related to ^, I'm currently not entirely sure of what the difference is between a task id vs the uuid vs the aggregations id. Maybe adding a description on the schema or more description on PR would be helpful?
- Other thing/s I am wondering about: From what I understand of the new aggregations model, it is 1 aggregation per user per workflow.
- What happens if a user wants to re-run an aggregation for a workflow? (Does an aggregation record on panoptes get "cleaned" up after a certain point? Do we manually have to remove the aggregation record from the db so that re-run can be done? Is it even a thing to "re-run"?)
I had some trouble with Doorkeeper scopes when I was testing this API outside of the specs (not Pundit scopes, but Doorkeeper's own). Nothing else in the app scopes solely by workflow and just adding to the optional scopes wasn't enough. Rather than keep fighting it, I switched the base scope for the Aggregation resource back to Project and added that column to the table--that's all it's used for so far, but the relation is logical and Doorkeeper is happy.
Added a page to the docs for the new model. @lcjohnso I think you wanted a ping to have a look.
Included this info in the new docs. The task_id comes from Celery via Aggregation and represents the running background task. This id can be used to check on the status of a running job via the aggregation API and to see if it's been completed. This could be implemented by the front end as a way to check the status of pending, long-running jobs. The UUID is generated by the aggregation run and used in filenames and folders on blob storage. It'll be used to construct a URL that the files will be downloaded directly from.
Yes, aggregations are unique per user & workflow. And true, this was missing a way to re-run the aggregation--what makes the most sense? The FE checks for an existing aggregation resource via Talk actually does clean up data requests after a day and blocks early rerun attempts, I don't love that either. For now, I think this functionality is enough to work with, but maybe I could add some kind of shortcut (via route, POST body short circuit, other?) to reduce the required FE calls. Thoughts? |
end | ||
end | ||
context 'when there is an existing aggregation for that workflow' do | ||
let!(:existing_agg) { create(:aggregation, workflow: workflow) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSpec/LetSetup: Do not use let! to setup objects not referenced in tests.
describe '#index' do | ||
let(:other_workflow) { create(:workflow) } | ||
let!(:aggregations) { create(:aggregation, workflow: workflow) } | ||
let!(:private_resource) { create(:aggregation, workflow: other_workflow) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSpec/LetSetup: Do not use let! to setup objects not referenced in tests.
|
||
describe '#index' do | ||
let(:other_workflow) { create(:workflow) } | ||
let!(:aggregations) { create(:aggregation, workflow: workflow) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSpec/LetSetup: Do not use let! to setup objects not referenced in tests.
Docs look good -- thanks! One question regarding the existing aggregations table: I see you're removing the main |
The plan looks like this:
I added the extra tasks to the board, plan to get them done this week. |
|
||
# THIS FUNCTIONALITY IS BEING DEPRECATED EFFECTIVE IMMEDIATELY | ||
# A REPLACEMENT IS FORTHCOMING. | ||
class Api::V1::AggregationsController < Api::ApiController |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/ClassAndModuleChildren: Use nested module/class definitions instead of compact style.
uuid: '557ebcfa3c29496787336bfbd7c4d856', | ||
status: 'completed' | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Layout/MultilineHashBraceLayout: Closing hash brace must be on the same line as the last hash element when opening brace is on the same line as the first hash element.
This is ready to merge. I truncated the irb(main):005:0> Aggregation.count
[primary] (55.2ms) SELECT COUNT(*) FROM "aggregations"
=> 1
irb(main):007:0> Aggregation.connection.truncate(Aggregation.table_name)
[primary] (777.3ms) TRUNCATE TABLE "aggregations"
=> #<PG::Result:0x00007fad5f56b968 status=PGRES_COMMAND_OK ntuples=0 nfields=0 cmd_tuples=0>
irb(main):008:0> Aggregation.count
[primary] (65.2ms) SELECT COUNT(*) FROM "aggregations"
=> 0 I plan on doing the same thing to production before this gets deployed. In the end, this feels like a data operation rather than a schema change and not something that should exist as a migration or a repeatable rake task. Truncate is the fastest way to remove the data, which will be important on prod where there are ~1.2M row (newest from 2017). |
Adds the necessary pieces to implement batch aggregation. In total, they should include:
Aggregation table migration
Client to connect to aggregation service
Refactors of: model, factory, controller, schemas, policy, serializer
Specs of all the above
This is ready for review.
Review checklist
apiary.apib
file?