Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing Datasets that Span Multiple Data Collection Runs #88
Processing Datasets that Span Multiple Data Collection Runs #88
Changes from 108 commits
45b08f3
0e0daa9
02a0aa2
78ac972
1ad087c
9cabc99
4d5cc79
4fd862e
162df3d
7783758
34bfa55
55c5036
c8957a6
4f7ed26
d647b7f
0f38734
5b5db7e
67f83b1
0dca2a3
3e0be79
0ead84a
b2ab1c3
167ca5e
aff4845
b06b46d
2672bd9
66d23ec
f3a3a47
88759ac
d059228
b3bc779
42d5a4c
f4d889b
f41c617
bdb1ff4
647f08f
31c27e3
bdf12a9
cf69090
d023534
055069d
987e76b
5f86e18
228650a
5880874
6b45769
44ccb09
2195b7e
d0c417e
bad7c0c
6a29f71
d14b0a1
0cb6b6d
b6f5cdd
32f9a41
9cb2a61
700e6fc
5a2ca0b
4965207
45e0297
3244584
46ccf67
12819f5
7c1ab49
1ca98fc
68f1415
b3f4fdd
f1a965b
9b26722
59bece8
a024df8
62f0a4e
0e26cfc
0e4d947
37232e7
ff34f96
b1f572b
6668dbe
0aff94e
28bebb1
a4735a0
937af40
f6e8d2b
a91c018
c0db4c6
6ef0511
8ba35ba
9932b80
946f3ef
7603d92
6bbd2e3
68e579c
f359b69
ad4d80b
2e2714f
b21a514
33a4eee
9ff4d71
52d6f98
322b6a2
f069f58
ac5554b
7bce2f7
9da164d
e36e1fb
d1266b1
9f5914d
466410b
1fc19d4
e6ce46d
03e4a98
ad1db04
51116d0
d451765
5301acd
436b502
37602ba
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a bit more on what the race condition is here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building accumulated data file in S3 from contents in multiple bundles. For instance, creating a project metadata TSV:
This is also O(N^2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more accurate if it said
the pipeline service needs to subscribe to the project data group...
The difference is that if every pipeline subscribed to "project data group notifications", whenever a project is submitted the same notification would get sent to each pipeline. This creates a lot of duplicated work in terms of inspecting the contents to determine if that pipeline should be run so it would be more efficient to have a single subscription to data group notifications. However, if there's a way to query the project data group bundles without requesting all of the metadata files from the data store then one subscription/pipeline would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, @samanehsan text has been clarified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get up to here currently but going beyond requires resolution of the analysis updates RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is very important to think about how this will stretch beyond PROJECT_SUBMISSION. It doesn't need to be high detail but I would need a warm feeling that we are not heading down a development path that can't cater to later requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As below, on reconsideration I don't think is too important to answer now as I suspect PROJECT_SUBMISSION is needed in pretty much all circumstances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you consider whether much of this can be implemented by subscriptions to the query service? This would still need to be supplemented by ingest signals of submission but it may allow more flexibility for future requirements, where those can be catered for by subscription requirements rather than new data model entities in ingest, with all the work and coordination that involves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it could be implemented by query service events, although it is imaged that the QS will use project data groups to populate the database.
Without understanding more about how QS events will integrate into the system, using the DSS event mechanism seems like a conservative step.
@MDunitz @kislyuk, any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still and important question but even with this, some submission finished event (here corresponding to PROJECT_SUBMISSION) would still be necessary. Hence, I don't think it needs to be answered here.