Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fully parallel upload processing #658

Merged
merged 1 commit into from
Oct 4, 2024
Merged

Conversation

Swatinem
Copy link
Contributor

@Swatinem Swatinem commented Aug 29, 2024

This adds another variant to the PARALLEL_PROCESSING feature/rollout flag which prefers the parallel upload processing pipeline in favor of running it as an experiment.

Upload Processing can run in essentially 4 modes:

  • Completely serial processing
  • Serial processing, but running "experiment" code (EXPERIMENT_SERIAL):
    • In this mode, the final (is_final) UploadProcessor task saves a copy
      of the final report for later verification.
  • Parallel processing, but running "experiment" code (EXPERIMENT_PARALLEL):
    • In this mode, another parallel set of UploadProcessor tasks runs after
      the main set up tasks.
    • These tasks are not persisting any of their results in the database,
      instead the final UploadFinisher task will launch the ParallelVerification task.
  • Fully parallel processing (PARALLEL):
    • In this mode, the final UploadFinisher task is responsible for merging
      the final report and persisting it.

An example Task chain might look like this, in "experiment" mode:

  • Upload
    • UploadProcessor
      • UploadProcessor
        • UploadProcessor (EXPERIMENT_SERIAL (the final one))
          • UploadFinisher
            • UploadProcessor (EXPERIMENT_PARALLEL)
            • UploadProcessor (EXPERIMENT_PARALLEL)
            • UploadProcessor (EXPERIMENT_PARALLEL)
              • UploadFinisher (EXPERIMENT_PARALLEL)
                • ParallelVerification

The PARALLEL mode looks like this:

  • Upload
    • UploadProcessor (PARALLEL)
    • UploadProcessor (PARALLEL)
    • UploadProcessor (PARALLEL)
      • UploadFinisher (PARALLEL)

@Swatinem Swatinem self-assigned this Aug 29, 2024
@codecov-notifications
Copy link

codecov-notifications bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 99.21260% with 1 line in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
tasks/upload.py 94.73% 1 Missing ⚠️

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #658   +/-   ##
=======================================
  Coverage   98.02%   98.02%           
=======================================
  Files         437      438    +1     
  Lines       36313    36389   +76     
=======================================
+ Hits        35597    35672   +75     
- Misses        716      717    +1     
Flag Coverage Δ
integration 98.02% <99.21%> (+<0.01%) ⬆️
unit 98.02% <99.21%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 95.89% <99.18%> (+0.01%) ⬆️
OutsideTasks 98.03% <100.00%> (+<0.01%) ⬆️
Files with missing lines Coverage Δ
helpers/parallel.py 100.00% <100.00%> (ø)
services/report/__init__.py 97.20% <100.00%> (+0.02%) ⬆️
tasks/tests/integration/test_upload_e2e.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_processing_task.py 100.00% <100.00%> (ø)
tasks/upload_finisher.py 96.00% <100.00%> (+0.40%) ⬆️
tasks/upload_processor.py 99.38% <100.00%> (ø)
tasks/upload.py 96.10% <94.73%> (-0.18%) ⬇️

@codecov-qa
Copy link

codecov-qa bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 99.21260% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.02%. Comparing base (c97132d) to head (3ef30d7).
Report is 1 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
tasks/upload.py 94.73% 1 Missing ⚠️

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #658   +/-   ##
=======================================
  Coverage   98.02%   98.02%           
=======================================
  Files         437      438    +1     
  Lines       36313    36389   +76     
=======================================
+ Hits        35597    35672   +75     
- Misses        716      717    +1     
Flag Coverage Δ
integration 98.02% <99.21%> (+<0.01%) ⬆️
unit 98.02% <99.21%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 95.89% <99.18%> (+0.01%) ⬆️
OutsideTasks 98.03% <100.00%> (+<0.01%) ⬆️
Files with missing lines Coverage Δ
helpers/parallel.py 100.00% <100.00%> (ø)
services/report/__init__.py 97.20% <100.00%> (+0.02%) ⬆️
tasks/tests/integration/test_upload_e2e.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_processing_task.py 100.00% <100.00%> (ø)
tasks/upload_finisher.py 96.00% <100.00%> (+0.40%) ⬆️
tasks/upload_processor.py 99.38% <100.00%> (ø)
tasks/upload.py 96.10% <94.73%> (-0.18%) ⬇️

Copy link

codecov-public-qa bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 99.21260% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.02%. Comparing base (c97132d) to head (3ef30d7).

✅ All tests successful. No failed tests found.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #658   +/-   ##
=======================================
  Coverage   98.02%   98.02%           
=======================================
  Files         437      438    +1     
  Lines       36313    36389   +76     
=======================================
+ Hits        35597    35672   +75     
- Misses        716      717    +1     
Flag Coverage Δ
integration 98.02% <99.21%> (+<0.01%) ⬆️
unit 98.02% <99.21%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 95.89% <99.18%> (+0.01%) ⬆️
OutsideTasks 98.03% <100.00%> (+<0.01%) ⬆️
Files Coverage Δ
helpers/parallel.py 100.00% <100.00%> (ø)
services/report/__init__.py 97.20% <100.00%> (+0.02%) ⬆️
tasks/tests/integration/test_upload_e2e.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_processing_task.py 100.00% <100.00%> (ø)
tasks/upload_finisher.py 96.00% <100.00%> (+0.40%) ⬆️
tasks/upload_processor.py 99.38% <100.00%> (ø)
tasks/upload.py 96.10% <94.73%> (-0.18%) ⬇️

Copy link

codecov bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 99.21260% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.02%. Comparing base (c97132d) to head (3ef30d7).
Report is 1 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
tasks/upload.py 94.73% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #658   +/-   ##
=======================================
  Coverage   98.02%   98.02%           
=======================================
  Files         437      438    +1     
  Lines       36313    36389   +76     
=======================================
+ Hits        35597    35672   +75     
- Misses        716      717    +1     
Flag Coverage Δ
integration 98.02% <99.21%> (+<0.01%) ⬆️
unit 98.02% <99.21%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 95.89% <99.18%> (+0.01%) ⬆️
OutsideTasks 98.03% <100.00%> (+<0.01%) ⬆️
Files with missing lines Coverage Δ
helpers/parallel.py 100.00% <100.00%> (ø)
services/report/__init__.py 97.20% <100.00%> (+0.02%) ⬆️
tasks/tests/integration/test_upload_e2e.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_processing_task.py 100.00% <100.00%> (ø)
tasks/upload_finisher.py 96.00% <100.00%> (+0.40%) ⬆️
tasks/upload_processor.py 99.38% <100.00%> (ø)
tasks/upload.py 96.10% <94.73%> (-0.18%) ⬇️

@Swatinem Swatinem force-pushed the swatinem/fully-parallel branch 3 times, most recently from 7635506 to b9f675a Compare September 10, 2024 11:10
@Swatinem Swatinem marked this pull request as ready for review September 10, 2024 11:10
@Swatinem Swatinem requested a review from a team September 10, 2024 11:10
@Swatinem Swatinem force-pushed the swatinem/fully-parallel branch from b9f675a to f1ab443 Compare September 10, 2024 11:37
Copy link
Contributor

@michelletran-codecov michelletran-codecov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments

tasks/upload_processor.py Outdated Show resolved Hide resolved
tasks/upload_processor.py Outdated Show resolved Hide resolved
tasks/upload_processor.py Outdated Show resolved Hide resolved
helpers/parallel.py Outdated Show resolved Hide resolved
tasks/upload.py Outdated Show resolved Hide resolved
helpers/parallel.py Outdated Show resolved Hide resolved
Copy link
Contributor

@matt-codecov matt-codecov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is somewhat cleaner than the existing harness but to be honest i still feel a little uncomfortable with all the if/else branches peppered around to copy something here and skip writing something there. it feels too easy to accidentally break real processing or leave side-effects that real users will be able to see

the approach i imagine would be simpler would be a separate task that either runs nightly and chooses a batch of N commits, or is scheduled as a followup after X% of finisher tasks. this task would fetch completed report JSONs and use the sessions list from them to reconstruct UploadTask arguments but with dummy commits/repos owned by Codecov plugged in. one dummy repo would be overridden into the expt and the other overridden out of it. we run the identical task arguments for each repo and compare the results

with that approach, any and all copying/staging we need to do for verification can happen in one place, and there's little to no risk of our test procedure accidentally breaking things for production users or accidentally leaving side-effects that they can see. there's nothing to clean up when transitioning from validation to running the actual experiment, it's just a Feature with a test and control group. it doesn't faithfully reproduce carryforward inheritance, but CFF is all settled before anything changes for parallel processing anyway. i think the main downside is having to suppress GitHub API errors because our dummy repos probably won't have unique authentic commits/PRs for each batch of tasks we want to test

out of steam for the day but will see your thoughts tomorrow

helpers/parallel.py Outdated Show resolved Hide resolved
services/report/__init__.py Outdated Show resolved Hide resolved
Comment on lines -1052 to -799
# this should be enabled for the actual rollout of parallel upload processing.
# if PARALLEL_UPLOAD_PROCESSING_BY_REPO.check_value(
# "this should be the repo id"
# ):
# upload_obj.state_id = UploadState.PARALLEL_PROCESSED.db_id
# upload_obj.state = "parallel_processed"
# else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you haven't found it, this enum value is what this commented out block is about

a state of PROCESSED implies the upload's data will be found if you call get_existing_report_for_commit(). a state of PARALLEL_PROCESSED indicates UploadProcessorTask has finished but UploadFinisherTask has not gotten to it yet. don't remember if the distinction mattered

fully forgot about this bit

@Swatinem
Copy link
Contributor Author

to be honest i still feel a little uncomfortable with all the if/else branches peppered around to copy something here and skip writing something there.

I totally agree with this. I’m tempted to just create a new task for parallel processing which removes all the code related to handling multiple uploads in one chunk, and have ideas for further simplification ahead of time.

@Swatinem Swatinem marked this pull request as draft September 12, 2024 08:32
@matt-codecov
Copy link
Contributor

i think some of the brittleness is inherent to the "kick off parallel tasks but copy all the inputs and then skip saving the outputs" approach to verification, but i'd be happy to be proved wrong haha. my suggested alternative requires us to handle any GH request failure non-fatally which may be easier said than done

there's a lot in upload_processor.py that could be reused, but you're right that we'll have to clean up the multi-upload batch stuff sooner or later and it's easier to reason about the parallel implementation if we do it sooner

i should have said this in my initial comment but: i can't see any specific problems in the PR apart from the edge case with IDs which only matters for comparison with serial results, and that was already there. i think this is all logically correct, and less fragile than it was before. i am excited to see this PR and for this project to get some momentum

@Swatinem Swatinem force-pushed the swatinem/fully-parallel branch 2 times, most recently from 739b6e5 to ad04519 Compare October 1, 2024 12:13
@Swatinem
Copy link
Contributor Author

Swatinem commented Oct 1, 2024

I updated this PR yet again, with the following changes:

  • Switched the PARALLEL_UPLOAD_PROCESSING_BY_REPO option to a tri-state flag
  • Introduced two enums, one wrapping that feature flag, the other managing the 4 states that the various tasks can be in
  • Otherwise I kept the logic mostly as is, which also means that there are still tons of ifs scattered all around.

To be quite honest, I think just keeping the various ifs littered around is preferable to duplicating all this logic.
Once parallel is fully enabled, there should be a lot of stuff ready to be cleaned up.

One thing that I would still have to take care of is the migration path. Rolling out the feature flag currently has a direct effect on already scheduled tasks, which should be avoided.

@Swatinem Swatinem marked this pull request as ready for review October 1, 2024 12:31
Copy link
Contributor

@michelletran-codecov michelletran-codecov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. I think I would also wait for @matt-codecov 's review/approval as he has more context into this code.

tasks/upload.py Outdated Show resolved Hide resolved
@Swatinem Swatinem force-pushed the swatinem/fully-parallel branch from ad04519 to 7660ffc Compare October 2, 2024 14:17
Comment on lines +610 to +613
if parallel_feature is ParallelFeature.EXPERIMENT and delete_archive_setting(
commit_yaml
):
parallel_feature = ParallelFeature.SERIAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you happen to know why this setting should disable the experiment? is it a problem for the fully parallel mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the relevant code to avoid creating a copy of the upload, in favor of just using the upload as it exists, provided it does exist and is not being deleted :-)

Parallel processing does not have that problem, as it only has a single task processing (and deleting) a raw upload.

Comment on lines 609 to 610
# When we are fully parallel, we need to update the `Upload` in the database
# with the final session_id (aka `order_number`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this taking the place of the PARALLEL_PROCESSED upload state daniel had?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite. I will discuss these various states a bit more and figure out a good way to go there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But good that you called this out, I found another bug related to the new code from #745 not being ported to this PR yet, which I now did.

@Swatinem Swatinem force-pushed the swatinem/fully-parallel branch from 7660ffc to 9169799 Compare October 3, 2024 07:50
@Swatinem Swatinem added this pull request to the merge queue Oct 3, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 3, 2024
This adds another variant to the `PARALLEL_PROCESSING` feature/rollout flag which prefers the parallel upload processing pipeline in favor of running it as an experiment.

Upload Processing can run in essentially 4 modes:
- Completely serial processing
- Serial processing, but running "experiment" code (`EXPERIMENT_SERIAL`):
  - In this mode, the final (`is_final`) `UploadProcessor` task saves a copy
    of the final report for later verification.
- Parallel processing, but running "experiment" code (`EXPERIMENT_PARALLEL`):
  - In this mode, another parallel set of `UploadProcessor` tasks runs *after*
    the main set up tasks.
  - These tasks are not persisting any of their results in the database,
    instead the final `UploadFinisher` task will launch the `ParallelVerification` task.
- Fully parallel processing (`PARALLEL`):
  - In this mode, the final `UploadFinisher` task is responsible for merging
    the final report and persisting it.

An example Task chain might look like this, in "experiment" mode:
- Upload
  - UploadProcessor
    - UploadProcessor
      - UploadProcessor (`EXPERIMENT_SERIAL` (the final one))
        - UploadFinisher
          - UploadProcessor (`EXPERIMENT_PARALLEL`)
          - UploadProcessor (`EXPERIMENT_PARALLEL`)
          - UploadProcessor (`EXPERIMENT_PARALLEL`)
            - UploadFinisher (`EXPERIMENT_PARALLEL`)
              - ParallelVerification

The `PARALLEL` mode looks like this:
- Upload
  - UploadProcessor (`PARALLEL`)
  - UploadProcessor (`PARALLEL`)
  - UploadProcessor (`PARALLEL`)
    - UploadFinisher (`PARALLEL`)
@Swatinem Swatinem force-pushed the swatinem/fully-parallel branch from 9169799 to 3ef30d7 Compare October 4, 2024 07:30
@Swatinem Swatinem added this pull request to the merge queue Oct 4, 2024
Merged via the queue into main with commit 98629d2 Oct 4, 2024
25 of 27 checks passed
@Swatinem Swatinem deleted the swatinem/fully-parallel branch October 4, 2024 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants