Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flyteadmin] Refactor panic recovery into middleware #5546

Merged
merged 3 commits into from
Aug 1, 2024

Conversation

Sovietaced
Copy link
Contributor

@Sovietaced Sovietaced commented Jul 9, 2024

What changes were proposed in this pull request?

Previously all gRPC handlers would handle panics inside each RPC handler. This added a lot of repetitive boilerplate to all RPC handlers that was pretty fragile to maintain. This pull request introduces recovery middleware that will recover from panics for all RPCs mounted to the RPC server.

This pull request also proposes a change to the panic recovery logic.

I made a change to the recovery logic such that it logs the panic at the error level instead of the fatal level. The previous fatal error level would call os.Exit(1) which immediately terminates the program ungracefully. My suspicion is that this made the existing prometheus panic metrics effectively useless given that prometheus metrics are polled on an interval and the server was likely killed when the metrics would normally be polled. (Arguably the panic metrics could be removed now).

IMO, it's better for high availability to have an RPC server that is alive and sending errors back (and reporting error metrics) than one that gets killed and is unresponsive until Kubernetes decides to boot up another healthy pod. As such, I have changed the behavior to return gRPC INTERNAL status codes instead of terminating the server. I'm open to debating this change so feel free to share your opinion.

How was this patch tested?

Unit tests

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@Sovietaced Sovietaced changed the title Recover Add recovery middleware Jul 9, 2024
@Sovietaced Sovietaced marked this pull request as ready for review July 9, 2024 06:29
@@ -0,0 +1,38 @@
package middleware
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to changing where this lives but it feels like there should be a middleware package, and any interceptors should move here imo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link

codecov bot commented Jul 9, 2024

Codecov Report

Attention: Patch coverage is 54.05405% with 17 lines in your changes missing coverage. Please review.

Project coverage is 36.17%. Comparing base (025296a) to head (ec7ba89).
Report is 129 commits behind head on master.

Files with missing lines Patch % Lines
flyteadmin/pkg/server/service.go 0.00% 17 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5546      +/-   ##
==========================================
+ Coverage   35.89%   36.17%   +0.27%     
==========================================
  Files        1301     1302       +1     
  Lines      109419   109388      -31     
==========================================
+ Hits        39281    39570     +289     
+ Misses      66041    65683     -358     
- Partials     4097     4135      +38     
Flag Coverage Δ
unittests-datacatalog 51.37% <ø> (ø)
unittests-flyteadmin 55.30% <54.05%> (+1.60%) ⬆️
unittests-flytecopilot 12.17% <ø> (ø)
unittests-flytectl 62.28% <ø> (ø)
unittests-flyteidl 7.09% <ø> (ø)
unittests-flyteplugins 53.31% <ø> (ø)
unittests-flytepropeller 41.75% <ø> (ø)
unittests-flytestdlib 55.27% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Sovietaced Sovietaced changed the title Add recovery middleware [flyteadmin] Refactor panic recovery into middleware Jul 9, 2024
@Sovietaced Sovietaced force-pushed the recover branch 5 times, most recently from c81893e to 6c27b60 Compare July 10, 2024 21:06
Signed-off-by: Jason Parraga <[email protected]>
@Sovietaced Sovietaced requested a review from katrogan July 30, 2024 01:22
Copy link
Contributor

@katrogan katrogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thank you for refactoring and the detailed PR explanation!

@@ -0,0 +1,38 @@
package middleware
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@katrogan
Copy link
Contributor

katrogan commented Aug 1, 2024

cc @eapolinario who is looking into test failures

@eapolinario eapolinario enabled auto-merge (squash) August 1, 2024 22:12
@eapolinario eapolinario merged commit 45e287a into flyteorg:master Aug 1, 2024
51 of 52 checks passed
bgedik pushed a commit to bgedik/flyte that referenced this pull request Aug 15, 2024
* Refactor panic handling to middleware

Signed-off-by: Jason Parraga <[email protected]>

* Remove registration of old panicCounter

Signed-off-by: Jason Parraga <[email protected]>

* Add test coverage

Signed-off-by: Jason Parraga <[email protected]>

---------

Signed-off-by: Jason Parraga <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>
vlibov pushed a commit to vlibov/flyte that referenced this pull request Aug 16, 2024
* Refactor panic handling to middleware

Signed-off-by: Jason Parraga <[email protected]>

* Remove registration of old panicCounter

Signed-off-by: Jason Parraga <[email protected]>

* Add test coverage

Signed-off-by: Jason Parraga <[email protected]>

---------

Signed-off-by: Jason Parraga <[email protected]>
Signed-off-by: Vladyslav Libov <[email protected]>
eapolinario added a commit that referenced this pull request Aug 20, 2024
…ame (#5616)

* Add environment variable for pod name

Signed-off-by: Bugra Gedik <[email protected]>

* [flyteadmin] Refactor panic recovery into middleware (#5546)

* Refactor panic handling to middleware

Signed-off-by: Jason Parraga <[email protected]>

* Remove registration of old panicCounter

Signed-off-by: Jason Parraga <[email protected]>

* Add test coverage

Signed-off-by: Jason Parraga <[email protected]>

---------

Signed-off-by: Jason Parraga <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Snowflake agent Doc (#5620)

* TEST build

Signed-off-by: Future-Outlier <[email protected]>

* remove emphasize-lines

Signed-off-by: Future-Outlier <[email protected]>

* test build

Signed-off-by: Future-Outlier <[email protected]>

* revert

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* [flytepropeller][compiler] Error Handling when Type is not found (#5612)

* FlytePropeller Compiler Avoid Crash when Type not found

Signed-off-by: Future-Outlier <[email protected]>

* Update pingsu's error message advices

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: pingsutw  <[email protected]>

* fix lint

Signed-off-by: Future-Outlier <[email protected]>

* Trigger CI

Signed-off-by: Future-Outlier <[email protected]>

* Trigger CI

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: pingsutw <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Fix nil pointer when task plugin load returns error (#5622)

Signed-off-by: Bugra Gedik <[email protected]>

* Log stack trace when refresh cache sync recovers from panic (#5623)

Signed-off-by: Bugra Gedik <[email protected]>

* use private-key (#5626)

Signed-off-by: Bugra Gedik <[email protected]>

* Explain how Agent Secret Works (#5625)

* first version

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Fix typo in execution manager (#5619)

Signed-off-by: ddl-rliu <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Amend Admin to use grpc message size (#5628)

* add send arg

Signed-off-by: Yee Hing Tong <[email protected]>

* Add acction to remove cache in gh runner

Signed-off-by: Eduardo Apolinario <[email protected]>

* Use correct checked out path

Signed-off-by: Eduardo Apolinario <[email protected]>

* Path in strings

Signed-off-by: Eduardo Apolinario <[email protected]>

* Checkout repo in root

Signed-off-by: Eduardo Apolinario <[email protected]>

* Use the correct path to new action

Signed-off-by: Eduardo Apolinario <[email protected]>

* Do not use gh var in path to clear-action-cache

Signed-off-by: Eduardo Apolinario <[email protected]>

* Remove wrong invocation of clear-action-cache

Signed-off-by: Eduardo Apolinario <[email protected]>

* GITHUB_WORKSPACE is implicit in the checkout action

Signed-off-by: Eduardo Apolinario <[email protected]>

* Refer to local `flyte` directory

Signed-off-by: Eduardo Apolinario <[email protected]>

---------

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* document the process of setting ttl for a ray cluster (#5636)

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Add CustomHeaderMatcher to pass additional headers (#5563)

Signed-off-by: Andrew Dye <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Turn flyteidl and flytectl releases into manual gh workflows (#5635)

* Make flyteidl releases go through a manual gh workflow

Signed-off-by: Eduardo Apolinario <[email protected]>

* Make flytectl releases go through a manual gh workflow

Signed-off-by: Eduardo Apolinario <[email protected]>

* Rewrite the documentation for `version` and clarify wording in RELEASE.md

Signed-off-by: Eduardo Apolinario <[email protected]>

---------

Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* docs: fix typo (#5643)

* fix CHANGELOG-v0.2.0.md

Signed-off-by: Christina <[email protected]>

* fix CHANGELOG-v1.0.2-b1.md

Signed-off-by: Christina <[email protected]>

* fix CHANGELOG-v1.1.0.md

Signed-off-by: Christina <[email protected]>

* fix CHANGELOG-v1.3.0.md

Signed-off-by: Christina <[email protected]>

---------

Signed-off-by: Christina <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Use enable_deck=True in docs (#5645)

Signed-off-by: Bugra Gedik <[email protected]>

* Fix flyteidl release  checkout all tags (#5646)

* Fetch all tags in flyteidl-release.yml

Signed-off-by: Eduardo Apolinario <[email protected]>

* Fix sed expression for npm job

Signed-off-by: Eduardo Apolinario <[email protected]>

---------

Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Install pyarrow in sandbox functional tests (#5647)

Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* docs: add documentation for configuring notifications in GCP (#5545)

* update

Signed-off-by: Desi Hsu <[email protected]>

* dco

Signed-off-by: Desi Hsu <[email protected]>

* dco

Signed-off-by: Desi Hsu <[email protected]>

* typo

Signed-off-by: Desi Hsu <[email protected]>

---------

Signed-off-by: Desi Hsu <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Correct "sucessfile" to "successfile" (#5652)

Signed-off-by: Bugra Gedik <[email protected]>

* Fix ordering for custom template values in cluster resource controller (#5648)

Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Don't error when attempting to trigger schedules for inactive projects (#5649)

* Don't error when attempting to trigger schedules for inactive projects

Signed-off-by: Katrina Rogan <[email protected]>

* regen

Signed-off-by: Katrina Rogan <[email protected]>

---------

Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* fix tests

Signed-off-by: Bugra Gedik <[email protected]>

* change to shorter names

Signed-off-by: Bugra Gedik <[email protected]>

* change to shorter names

Signed-off-by: Bugra Gedik <[email protected]>

* change to shorter names

Signed-off-by: Bugra Gedik <[email protected]>

* change to shorter names

Signed-off-by: Bugra Gedik <[email protected]>

* change to shorter names

Signed-off-by: Bugra Gedik <[email protected]>

* Fix comment symbol

Signed-off-by: Eduardo Apolinario <[email protected]>

* fix one more test

Signed-off-by: Bugra Gedik <[email protected]>

---------

Signed-off-by: Bugra Gedik <[email protected]>
Signed-off-by: Jason Parraga <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: ddl-rliu <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Andrew Dye <[email protected]>
Signed-off-by: Christina <[email protected]>
Signed-off-by: Desi Hsu <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
Co-authored-by: Jason Parraga <[email protected]>
Co-authored-by: Future-Outlier <[email protected]>
Co-authored-by: pingsutw <[email protected]>
Co-authored-by: ddl-rliu <[email protected]>
Co-authored-by: Yee Hing Tong <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Andrew Dye <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Christina <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: desihsu <[email protected]>
Co-authored-by: ShengYu <[email protected]>
Co-authored-by: Katrina Rogan <[email protected]>
pmahindrakar-oss pushed a commit that referenced this pull request Sep 9, 2024
…ame (#5616)

* Add environment variable for pod name

Signed-off-by: Bugra Gedik <[email protected]>

* [flyteadmin] Refactor panic recovery into middleware (#5546)

* Refactor panic handling to middleware

Signed-off-by: Jason Parraga <[email protected]>

* Remove registration of old panicCounter

Signed-off-by: Jason Parraga <[email protected]>

* Add test coverage

Signed-off-by: Jason Parraga <[email protected]>

---------

Signed-off-by: Jason Parraga <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Snowflake agent Doc (#5620)

* TEST build

Signed-off-by: Future-Outlier <[email protected]>

* remove emphasize-lines

Signed-off-by: Future-Outlier <[email protected]>

* test build

Signed-off-by: Future-Outlier <[email protected]>

* revert

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* [flytepropeller][compiler] Error Handling when Type is not found (#5612)

* FlytePropeller Compiler Avoid Crash when Type not found

Signed-off-by: Future-Outlier <[email protected]>

* Update pingsu's error message advices

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: pingsutw  <[email protected]>

* fix lint

Signed-off-by: Future-Outlier <[email protected]>

* Trigger CI

Signed-off-by: Future-Outlier <[email protected]>

* Trigger CI

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: pingsutw <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Fix nil pointer when task plugin load returns error (#5622)

Signed-off-by: Bugra Gedik <[email protected]>

* Log stack trace when refresh cache sync recovers from panic (#5623)

Signed-off-by: Bugra Gedik <[email protected]>

* use private-key (#5626)

Signed-off-by: Bugra Gedik <[email protected]>

* Explain how Agent Secret Works (#5625)

* first version

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Fix typo in execution manager (#5619)

Signed-off-by: ddl-rliu <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Amend Admin to use grpc message size (#5628)

* add send arg

Signed-off-by: Yee Hing Tong <[email protected]>

* Add acction to remove cache in gh runner

Signed-off-by: Eduardo Apolinario <[email protected]>

* Use correct checked out path

Signed-off-by: Eduardo Apolinario <[email protected]>

* Path in strings

Signed-off-by: Eduardo Apolinario <[email protected]>

* Checkout repo in root

Signed-off-by: Eduardo Apolinario <[email protected]>

* Use the correct path to new action

Signed-off-by: Eduardo Apolinario <[email protected]>

* Do not use gh var in path to clear-action-cache

Signed-off-by: Eduardo Apolinario <[email protected]>

* Remove wrong invocation of clear-action-cache

Signed-off-by: Eduardo Apolinario <[email protected]>

* GITHUB_WORKSPACE is implicit in the checkout action

Signed-off-by: Eduardo Apolinario <[email protected]>

* Refer to local `flyte` directory

Signed-off-by: Eduardo Apolinario <[email protected]>

---------

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* document the process of setting ttl for a ray cluster (#5636)

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Add CustomHeaderMatcher to pass additional headers (#5563)

Signed-off-by: Andrew Dye <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Turn flyteidl and flytectl releases into manual gh workflows (#5635)

* Make flyteidl releases go through a manual gh workflow

Signed-off-by: Eduardo Apolinario <[email protected]>

* Make flytectl releases go through a manual gh workflow

Signed-off-by: Eduardo Apolinario <[email protected]>

* Rewrite the documentation for `version` and clarify wording in RELEASE.md

Signed-off-by: Eduardo Apolinario <[email protected]>

---------

Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* docs: fix typo (#5643)

* fix CHANGELOG-v0.2.0.md

Signed-off-by: Christina <[email protected]>

* fix CHANGELOG-v1.0.2-b1.md

Signed-off-by: Christina <[email protected]>

* fix CHANGELOG-v1.1.0.md

Signed-off-by: Christina <[email protected]>

* fix CHANGELOG-v1.3.0.md

Signed-off-by: Christina <[email protected]>

---------

Signed-off-by: Christina <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Use enable_deck=True in docs (#5645)

Signed-off-by: Bugra Gedik <[email protected]>

* Fix flyteidl release  checkout all tags (#5646)

* Fetch all tags in flyteidl-release.yml

Signed-off-by: Eduardo Apolinario <[email protected]>

* Fix sed expression for npm job

Signed-off-by: Eduardo Apolinario <[email protected]>

---------

Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Install pyarrow in sandbox functional tests (#5647)

Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* docs: add documentation for configuring notifications in GCP (#5545)

* update

Signed-off-by: Desi Hsu <[email protected]>

* dco

Signed-off-by: Desi Hsu <[email protected]>

* dco

Signed-off-by: Desi Hsu <[email protected]>

* typo

Signed-off-by: Desi Hsu <[email protected]>

---------

Signed-off-by: Desi Hsu <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Correct "sucessfile" to "successfile" (#5652)

Signed-off-by: Bugra Gedik <[email protected]>

* Fix ordering for custom template values in cluster resource controller (#5648)

Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* Don't error when attempting to trigger schedules for inactive projects (#5649)

* Don't error when attempting to trigger schedules for inactive projects

Signed-off-by: Katrina Rogan <[email protected]>

* regen

Signed-off-by: Katrina Rogan <[email protected]>

---------

Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>

* fix tests

Signed-off-by: Bugra Gedik <[email protected]>

* change to shorter names

Signed-off-by: Bugra Gedik <[email protected]>

* change to shorter names

Signed-off-by: Bugra Gedik <[email protected]>

* change to shorter names

Signed-off-by: Bugra Gedik <[email protected]>

* change to shorter names

Signed-off-by: Bugra Gedik <[email protected]>

* change to shorter names

Signed-off-by: Bugra Gedik <[email protected]>

* Fix comment symbol

Signed-off-by: Eduardo Apolinario <[email protected]>

* fix one more test

Signed-off-by: Bugra Gedik <[email protected]>

---------

Signed-off-by: Bugra Gedik <[email protected]>
Signed-off-by: Jason Parraga <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: ddl-rliu <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Andrew Dye <[email protected]>
Signed-off-by: Christina <[email protected]>
Signed-off-by: Desi Hsu <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
Co-authored-by: Jason Parraga <[email protected]>
Co-authored-by: Future-Outlier <[email protected]>
Co-authored-by: pingsutw <[email protected]>
Co-authored-by: ddl-rliu <[email protected]>
Co-authored-by: Yee Hing Tong <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Andrew Dye <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Christina <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: desihsu <[email protected]>
Co-authored-by: ShengYu <[email protected]>
Co-authored-by: Katrina Rogan <[email protected]>
Signed-off-by: pmahindrakar-oss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants