Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[434] Allow to filter jobs in ZyteJobsComparisonMonitor by close_reason #440

Merged
merged 13 commits into from
Jul 15, 2024

Conversation

shafiq-muhammad
Copy link
Contributor

Adds limit for nested dict stats computation

Added a new setting SPIDERMON_JOBS_COMPARISON_CLOSE_REASONS to allow ZyteJobsComparisonMonitor to filter by the ScrapyCloud jobs with their close_reason stat.

Copy link

codecov bot commented Apr 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.54%. Comparing base (6fe8928) to head (7260d08).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #440      +/-   ##
==========================================
+ Coverage   79.51%   79.54%   +0.03%     
==========================================
  Files          76       76              
  Lines        3237     3242       +5     
  Branches      537      539       +2     
==========================================
+ Hits         2574     2579       +5     
  Misses        593      593              
  Partials       70       70              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@curita curita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me!

We should add some unittest to cover the new addition, and it would be nice to run some test job in ScrapyCloud to verify these changes (Installing Spidermon in the requierements.txt from git+https://github.com/shafiq-muhammad/spidermon@shafiq-434 should get you these changes). After that's addressed we should be good to go 👍

spidermon/contrib/scrapy/monitors/monitors.py Outdated Show resolved Hide resolved
spidermon/contrib/scrapy/monitors/monitors.py Outdated Show resolved Hide resolved
@VMRuiz VMRuiz closed this Apr 17, 2024
@VMRuiz VMRuiz reopened this Apr 17, 2024
@shafiq-muhammad shafiq-muhammad requested a review from curita May 2, 2024 12:08
@aaneto
Copy link
Contributor

aaneto commented May 3, 2024

Just a quick FYI @shafiq-muhammad @curita, I believe the pipeline is failing due to a regression from pytest 8.1.1 to 8.2.0.

My validation was:

  • Creating an empty .env for the project and running pip install tox and tox -e base
  • Noticing the errors, and setting pytest==8.1.1 on tox.ini
  • Running tox -e base again, with success this time.

Hope this helps (and is correct regardless of environment).

@VMRuiz
Copy link
Collaborator

VMRuiz commented May 3, 2024

Yes, it looks like Pytest 8.2.0 release has caused a lot of troubles according to their issues page.
@shafiq-muhammad , Could you please pin pytest to 8.1.1 until they release a new version?

Copy link
Member

@curita curita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, though we might want to change how we get the jobs depending on whether we want to get fewer API calls. I'm inclined to get fewer API calls, but let's wait for Victor's opinion.

Other than that, I suggested a small documentation addition, which should be quick to include.

spidermon/contrib/scrapy/monitors/monitors.py Outdated Show resolved Hide resolved

if len(current_jobs) < MAX_API_COUNT or len(total_jobs) == number_of_jobs:
for job in current_jobs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: This makes sense since it seems we can't filter by close_reason in client.spider.jobs.list().

I wonder if it makes sense to retrieve MAX_API_COUNT count jobs each time from the API and append to total_jobs the jobs that meet the criteria until we reach number_of_jobs. This should result in fewer API calls, but those calls will be larger. What do you think, @VMRuiz?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the default close reason is finished I would keep it as it is. As the most common scenario will be that you fetch the last few jobs and those will have the finished status. Even in the case where a second call is needed - due to a few unexpected failed jobs in between - it could still be faster for lower volumes than making 1 very big call.

On the contrary, if it's expected that uncommon close reasons like timeout are used, then it makes sense to fetch as many jobs as possible, as they will be few or even none.

Copy link
Member

@curita curita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ready to me!

@VMRuiz VMRuiz merged commit d7d553a into scrapinghub:master Jul 15, 2024
8 checks passed
@shafiq-muhammad shafiq-muhammad deleted the shafiq-434 branch August 22, 2024 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants