Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing timeout ratios for CVE API requests #1393

Merged
merged 9 commits into from
Dec 6, 2024

Conversation

cmm-lyft
Copy link
Contributor

@cmm-lyft cmm-lyft commented Nov 29, 2024

Summary

Based on tests conducted semi-automated, recently https://services.nvd.nist.gov/rest/json/cves/2.0/ response time has increased up to 100,000 ms, and the latest executions have raised 503 error, responding after 20 to 30 retries with 3 seconds intervals:

Failed to get CVE data from NIST NVD API 503 : <html><body><h1>503 Service Unavailable</h1> No server is available to handle this request. </body></html>

Increasing the CONNECT_AND_READ_TIMEOUT to 120 seconds, the MAX_RETRIES to 30 and including sleep time on Exception to mitigate the insidense of 503 error.

The max execution time of the request logic (worst case scenario) would be of 7 minutes (120 of request connection, 120 of request timeout, 30 retries with 6 sleep time as max).

Checklist

Provide proof that this works (this makes reviews move faster). Please perform one or more of the following:

  • Update/add unit or integration tests.
  • Include a screenshot showing what the graph looked like before and after your changes.
  • Include console log trace showing what happened before and after your changes.

If you are changing a node or relationship:

If you are implementing a new intel module:

@cmm-lyft
Copy link
Contributor Author

👀 @achantavy @heryxpc @serge-wq @khanhldt

Comment on lines 25 to 27
MAX_RETRIES = 30
# Connect and read timeouts of 120 seconds each; see https://requests.readthedocs.io/en/master/user/advanced/#timeouts
CONNECT_AND_READ_TIMEOUT = (120, 120)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

503 is not related to (connect, read) timeout config, no? Any reason to do those changes?

ideally, i would suggest to use http adapter with backoff. If that's complex to change, at least try to do similar thing in code (doubling the sleep time every next attempt).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same, but executing tests I observed that is the current behavior.
You can manually try with this link: https://services.nvd.nist.gov/rest/json/cves/2.0/
execute the request several times from your browser (even without passing credentials / token), and you will reproduce that 503 error.
the response is super slow, and in certain cases (from my tests) it keeps returning the error even after +20 retries:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can implement a backoff strategy for this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It looks like a timeout that occurred on their reverse proxy or load balancer. The configuration we have here is for the client side to cut off the connection, I suspect it make a difference.

I tried the curl command and it failed with 503 fairly fast, then 1 time it successfully waited for 10 seconds before returning the whole data. Usually connect timeout doesn't need to be very long, while read timeout can be longer to wait for the data to come back. Still maybe you don't really need to do those changes. If you really wish so, I would suggest something like (30, 120) instead.

And yes, backoff retry is better because if everyone keeps bombarding their service won't be able to recover.

@cmm-lyft cmm-lyft requested a review from kledo-lyft December 4, 2024 20:00
Comment on lines +101 to +103
# Exponential backoff
sleep_time *= 2
time.sleep(sleep_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a total of 3 * 2^5 or 6 * 2^5 seconds when API key is provided?

That seems low. Do we need higher number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 2^5?
condition at L99 is >= , so it will actually iterate over 6.
3 * 2^6 = 192
3 * 2^6 = 384
I think those times are high enough. 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, let's try to enumerate it.
it's incremented by 1 before the check and stopped when the value reaches 6

retries variable value -> sleep time
1 -> 3
2 -> 6
3 -> 12
4 -> 24
5 -> 48
6 -> stopped (hit >= MAX_RETRIES)

@cmm-lyft cmm-lyft requested a review from kledo-lyft December 4, 2024 21:36
@cmm-lyft
Copy link
Contributor Author

cmm-lyft commented Dec 6, 2024

👀 @achantavy

cmm-lyft and others added 8 commits December 6, 2024 11:48
Signed-off-by: cmm-lyft <[email protected]>
…ion (cartography-cncf#1392)

### Summary
Correcting error:
```
File "/code/venvs/venv/lib/python3.10/site-packages/cartography/intel/aws/ec2/network_acls.py", line 72, in transform_network_acl_data
    'CidrBlock': rule['CidrBlock'],
KeyError: 'CidrBlock'
```

### Checklist

Provide proof that this works (this makes reviews move faster). Please
perform one or more of the following:
- [ ] Update/add unit or integration tests.
- [ ] Include a screenshot showing what the graph looked like before and
after your changes.
- [ ] Include console log trace showing what happened before and after
your changes.

If you are changing a node or relationship:
- [ ] Update the
[schema](https://github.com/lyft/cartography/tree/master/docs/root/modules)
and
[readme](https://github.com/lyft/cartography/blob/master/docs/schema/README.md).

If you are implementing a new intel module:
- [ ] Use the NodeSchema [data
model](https://cartography-cncf.github.io/cartography/dev/writing-intel-modules.html#defining-a-node).

Signed-off-by: cmm-lyft <[email protected]>
Signed-off-by: cmm-lyft <[email protected]>
Signed-off-by: cmm-lyft <[email protected]>
Signed-off-by: cmm-lyft <[email protected]>
### Summary

Migrating the ASG sync to use the data model. Nothing else was changed
in the sync aside from creating the model and using it during the sync.

### Checklist

Provide proof that this works (this makes reviews move faster). Please
perform one or more of the following:
- [x] Update/add unit or integration tests.
- [ ] Include a screenshot showing what the graph looked like before and
after your changes.
- [ ] Include console log trace showing what happened before and after
your changes.

If you are changing a node or relationship:
- [ ] Update the
[schema](https://github.com/lyft/cartography/tree/master/docs/root/modules)
and
[readme](https://github.com/lyft/cartography/blob/master/docs/schema/README.md).

If you are implementing a new intel module:
- [ ] Use the NodeSchema [data
model](https://cartography-cncf.github.io/cartography/dev/writing-intel-modules.html#defining-a-node).

---------

Signed-off-by: Sergio Franco <[email protected]>
Signed-off-by: cmm-lyft <[email protected]>
@cmm-lyft cmm-lyft force-pushed the time_sleep_nvd_retry branch from 31aa342 to 5309729 Compare December 6, 2024 17:48
@achantavy achantavy merged commit f842546 into cartography-cncf:master Dec 6, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants