Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out better back-pressure monitoring for in flight URLs #104

Open
kkrugler opened this issue Mar 15, 2018 · 0 comments
Open

Figure out better back-pressure monitoring for in flight URLs #104

kkrugler opened this issue Mar 15, 2018 · 0 comments
Milestone

Comments

@kkrugler
Copy link
Member

Currently UrlDBFunction tries to avoid a circular back-pressure lockup by monitoring the number of "in flight" URLs. The assumption is that the max number of outlinks extracted for any one page, times a maximum in-flight limit, is less than the size of the network buffer we're using.

This has a number of issues:

  1. With parallelism > 1, the new URLs generated by extracting outlinks (keyed by PLD for each URL) might all go to a different UrlDBFunction, thus bypassing the throttling that we're trying to do.
  2. Not every in flight URL actually results in a fetched/parsed page. And not every page has the maximum number of outlinks. So our max in flight limit is very, very conservative and thus results in inefficiencies.

Since the real problem is caused by outlink expansion, we could have a metrics gauge that tracks in-flight URLs, where we increment this both when emitting URLs from the UrlDBFunction, and when collecting outlinks. Before emitting outlinks, we could check the current gauge and wait (random backoff) if we're too close to our limit.

This would require a REST call to get the (summed) gauge value for every parsed page. It's unclear whether this would wind up being a performance hit, as parsing (and fetching pages) can also take significant amount of time.

@kkrugler kkrugler self-assigned this Mar 15, 2018
@kkrugler kkrugler added this to the Future milestone Apr 25, 2018
@kkrugler kkrugler removed their assignment Apr 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant