Figure out better back-pressure monitoring for in flight URLs #104

kkrugler · 2018-03-15T21:18:35Z

Currently UrlDBFunction tries to avoid a circular back-pressure lockup by monitoring the number of "in flight" URLs. The assumption is that the max number of outlinks extracted for any one page, times a maximum in-flight limit, is less than the size of the network buffer we're using.

This has a number of issues:

With parallelism > 1, the new URLs generated by extracting outlinks (keyed by PLD for each URL) might all go to a different UrlDBFunction, thus bypassing the throttling that we're trying to do.
Not every in flight URL actually results in a fetched/parsed page. And not every page has the maximum number of outlinks. So our max in flight limit is very, very conservative and thus results in inefficiencies.

Since the real problem is caused by outlink expansion, we could have a metrics gauge that tracks in-flight URLs, where we increment this both when emitting URLs from the UrlDBFunction, and when collecting outlinks. Before emitting outlinks, we could check the current gauge and wait (random backoff) if we're too close to our limit.

This would require a REST call to get the (summed) gauge value for every parsed page. It's unclear whether this would wind up being a performance hit, as parsing (and fetching pages) can also take significant amount of time.

kkrugler self-assigned this Mar 15, 2018

kkrugler added this to the Future milestone Apr 25, 2018

kkrugler removed their assignment Apr 25, 2018

kkrugler added the enhancement label Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out better back-pressure monitoring for in flight URLs #104

Figure out better back-pressure monitoring for in flight URLs #104

kkrugler commented Mar 15, 2018

Figure out better back-pressure monitoring for in flight URLs #104

Figure out better back-pressure monitoring for in flight URLs #104

Comments

kkrugler commented Mar 15, 2018