You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently UrlDBFunction tries to avoid a circular back-pressure lockup by monitoring the number of "in flight" URLs. The assumption is that the max number of outlinks extracted for any one page, times a maximum in-flight limit, is less than the size of the network buffer we're using.
This has a number of issues:
With parallelism > 1, the new URLs generated by extracting outlinks (keyed by PLD for each URL) might all go to a different UrlDBFunction, thus bypassing the throttling that we're trying to do.
Not every in flight URL actually results in a fetched/parsed page. And not every page has the maximum number of outlinks. So our max in flight limit is very, very conservative and thus results in inefficiencies.
Since the real problem is caused by outlink expansion, we could have a metrics gauge that tracks in-flight URLs, where we increment this both when emitting URLs from the UrlDBFunction, and when collecting outlinks. Before emitting outlinks, we could check the current gauge and wait (random backoff) if we're too close to our limit.
This would require a REST call to get the (summed) gauge value for every parsed page. It's unclear whether this would wind up being a performance hit, as parsing (and fetching pages) can also take significant amount of time.
The text was updated successfully, but these errors were encountered:
Currently UrlDBFunction tries to avoid a circular back-pressure lockup by monitoring the number of "in flight" URLs. The assumption is that the max number of outlinks extracted for any one page, times a maximum in-flight limit, is less than the size of the network buffer we're using.
This has a number of issues:
Since the real problem is caused by outlink expansion, we could have a metrics gauge that tracks in-flight URLs, where we increment this both when emitting URLs from the UrlDBFunction, and when collecting outlinks. Before emitting outlinks, we could check the current gauge and wait (random backoff) if we're too close to our limit.
This would require a REST call to get the (summed) gauge value for every parsed page. It's unclear whether this would wind up being a performance hit, as parsing (and fetching pages) can also take significant amount of time.
The text was updated successfully, but these errors were encountered: