Retrieve and parse web page (remote resource in general) incrementally #19

llucax · 2021-01-04T09:04:29Z

We only want to get some information about the URL, and we acknowledge this information won't be perfect, as we'll need to make assumptions and use heuristics to figure out where to get the information from.

Because of this, to avoid retrieving huge documents and avoid parsing huge documents, we should ideally retrieve and parse the remote resource incrementally, and stop when we have enough information about it. For example, generated links will always have a maximum length, and if we are asked to generate a link for a resource storing the complete Shakespeare works, we only need to get the first 4K at most and then we are done. A lot of CPU power and network traffic can be saved this way.

So far the web page from which to generate the expanded URL is retrieved synchronously, but this is problematic because the worker thread will be stuck waiting for data to arrive from the network. We convert the web page retrieval to use the async framework, so the worker thread can be used to process other requests while we wait for data to come from the network. Part of #19.

llucax added the optimization Makes the software use less resources or run faster label Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve and parse web page (remote resource in general) incrementally #19

Retrieve and parse web page (remote resource in general) incrementally #19

llucax commented Jan 4, 2021

Retrieve and parse web page (remote resource in general) incrementally #19

Retrieve and parse web page (remote resource in general) incrementally #19

Comments

llucax commented Jan 4, 2021