Skip to content
This repository has been archived by the owner on Jan 17, 2023. It is now read-only.

Retrieve and parse web page (remote resource in general) incrementally #19

Open
llucax opened this issue Jan 4, 2021 · 0 comments
Open
Labels
optimization Makes the software use less resources or run faster

Comments

@llucax
Copy link
Member

llucax commented Jan 4, 2021

We only want to get some information about the URL, and we acknowledge this information won't be perfect, as we'll need to make assumptions and use heuristics to figure out where to get the information from.

Because of this, to avoid retrieving huge documents and avoid parsing huge documents, we should ideally retrieve and parse the remote resource incrementally, and stop when we have enough information about it. For example, generated links will always have a maximum length, and if we are asked to generate a link for a resource storing the complete Shakespeare works, we only need to get the first 4K at most and then we are done. A lot of CPU power and network traffic can be saved this way.

@llucax llucax added the optimization Makes the software use less resources or run faster label Jan 4, 2021
llucax added a commit that referenced this issue Jan 7, 2021
So far the web page from which to generate the expanded URL is retrieved
synchronously, but this is problematic because the worker thread will be
stuck waiting for data to arrive from the network.

We convert the web page retrieval to use the async framework, so the
worker thread can be used to process other requests while we wait for
data to come from the network.

Part of #19.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
optimization Makes the software use less resources or run faster
Projects
None yet
Development

No branches or pull requests

1 participant