Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect robots.txt #37

Open
3 tasks
jogli5er opened this issue Nov 21, 2019 · 0 comments
Open
3 tasks

Respect robots.txt #37

jogli5er opened this issue Nov 21, 2019 · 0 comments
Assignees
Labels
todo This should be implemented, is planned and a necessity, therefor not an enhancement.

Comments

@jogli5er
Copy link
Member

jogli5er commented Nov 21, 2019

For every new discovered host, check for a robots.txt.
Then, for every URL, we need to check whether it is allowed to access or not depending on the robots.txt.
This can be either done during insertion time or during dispatch time. For single runs, both options are pretty much identical, however, for longer, multi-scrape runs this can have some differences, as robots.txt might change.

  • Multiple subdomains can have a robots.txt. As we store the subdomain as part of the path, we have a more costly lookup to do, to find out if we already have a robots.txt path for a given subdomain
  • Check if we are allowed to access a given url. Check needs to be done with the subdomain in mind, see first point
  • Download the robots.txt lazy: we see that not yet a path exists with a robots.txt for the given subdomain, we should add it and eagerly get it, before continuing the download. Check that no timeouts appear, as we now do two downloads instead of a single one, hence occupy a network slot at most twice as long
@jogli5er jogli5er added the todo This should be implemented, is planned and a necessity, therefor not an enhancement. label Nov 21, 2019
@jogli5er jogli5er self-assigned this Nov 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
todo This should be implemented, is planned and a necessity, therefor not an enhancement.
Projects
None yet
Development

No branches or pull requests

1 participant