Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max requests limit edited #89

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cortezroberto
Copy link
Contributor

Description

What did I do?

This PR is related to issue #88, I made a research about the follow-redirects module implemented in this project to solve the following bug:
image
"When running the scraper locally, one of the newer sources triggered an Error [ERR_FR_TOO_MANY_REDIRECTS]: Maximum number of redirects exceeded."

How to solve it:

After reading the follow-redirects documentation and testing locally the project I found out that the module provides default options for different scenarios while making requests, in this case, the two options that are involved in the bug are:

  • maxRedirects – sets the maximum number of allowed redirects; if exceeded, an error will be emitted.
  • maxBodyLength – sets the maximum size of the request body; if exceeded, an error will be emitted.

Default values:
maxRedirects = 21.
maxBodyLength = 10MB (10 * 1024 * 1024).

To solve the bug it's necessary to edit the default values of the options on the ./node_modules/follow-redirects/index.js.
First I added the source that triggered the bug:
Screen Shot 2022-07-04 at 19 00 13
Then, I changed the values of the options to be able to have more redirects and avoid these kinds of bugs while scrapping sites.
Screen Shot 2022-07-04 at 19 11 30

Changed values:
maxRedirects = 84.
maxBodyLength = 20MB (20 * 1024 * 1024).

@MizouziE
Copy link
Collaborator

MizouziE commented Jul 9, 2022

Hi @cortezroberto
My apologies for taking so long to get back to you! I'm glad that you found a solution that worked, but I cannot seem to replicate it and I think there a few possible reasons.

  • I noticed the branch you were working from was not up to date with the current project so there is a possibility that another package we've implemented is causing a conflict. Can you hit the fetch upstream on your repo to get all the latest changes and see if it still works?
    image

  • The only changes in this actual PR is the addition of the source for Washington Post, so I suspect the changes you made were inside node_modules/follow-redirects/index.js which is excluded from git by the .gitignore file. There is mention in the documentation of how to override these defaults globally, but I honestly couldn't get it to work when I tried so I think I was missing something.
    image

  • Finally, when it did seem like the following redirects problem was solved, it came back with a new type of error that still prevented any articles being retrieved from Washington Post. I am looking through the site and trying to understand the major differences in structure, but cannot figure it out.
    image

I think that this source might just be too problematic, so if neither of us can find a solution I think we can just remove it from the list. There are plenty of others (that just need a slight tweak) so there are enough results being delivered in my opinion. I'd just like to understand the reason for this error.

Thanks again, and let me know any thoughts on all the above or if you want to discuss any of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants