Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse / Donwloader stuck in inf loop if it did not reach max_num #133

Open
zMynxx opened this issue Dec 8, 2024 · 3 comments
Open

Parse / Donwloader stuck in inf loop if it did not reach max_num #133

zMynxx opened this issue Dec 8, 2024 · 3 comments

Comments

@zMynxx
Copy link

zMynxx commented Dec 8, 2024

When using the crawler I've encountered with issue regarding the max_num.
If cases where less images found than the provided "max_num", a infinite loop begin, stop any previous result from actually providing information.
Expected result should be immidiate stop when there nothing left to download, and exit safely.

The following example is for the greedy crawler with a url provided to use flicker search engine like so:

  • search_phrase set to "gripper"
  • max_num set to 30
def test_flicker(search_phrase: str, max_num: int) -> None:
    print("start testing FlickerImageCrawler")
    greedy_crawler = GreedyImageCrawler(parser_threads=4, storage={"root_dir": root_dir})
    greedy_crawler.crawl(f"https://www.flickr.com/search/?q={search_phrase}", max_num=max_num, min_size=(100, 100))

result: (downloaded image #27 and then inf loop)

INFO - downloader - image #27\thttps://combo.staticflickr.com/pw/images/favicons/f>
INFO - parser - parser-001 is waiting for new page urls
INFO - parser - parser-002 is waiting for new page urls
INFO - parser - parser-004 is waiting for new page urls
INFO - parser - parser-003 is waiting for new page urls
INFO - parser - parser-001 is waiting for new page urls
INFO - parser - parser-002 is waiting for new page urls
INFO - parser - parser-004 is waiting for new page urls
INFO - parser - parser-003 is waiting for new page urls
INFO - parser - parser-001 is waiting for new page urls
INFO - downloader - downloader-001 is waiting for new download tasks
INFO - parser - parser-002 is waiting for new page urls
INFO - parser - parser-003 is waiting for new page urls
INFO - parser - parser-004 is waiting for new page urls
INFO - parser - parser-001 is waiting for new page urls
INFO - parser - parser-002 is waiting for new page urls
INFO - parser - parser-003 is waiting for new page urls
INFO - parser - parser-004 is waiting for new page urls
@ZhiyuanChen
Copy link
Collaborator

Thank you for raising this issue, this seems very interesting. I'll look into it

@zMynxx
Copy link
Author

zMynxx commented Dec 22, 2024

Any updates?
If not, in the mean time can you please introduce a timeout mechanism? say 30s or so. This way the crawler will still be functional / operational. At the moment it is unstable due to that, to the point where I cannot use it :(

@ZhiyuanChen
Copy link
Collaborator

Sorry for this late reply.

Can you try if the latest commit fixes your issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants