crawler-service

This crawler will scan a list of urls and publish a series of events on PubSub with all the pages crawled. It's designed to handle big websites with million of pages but it will take a bit of memory as the list of the crawled pages is held in memory. This also means that if there is a crash or a shutdown, the crawl will have to restart from scratch as progress is not handed over or recovered in the current implementation.

Its data model is quite tied down to my application but the internals of the crawler are all well abstracted and tested so it shuold be simple to reuse.

If you need this crawler as a library, I'm happy to extract it and open source it

Name		Name	Last commit message	Last commit date
Latest commit History 353 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build.gradle		build.gradle
build_helper.sh		build_helper.sh
cloudbuild.yaml		cloudbuild.yaml
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawler-service

About

Releases

Packages

Languages

salvatorenovelli/crawler-service

Folders and files

Latest commit

History

Repository files navigation

crawler-service

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages