Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Domain name normalization #87

Open
2 tasks
yuliya-ivaniukovich opened this issue Oct 25, 2017 · 0 comments
Open
2 tasks

Domain name normalization #87

yuliya-ivaniukovich opened this issue Oct 25, 2017 · 0 comments
Assignees
Labels
Milestone

Comments

@yuliya-ivaniukovich
Copy link
Contributor

At the moment strings like www.domain.com, domain.com, subdomain.domain.com are treated as individual jobs with independent results. This can lead to some data collisions since we use document URL as a key of document tabel.
We should:

  • investigate if www.domain.com and domain.com are treated as the same address in Heritrix
  • for subdomains we can have independent Heritrix jobs, but we should link them to already existing documents if they were downloaded and checked before within another job.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants