Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent duplicate files #108

Merged
merged 2 commits into from
Aug 14, 2020
Merged

Prevent duplicate files #108

merged 2 commits into from
Aug 14, 2020

Conversation

satyamtg
Copy link
Contributor

@satyamtg satyamtg commented Aug 11, 2020

This fixes #104 by downloading everything to instance_assets instead of downloading to different places.
Also prevents creation of dummy subtitles if content was {}, which resulted in many duplicates in zimcheck

Please do note that this prevents duplicate files on the basis of unique URLs, however there are some small number of files which have different URLs but same content, and hence are reported by zimcheck. There's no way of identifying them without calculating a checksum. I think we shall see openzim/python-scraperlib#33 and use that instead for a small subset of files (which are basically some images under instance_assets).

@satyamtg satyamtg self-assigned this Aug 11, 2020
@satyamtg satyamtg force-pushed the fix_duplicates branch 2 times, most recently from b043f94 to 762ed33 Compare August 12, 2020 13:28
@satyamtg satyamtg requested a review from rgaudin August 12, 2020 13:34
@satyamtg satyamtg marked this pull request as ready for review August 12, 2020 13:34
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a function that returns the "../" * x? Might bit a bit more readable?

openedx2zim/html_processor.py Outdated Show resolved Hide resolved
openedx2zim/html_processor.py Outdated Show resolved Hide resolved
@satyamtg satyamtg requested a review from rgaudin August 13, 2020 14:00
@satyamtg
Copy link
Contributor Author

Should we add a function that returns the "../" * x? Might bit a bit more readable?

Have added a function get_back_jumps(nb_jumps) in utils.py

@rgaudin rgaudin merged commit 0699dff into master Aug 14, 2020
@rgaudin rgaudin deleted the fix_duplicates branch August 14, 2020 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many duplicate data in zim
2 participants