-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ISCC content hash for images (ie. phash variation) #212
Comments
I think something similar was discussed at #110 I would be happy to include this, but I would suggest making it a preprocessor helper function, and add to the README how to use it, and recommend it there. Regarding the steps: Then I think we could add a iscc_content_hash function which combines the blocks above. |
When default settings are used, the target should be to follow the ISCC so that the result is identical to the reference implementation. Beyond that, it is a good idea to allow the user to use one's own settings via parameters. "Crop empty borders of image": what is "empty"? using the background color? what happens if the image is all white? I am not sure every user will want that. It uses pixel 0,0 as the background color, uses it as a mask, and then uses Pillows getbox() to get the box inside the mask. Getbox return 0 if it fails and then ... I also think that it is hard to implement well, but if it is specs then follow the reference implementation and make it optional/adjustable via parameters.
About resampling methods: If the ISCC spec/reference implementation uses Pillows BICUBIC, then the resampling function should be the same. Using a different one will add extra "noise" to the results, which should be avoided if possible. In some use cases, getting exact hashes matters. For example, I am currently using Mariadb as a database with combined dhash+phash hashes, so either one needs to be an exact match. With this, I can query them against db indexes very fast, and using dual hashes significantly reduces false negatives and because that i am investigating howto implement Pillows LANCZOS (or with BICUBIC in this case) using Java to get rid of extra noise when phash is calculated using Java code.
|
OK cool, yes, a pull request is welcome. |
This feature request is to being able to calculate ISCC content hash for images using Imagehash lib so they could be compared easily.
International Standard Content Code (ISCC) is standard proposal for content identifier for text, images, audio and video. It uses hashes which contains three independent 64 bit similarity hash blocks (Metadata, Content, Binary) and file checksum. All four can be compared independently. There is ISO 24138:2024 proposal published at 2024-05. Reference code and SDK is under Apache licence.
Home pages
Imagehashing part
ISCC content hashing for images is similar than in Imagehash library's phash but has preprocess steps to confirm that data is in uniform format. It also doesn't use Numpy in core, but uses Pillow for preprosessing the image (scaling, cropping, grayscaling etc) and ISCC uses BICUBIC for scaling compared LANCZOS used by Imagehash library.
I made a quick test that when steps 1-3 were disabled and resizing was done using Pillow's LANCZOS instead of BICUBIC ISCC returned identical values compared to Imagehash library's phash.
Steps for calculating ISCC content hash for images
Source code
Related investigation ticket in Wikimedia's phabricator
The text was updated successfully, but these errors were encountered: