-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watermark detection (but not removal) #29
Comments
Another, half baked thought: Image-ify all pages, discard whitespace, xor against front page or other reference standard to identify pixels that do not vary across pages: this is either margin decoration or a common watermark. Bryan Bishop [email protected] wrote:
Sent from my Android device with K-9 Mail. Please excuse my brevity. |
Cool, but how do you get rid of those elements? You would have to randomly delete pdf elements until the resulting pngs didn't have those images. Might work. Also, this technique would accidentally remove journal titles in margins, which is bad, but okay if there is JSON metadata that is attached to the pdf somehow. |
Unless you brute-force attempted to delete each individual element, I figure it's just a rapid filter to helo detect watermarks. Of course, brute force deletion might assist in creating a pdfparanoia profile for a new publisher, so perhaps the once-off inefficiency would prove worthwhile. A straight xor would only work if each watermark instance was binary-identical to the next. With any image compression this would likely fail, so perhaps a less stringent comparison, seeking bytes/pixels that vary less than a certain threshold, discarding X outliers based on pagecount..? Bryan Bishop [email protected] wrote:
Sent from my Android device with K-9 Mail. Please excuse my brevity. |
Make a way to detect whether or not a document is likely to have a watermark. There are a few different ways of detection that I can imagine:
Knowing that there is a watermark present is really helpful, because it means that you can track which percent of your collection is watermarked. Other tools can make informed decisions about what to do with a paper if there is a known watermark.
Unknown watermarks are the worst, but there's no way to detect an unknown unknown.
The text was updated successfully, but these errors were encountered: