Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD5 File Identification Conflict with Aldus FreeHand Drawing Signature #1004

Open
JohannesKarlsen99 opened this issue Sep 27, 2023 · 5 comments

Comments

@JohannesKarlsen99
Copy link

We are encountering challenges when attempting to identify MD5 files. Specifically, there are instances where the content of the MD5 file aligns with the signature of an Aldus FreeHand Drawing file.

We've made efforts to rectify this by using the checkForExtensionMismatches method within the binarySignatureIdentifier class. However, we've noted that this version of Aldus doesn't have a file extension. This results in no mismatch being detected. Upon inspecting the code, it seems that this behavior is deliberate.

This raises a concern: if a file format is defined without a file extension in its specification, then any file with that signature that also contains a file extension should be deemed inaccurate. The current implementation seems counterintuitive, as it doesn't align with the specified file format criteria.

If the design decision to avoid mismatches for formats without extensions was intentional, please provide clarity on the reasoning behind it. Given the described scenario, it appears to introduce errors in file identification.

@sparkhi
Copy link
Collaborator

sparkhi commented Sep 27, 2023

Could you please share an example file here so we can explore further. Also, can you please confirm whether you are using the GUI or CLI?

@JohannesKarlsen99
Copy link
Author

To clarify, we are using an implementation that integrates DROID's capabilities into our application. Specifically, we have developed code that leverages DROID's functionalities for file identification.

I've attached an example file. example.md5.zip

@Dclipsham
Copy link

Interesting scenario, not covered in the original FF ID documentation (https://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_identification.pdf see section 3. starting at page 13).

As the signature for Freehand 1 is only using ASCII characters from the 0-f range there's a small chance (1 in 65,536 rather than the 1 in ~4billion you'd normally expect for an entirely random clash for a 4-byte signature) that this could have false positives for formats such as MD5, SHA-1, SHA-256 etc that are internally just a stream of hex represented as ASCII (albeit usually with a filepath appended if generated with utilities such as md5sum, sha256sum etc).

It may be possible for the Freehand signature to be strengthened - I'd probably be looking towards @thorsted for advice there.

Regarding the behaviour where a format has an empty external signature, I can see arguments in both directions. Particularly in the Macintosh world (as in this specific case) extensions were sometimes used by convention rather than specification, so a lack of an official extension doesn't preclude people from using them, however normally when a PRONOM external signature field has been left empty its because of a lack of a clear convention, so I don't think it would be a bad thing to have a mismatch flag where a format entry lacks an associated extension, but a file instance has one.

@thorsted
Copy link

thorsted commented Sep 27, 2023

Interesting issue. More of a PRONOM issue than DROID.

David is correct in regards to file extension. There are many Macintosh only formats which where never assigned an extension as they were unnecessary in the MacOS. Freehand versions 1 & 2 are examples of this. Although based on later versions when the software was cross-platform, one could assign .FH1 and .FH2, but I don't agree this is necessary as the original files will never have this extension in the real world.

Identification of MD5 files is the more difficult issue. A defined binary signature will always have priority over an extension only signature.

That being said, I can look at strengthening the FreeHand 1 signature to include more bytes, but who is to say once that is released your MD5 files won't clash with another signature?

@JohannesKarlsen99
Copy link
Author

I acknowledge the potential value in strengthening the FreeHand 1 signature, as this might help reduce the number of false positives. That said, my primary concern revolves around the behavior of mismatch detection when a file format's signature lacks a defined extension. In our integration of DROID, several md5 files were initially misidentified. Although the checkForExtensionMismatches method resolved many of these, I'm inclined to think that when a file has an extension and the signature doesn't, it should trigger a mismatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants