Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WACZ to 'Analyse contents of archive files' #1087

Open
Dclipsham opened this issue Apr 11, 2024 · 6 comments
Open

Add WACZ to 'Analyse contents of archive files' #1087

Dclipsham opened this issue Apr 11, 2024 · 6 comments

Comments

@Dclipsham
Copy link

Dclipsham commented Apr 11, 2024

Just to formalise a comment from #887,

fmt/1840 is emerging as a leading format for web archiving. Structurally it is a zip file containing a JSON manifest file and other structural elements along with payload data (see e.g. https://loc.gov/preservation/digital/formats/fdd/fdd000586.shtml and https://specs.webrecorder.net/wacz/1.1.1/)

Currently it identifies by Container Signature. It would be extremely useful to be able to recursively probe the contents of WACZ in the same manner it is already possible to probe the other web archive container formats, WARC and ARC.

@kathaurielle
Copy link

Belated thanks David! What can I advise the user about this change in DROID@s behaviour after the addition of fmt/1840?:

Prior to the addition of the wacz sig 1840 in v.110, wacz files were IDed as zip files, and the DROID report listed all the files inside them.

Now with the addition fmt/1840, however, DROID doesn't scan inside the files, it just IDs them as wacz, and outputs a one line DROID report.

Is a fix on the radar, where it scans inside wacz files? Thank you! KP.

@Dclipsham
Copy link
Author

Well the reason is that a container signature identification is considered more definitive than a zip (because many container based formats use zip as their wrapper) so when it finds the WACZ elements it is looking for, DROID doesn't currently go further.

The answer is, as with the other web archive formats ARC/WARC, to program DROID to scan the contents of WACZ also. It's really a new feature request than a 'fix', and timing is a question for @steve-daly and @sparkhi as it requires development effort.

@iholliew
Copy link

@Dclipsham @kathaurielle I’m now all clear on why DROID no longer reports on the contents of WACZ files. Hopefully, there will be a fix for this in time. I can only say that from my perspective and the way that we use DROID in my workplace to analyse digital collections, it is more important for us to be able to inspect the contents of a WACZ file than for the WACZ file to have its own unique identifier. This is why we’re opting to use an earlier signature file that supports this functionality (specifically, ‘V109’), rather than later releases.

@steve-daly
Copy link

@Dclipsham @iholliew what would you like the behaviour to be if we added WACZ to the Web Archive archive formats. The reason it's more complicated than plan ZIP etc is that the WACZ spec gives more meaning to some of the contained files/structure in the WACZ file so parsing a WACZ files is more than just showing the contents arbitrarily. We could just treat WACZ as Zip when decoding this way, but the need to understand the JSON manifest (for example) caused us to pause on this.

@Dclipsham
Copy link
Author

I personally would just like to decode the zip so it can be explored in the CSV output. I don't have a requirement for DROID to do anything clever with the JSON manifest

@iholliew
Copy link

Thanks @steve-daly. I second @Dclipsham's comments. I'm only interested in using DROID to report on the objects contained within the wacz file, like in 'v109'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants