You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apologies if this is a silly question, but I'm wondering if cdxj-indexer has the ability to generate a list of pages (and potentially their titles) from a warc file? I am thinking of something analogous to the 'Pages' tab in replayweb.page, where it's a list of the captured pages and their titles, rather than the list of all of the many digital objects that make them up. I wondered if there is an http header that could be used for this with cdxj-indexer that would help with this, but I don't see anything obvious.
The use case is that it would be great to provide a human-friendly list of pages on our archive catalogue entries. At the moment I've been generating cdxj's with the default settings but I can see researchers finding this confusing.
Many thanks,
Jake
The text was updated successfully, but these errors were encountered:
I don't think this seems like something cdxj-indexer should support, since it does one thing (generate CDXJ files) and does it pretty well. But it does seem like something you might be better off writing as a custom utility program that uses warcio directly? Getting a page title will require parsing responses that are html with something like beautifulsoup? This seems to work ok?
Hello!
Apologies if this is a silly question, but I'm wondering if cdxj-indexer has the ability to generate a list of pages (and potentially their titles) from a warc file? I am thinking of something analogous to the 'Pages' tab in replayweb.page, where it's a list of the captured pages and their titles, rather than the list of all of the many digital objects that make them up. I wondered if there is an http header that could be used for this with cdxj-indexer that would help with this, but I don't see anything obvious.
The use case is that it would be great to provide a human-friendly list of pages on our archive catalogue entries. At the moment I've been generating cdxj's with the default settings but I can see researchers finding this confusing.
Many thanks,
Jake
The text was updated successfully, but these errors were encountered: