harden StoryArchiveReader??? #220
Labels
archiving
related to archive worker
documentation
Improvements or additions to documentation
enhancement
New feature or request
Milestone
I originally wrote StoryArchiveReader (in the now misnamed story_archive_writer.py, to keep it near the 'Writer) as a quick and dirty one-off for testing. It is not a general purpose WARC reader, and assumes the contents of the WARC (after the initial "warcinfo" record) is alternate "response" records (w/ "200 OK" status) and "metadata" records w/ content-type
application/x.mediacloud-indexer+json
.At the very least, it could:
software: mediacloud story-indexer ArchiveWriter
and give (at least) a warning???Thoughts:
The WARC specification isn't particularly strict or prescriptive, and any legal WARC file is likely an open-ended task.
StoryArchiveReader is simple on purpose, and enhancing it for other uses might be a bad idea.
If it's necessary to be able to read some other WARC files (of a particular format), it might be best to write a new class that accepts those WARC records, and to wrap the instantiation of a Reader into a function that looks at the "warcinfo" record and picks the right reader.
The text was updated successfully, but these errors were encountered: