Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harden StoryArchiveReader??? #220

Open
philbudne opened this issue Jan 27, 2024 · 0 comments
Open

harden StoryArchiveReader??? #220

philbudne opened this issue Jan 27, 2024 · 0 comments
Labels
archiving related to archive worker documentation Improvements or additions to documentation enhancement New feature or request
Milestone

Comments

@philbudne
Copy link
Contributor

I originally wrote StoryArchiveReader (in the now misnamed story_archive_writer.py, to keep it near the 'Writer) as a quick and dirty one-off for testing. It is not a general purpose WARC reader, and assumes the contents of the WARC (after the initial "warcinfo" record) is alternate "response" records (w/ "200 OK" status) and "metadata" records w/ content-type application/x.mediacloud-indexer+json.

At the very least, it could:

  1. document that it was a NON-goal to accept arbitrary WARC files.
  2. document what it expects
  3. check that the warcinfo record contains software: mediacloud story-indexer ArchiveWriter and give (at least) a warning???

Thoughts:
The WARC specification isn't particularly strict or prescriptive, and any legal WARC file is likely an open-ended task.
StoryArchiveReader is simple on purpose, and enhancing it for other uses might be a bad idea.
If it's necessary to be able to read some other WARC files (of a particular format), it might be best to write a new class that accepts those WARC records, and to wrap the instantiation of a Reader into a function that looks at the "warcinfo" record and picks the right reader.

@philbudne philbudne added documentation Improvements or additions to documentation enhancement New feature or request archiving related to archive worker labels Jan 27, 2024
@philbudne philbudne added this to the long-term milestone Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
archiving related to archive worker documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant