Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to improve usability of the corpus #9

Open
anjackson opened this issue Sep 10, 2014 · 2 comments
Open

Refactor to improve usability of the corpus #9

anjackson opened this issue Sep 10, 2014 · 2 comments

Comments

@anjackson
Copy link
Member

Based on this feedback, we should consider refactoring the corpus.

Certainly the tools should be moved out or kept in a separate top-level folder. I think they have already been copied into the 'fidget' codebase, so I can check that and then remove it.

While I appreciate that the metadata files clutter things up as they are, I still like the idea of keeping the metadata close to the files. This is because it helps track who contributed what, and makes updating the metadata easier. Rather than completely separate them, how about a compromise.

Instead of putting metadata alongside each individual file, we collect it at the top-level of each collection, and we make the top-level of each collection consistent. Using the variations collection as an example, we switch to a standard layout like this:

ebooks/README.md   - Contains human-readable textual information
ebooks/metadata.md   - Contains metadata about the items in this collection
ebooks/data/   -   Contains the actual sample files

So, you can reliably get to the test files by looking at */data/ from whatever the parent directory is.

I'd still like the option to include tool output, as we can't assume that we will able to reliably re-run tools in the future. Following Ross's suggestion, we could arrange the top-level like this:

/corpora/ - Parent folder for corpora, e.g. /corpora/ebooks/
/scripts/ - Scripts that run tools and other processes.
/tool-results/   -   Contains sample tool output.

There are some other points I'd like to revisit.

I'd like to be able to include 3rd party corpora either as e.g. git submodules. For example, there are lots of interesting files in the test corpus that someone set up for the fine-free-file command: https://git.fedorahosted.org/cgit/file-tests.git/tree/db (see also https://fedorahosted.org/file-tests/)
We can't necessarily distribute these corpora, but it would be nice to be able to make them easy to plug in.

Secondly, the idea was always meant to be that the metadata would be used to generate static web pages that let you explore the content (e.g. hosted via GitHub pages). The longer term idea was to add a continuous integration hook (e.g. Travis-CI) that runs tools and tests over the corpus and add that to the generated pages. I'd be interesting in knowing if anyone else is interested in that approach.

@anjackson
Copy link
Member Author

Note that any refactoring should be done on a fork first. Apart from anything else, SCAPE deliverables may be referencing individual resources in this data set, and so changing the structure would break those links.

@anjackson
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant