Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes on Canonical Schema #24

Open
bnewbold opened this issue Sep 18, 2019 · 0 comments
Open

Notes on Canonical Schema #24

bnewbold opened this issue Sep 18, 2019 · 0 comments
Labels
enhancement New feature or request

Comments

@bnewbold
Copy link

bnewbold commented Sep 18, 2019

Hello wonderful arXiv maintainers!

This repository was linked from the arxiv-api mailing list. I see there are some JSON metadata schemas, which may not yet be set in stone, and I have some notes that I hope might be constructive. If there is a better place to leave this kind of feedback (the mailing list?) please let me know.

Listing.json

Refers to a ListingEvent type that doesn't seem to exist (https://github.com/arXiv/arxiv-canonical/blob/develop/schema/resources/Listing.json#L16)

If the date field is a full timestamp, it should probably be called datetime or timestamp and the timezone specified. Otherwise the type should be just date (I may be confused about JSON schema types).

EPrint

I assume that the semantics are that each e-print version gets one of these full records, not just the most recent version. A little unclear because of the history field.

The size_kilobytes field seems like it should be attached to the file records, not the EPrint records. Also storing as a count of bytes seems preferable.

Is there a controlled vocabulary for reason_for_withdrawal, or is that a free-form text field? I'd be interested in reusing such a controlled vocabulary if one existed.

What is the proxy field? For case where a party has submitted an e-print on behalf of the author(s)?

Many e-prints (an increasing number?) are not English language; it would be helpful to have the language of the work in it's own field instead of trying to parse out the comments. The "number of pages" is also frequently mentioned and could be it's own field, but that seems less important than the language.

File

The "checksum" field has little context. CRC32, MD5, SHA-1, SHA-256? Will it be prefixed by the hash type? An array would allow checksums of multiple types. It is very helpful for external integrations and auditing if one can assume there is at least one checksum type annotated for all files in the repository, even if the specific checksums used change over time.

If files are updated/modified (eg, PDF is re-generated from source), does that count as an "updated" for the related EPrint record? If not, is there any mechanism for noticing PDF changes across the repository as a whole?

Other

Separate from the schema, i'm curious if there will be some public mechanism to consume events from the Kinesis feed. For example, a public REST API with monotonically increasing integer event numbers. For example, the similar Kafka log system has a REST proxy with an endpoint that allows fetching events one-by-one.

@bnewbold bnewbold added the enhancement New feature or request label Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant