Notes on Canonical Schema #24

bnewbold · 2019-09-18T22:47:27Z

Hello wonderful arXiv maintainers!

This repository was linked from the arxiv-api mailing list. I see there are some JSON metadata schemas, which may not yet be set in stone, and I have some notes that I hope might be constructive. If there is a better place to leave this kind of feedback (the mailing list?) please let me know.

Listing.json

Refers to a ListingEvent type that doesn't seem to exist (https://github.com/arXiv/arxiv-canonical/blob/develop/schema/resources/Listing.json#L16)

If the date field is a full timestamp, it should probably be called datetime or timestamp and the timezone specified. Otherwise the type should be just date (I may be confused about JSON schema types).

EPrint

I assume that the semantics are that each e-print version gets one of these full records, not just the most recent version. A little unclear because of the history field.

The size_kilobytes field seems like it should be attached to the file records, not the EPrint records. Also storing as a count of bytes seems preferable.

Is there a controlled vocabulary for reason_for_withdrawal, or is that a free-form text field? I'd be interested in reusing such a controlled vocabulary if one existed.

What is the proxy field? For case where a party has submitted an e-print on behalf of the author(s)?

Many e-prints (an increasing number?) are not English language; it would be helpful to have the language of the work in it's own field instead of trying to parse out the comments. The "number of pages" is also frequently mentioned and could be it's own field, but that seems less important than the language.

File

The "checksum" field has little context. CRC32, MD5, SHA-1, SHA-256? Will it be prefixed by the hash type? An array would allow checksums of multiple types. It is very helpful for external integrations and auditing if one can assume there is at least one checksum type annotated for all files in the repository, even if the specific checksums used change over time.

If files are updated/modified (eg, PDF is re-generated from source), does that count as an "updated" for the related EPrint record? If not, is there any mechanism for noticing PDF changes across the repository as a whole?

Other

Separate from the schema, i'm curious if there will be some public mechanism to consume events from the Kinesis feed. For example, a public REST API with monotonically increasing integer event numbers. For example, the similar Kafka log system has a REST proxy with an endpoint that allows fetching events one-by-one.

The text was updated successfully, but these errors were encountered:

bnewbold added the enhancement New feature or request label Sep 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes on Canonical Schema #24

Notes on Canonical Schema #24

bnewbold commented Sep 18, 2019 •

edited

Loading

Notes on Canonical Schema #24

Notes on Canonical Schema #24

Comments

bnewbold commented Sep 18, 2019 • edited Loading

Listing.json

EPrint

File

Other

bnewbold commented Sep 18, 2019 •

edited

Loading