You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository was linked from the arxiv-api mailing list. I see there are some JSON metadata schemas, which may not yet be set in stone, and I have some notes that I hope might be constructive. If there is a better place to leave this kind of feedback (the mailing list?) please let me know.
If the date field is a full timestamp, it should probably be called datetime or timestamp and the timezone specified. Otherwise the type should be just date (I may be confused about JSON schema types).
EPrint
I assume that the semantics are that each e-print version gets one of these full records, not just the most recent version. A little unclear because of the history field.
The size_kilobytes field seems like it should be attached to the file records, not the EPrint records. Also storing as a count of bytes seems preferable.
Is there a controlled vocabulary for reason_for_withdrawal, or is that a free-form text field? I'd be interested in reusing such a controlled vocabulary if one existed.
What is the proxy field? For case where a party has submitted an e-print on behalf of the author(s)?
Many e-prints (an increasing number?) are not English language; it would be helpful to have the language of the work in it's own field instead of trying to parse out the comments. The "number of pages" is also frequently mentioned and could be it's own field, but that seems less important than the language.
File
The "checksum" field has little context. CRC32, MD5, SHA-1, SHA-256? Will it be prefixed by the hash type? An array would allow checksums of multiple types. It is very helpful for external integrations and auditing if one can assume there is at least one checksum type annotated for all files in the repository, even if the specific checksums used change over time.
If files are updated/modified (eg, PDF is re-generated from source), does that count as an "updated" for the related EPrint record? If not, is there any mechanism for noticing PDF changes across the repository as a whole?
Other
Separate from the schema, i'm curious if there will be some public mechanism to consume events from the Kinesis feed. For example, a public REST API with monotonically increasing integer event numbers. For example, the similar Kafka log system has a REST proxy with an endpoint that allows fetching events one-by-one.
The text was updated successfully, but these errors were encountered:
Hello wonderful arXiv maintainers!
This repository was linked from the arxiv-api mailing list. I see there are some JSON metadata schemas, which may not yet be set in stone, and I have some notes that I hope might be constructive. If there is a better place to leave this kind of feedback (the mailing list?) please let me know.
Listing.json
Refers to a
ListingEvent
type that doesn't seem to exist (https://github.com/arXiv/arxiv-canonical/blob/develop/schema/resources/Listing.json#L16)If the
date
field is a full timestamp, it should probably be calleddatetime
ortimestamp
and the timezone specified. Otherwise the type should be justdate
(I may be confused about JSON schema types).EPrint
I assume that the semantics are that each e-print version gets one of these full records, not just the most recent version. A little unclear because of the history field.
The
size_kilobytes
field seems like it should be attached to the file records, not theEPrint
records. Also storing as a count of bytes seems preferable.Is there a controlled vocabulary for
reason_for_withdrawal
, or is that a free-form text field? I'd be interested in reusing such a controlled vocabulary if one existed.What is the
proxy
field? For case where a party has submitted an e-print on behalf of the author(s)?Many e-prints (an increasing number?) are not English language; it would be helpful to have the language of the work in it's own field instead of trying to parse out the comments. The "number of pages" is also frequently mentioned and could be it's own field, but that seems less important than the language.
File
The "checksum" field has little context. CRC32, MD5, SHA-1, SHA-256? Will it be prefixed by the hash type? An array would allow checksums of multiple types. It is very helpful for external integrations and auditing if one can assume there is at least one checksum type annotated for all files in the repository, even if the specific checksums used change over time.
If files are updated/modified (eg, PDF is re-generated from source), does that count as an "updated" for the related
EPrint
record? If not, is there any mechanism for noticing PDF changes across the repository as a whole?Other
Separate from the schema, i'm curious if there will be some public mechanism to consume events from the Kinesis feed. For example, a public REST API with monotonically increasing integer event numbers. For example, the similar Kafka log system has a REST proxy with an endpoint that allows fetching events one-by-one.
The text was updated successfully, but these errors were encountered: