-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sam: Clarify the usage of UTF-8 characters in header #719
Comments
The problem here is simply that the regular expression has not been updated after the addition of the loosening to allow UTF-8 in certain field values. These test files are not made invalid, and the ”UTF-8 encoding may be used [in these field values]” text should be genially interpreted as augmenting and superseding the ASCII restriction in that part of the regular expression. That said, there is indeed a minor inconsistency here and we should improve the What it is really trying to say is that field values are strings of one or more non-TAB characters. In expanding this to allow UTF-8 characters, we should consider whether there are any other UTF-8 whitespace characters that we might want to disallow. It is indeed true that only a few particularly and individually specified fields allow UTF-8 characters beyond ASCII. IMHO the spec doesn't particularly need to explain why the particular choices have been made. OTOH it is useful for the lore to be known amongst the spec's maintainers and contributors. The general rule of thumb is that anything that would be processed or interpreted by a program is restricted to ASCII. Free text fields that are ignored by programs or are simply displayed to the user (but not interpreted by program code) are candidates for specifying as allowing UTF-8. The justification for this is that doing things like testing whether two strings represent the same text is non-trivial in UTF-8, as noted by @daviesrob in ga4gh/refget#2 (comment) and probably in other similar discussions too. Thus for example allowing UTF-8 in RNAME fields would make it difficult to determine whether two SAM records referred to the same chromosome while not adding any particular expressive power. |
We clearly need a footnote somewhere to explain the regexp may need some nuanced interpretation for UTF-8 capable fields. Attempting to include UTF-8 in the regexp will just break things horribly and add more confusion than it solves. I think maybe one way of handling this is to use POSIX character classes as they document the meaning of the terms.
Edit: actually if locale is set correctly, it may be that e.g. POSIX |
Use `[:print:]` in the header regex and note that for ASCII it is equivalent to `[ -~]` and that the aim is to forbid control characters. Fixes samtools#719.
There are lots of Unicode whitespace characters, but fortunately none of them have tab-like properties. There are however additional line separator characters, so I have added text to PR #670 (which is no longer draft) to emphasise that SAM file line endings must be ASCII — in particular LF or CR-LF.
Upon further reflection, I don't think we should do this. Disallowing them would invite people to think that such characters would act as field or line separators, but we want parsing those delimiters to remain doable by looking only for ASCII. This means that people could put U+2028 (line separator) characters in their
That's a pretty good idea. (And What the existing |
Use `[:print:]` in the header regex and note that for ASCII it is equivalent to `[ -~]` and that the aim is to forbid control characters. Fixes samtools#719.
Use `[:print:]` in the header regex and note that for ASCII it is equivalent to `[ -~]` and that the aim is to forbid control characters. Fixes samtools#719.
Use `[:print:]` in the header regex and note that for ASCII it is equivalent to `[ -~]` and that the aim is to forbid control characters. Fixes samtools#719.
This is in regard to Sequence Alignment/Map Format Specification (2022-08-22).
§ 1.3 "The header section" defines patterns for header lines:
This invalidates the following test examples:
test/sam/passed/hdr.PG3.sam
test/sam/passed/hdr.PG5.sam
test/sam/passed/hdr.SQ6.sam
The text "UTF-8 encoding may be used" for the
CL
andDS
fields does not remove the character set constraint. It also remains arbitrary as to why only some fields have this definition.The text was updated successfully, but these errors were encountered: