-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spec] Whitespace Handling #46
Comments
just a short note regarding this topic: when standardizing this, please keep in mind that one of the reasons for the success of the UltraStar txt file system was that people are able to easily open it with a random text editor and adjust it. This should stay possible in the future, despite random text editors probably not knowing the usdx format specification requirements ;-) |
I wonder if it wouldn't be easier for implementations to adhere to the spec if 'whitespace' included any Unicode whitespace. Modern programming languages typically supprt trimming and splitting on Unicode whitespace out of the box. Of course, this most likely won't matter in practice. Everything else sounds good! 👍 |
I'd be fine with that as well and I agree that it would probably simplify implementations (except if people use regex, where often This would technically break compliance of USDX (and probably other implementations) with the 0.x versions of the spec (because USDX only recognizes space and tab as whitespace). Alternatively this could be a 2.0.0 breaking change but that would make a compliant implementation pretty awkward as it would technically have to guess valid whitespace when checking the I'd definitively go the route of retroactively breaking compliance for existing implementations because – as you said – in practice this likely won't matter at all (especially since existing songs don't use other whitespaces than space and maybe tab). There's also an argument to be made to recognize all Unicode line terminators as valid line terminators although I have never really seen them used. This also could cause issues with certain |
The value of But I see, it's complicated. Maybe it'd be best to keep it simple and live with implementations taking some liberties here. |
The I think at some point we will have to accept that implementations will in practice probably just use the tools of the respective programming languages and not adhere to the spec 100% (which is fine by me). |
I think it makes sense to discuss the set of whitespace characters in this issue and open new issues for the question of handling empty lines and significant whitespace. A quick search leads me to the conclusion that there are basically two approaches to trimming whitespace in programming languages:
A third option for the spec could be to restrict the set of allowed white space characters, e.g. to just space and tab. Some programming languages seem to do this for example. Existing implementations behave differently:
I think that the most “compatible” set of whitespace characters would be just an ascii space and nothing else. In practice this hardly matters but if we want to make the format editable by humans I think allowing all Unicode spaces would be the way to go. If a file contains a Unicode whitespace this could be very hard to identify for non technical users. In the interest of compatibility we should maybe recommend that implementations only use the ascii space when generating song txts. This seems to be common practice already. Maybe there are some more comments on this, otherwise we could maybe start voting in a week or so. |
Summary / Vote nowAre we ready to vote on this? As a quick reminder: This is about the question what constitutes a "whitespace" in the context of the file format. I think the following options have been mentioned:
Please vote via emoticon reaction. |
thanks for summary @codello . yes let's vote now till 31th march. |
Sorry for the delay. The decision seems to be unanimous. I have created PR #53 to add this result to the spec. |
I approved and merged the pr #53 . |
Suggestion
Hello. I have been looking for an official spec of the UltraStar file format for years and I'm very happy that there is an effort now to standardize the format.
I think the spec should specify how whitespace is handled in UltraStar files. This includes:
What characters are recognized as whitespace?
USDX recognizes
U+0009
(Horizontal Tab) andU+0020
(Space) as whitespace characters (except forP 1
,P 2
markers, where only a Space is recognized). I suggest that the spec codifies this and specifies that each a single tab and a single space are recognized as whitespace. Specifically no other unicode characters from the "Whitespace" category (such asU+00A0
non-breaking space orU+2002
en space) are recognized as whitespace by the specification.Given that I have personally not seen many songs use tabs instead of spaces a simpler approach could also be to only allow the Space character.
What characters are recognized as line breaks?
This question is also discussed in #43.
I think USDX recognizes
U+000A
(Line Feed,\n
),U+000D
(Carriage Return,\r
) and Windows-Style line breaks (\r\n
). Given the various platforms and programming languages that are used to process songfiles I suggest the spec should specify all three types of line breaks as valid (single\n
, single\r
, and\r\n
).I also suggest for the the spec to recommend
\n
as the preferred line break character that programs should use when writing songfiles.How are empty lines handled?
The only place where USDX allows empty lines is between the header and the first note (and after the
E
nd). In fact USDX seems to allow any text between the header and the first note (or technically the first line starting with a note character). I suggest the spec clarifies where empty lines are allowed and where not. I suggest that the spec disallows empty lines for versions <2.0.0 and allows empty lines for version >=2.0.0. I think allowing empty lines in a textfile can improve readability but allowing empty lines would obviously be a breaking change.Alternatively the spec could mandate that textfiles must not include empty lines but could allow parsers to ignore empty lines if present. This is inspired by the JSON spec doing something similar for the BOM. Only allowing empty lines between the header and the first note seems kind of arbitrary to me.
Note: Maybe the phrasing of handling the byte order mark might be interesting for this spec as well:
In what places of the file is whitespace significant?
The spec should clarify where whitespace is ignored and where it is preserved. I can think of the following relevant places:
USDX ignores whitespace at the beginning and end of header keys and values. The following are equivalent:
I suggest that the spec codifies this. One implication of this decision is that audio/video/image files cannot be referenced if their name starts or ends with a space, but I think this is an acceptable limitation (and is the current behavior of USDX anyways).
Similarly the spec should clarify how whitespace in comma-separated lists (#30) are to be handled. I suggest that spaces around these commas should be ignored and that the following are equivalent:
USDX supports a single space after a
P
marker. So the following are equivalentI suggest that for version >=2.0.0 only
P1
is valid.USDX parses a note line as follows:
The last "exactly one space" is obviously important for notes whose text starts with a space. I suggest that for consistency the spec only allows a single space in the other places as well. The same would hold for line breaks (
- 12
and- 12 12
in relative mode).Use case
When parsing or generating an UltraStar file it is important to have a standardized approach to whitespace handling. This is even more important since there are some places in a textfile where whitespace is significant. A clear specification helps developers write correct parsers and helps prevent different programs from generating incompatible files.
I realize that in practice most textfiles probably don't differ as much as the proposal makes it out to be. I still think it's important that the spec is precise in this regard.
Extra info/examples/attachments
I have only really checked the USDX implementation of the textfile parser and don't currently know how other implementations approach these specifics.
I realize that this is quite a long issue. Please tell me if I should split this into multiple issues to be discussed separately.
The text was updated successfully, but these errors were encountered: