Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify SYS_GET_CMDLINE return string format #276

Open
StevenvdSchoot opened this issue Aug 7, 2024 · 4 comments
Open

Clarify SYS_GET_CMDLINE return string format #276

StevenvdSchoot opened this issue Aug 7, 2024 · 4 comments

Comments

@StevenvdSchoot
Copy link

The documentation for the SYS_GET_CMDLINE semihosting operation mentions that the operation "Returns the command line that is used for the call to the executable, that is, argc and argv".
The return fields are then defined as:

field 1
A pointer to a null-terminated string of the command line.
field 2
The length of the string in bytes.

It seems to me there are three interpretations of this definition:

  1. field 1 is supposed to contain the command string before argument splitting. When using a POSIX command string, the string is parsed in step, where field splitting (converting the single command string into a command-name string and list of command argument strings) is somewhere in the middle. It is unclear whether the string in field 1 should be the raw, unprocessed command string or should already have gone through the processing steps before field splitting (or anything in between).
    For example: For the command ./app.elf "hello $(echo world)" the unprocessed command string would be ./app.elf "hello $(echo world)"\0, where the command string processed up to field splitting would be ./app.elf "hello world"\0.
    Regardless of the level of processing, this is different from argv. Field splitting and quote removal needs to happen on the returned string before it can be used as argv.

  2. field 1 is supposed to contain a list of null terminated strings, concatenated together. Although technically a null-terminated string is not forbidden to contain null characters, this feels like stretching the definition of field 1.
    For example: For the command ./app.elf "hello $(echo world)" field 1 would contain ./app.elf\0hello world\0

  3. field 1 is supposed to contain a list of strings, separated by spaces. This seems to be qemu's interpretation.
    For example: For the command ./app.elf "hello $(echo world)" field 1 would contain ./app.elf hello world\0
    This form yields a null-terminated string without null characters. However, splitting it up back into the original arguments is ambiguous. This can be seen from the example, where argv = {"./app.elf", "hello", "world"} or argv = {"./app.elf", "hello world"} or even argv = {"./app.elf hello world"} could be correct argument vectors that would all yield the given argument string.

The examples assume POSIX commands, but I think it's trivial to see how Windows cmd, powershell, or any other command line spec yields similar situations.

I think some more clarity about what the format of the string returned by SYS_GET_CMDLINE is needed. The uncertainty on the format, in my view, defeats the purpose of standardizing the command in the first place, since it can only be parsed when making assumptions about the provider of the string (the host machine).

Personally, I think interpretation 2 (list of string separated by null characters) is the most simple and useful one. Since command names and argument strings cannot contains null characters parsing such a string back into a list of strings is trivial.

@statham-arm
Copy link
Contributor

Good catch! I agree that the text "that is, argc and argv" is confusing and misleading.

The intention, and every implementation I've seen, is that SYS_GET_CMDLINE returns a single long string containing the whole command line. If an embedded program wants to contain the standard C main(int argc, char **argv) then it's responsible for splitting that single command line into argv words according to whatever rule seems sensible.

SYS_GET_CMDLINE does not return an integer usable as argc, or a list of strings ready for use as the elements of argv. That text should be changed to avoid making it look as if it does.

@StevenvdSchoot
Copy link
Author

StevenvdSchoot commented Aug 8, 2024

Thanks for the quick response!

The intention, and every implementation I've seen, is that SYS_GET_CMDLINE returns a single long string containing the whole command line.

This is actual what my question is about. The documentation does not define what "the whole command line" or "the command line" means. My original comment lists some interpretations I could come up with for that term could mean. Each interpretation has different consequences for what an embedded programs might be able to do with the received string.

For example qemu currently passes just the command name and arguments separated by spaces as"the command line" string, which is different from what POSIX would define as the command line string.

@statham-arm
Copy link
Contributor

POSIX doesn't have any definition of a single-string command line at all. In POSIX, the command line is communicated across each exec system call as a list of separate NUL-terminated strings, so that the declaration of main() as taking an argv array reflects what's truly going on at the OS level. In a POSIX context, the only time you see a program's command line in the form of a single string is if it's input to a shell, which has to split it into argv words before it can set up the exec that runs the actual subprogram. (But the rules it uses for doing that vary between shells, and also, are interleaved with lots of other processing.)

On the other hand, on Windows, the command line is communicated across a CreateProcess Win32 API call as a single string, which the subprocess can retrieve still in its single-string form via GetCommandLine. So a single string is the native form of the command line. In a console-subsystem executable, the typical crt0 code will retrieve that command line and split it into argv words, so as to comply with the C standard which requires the arguments to main() to be in that form. But it's not necessary to split the arguments at all: an application is also welcome to keep the command line unsplit, or to split it according to conventions unlike the default crt0 ones. And some do, because the splitting done by crt0 loses information, and not every application is happy to lose that information.

The semihosting API follows Windows's convention in this respect. The command line passed across semihosting is a single string, with a single NUL terminator at the end. Questions of its semantics are left to each application to define.

If you have a tool like qemu that accepts a POSIX argument list for the semihosted program and needs to translate it into a single string, then the semihosting specification takes no position on how that should be done. I think a command-line interface for a tool like that ought to provide some way to specify the whole command line as a single string, because that's the most precise form you can specify if in. But if it also chooses to accept multiple POSIX argv words and glue them together in some way, how it does it is outside this specification.

A tool like that on Windows would surely do better to get the semihosting command line directly from the whole-string command line passed to the Windows tool – trying to recombine words from the argv generated by its crt0 would lose a lot of precision that it could have avoided losing.

@statham-arm
Copy link
Contributor

Thinking about it a bit more, it sounds as if what you're really after is a specification of the convention used for breaking up the SYS_GET_CMDLINE string into argv?

If every libc's startup code did that in the same way, then tools like qemu would be able to take account of it when constructing a command string out of their argv words, and quote the string in such a way that the argv received by the semihosted program's main() contained the same words as the ones qemu had received on its command line, without any corruption in between, even if the argument words contained difficult characters like spaces or quotes.

Unfortunately, libc implementations don't agree on a standard convention for this. For example, picolibc simply breaks up the command line at spaces, with no quoting system at all, so that you just can't get a space to appear in the middle of an argv word. On the other hand, Arm Compiler 6's C library implements a quoting system using single and double quotes and backslashes, similar to POSIX in general but differing in details, and also optionally process I/O redirection specifications by deleting them from the command line and reinitializing the stdio streams.

So there's no convention qemu could follow that makes the same effect happen in applications using both of those startup routines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants