Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils/astring: add expected control characters #134

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

smitterl
Copy link

@smitterl smitterl commented Aug 7, 2024

Fixes: #133

In recent kernel boots, stripping raised error because of missing control characters.

Add them:
c reset screen
!p soft reset terminal
]104 - execute OS command '104'

These control characters are listed in the xterm control sequence cheat
sheet at
https://gist.github.com/justinmk/a5102f9a0c1810437885a04a07ef0a91

@smitterl
Copy link
Author

smitterl commented Aug 7, 2024

With !p and ]104 issue #133 doesn't happen and my test passes.
However I have not found any reference to explain these characters. @clebergnu can you help?

@smitterl
Copy link
Author

It looks to me the sequences are:

ESC c: terminal reset
ESC !p: terminal reset
ESC ]: introduces an OS command

I don't have details about the OS command 104.

Ref. https://gist.github.com/justinmk/a5102f9a0c1810437885a04a07ef0a91

@smitterl smitterl marked this pull request as ready for review August 16, 2024 11:12
@@ -37,6 +37,7 @@ def strip_console_codes(output, custom_codes=None):
output = f"\x1b[m{output}"
console_codes = "%[G@8]|\\[[@A-HJ-MPXa-hl-nqrsu\\`]"
console_codes += "|\\[[\\d;]+[HJKgqnrm]|#8|\\([B0UK]|\\)|\\[\\?2004[lh]"
console_codes += "|[c]|[!p]|[\\]104]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the problem with "c" (and actually with all the existing and added codes) is that we only assume they are matched from the beginning, but we don't force that. This means something like \x1bthis-is-a-broken-text-c would now be accepted and result in output his-is-a-broken-text-c.

Note this is not introduced by your change, it just makes it more visible (as matching a single "c" is quite probable).

The fix for this should be IMO to match from the beginning (especially since we erase characters from the beginning). So your fix should add |^c|^!p|^]104 to ensure only \x1bc/\x1b!p and \x1b]104 is matched.

As for the brackets, it's a regexp so [c] matches any character out of the brackets. At this point we don't expect any other standalone characters so I'd suggest just the c without the brackets.

Similarly the [!p] which actually matches only ! out of the !pfoo string as it matches one of the characters inside the brackets. So that one really has to be without the brackets.

The last one is actually generic and vendors are free to specify any ]\d+ strings so here I'd probably consider ^]\d+ rather than one of the variants, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems even worse then I though. Astring blindly ignores many issues, I have fixed some of them and added a selftest here: #135 but IMO it'd be even better to simply replace it with a regexp to substitute all console codes with "" instead of this custom handling.

Anyway I understand that this code is here and the behaviour was odd for a long time. So the question is whether we really want to do something about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case my comments were not clear, I'm ready to ack this patch once your part of the regexp works properly, therefore something like console_codes += "|^c|^!p|^]104 or console_codes += "|^c|^!p|^]\d+. @clebergnu you had some concerns, would this work for you?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try this, thanks for the review, @ldoktor AFAICR @clebergnu 's concern was that we first confirm those are actually console codes. I did it in all cases IMO, just not sure what the OS operation 104 is supposed to mean. As you might have seen I have reached out to many people in the OS, kernel space and no answer so far. But if the test passes, it passes, and the control characters all start with \x1b so I think it's safe to assume we're not breaking 'normal strings'.

Copy link
Author

@smitterl smitterl Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case my comments were not clear, I'm ready to ack this patch once your part of the regexp works properly, therefore something like console_codes += "|^c|^!p|^]104 or console_codes += "|^c|^!p|^]\d+.

Either I'm misunderstanding something or your proposal doesn't work. With the indicated regex the algorithm raises at "!p".

The problematic line that I reported starts with "ESC c" but no line starts with "ESC !p". The control character doesn't imply that the lines start with these characters. We see the soft reset command isn't the first, so for the same right "c" doesn't need to be the first. Also, in this xterm reference there's no indicated it must be the first https://gist.github.com/justinmk/a5102f9a0c1810437885a04a07ef0a91 Maybe the algorithm works a bit different from what we read in the code? I read out the tmp_word. Its value changes as listed below.

However, my original regex works and it doesn't filter out 'c' if there's no ESC in front of it as we can see in terms such as "Connected" or "avocado" right in the first line.

I think we should merge this as-is and not accept #135 for the outlined reasons.

Please let me know if you disagree. Otherwise I hope we can merge this because there's no workaround besides this patch for my tests.

tmp_word: [mConnected to domain 'avocado-vt-vm1'
Escape character is ^] (Ctrl + ])
LOADPARM=[        ]
Using virtio-blk.
Using SCSI scheme.
s390-ccw Enumerated Boot Menu.

 [0] default
...
[    0.558508] Checked W+X mappings: passed, no W+X pages found
[    0.558514] Run /init as init process
[    0.563984] systemd[1]: Successfully made /usr/ read-only.

2024-08-30 09:09:55,831 astring          L0055 DEBUG| tmp_word: c
2024-08-30 09:09:55,831 astring          L0055 DEBUG| tmp_word: [!p
2024-08-30 09:09:55,831 astring          L0055 DEBUG| tmp_word: ]104^G
2024-08-30 09:09:55,831 astring          L0055 DEBUG| tmp_word: [?7h[    0.564148] systemd[1]: systemd 256-13.el10 running in system mode (+PAM +AUDIT +SELINU
X -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBCRYPTSETUP_PLUGINS +LIB
FDISK +PCRE2 +PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT +LIBARCHIVE)
[    0.564153] systemd[1]: Detected virtualization kvm.
[    0.564157] systemd[1]: Detected architecture s390x.
[    0.564159] systemd[1]: Running in initrd.

Welcome to 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clebergnu IIUC it should be OSC command: https://en.wikipedia.org/wiki/ANSI_escape_code#OSC which are not strictly defined (which is why I suggested the \d to catch them all).

As for this version I mentioned the brackets are unnecessary as you only define one or a few characters (it's regexp so the [] brackets means one of the characters.

And the c itself sounds a bit too vague without the ^ because it simply matches 1 character whenever c is present after the initial text. Take this as an example: aexpect.utils.astring.strip_console_codes("0\x1b[0m1\x1b2345\x1b67890123456789") fails but aexpect.utils.astring.strip_console_codes("0\x1b[0m1\x1b23c45\x1b67890123456789") succeeds and reports the c as part of the output, stripping the 2 simply because it cuts of the first character.

So yes, the #135 should rapidly change the situation but even before that one (and I think your commit should go first) shouldn't introduce more of the odd behaviours.

As for the !p, thanks for sharing the snippet (would be nice to get the full original output). From there it looks like your getting \x1b[!p and not just \x1b!p, which is why my regexp does not work. In that case it should be |^c|^\\[!p|^]104 or (IMO better) |^c|^\\[!p|^]\d+.

Copy link
Contributor

@clebergnu clebergnu Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I understand @ldoktor concern. @smitterl, would you be able to add #135 to your test with your reproducer?

Copy link
Author

@smitterl smitterl Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clebergnu with #135 as-is the error just reproduces. But I ran another test that uses the console and that one passes. I'm trying to figure out now what the right expression should be according to #135, |^c|^\\[!p|^]\d+ doesn't work. Python itself complains about the invalid escape sequence \d but |^c|^\\[!p|^]\\d+ doesn't work either.
I am still not sure I understand why the "^", I have not understood the algorithm yet. I hope to have a moment for this soon.

Fixes: avocado-framework#133

In recent kernel boots, stripping raised error because of missing
control characters.

Add them:
`c` reset screen
`!p` soft reset terminal
`]104` - execute OS command '104'

These control characters are listed in the xterm control sequence cheat
sheet at
https://gist.github.com/justinmk/a5102f9a0c1810437885a04a07ef0a91

Signed-off-by: Sebastian Mitterle <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

astring.strip_console_codes raises error on recent kernel messages
3 participants