-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible fix for extended ASCII issue (#490) #666
base: master
Are you sure you want to change the base?
Conversation
It is done on purpose to convert VBA source code to unicode for Python 3, and to UTF-8 for Python 2, so that we always get the VBA code as native str whatever the Python version. Also for Python 2 I chose to convert to UTF-8 so that any application calling olevba through its API always gets a byte string with a known encoding. If we keep raw bytes instead, then it can be encoded with any code page, and it's hard to handle it properly from calling applications. |
Here are some example ITW maldocs from the last 7 days or so: https://bazaar.abuse.ch/sample/b7153cc8f00e1f39c16da557acb8a43f57eed55a371674e995a8ac808e047ab4/ The VBA macros build an encoded payload string with a series of single character string concatenations and then decode the payload shell command. Several of the characters in the encoded payload string are "special" VBA extended ASCII characters. The decode loop works in ViperMonkey if the raw byte values from the decompressed VBA are provided by olevba. I tried to implement a mapping in ViperMonkey from the unicode translation of these extended ASCII characters back to the original extended ASCII value, but Python does not map these raw byte values to unicode in a nice predictable way. A flag to the VBA_Parser() constructor telling olevba to return raw code strings would work just fine for ViperMonkey usage. |
Indeed it's a tricky issue, due to the fact that MS Office and Python do not handle some extended ASCII codes the same way for code page 1252 (and potentially other code pages): for Python the cp1252 codec treats them as undefined and cannot convert them to unicode, while MS Office (actually Windows) treats them as special control codes. I am still investigating the issue to find a proper solution, all my findings are documented in issue #490. I will see how to improve the API so that you can get the raw bytes directly. This should work fine for western code pages such as 1252, but you may have other issues with more exotic code pages... This is why I converted everything to unicode in the first place. :-) |
This is a potential fix for #490 . Based on the behavior of oledump and some VBA payload decoders that decode extended ASCII strings it looks like VBA code bytes should be left unmodified after they have been decompressed, so no unicode string conversion. The unicode conversion attempts to do something smart and useful with the "special" Office extended VBA characters, which results in these single byte extended ASCII characters being converted to multi-byte unicode characters, which breaks the VBA payload decoders.
Additionally, if you loop through a string with extended ASCII characters in VBA, pull out each character with Mid(), and then print the characters with Debug.Print the original single byte extended ASCII character is printed (i.e. what you get from the initial decompressed VBA stream prior to unicode conversion).