Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible fix for extended ASCII issue (#490) #666

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kirk-sayre-work
Copy link
Contributor

This is a potential fix for #490 . Based on the behavior of oledump and some VBA payload decoders that decode extended ASCII strings it looks like VBA code bytes should be left unmodified after they have been decompressed, so no unicode string conversion. The unicode conversion attempts to do something smart and useful with the "special" Office extended VBA characters, which results in these single byte extended ASCII characters being converted to multi-byte unicode characters, which breaks the VBA payload decoders.

Additionally, if you loop through a string with extended ASCII characters in VBA, pull out each character with Mid(), and then print the characters with Debug.Print the original single byte extended ASCII character is printed (i.e. what you get from the initial decompressed VBA stream prior to unicode conversion).

@decalage2
Copy link
Owner

It is done on purpose to convert VBA source code to unicode for Python 3, and to UTF-8 for Python 2, so that we always get the VBA code as native str whatever the Python version. Also for Python 2 I chose to convert to UTF-8 so that any application calling olevba through its API always gets a byte string with a known encoding. If we keep raw bytes instead, then it can be encoded with any code page, and it's hard to handle it properly from calling applications.
So, do you have specific samples I could try, to understand what the issue is with the unicode conversion? (maybe it's due to a code page that is not properly handled by python, or my code has a bug) I see there is one sample mentioned in #490, do you have others?
Or if you have a specific need to get the raw bytes instead of the Unicode/UTF-8 version, I can look at the API to make it easier to get.

@decalage2 decalage2 self-requested a review March 10, 2021 20:40
@decalage2 decalage2 self-assigned this Mar 10, 2021
@kirk-sayre-work
Copy link
Contributor Author

kirk-sayre-work commented Mar 10, 2021

Here are some example ITW maldocs from the last 7 days or so:

https://bazaar.abuse.ch/sample/b7153cc8f00e1f39c16da557acb8a43f57eed55a371674e995a8ac808e047ab4/
https://bazaar.abuse.ch/sample/c84b1478fdf53dd00791f5ceb8e7744493964c363a8c02ea4f24600dab28fb83/
https://bazaar.abuse.ch/sample/f451591b470a934d1ef08937d9009f19e1d426651d87603d3cded34b54c53b6c/

The VBA macros build an encoded payload string with a series of single character string concatenations and then decode the payload shell command. Several of the characters in the encoded payload string are "special" VBA extended ASCII characters. The decode loop works in ViperMonkey if the raw byte values from the decompressed VBA are provided by olevba. I tried to implement a mapping in ViperMonkey from the unicode translation of these extended ASCII characters back to the original extended ASCII value, but Python does not map these raw byte values to unicode in a nice predictable way.

A flag to the VBA_Parser() constructor telling olevba to return raw code strings would work just fine for ViperMonkey usage.

@decalage2
Copy link
Owner

decalage2 commented Mar 11, 2021

Indeed it's a tricky issue, due to the fact that MS Office and Python do not handle some extended ASCII codes the same way for code page 1252 (and potentially other code pages): for Python the cp1252 codec treats them as undefined and cannot convert them to unicode, while MS Office (actually Windows) treats them as special control codes. I am still investigating the issue to find a proper solution, all my findings are documented in issue #490.

I will see how to improve the API so that you can get the raw bytes directly. This should work fine for western code pages such as 1252, but you may have other issues with more exotic code pages... This is why I converted everything to unicode in the first place. :-)

@decalage2 decalage2 added this to the Next Release milestone Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants