We are examining malware on VM cut off from network and available drives. I would like to be able to copy paste some information (hashes/checksums) and some korean filenames for google translate
I am using python3.6 on my local Ubuntu 16.04 machine I set up a virtualenvironment and added the python packages described in requirements.txt
virtualenv -p /home/cas/miniconda/bin/python --no-site-packages ocr
source ocr/bin/activate
pip install -r requirements.txt
The main OCR engine used is tesseract-ocr, this is intalled with apt
sudo apt-get install tesseract-ocr
on virutual machine we have python2.7 on windows, I had nothing to do with that config, but i'm really happy we have python interpreter!
1
- On virtual machine I have mounted a USB image with OSFmount.
- The filenames are in characters, want to investigate with google translate!
2
- on VM copy the file name into a file called "unicode_raw.txt"
- run get_utf_codes.py to save out the utf codes for OCR recognition
- note this is done on vm, in this case we are doing OCR on output_ocr.txt
- this is a great place to make text as clear as possible by changing to large font in notepad
- a high quality image will give OCR algo best chance at accuracy 3 -on local machine take a screenshot of the unicode representation for OCR analysis
- run snip_to_text.py on local machine, -i flag is png image, -o is output file
(ocr) cas@ubuntu:~/working_dir/python_ocr$ python snip_to_text.py -i mp3_chars.png -o mp3_chars_out.txt
- you now should have text in the mp3_chars_out.txt (-o param), look it over and make any corrections (for example I remove some random spaces 4
- now we need to get back to utf-8 chars.
- hacky AF solution is to simply print them to console and file as strings in python (so you dont have to deal with the \ escape chars)
- this is done manually in print_utf.py
- this also stores output in output file called unicode_out.txt
(ocr) cas@ubuntu:~/working_dir/python_ocr$ python print_utf.py