Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to read images from the Compound File? #13

Open
jilieryuyi opened this issue Jun 13, 2023 · 6 comments
Open

how to read images from the Compound File? #13

jilieryuyi opened this issue Jun 13, 2023 · 6 comments

Comments

@jilieryuyi
Copy link

how to read images from the Compound File?

@richardlehane
Copy link
Owner

Here's an example tool I wrote that uses this library: https://github.com/richardlehane/comdump
This will dump out the contents of a compound file to disk and is perhaps what you want?

@jilieryuyi
Copy link
Author

Here's an example tool I wrote that uses this library: https://github.com/richardlehane/comdump This will dump out the contents of a compound file to disk and is perhaps what you want?

Thank you, I'll give it a try

@jilieryuyi
Copy link
Author

Here's an example tool I wrote that uses this library: https://github.com/richardlehane/comdump This will dump out the contents of a compound file to disk and is perhaps what you want?

Unfortunately, the image cannot be exported correctly

@richardlehane
Copy link
Owner

The Microsoft Compound File Binary File format is a container format with a file-system like structure. All this library does is allow you to traverse that structure and access the contents of the files contained within. I.e. it will export the content of all contained files correctly, but it won't interpret those files for you. That really comes down to what type of file you are dealing with (lots of different applications have used MS-CFB as a container format e.g. old MS Office family etc.). If you can provide more details about the types of files you are trying to access, or provide a sample, may be able to help further

@jilieryuyi
Copy link
Author

The Microsoft Compound File Binary File format is a container format with a file-system like structure. All this library does is allow you to traverse that structure and access the contents of the files contained within. I.e. it will export the content of all contained files correctly, but it won't interpret those files for you. That really comes down to what type of file you are dealing with (lots of different applications have used MS-CFB as a container format e.g. old MS Office family etc.). If you can provide more details about the types of files you are trying to access, or provide a sample, may be able to help further

1.zip

@richardlehane
Copy link
Owner

So it looks like the file you are working with is a MS Word document. I used comdump to unpack the document and saw these streams:

image

It does appear there is an image inside the "Data" stream (I've highlighted the start of a JPG header in that stream) but just truncating the Data file by cutting the first 478 bytes didn't result in a working JPG file.

My library just interprets the underlying container format (MS-CFBF) but it isn't capable of parsing the structures within the Word streams (or any of the other file formats that are based on MS-CFBF). In order to write a program to parse these structures you could refer to this MS documentation: https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22 Or you could use a library that does parse Word, like Aspose: https://products.aspose.app/words/parser

If it is just a one off, and you have access to MS Office, you could also just follow this guide to extracting embedded images which is to save the file as an HTML page: https://support.microsoft.com/en-au/topic/wd-how-to-extract-embedded-images-from-a-word-document-f478bf7f-3bba-6afb-6ddc-3eeb284af36b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants