Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for post-unpack plugins #80

Open
Caesurus opened this issue Jul 8, 2021 · 5 comments
Open

Support for post-unpack plugins #80

Caesurus opened this issue Jul 8, 2021 · 5 comments

Comments

@Caesurus
Copy link
Contributor

Caesurus commented Jul 8, 2021

What I'm finding in a bunch of cases is a need for unpack plugins to have access to multiple files at the same time in order to complete a full unpack. Here is an example... I ported the following go code to python:
packsparseimg.go

The idea with this is that there is a rawprogram*.xml file that holds information about how to unpack the img files that go with it. So for instance there can be several userdata_x.img files, and the xml file contains the offsets of where to write the individual files in order to resemble the outer image.

This isn't a problem if the outer container format is known. For instance, if it's a tar file I could add functionality to the patool plugin to check for rawprogram*.xml files and process them. But now if the outer container is a 7z file, then I have to duplicate that functionality to the 7z plugin. I can put this functionality in a helper class that's available from several plugins, but that doesn't feel very extensible.

What I'd like to see is a different type of plugin that registers a file pattern to look for in extracted files, if found, it calls this plugin with the directory that contains the extracted files so that it can re-assemble the .img files into something another plugin can then unpack in isolation.

In my opinion this makes it more module and extensible, but would like your opinion about this. Are you open to this plan? If I implement this in a fork will you consider a PR with this functionality? If this doesn't sound appealing or you have objections I'd love to hear them and I'll adjust accordingly.

@dorpvom
Copy link
Collaborator

dorpvom commented Jul 9, 2021

Hi,
our current architecture makes this complicated to achieve but I understand the appeal. I initially thought we might be able to solve this with some kind of callback, but that would have to happen in FACT_core since this extractor only ever looks at a single file.
A post unpack plugin that looks at the extracted content and searches for such an occurence would probably work, but I'm not sure there is a way to build that, that does not completely go against our current architecture.
Even looking at FACT_core, the unpack scheduler does not care which container a processed file comes from, so it does not have an easy way to check for dependencies in multiple files from the same source.
So. While I haven't encountered such a case, I'd be interested in supporting it. I think the next step should be a more specific sketch of where the new functionality would be added and how the interfacing components are affected. Then we could discuss and maybe we can even support you in implementing it.

@Caesurus
Copy link
Contributor Author

Caesurus commented Jul 9, 2021

My rough idea was to add a method to UnpackBase, register_post_plugin( search_pattern, unpacker_name_and_function ) that would store just like register_plugin, but would use a glob/rglob search pattern. EG rawprogram*.xml

The anatomy of the post extraction unpack plugin would be the same as any of the other plugins (name/version etc).

Then modify the Unpacker.unpack() function to do something along these lines:


    extracted_files, meta_data = self.extract_files_from_file(file_path, tmp_dir.name)
    extracted_files, meta_data = self._do_fallback_if_necessary(extracted_files, meta_data, tmp_dir.name, file_path)

+   extracted_files, meta_data = self.post_extraction_plugins(extracted_files, meta_data, tmp_dir.name, file_path)

    extracted_files = self.move_extracted_files(extracted_files, Path(tmp_dir.name))

The post_extraction_plugins() would basically just iterate through the registered patterns, try to match against the list of extracted_files, and then call the corresponding function if it is found.
What I don't love about this is that it could add a nontrivial amount of time to the process if a lot of these patterns are registered.

I think that I may still like the idea of adding a filter to the mix as well.
So for example, only scan for a specific pattern if the top level file mime-type was within a given list...

This seems like a fairly minimal set of changes to implement, and would make for a better implementation for these specific use cases...

Here are some use cases that I have encountered that rely on multiple files (usually there is a manifest file).
Once these files are processed the genericFS can then extract files from the resulting images.

  1. block-based OTA android images
    https://invisiblek.github.io/lineage_wiki/extracting_blobs_from_zips.html
    They have a *transfer.list file that is used by sdat2img to unpack an image to a usable format

  2. Qualcomm QFIL utility splitting files and maintaining offsets in the rawprogram*.xml/patch.xml files.

  3. Custom packed formats. .rfw is one example, it also has a manifest file that's used to re-assemble files.

@Caesurus
Copy link
Contributor Author

For the record: "fact_extractor is so neat, I use it everyday" <-- Me... I just said that, and it's true.

@dorpvom
Copy link
Collaborator

dorpvom commented Sep 2, 2021

What is the status of this by the way? Have you worked on a solution yet? I think we haven't on our side, though I see a possibility of trying out some ideas in the fall.

@Caesurus
Copy link
Contributor Author

Caesurus commented Sep 2, 2021

So... the short version... I haven't started implementing it.

The longer version: I have a system that uses fact_extractor to extract files from other files, and I recently added support of extraction deduplication. The assertion is:

Given a specific file hash and assuming the unpack plugins haven't changed... fact_extractor will extract exactly the same files each time.

So if there is an archive that occurs in multiple fw, we shouldn't spend resources extracting it every time we see it. There are some inherent complexities that come with this, because you need to store a bunch of information about what plugins were used, and complexity about handling archives within archives etc... anyway, I'm going off on a tangent.

All this to say that I rely on the plugin name and plugin version to determine if a file needs to be re-processed because of a change in the plugins.

Now, if a post-unpack plugin can potentially be run after any other plugins, it introduces more complexity to that logic, and that needs some serious consideration to get right. So I delayed working on this until it bothers me enough to implement.

Of course none of this is anything you, as maintainers, need to worry about. I'm encountering more instances where this will be useful though. It looks like Qualcomm really likes following this paradigm of having manifest files alongside binary files.

I will definitely ping you and provide an update if/when I implement something like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants