-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More deterministic renames across different versions of the same code #97
Comments
Thank you, Glenn, for taking the time to answer me in detail... I've wrote a message before this one, but it got lost... I am going through the links you just shared, and I will get back to you with some ideas, I think I already have some that are worth discussing but I want to make sure before that that I have valid and viable ones as my knowledge is still very limited in this area. |
Just to clarify that I'm on the same page here, is the issue that:
This is an interesting problem. I'd love to research some ways to implement this. Especially AST fingerprinting seems promising, thank you @0xdevalias for your links. |
One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source. In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs. |
I would like to share an idea I’ve been considering, even though I’m still in the process of researching this topic. I hope it proves to be useful! My suggestion is to break the code down into smaller, modular functions, which seems to be a practice your script might already be implementing. One approach to enhance this is to replace all variable names with generic placeholders (like a, b, c, d) or numerical identifiers (such as 0001, 0002, 0003) by order of apparency. (I honestly don't know how this can be done but maybe via RegEx or just asking LLM to do it). Anyway, this would allow for a standardized, minified version of the code. After creating this stripped down and abstracted version, we could calculate a hash of the code as a string. This hash would serve as a unique identifier to track changes portions of the code from different versions of the project and prevent duplicate entries as well as a reference to where to store the future generated variable names. The resulting data could be stored in an appropriate format, such as CSV, NoSQL, or JSON, based on your requirements for speed, scalability, and ease of access. Next, we could analyze this stored data from a designated project location or a maybe specified subfolder (into .humanifjs). Here, we could leverage language models (LLMs) to generate meaningful variable names based on the context of the functions. This would create a "reference" that can assist in future analyses of the code. When new versions of the obfuscated code are generated (which will have different variable names), we can apply a similar process to compare them with previously processed versions. By using diff techniques, we can identify changes and maintain a collection of these sub-chunks of code, which would help reduce discrepancies. In most cases, we should see a high degree of similarity unless a particular function’s logic has altered. We can then reassign the previously generated variable names (instead of the original variable names or having to generate different ones) to the new code chunks by feeding them as choices for the LLM or assigning them directly programmatically to reduce the need to consume more tokens for the same chunks. Additionally, to enhance this process, we could explore various optimizations in how the LLM generates and assigns these variable names, as well as how we handle the storage and retrieval of the chunks. I look forward to your thoughts on this approach and any suggestions you may have for improving it further! What would make this work better is to make it able to work take advantage of diff (compare) technics to make some sort of sub-chuncks then keeping them available to reduce the discrepancy, maybe also optimize the generation... I hope this makes sense. And as you stated here
This would be optimal indeed as it will allow to leverage the collective work to get the best results. PS: I don't have a good machine right now to do some testing myself, nor an API key that allows me to do them properly. |
@jehna Agreed. This was one of the ideas that first led me down the 'fingerprinting' path. Though instead of 'deterministically reversing the code to the original source' in its entirety (which may also be useful), my plan was first to be able to detect dependencies and mark them as such (as most of the time I don't care to look too deeply at them), and then secondly to just be able to extract the 'canonical variable/function names' from that original source and be able to apply them to my unminified version (similar to how
While it's a very minimal/naive attempt, and definitely not the most robust way to approach things, a while back I implemented a really basic 'file fingerprint' method, mostly to assist in figuring out when a chunk had been renamed (but was otherwise largely the same chunk as before), that I just pushed to
When I was implementing it, I was thinking about embeddings, but didn't want to have to send large files to the OpenAI embeddings API; and wanted a quick/simple local approximation of it. Expanding on this concept to the more general code fingerprinting problem; I would probably look at breaking things down to at least an individual module level, as I believe usually modules tend to coincide with original source files; and maybe even break things down even further to a function level if needed. I would also probably be normalising the code to remove any function/variable identifiers first; and to remove the impact of whitespace differences/etc. While it's not applied to generating a fingerprint, you can see how I've used some of these techniques in my approach to creating a 'diff minimiser' for identifying newly changed code between builds, while ignoring the 'minification noise / churn':
@jehna Oh true.. yeah, that definitely makes sense. Kind of like a local cache.
@neoOpus This would be handled by parsing the code into an AST, and then manipulating that AST to rename the variables. You can see various hacky PoC versions of this with various parsers in my
Which you can see some of the early hacky mapping attempts I was making in these files:
That was the point where I realised I really needed something more robust (such as a proper fingerprint that would survive code minification) to use as the key.
@neoOpus Re-applying the old variable names to the new code wouldn't need an LLM at all, as that part is handled in the AST processing code within
@neoOpus At a high level, it seems that the thinking/aspects you've outlined here are more or less in line with what I've discussed previously in the resources I linked to in my first comment above.
@neoOpus IMO, the bulk of the 'harder parts' of implementing this aren't really LLM related, and shouldn't require a powerful machine. The areas I would suggest most looking into around this are how AST parsing/manipulation works; and then how to create a robust/stable fingerprinting method. IMO, figuring the ideal method of fingerprinting is probably the largest / potentially hardest 'unknown' in all of this currently (at least to me, since while I started to gather resources for it, I haven't had the time to deep dive into reading/analysing them all):
Off the top of my head, I would probably look at breaking things down to at least an individual module level, as I believe usually modules tend to coincide with original source files; and maybe even break things down even further to a function level if needed; and then generate fingerprints for them. I would also potentially consider looking at the module/function 'entry/exit' points (eg. imports/exports); or maybe even the entire 'shape' of the module import graph itself. I would also probably be normalising the code to remove any function/variable identifiers and to remove the impact of whitespace differences/etc; before generating any fingerprints on it. Another potential method I considered for the fingerprints is identifying the types of elements that tend to remain stable even when minified, and using those as part of the fingerprint. As that is one of the manual methods I used to be able to identify a number of the modules listed here:
(Edit: I have captured my notes from this comment on the following gist for posterity: https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#issue-97-more-deterministic-renames-across-different-versions-of-the-same-code) |
|
Hi,
I have an idea that I hope will be helpful and prompt some discussion.
Currently, LLMs often guess variable names differently across various versions of the same JavaScript code. This inconsistency complicates versioning, tracking changes, and merging code for anyone regularly analyzing or modifying applications, extensions, etc.
My suggestion is to create a mapping file that lists generated variable names alongside their LLM-generated alternatives, updated continuously. This would serve as a lookup table for the LLM, helping maintain consistency and reducing variations in the final output. Admittedly, I haven't fully explored the feasibility of this concept, but I believe it would strengthen reverse-engineering processes.
The text was updated successfully, but these errors were encountered: