Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get types for typevars in each function (like v_7693)? #7

Open
am009 opened this issue Nov 29, 2023 · 10 comments
Open

How to get types for typevars in each function (like v_7693)? #7

am009 opened this issue Nov 29, 2023 · 10 comments

Comments

@am009
Copy link

am009 commented Nov 29, 2023

I'm trying to write a decompiler using retypd. If I understand those python source code correctly, existing frontends (retypd-ghidra-plugin and gtirb-ddisasm-retypd) only generate function types (arguments and return value) for each function.

I also generated type constrains for local variables and registers. Is it possible to get the types of these variables (these v_xxx in the following example contrains)?

"memset": [
  "memset.in_0 <= v_1131",
  "v_5 <= memset.in_2",
  "bool <= v_1132",
  "memset.in_0 <= memset.out",
  "v_1135 <= v_1134.store.σ1@0*[nobound]",
  "v_1135 <= memset.in_0.store.σ1@0",
  "v_236 <= memset.in_2",
  "bool <= v_1136",
  "v_1135 <= v_1134.store.σ1@0*[nullterm]",
  "v_1135 <= memset.in_0.store.σ1@1",
  "v_1135 <= v_1134.store.σ1@-3",
  "v_1135 <= memset.in_0.store.σ1@2",
  "v_416 <= memset.in_2",
  "bool <= v_1137",
  "v_1135 <= v_1134.store.σ1@-4",
  "v_1135 <= memset.in_0.store.σ1@3",
  "v_161 <= memset.in_2",
  "bool <= v_1138",

Thanks in advance. Retypd is really powerful in type recovery. I'm quite grateful that it is open source.

@aeflores
Copy link
Contributor

aeflores commented Dec 1, 2023

Hi @am009, getting types for local variables is possible but it would require some changes in Retypd. This is actually something that we have considered implementing, but haven't had time to do it yet.

It could be done as a post-processing step once the inter-procedural propagation has finished. Once we have types for function arguments and return values, you could propagate internally and get types for local variables. However, we probably don't want to do that for all variables in the program eagerly, but rather allow querying specific procedures (otherwise it would be potentially very costly).

If this is something you'd like to contribute, I'd be happy to provide more guidance. Otherwise, we will probably end up doing it ourselves, but I am not sure when.

@am009
Copy link
Author

am009 commented Dec 5, 2023

Hi @aeflores, I urgently need this feature and I am really happy to contribute to retypd. I have been reading the code these days. The following are my understandings.

I think there are two important entries of the algorithm, one is the __call__ of Solver, and the other is the __call__ of CTypeGenerator. Solver solves constraints and the result is saved in Sketch. CTypeGenerator is responsible for recursively converting each Sketch into a primitive type, array type, struct type, or pointer type to them (in CTypeGenerator.c_type_from_nodeset).

I guess, after the __call__ of Solver is invoked, the Solver should provide a new API to allow the query of the Sketches for all variables (including local variables)?

To solve the infinite nodes problem caused by recursive types, retypd uses a method related to push-down systems. Is this part of knowledge related to papers like "Synchronized Pushdown Systems for Pointer and Data-Flow Analysis"? I wonder how the author came up with the idea and probably I need to learn more to understand the paper further.

By the way, because retypd may spend a lot of time on large programs, can we speed it up by using a faster interpreter other than CPython, like PyPy?

@aeflores
Copy link
Contributor

aeflores commented Dec 5, 2023

Hi @am009

I guess, after the call of Solver is invoked, the Solver should provide a new API to allow the query of the Sketches for all variables (including local variables)?

That's right.

To solve the infinite nodes problem caused by recursive types, retypd uses a method related to push-down systems. Is this part of knowledge related to papers like "Synchronized Pushdown Systems for Pointer and Data-Flow Analysis"? I wonder how the author came up with the idea and probably I need to learn more to understand the paper further.

I think you don't have to get into those details to get this working.
A good place to look at is https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L808-L847

The basic steps for "solving" a procedure (or a scc with mutually recursive procedures) are the following:

  1. You get the constraints for the procedure https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L809 (these are the ones that you posted above)
  2. You get constraints from the procedure prototype (https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L817) and from the prototypes of the callees (https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L817)
  3. Then you call infer_shapes to generate the "skeleton" of the sketches for whatever variables you care about https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L830
  4. Then you call generate_primitive_constraints (https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L836) and add_constraints (https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L845) to populate the sketches nodes with primitive type information.

So let's say you want to get a sketch for a set of variables vars in a procedure proc, you would do something like this:

    def get_type_of_variables(self,
                            sketches_map: Dict[DerivedTypeVariable, Sketch],
                            type_schemes: Dict[DerivedTypeVariable, ConstraintSet],
                            proc:DerivedTypeVariable, 
                            vars:Set[DerivedTypeVariable]):
        constraints = self.program.proc_constraints.get(
                        proc, ConstraintSet()
                    )
        callees = set(networkx.DiGraph(self.program.callgraph).successors(proc))
        fresh_var_factory = FreshVarFactory()
        constraints |= Solver.instantiate_calls(
            callees,
            constraints,
            sketches_map,
            type_schemes,
            fresh_var_factory,
        )
        constraints |= sketches_map[proc].instantiate_sketch(
            proc, fresh_var_factory
        )
    
        var_sketches = self.infer_shapes(
            vars,
            self.program.types,
            constraints,
        )
        for var in vars:
            primitive_constraints = self._generate_primitive_constraints(
                constraints,
                frozenset({var}),
                self.program.types.internal_types)
    
            var_sketches[var].add_constraints(
                primitive_constraints
            )
        return var_sketches

NOTE I haven't actually run the code, it probably won't run as is, but hopefully it gives you an idea of what needs to happen.

In your example, you would call it with e.g. proc = "memset" and (for example) vars = {"v_1135","v_1132"} (it would actually be vars= { DerivedTypeVariable(var) for var in ["v_1135","v_1132"]} and same for proc). sketches_map and type_schemes would be the ones returned by the call method. Once you have sketches, you can obtain the corresponding C types. Hopefully this is enough to get you started.

This code recomputes a lot of stuff, doing that more efficiently would probably require more extensive changes.

By the way, because retypd may spend a lot of time on large programs, can we speed it up by using a faster interpreter other than CPython, like PyPy?

I think we tried running it with PyPy at some point, and it worked... or it was close to working, so yes that is definitely an option. There are also plenty of optimization opportunities in this code (cacheing some of the graphs, better data structures, etc).

@aeflores
Copy link
Contributor

aeflores commented Dec 5, 2023

@am009 please note that we require signing a "Contributor License Agreement" https://github.com/GrammaTech/retypd/blob/master/GrammaTech-CLA-retypd.pdf before reviewing and accepting contributions into the official repo (the CLA should be sent to [email protected]).

@am009
Copy link
Author

am009 commented Jan 14, 2024

Sorry for the late reply. Busy with other random stuff these days. Now I finally settled down and started to learn something.

Really grateful for the detailed response, and it works. I created a PR for it: #8 ( I sent a mail with a signed CLA on December 6, 2023.)

Actually, I'm very interested in binary type recovery and trying to do some research in this area, so I would like to learn more about the internals of retypd. I still find the retypd paper difficult to read. I guess I need to learn more about pushdown systems.

I still would like to know, what kind of pushdown systems is it related to? Do you have any recommended learning resources? (or any getting started introduction or tutorial about push-down systems) There is a reference to "Saturation algorithms for model checking pushdown systems." in the retypd paper. Probably it is related? Thanks in advance.

@aeflores
Copy link
Contributor

Hi @am009 , thanks for the PR, I'll try to review it soon.

I still find the retypd paper difficult to read

Me too! @peteraldous and I were planning to write another document/paper revisiting the algorithm, but we never finished it.

I still would like to know, what kind of pushdown systems is it related to?

Despite the name, pushdown systems are not directly used in the implementation explicitly. If you are not doing that yet, I'd recommend focusing on the Arxiv version https://arxiv.org/pdf/1603.05495. Most of the implementation details are in the appendices.

In very rough terms, we have a type system, defined in Figure 3 of the paper, and we need to compute all the "implications" of our initial set of constraints (all the implied type constraints). We can do that by building a transducer that captures all possible type derivations. Appendix D describes how to build that transducer from an initial set of constraints. The construction of the transducer has several steps, and the pushdown system is basically an intermediate step. Again in very rough terms, Appendix D does the following steps:

Constraint set -> Pushdown system -> Transducer

In order to go from pushdown system to transducer, we:

  1. create a graph based on the pushdown system
  2. do saturation (Section D.3)
  3. and shadowing (D4) on the graph.
Constraint set -> Pushdown system ---graph+saturation+shadowing--> Tranducer

This implementation just goes from the constraint set to the graph directly. Saturation and shadowing in the implementation can be found here respectively:

@aeflores
Copy link
Contributor

aeflores commented Jan 31, 2024

I sent a mail with a signed CLA on December 6, 2023.

Hi @am009 , It looks like we have not received your email. Would you mind sending it again?

@am009
Copy link
Author

am009 commented Feb 1, 2024

Hi @am009 , It looks like we have not received your email. Would you mind sending it again?

OK, I have sent it again.
image

@am009
Copy link
Author

am009 commented Jun 4, 2024

Finally, I can understand most of the algorithm now, after reading "Saturation Algorithms for Model-checking Pushdown Systems" and "Advanced Topics in Types and Programming Languages" Chapter 10. Thanks for your previous explanations. The code is also very helpful. Now I am trying to reimplement the algorithm in Rust to have a better understanding of the algorithm.

@am009
Copy link
Author

am009 commented Jun 4, 2024

I found a potential issue related to the algorithm, but it can also be a misunderstanding:

retypd/src/solver.py

Lines 707 to 711 in 8f7f72b

scc_sketches_map = self.infer_shapes(
scc | involved_globals,
self.program.types,
scc_initial_constraints,
)

There is an InferShapes function call during the simplification of constraints (InferProcTypes). I think the inferring of the sketches should be done in the next stage in the function InferTypes(while getting the real types), not here (during simplifying constraints).

Also, there are two arguments for InferShapes, the second argument is the main output map that maps variables to sketches. But here, it passes an empty set to it, which means the result is not used at all? So, I think the InferShapes call in the InferProcTypes is purely to apply the additive constraints. If there are no additive constraints, then the InferShapes call can be omitted. So the above call to InferShapes is not necessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants