Skip to content

Decompilation

Alexey Andreev edited this page Feb 5, 2014 · 12 revisions

The idea of decompiler is based on the fact that for each reducible graph we can always build an appropriate source code, as JavaScript has an ability to break any labeled block. Even if there wasn't such ability, we could emulate it with such construction: do { ... } while (false). So we just need to build a proper block hierarchy. Initially each block should start at the jump to label and end at the label itself. But due to some blocks may overlap, we should move the beginning of block to make them nest each other.

Another fact we should notice is that we can't express back jumps this way. Instead we should compute a loop forest of our CFG. But as we also can't jump inside loop, we should reorder CFG that all nodes, belonging to one loop come one after another. We can't let any non-loop nodes be inserted between loop nodes. So we end up with the following algorithm:

  1. Build a CFG loop forest.
  2. Perform a depth-first search over the CFG. When we are on some node, we should first visit its successors that exit the most outer loop, then we can visit the remaining ones (i.e. successors which stay on the same loop). Now the list has a topoligically sorted CFG, where no jumps into loops exist.
  3. Add all the loops into the set of blocks.
  4. For each jump instruction (i.e. GOTO, IFEQ, IFNE, etc) which does not follow a loop back edge add a block beginning at the instruction itself and ending at the jump target label.
  5. If two blocks overlap, move the later block's beginning before the beginnig of the earlier block.

Consider the following example

arrayCopy(arr, len):
  $0:
    copy = new array(len);
    copyLen = len;
    origLen = length(arr);
    if copyLen > origLen then goto $1 else goto $2
  $1:
    copyLen = origLen
    goto $2
  $2:
    i = 0;
    goto $3
  $3:
    if i < copyLen then goto $4 else goto $5
  $4:
    elt = arr[i];
    copy[i] = elt;
    i = i + 1
    goto $3
  $5:
    return copy;

with the following CFG:

CFG

Now we are ready to produce the abstract syntax tree. Notice, that we should follow the nesting block hierarchy we just have built. Each jump instruction, unless it is the instruction which follows a back edge, is represented as break label or if (condition) break label;.

After building the AST this way we apply some optimizations over the AST, in order to make the resulting code more natural and neat. The two main optimizations are these:

The first optimization. When we have the following labelled block:

label: {
    A
    if (X) {
        B
        break label;
    }
    C
}

we can eliminate the break statement the following way:

label: {
   A
   if (X) {
      B
   } else {
      C
   }
}

of course, if all break statements have been eliminated eliminated, we can eliminate the whole block. For example, if there are no references to label1 in A, B and C, we could rewrite our example the following way:

A
if (X) {
   B
} else {
   C
}

The second optimization. For each usage of variable we can replace it by its declaration right-hand side, if this variable is used and declared once and the variable's usage follows immediately after its declaration.

Consider the following code:

a = foo()
b = bar()
c = a - b

We apply subtraction in left-to-right order. Thus, we should eliminate temporal declaration in backwards order, i.e. from right to left. Let us apply optimization:

a = foo()
c = a - bar();

and after the second application we have the following:

c = foo() - bar();

Notice, that we must always apply this optimization in the backwards computation order, as otherwise we may alter the method call order. And if these methods both have side effects, we loose the original meaning. Moreover, we can change the computation order in general, that leads to altering of liveness information. But as our register allocation and phi elimination relies in the original liveness information, we change our code meaning.

After AST optimizations TeaVM renames all variables to corresponding registers that are computed during the previous phase. If only we renamed variables at the previous phase, we loose information abount the number of usages.

Clone this wiki locally