Line map between output and input text #29

gratianlup · 2024-09-29T07:36:08Z

Hi,

Would it be possible to also compute the mapping between the LLM output and the input from Ghidra decompiler as a line map? Something like LLM_OUT_LINES[line_number] = {one or more line numbers from the Ghidra input}.

In your Colab example, the output line:
if (fabs(a[i] - a[j]) < eps)

would be mapped to the 3 input lines:

if ((float)(DAT_001020d0 &
                 (uint)(*(float *)(param_2 + (long)local_10 * 4) -
                       *(float *)(param_2 + (long)local_c * 4))) < param_1) {

I'm not sure if something like this can be done with LLMs at all. If doable though, then this project would be really useful for tools like profilers, where one could mark the source lines where most time is spent by mapping assembly instructions to lines with the help of debug info.

The text was updated successfully, but these errors were encountered:

albertan017 · 2024-09-29T07:58:41Z

Aligning the input and output of a large language model isn't achievable unless we tailor the training process (similar to how objdump -d -S pairs one line of source code with a few lines of assembly). We plan to explore this line-by-line training approach (asm-src, not ghidra) in future updates for a more versatile chat model, which might take a few months to develop, but we hope it will be beneficial.

We've also observed that a group of smart researchers have done some work which may help your situation; you might want to explore their models.

https://arxiv.org/pdf/2406.17233

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line map between output and input text #29

Line map between output and input text #29

gratianlup commented Sep 29, 2024

albertan017 commented Sep 29, 2024

Line map between output and input text #29

Line map between output and input text #29

Comments

gratianlup commented Sep 29, 2024

albertan017 commented Sep 29, 2024