Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] joern-export tool export incorrect formated .dot files. #5158

Open
yikesoftware opened this issue Dec 4, 2024 · 14 comments
Open

[Bug] joern-export tool export incorrect formated .dot files. #5158

yikesoftware opened this issue Dec 4, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@yikesoftware
Copy link

When I try to export CPG in dot format using the joern export tool, the resulting dot file format is incorrect, which makes it difficult for libraries such as graphviz and pydot to parse correctly. In contrast, when exporting images such as CFG, PDG, AST, CPG14, etc., the dot format is correct.

For example: joern-export 1.cpg.bin -o 1.cpg.gv --repr cpg --format dot

And I got a file in output results:

digraph {
  120259084301[label="METHOD_PARAMETER_OUT CODE="p1" EVALUATION_STRATEGY="BY_VALUE" INDEX="1" IS_VARIADIC="false" NAME="p1" ORDER="1" TYPE_FULL_NAME="ANY"]
  128849018890[label=METHOD_RETURN CODE="RET" EVALUATION_STRATEGY="BY_VALUE" ORDER="2" TYPE_FULL_NAME="ANY"]
  25769803790[label=BLOCK ARGUMENT_INDEX="1" CODE="<empty>" ORDER="1" TYPE_FULL_NAME="ANY"]
  111669149706[label=METHOD AST_PARENT_FULL_NAME="<global>" AST_PARENT_TYPE="NAMESPACE_BLOCK" CODE="<empty>" FILENAME="<empty>" FULL_NAME="<operator>.addressOf" IS_EXTERNAL="true" NAME="<operator>.addressOf" ORDER="0" SIGNATURE=""]
  115964117005[label=METHOD_PARAMETER_IN CODE="p1" EVALUATION_STRATEGY="BY_VALUE" INDEX="1" IS_VARIADIC="false" NAME="p1" ORDER="1" TYPE_FULL_NAME="ANY"]
  111669149706 -> 120259084301 [label=AST ]
  115964117005 -> 120259084301 [label=PARAMETER_LINK ]
  111669149706 -> 128849018890 [label=AST ]
  111669149706 -> 115964117005 [label=REACHING_DEF property=""]
  115964117005 -> 128849018890 [label=REACHING_DEF property="p1"]
  111669149706 -> 115964117005 [label=AST ]
  128849018890 -> 111669149706 [label=POST_DOMINATE ]
  115964117005 -> 120259084301 [label=REACHING_DEF property="p1"]
  111669149706 -> 25769803790 [label=AST ]
  111669149706 -> 128849018890 [label=CFG ]
  111669149706 -> 25769803790 [label=CONTAINS ]
  111669149706 -> 128849018890 [label=DOMINATE ]
}

There are 4 main errors of this dot file:

  1. If the attribute value contains special characters such as spaces,<,>,=, etc., it must be enclosed in double quotation marks. Otherwise, it will cause syntax errors.
  2. Graphviz requires that the node ID must be a valid identifier. If the node ID is a pure number (such as 120259084301), it needs to be enclosed in double quotation marks, otherwise it will be mistaken for an integer constant.
  3. The attribute value contains special characters such as(such as CODE=""), which need to be enclosed in quotation marks or escaped in some cases.
  4. In Graphviz's. dot file, it is best to use semicolons for each node definition, edge definition, and attribute definition; ending. Your file is missing semicolons.

The corrected file is as follow:

digraph {
  "120259084301" [label="METHOD_PARAMETER_OUT CODE=\"p1\" EVALUATION_STRATEGY=\"BY_VALUE\" INDEX=\"1\" IS_VARIADIC=\"false\" NAME=\"p1\" ORDER=\"1\" TYPE_FULL_NAME=\"ANY\""];
  "128849018890" [label="METHOD_RETURN CODE=\"RET\" EVALUATION_STRATEGY=\"BY_VALUE\" ORDER=\"2\" TYPE_FULL_NAME=\"ANY\""];
  "25769803790" [label="BLOCK ARGUMENT_INDEX=\"1\" CODE=\"<empty>\" ORDER=\"1\" TYPE_FULL_NAME=\"ANY\""];
  "111669149706" [label="METHOD AST_PARENT_FULL_NAME=\"<global>\" AST_PARENT_TYPE=\"NAMESPACE_BLOCK\" CODE=\"<empty>\" FILENAME=\"<empty>\" FULL_NAME=\"<operator>.addressOf\" IS_EXTERNAL=\"true\" NAME=\"<operator>.addressOf\" ORDER=\"0\" SIGNATURE=\"\""];
  "115964117005" [label="METHOD_PARAMETER_IN CODE=\"p1\" EVALUATION_STRATEGY=\"BY_VALUE\" INDEX=\"1\" IS_VARIADIC=\"false\" NAME=\"p1\" ORDER=\"1\" TYPE_FULL_NAME=\"ANY\""];
  
  "111669149706" -> "120259084301" [label="AST"];
  "115964117005" -> "120259084301" [label="PARAMETER_LINK"];
  "111669149706" -> "128849018890" [label="AST"];
  "111669149706" -> "115964117005" [label="REACHING_DEF" property=""];
  "115964117005" -> "128849018890" [label="REACHING_DEF" property="p1"];
  "111669149706" -> "115964117005" [label="AST"];
  "128849018890" -> "111669149706" [label="POST_DOMINATE"];
  "115964117005" -> "120259084301" [label="REACHING_DEF" property="p1"];
  "111669149706" -> "25769803790" [label="AST"];
  "111669149706" -> "128849018890" [label="CFG"];
  "111669149706" -> "25769803790" [label="CONTAINS"];
  "111669149706" -> "128849018890" [label="DOMINATE"];
}

Perhaps for some reason, you have simplified the file format of dot when exporting CPG, but I think it is best to ensure that the format is standard and only requires some symbol related modifications, so the workload should not be significant.


Joern version: 4.0.116
OS: Ubuntu 24.04
Java version: openjdk 21.0.5 2024-10-15

@yikesoftware yikesoftware added the bug Something isn't working label Dec 4, 2024
@max-leuthaeuser
Copy link
Contributor

max-leuthaeuser commented Dec 4, 2024

@mpollmeier

I guess the issue is in io.joern.joerncli.JoernExport.
For Representation.All or Representation.Cpg we use the flatgraph.formats.Exporter (which generates the broken dot output shown here) while for anything else we use io.shiftleft.semanticcpg.dotgenerator.DotSerializer which works correctly.

I am not sure why we made this distinction there and re-implemented the dot exporting logic again within flatgraph if we already had a working one in joern.

Anyway: there is a lot of escaping (using StringEscapeUtils.escapeHtml4) and quoting (using \") in io.shiftleft.semanticcpg.dotgenerator.DotSerializer which is missing in flatgraph.formats.Exporter.

@yikesoftware
Copy link
Author

@mpollmeier

I guess the issue is in io.joern.joerncli.JoernExport. For Representation.All or Representation.Cpg we use the flatgraph.formats.Exporter (which generates the broken dot output shown here) while for anything else we use io.shiftleft.semanticcpg.dotgenerator.DotSerializer which works correctly.

I am not sure why we made this distinction there and re-implemented the dot exporting logic again within flatgraph if we already had a working one in joern.

Anyway: there is a lot of escaping (using StringEscapeUtils.escapeHtml4) and quoting (using \") in io.shiftleft.semanticcpg.dotgenerator.DotSerializer which is missing in flatgraph.formats.Exporter.

Do you have any temporary solutions?

@mpollmeier
Copy link
Contributor

Thanks for bringing this up, and yes there's a bit of historical mess in there. I'll take a look today.

@mpollmeier
Copy link
Contributor

didn't get around to looking into this yet

@mpollmeier
Copy link
Contributor

I just pushed michael/dot-export, @yikesoftware can you give it a try and see if this is closer to what you need? Check out the branch, run sbt stage and then something like ./joern-export cpg.bin -o dot-export-cpg-with-joern --repr cpg --format dot

@max-leuthaeuser @yikesoftware there's two differences between the joern DotSerializer and flatgraph's DotExporter:

  1. escaping, as discussed above
  2. flatgraph's exporter is generic and exports all nodes with all properties, while joern's DotSerializer is specific to the cpg schema and only considers a subset of the nodes and their properties.

We never really made a plan for what we really want in the dot export, and there's many different use cases. I guess I'll change the 'all' representation back to the flatgraph exporter, after I added escaping.

@yikesoftware
Copy link
Author

I just pushed michael/dot-export, @yikesoftware can you give it a try and see if this is closer to what you need? Check out the branch, run sbt stage and then something like ./joern-export cpg.bin -o dot-export-cpg-with-joern --repr cpg --format dot

@max-leuthaeuser @yikesoftware there's two differences between the joern DotSerializer and flatgraph's DotExporter:

  1. escaping, as discussed above
  2. flatgraph's exporter is generic and exports all nodes with all properties, while joern's DotSerializer is specific to the cpg schema and only considers a subset of the nodes and their properties.

We never really made a plan for what we really want in the dot export, and there's many different use cases. I guess I'll change the 'all' representation back to the flatgraph exporter, after I added escaping.

Thanks! After my testing, your method can indeed export dot format files in the correct format. However, compared to the original approach, each node seems to only have label, and the previously rich attribute information has disappeared. May I ask if it is possible to preserve these high-value attribute information while maintaining the correct export format?

Before:

  120259084301[label="METHOD_PARAMETER_OUT CODE="p1" EVALUATION_STRATEGY="BY_VALUE" INDEX="1" IS_VARIADIC="false" NAME="p1" ORDER="1" TYPE_FULL_NAME="ANY"]
  128849018890[label=METHOD_RETURN CODE="RET" EVALUATION_STRATEGY="BY_VALUE" ORDER="2" TYPE_FULL_NAME="ANY"]
  25769803790[label=BLOCK ARGUMENT_INDEX="1" CODE="<empty>" ORDER="1" TYPE_FULL_NAME="ANY"]
  111669149706[label=METHOD AST_PARENT_FULL_NAME="<global>" AST_PARENT_TYPE="NAMESPACE_BLOCK" CODE="<empty>" FILENAME="<empty>" FULL_NAME="<operator>.addressOf" IS_EXTERNAL="true" NAME="<operator>.addressOf" ORDER="0" SIGNATURE=""]
  115964117005[label=METHOD_PARAMETER_IN CODE="p1" EVALUATION_STRATEGY="BY_VALUE" INDEX="1" IS_VARIADIC="false" NAME="p1" ORDER="1" TYPE_FULL_NAME="ANY"]

After:

"120259084293" [label = <METHOD_PARAMETER_OUT> ]
"25769803785" [label = <(BLOCK,&lt;empty&gt;,&lt;empty&gt;)> ]
"128849018885" [label = <(METHOD_RETURN,ANY)> ]
"115964116997" [label = <(PARAM,p1)> ]
"111669149701" [label = <(METHOD,printf)> ]

@mpollmeier
Copy link
Contributor

yes that's expected as per my message above. specifically:

I guess I'll change the 'all' representation back to the flatgraph exporter, after I added escaping.

That's the part you need. Won't get to that this week though.

@mpollmeier
Copy link
Contributor

@yikesoftware can you please share how you use or render the dot file?
If i use dot to render a png (e.g. dot -Tpng foo.dot -o foo.png) it does render an image without complaining about the rules you mention above. That being said, it also disregards all properties, so you're likely using a different tool...

@yikesoftware
Copy link
Author

@yikesoftware can you please share how you use or render the dot file? If i use dot to render a png (e.g. dot -Tpng foo.dot -o foo.png) it does render an image without complaining about the rules you mention above. That being said, it also disregards all properties, so you're likely using a different tool...

When using previous version to dump dot and make png:

➜  ida_scripts dot -Tpng /home/eqqie/work/ida_scripts/1.cpg.gv/1.c/func1.dot -o test.png
Error: /home/eqqie/work/ida_scripts/1.cpg.gv/1.c/func1.dot: syntax error in line 2 near 'from'
➜  ida_scripts

dot file:

digraph {
  25769803777[label=BLOCK ARGUMENT_INDEX="-1" CODE="{
    printf("Hello from func1: %d\n", arg);
    return arg;
}" COLUMN_NUMBER="19" LINE_NUMBER="3" ORDER="2" TYPE_FULL_NAME="void"]
  146028888064[label=RETURN ARGUMENT_INDEX="-1" CODE="return arg;" COLUMN_NUMBER="5" LINE_NUMBER="5" ORDER="2"]
  90194313216[label=LITERAL ARGUMENT_INDEX="1" CODE=""Hello from func1: %d\n"" COLUMN_NUMBER="12" LINE_NUMBER="4" ORDER="1" TYPE_FULL_NAME="char*"]
  68719476736[label=IDENTIFIER ARGUMENT_INDEX="2" CODE="arg" COLUMN_NUMBER="38" LINE_NUMBER="4" NAME="arg" ORDER="2" TYPE_FULL_NAME="int"]
  68719476737[label=IDENTIFIER ARGUMENT_INDEX="-1" CODE="arg" COLUMN_NUMBER="12" LINE_NUMBER="5" NAME="arg" ORDER="1" TYPE_FULL_NAME="int"]
  111669149697[label=METHOD AST_PARENT_FULL_NAME="1.c:<global>" AST_PARENT_TYPE="TYPE_DECL" CODE="int func1(int arg){
    printf("Hello from func1: %d\n", arg);
    return arg;
}" COLUMN_NUMBER="1" COLUMN_NUMBER_END="1" FILENAME="1.c" FULL_NAME="func1" IS_EXTERNAL="false" LINE_NUMBER="3" LINE_NUMBER_END="6" NAME="func1" ORDER="1" SIGNATURE="int(int)"]
  115964116992[label=METHOD_PARAMETER_IN CODE="int arg" COLUMN_NUMBER="11" EVALUATION_STRATEGY="BY_VALUE" INDEX="1" IS_VARIADIC="false" LINE_NUMBER="3" NAME="arg" ORDER="1" TYPE_FULL_NAME="int"]
  128849018880[label=METHOD_RETURN CODE="RET" COLUMN_NUMBER="1" EVALUATION_STRATEGY="BY_VALUE" LINE_NUMBER="3" ORDER="3" TYPE_FULL_NAME="int"]
  120259084288[label=METHOD_PARAMETER_OUT CODE="int arg" COLUMN_NUMBER="11" EVALUATION_STRATEGY="BY_VALUE" INDEX="1" IS_VARIADIC="false" LINE_NUMBER="3" NAME="arg" ORDER="1" TYPE_FULL_NAME="int"]
  30064771072[label=CALL ARGUMENT_INDEX="-1" CODE="printf("Hello from func1: %d\n", arg)" COLUMN_NUMBER="5" DISPATCH_TYPE="STATIC_DISPATCH" LINE_NUMBER="4" METHOD_FULL_NAME="printf" NAME="printf" ORDER="1" SIGNATURE="" TYPE_FULL_NAME="ANY"]
  25769803777 -> 146028888064 [label=AST ]
  30064771072 -> 128849018880 [label=REACHING_DEF property="printf(\"Hello from func1: %d\\n\", arg)"]
  111669149697 -> 90194313216 [label=CFG ]
  68719476736 -> 120259084288 [label=REACHING_DEF property="arg"]
  111669149697 -> 90194313216 [label=REACHING_DEF property=""]
  68719476736 -> 68719476737 [label=REACHING_DEF property="arg"]
  68719476737 -> 115964116992 [label=REF ]
  68719476736 -> 30064771072 [label=REACHING_DEF property="arg"]
  111669149697 -> 68719476736 [label=CONTAINS ]
  68719476737 -> 146028888064 [label=DOMINATE ]
  90194313216 -> 68719476736 [label=CFG ]
  90194313216 -> 111669149697 [label=POST_DOMINATE ]
  111669149697 -> 25769803777 [label=CONTAINS ]
  146028888064 -> 128849018880 [label=DOMINATE ]
  111669149697 -> 115964116992 [label=AST ]
  68719476736 -> 30064771072 [label=CFG ]
  68719476736 -> 128849018880 [label=REACHING_DEF property="arg"]
  111669149697 -> 90194313216 [label=CONTAINS ]
  146028888064 -> 68719476737 [label=POST_DOMINATE ]
  30064771072 -> 90194313216 [label=ARGUMENT ]
  90194313216 -> 68719476736 [label=DOMINATE ]
  115964116992 -> 120259084288 [label=PARAMETER_LINK ]
  68719476736 -> 90194313216 [label=POST_DOMINATE ]
  30064771072 -> 68719476736 [label=ARGUMENT ]
  111669149697 -> 25769803777 [label=AST ]
  68719476737 -> 146028888064 [label=CFG ]
  68719476737 -> 146028888064 [label=REACHING_DEF property="arg"]
  115964116992 -> 68719476736 [label=REACHING_DEF property="arg"]
  111669149697 -> 68719476736 [label=REACHING_DEF property=""]
  30064771072 -> 90194313216 [label=AST ]
  111669149697 -> 68719476737 [label=REACHING_DEF property=""]
  128849018880 -> 146028888064 [label=POST_DOMINATE ]
  111669149697 -> 68719476737 [label=CONTAINS ]
  146028888064 -> 128849018880 [label=CFG ]
  25769803777 -> 30064771072 [label=AST ]
  111669149697 -> 90194313216 [label=DOMINATE ]
  68719476737 -> 30064771072 [label=POST_DOMINATE ]
  30064771072 -> 68719476737 [label=CFG ]
  115964116992 -> 120259084288 [label=REACHING_DEF property="arg"]
  111669149697 -> 30064771072 [label=CONTAINS ]
  111669149697 -> 115964116992 [label=REACHING_DEF property=""]
  146028888064 -> 68719476737 [label=ARGUMENT ]
  90194313216 -> 30064771072 [label=REACHING_DEF property="\"Hello from func1: %d\\n\""]
  146028888064 -> 128849018880 [label=REACHING_DEF property="<RET>"]
  68719476736 -> 30064771072 [label=DOMINATE ]
  30064771072 -> 68719476736 [label=AST ]
  111669149697 -> 146028888064 [label=CONTAINS ]
  30064771072 -> 68719476736 [label=POST_DOMINATE ]
  90194313216 -> 68719476736 [label=REACHING_DEF property="\"Hello from func1: %d\\n\""]
  146028888064 -> 68719476737 [label=AST ]
  68719476736 -> 115964116992 [label=REF ]
  111669149697 -> 120259084288 [label=AST ]
  111669149697 -> 128849018880 [label=AST ]
  30064771072 -> 68719476737 [label=DOMINATE ]
}

When using michael/dot-export version to dump dot and make png:

➜  ida_scripts dot -Tpng /home/eqqie/work/ida_scripts/1.cpg.gv/1.c/func1.dot -o test.png
➜  ida_scripts

QQ_1734532881593

It was successful, but it lacked a lot of key attribute information.

@yikesoftware
Copy link
Author

yikesoftware commented Dec 18, 2024

@mpollmeier Perhaps the key issue lies only in the handling of quotation marks, as all internal quotation marks need to be manually escaped.

@mpollmeier
Copy link
Contributor

what's ida_scripts?

@yikesoftware
Copy link
Author

yikesoftware commented Dec 18, 2024

@mpollmeier Perhaps the key issue lies only in the handling of quotation marks, as all internal quotation marks need to be manually escaped.

Path name. Just ignore it. I originally planned to use IDA to export some pseudocode and input it into Joern, but I discovered this issue during the testing phase.

@mpollmeier
Copy link
Contributor

ah, sorry i got confused by your shell prefix 😆

@yikesoftware
Copy link
Author

@mpollmeier

A belated Merry Christmas~

Any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants