C parser and abstract syntax tree for Ruby.
require 'cast'
source = File.read('file.c')
ast = C.parse(source)
ast.entities.each do |declaration|
declaration.declarator.each do |declarator|
puts "#{declarator.name}: declarator.type"
end
end
Or in irb:
irb> ast = C.parse('int main(void) { return 0; }')
=> TranslationUnit
entities:
- FunctionDef
type: Function
type: Int
params: []
name: "main"
def: Block
stmts:
- Return
expr: IntLiteral
val: 0
irb> puts ast
int main(void) {
return 0;
}
=> nil
C.parse
returns a tree of Node
objects. Here's the class hierarchy:
- Node
- TranslationUnit
- Comment
- Declaration
- Declarator
- FunctionDef
- Parameter
- Enumerator
- MemberInit
- Member
- Statement
- Block
- If
- Switch
- While
- For
- Goto
- Continue
- Break
- Return
- ExpressionStatement
- Label
- PlainLabel
- Default
- Case
- Type
- IndirectType
- Pointer
- Array
- Function
- DirectType
- Struct
- Union
- Enum
- CustomType
- PrimitiveType
- Void
- Int
- Float
- Char
- Bool
- Complex
- Imaginary
- IndirectType
- Expression
- Comma
- Conditional
- Variable
- UnaryExpression
- PostfixExpression
- Index
- Call
- Dot
- Arrow
- PostInc
- PostDec
- PrefixExpression
- Cast
- Address
- Dereference
- Sizeof
- Plus
- Minus
- PreInc
- PreDec
- BitNot
- Not
- PostfixExpression
- BinaryExpression
- Add
- Subtract
- Multiply
- Divide
- Mod
- Equal
- NotEqual
- Less
- More
- LessOrEqual
- MoreOrEqual
- BitAnd
- BitOr
- BitXor
- ShiftLeft
- ShiftRight
- And
- Or
- AssignmentExpression
- Assign
- MultiplyAssign
- DivideAssign
- ModAssign
- AddAssign
- SubtractAssign
- ShiftLeftAssign
- ShiftRightAssign
- BitAndAssign
- BitXorAssign
- BitOrAssign
- Literal
- StringLiteral
- CharLiteral
- CompoundLiteral
- IntLiteral
- FloatLiteral
- NodeList
- NodeArray
- NodeChain
The bold ones are abstract.
The last 2 (NodeList
s) represent lists of Node
s. They quack like
standard ruby Arrays
. NodeChain
is a doubly linked list;
NodeArray
is an array.
-
parent
: return the parent in the tree (aNode
or nil). -
pos
,pos=
: the position in the source file (aNode::Pos
). -
to_s
: return the code for the tree (aString
). -
inspect
: return a pretty string for inspection, makes irb fun. -
match?(str)
,=~(str)
: return true iff str parses as aNode
equal to this one. -
detach
: remove this node from the tree (parent becomes nil) and return it. -
detached?
,attached?
: return true if parent is nil or non-nil respectively. -
replace_with(node)
: replace this node with node in the tree. -
swap_with(node)
: exchange this node with node in their trees. -
insert_prev(*nodes)
,insert_next(*nodes)
: insert nodes before this node in the parent list. Parent must be aNodeList
! Useful for adding statements before a node in a block, for example. -
Foo?
: (whereFoo
is a module name) returnself.is_a?(Foo)
. This is a convienience for a common need. Example:\# print all global variables ast.entities.each do |node| node.Declaration? or next node.declarators.each do |decl| unless decl.type.Function? puts "#{decl.name}: #{decl.type}" end end end
The =~
method lets you do:
if declarator.type =~ 'const int *'
puts "Ooh, a const int pointer!"
end
This is not the same as declarator.type.to_s == 'const int *'
;
that'd require you to guess how to_s
formats its strings (most
notably, the whitespace).
The big table down below lists the fields of each Node
. A field is
an attribute which:
- is used in equality checks (
==
andeql?
). - are copied recursively by
dup
andclone
.
Fields listed as children form the tree structure. They only have a
Node
or nil
value, and are yielded/returned/affected by the
traversal methods:
next
,prev
: return the next/prev sibling.list_next
,list_prev
: likenext
/prev
, but also requires the parent to beNodeList
. I'll be honest; I don't remember why I added these methods. They may well suddenly disappear.each
,reverse_each
: Yield all (non-nil) children.Node
includesEnumerable
, so, you know.depth_first
,reverse_depth_first
: Walk the tree in that order, yielding two args (event, node) at each node. event is:down
on the way down,:up
on the way up. If the block throws:prune
, it won't descend any further.preorder
,reverse_preorder
,postorder
,reverse_postorder
: Walk the tree depth first, yielding nodes in the given order. For the preorders, if the block throws:prune
, it won't descend any further.node_after(child)
,node_before(child)
: return the node before/after child (same aschild.next
).remove_node(child)
: remove child from this node (same aschild.detach
).replace_node(child, new_child)
: replace child with yeah you guessed it (same aschild.replace_with(newchild)
).
Note: don't modify the tree during traversal!
Other notes about the table:
- Field names that end in '?' are always true-or-false.
- If no default is listed:
- it is false if the field name ends in a '?'
- it is a
NodeArray
if it is aNodeList
. - it is
nil
otherwise.
table.node_desc tr.first_field table td {
border: none;
}
table.node_desc td {
padding: 3px;
vertical-align: top;
{
table.node_desc table td {
padding: 0px;
{
Class | Field | Type / values | Default | Comments | ||||
---|---|---|---|---|---|---|---|---|
TranslationUnit | entities * | NodeList | NodeChain[] | The root of a parsed file. | ||||
Declaration | storage | :typedef, :extern, :static, :auto, :register |
Also:
|
|||||
type * | DirectType | |||||||
declarators * | NodeList | NodeArray[] | ||||||
inline? | true, false | |||||||
Declarator | indirect_type * | IndirectType |
What's a "declarator?" Consider "int i, *ip;". This is
a Declaration with two Declarators:
Declaration type: Int declarators: - Declarator name: "i" - Declarator indirect_type: Pointer name: "ip"The indirect_type of the ip Declarator is a Pointer to nil. To get the complete type of the variable use:
Pointer type: Int |
|||||
name | String | |||||||
init * | Expression | |||||||
num_bits * | Integer | |||||||
FunctionDef | storage | :extern, :static |
Also:
|
|||||
inline? | true, false | |||||||
type * | Type | |||||||
name | String | |||||||
def * | Block | Block.new | ||||||
no_prototype? | true, false | |||||||
Parameter | register? | true, false | Used in Functions. | |||||
type * | Type | |||||||
name | String | |||||||
Enumerator | name | String | Used in Enums. | |||||
val * | Expression | |||||||
MemberInit | member * | NodeList of (Member or Expression) | Used in CompoundLiterals. | |||||
init * | Expression | |||||||
Member | name | String | Used in MemberInits. | |||||
Block | labels * | NodeList of Label | NodeArray[] | |||||
stmts * | NodeList of (Statement or Declaration or Comment) | NodeArray[] | ||||||
If | labels * | NodeList of Label | NodeArray[] | |||||
cond * | Expression | |||||||
then * | Statement | |||||||
else * | Statement | |||||||
Switch | labels * | NodeList of Label | NodeArray[] | |||||
cond * | Expression | |||||||
stmt * | Statement | |||||||
While | labels * | NodeList of Label | NodeArray[] | do? means it's a do-while loop. | ||||
do? | true, false | |||||||
cond * | Expression | |||||||
stmt * | Statement | |||||||
For | labels * | NodeList of Label | NodeArray[] | |||||
init * | Expression or Declaration | |||||||
cond * | Expression | |||||||
iter * | Expression | |||||||
stmt * | Statement | |||||||
Goto | labels * | NodeList of Label | NodeArray[] | |||||
target | String | |||||||
Continue | labels * | NodeList of Label | NodeArray[] | |||||
Break | labels * | NodeList of Label | NodeArray[] | |||||
Return | labels * | NodeList of Label | NodeArray[] | |||||
expr * | Expression | |||||||
ExpressionStatement | labels * | NodeList of Label | NodeArray[] | |||||
expr * | Expression | |||||||
PlainLabel | name | String | ||||||
Default | ||||||||
Case | expr * | Expression | ||||||
Comma | exprs * | NodeList of Expression | ||||||
Conditional | cond * | Expression | ||||||
then * | Expression | |||||||
else * | Expression | |||||||
Variable | name | String | ||||||
Index | expr * | Expression | ||||||
index * | Expression | |||||||
Call | expr * | Expression | ||||||
args * | NodeList of (Expression or Type) | |||||||
Dot | expr * | Expression | ||||||
member * | String | |||||||
Arrow | expr * | Expression | ||||||
member * | String | |||||||
PostInc | expr * | Expression | ||||||
PostDec | expr * | Expression | ||||||
Cast | type * | Type | ||||||
expr * | Expression | |||||||
Address | expr * | Expression | ||||||
Dereference | expr * | Expression | ||||||
Sizeof | expr * | Type or Expression | ||||||
Positive | expr * | Expression | ||||||
Negative | expr * | Expression | ||||||
PreInc | expr * | Expression | ||||||
PreDec | expr * | Expression | ||||||
BitNot | expr * | Expression | ||||||
Not | expr * | Expression | ||||||
Add | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
Subtract | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
Multiply | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
Divide | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
Mod | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
Equal | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
NotEqual | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
Less | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
More | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
LessOrEqual | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
MoreOrEqual | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
BitAnd | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
BitOr | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
BitXor | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
ShiftLeft | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
ShiftRight | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
And | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
Or | expr1 * | Expression | ||||||
expr2 * | Expression | |||||||
Assign | lval * | Expression | ||||||
rval * | Expression | |||||||
MultiplyAssign | lval * | Expression | ||||||
rval * | Expression | |||||||
DivideAssign | lval * | Expression | ||||||
rval * | Expression | |||||||
ModAssign | lval * | Expression | ||||||
rval * | Expression | |||||||
AddAssign | lval * | Expression | ||||||
rval * | Expression | |||||||
SubtractAssign | lval * | Expression | ||||||
rval * | Expression | |||||||
ShiftLeftAssign | lval * | Expression | ||||||
rval * | Expression | |||||||
ShiftRightAssign | lval * | Expression | ||||||
rval * | Expression | |||||||
BitAndAssign | lval * | Expression | ||||||
rval * | Expression | |||||||
BitXorAssign | lval * | Expression | ||||||
rval * | Expression | |||||||
BitOrAssign | lval * | Expression | ||||||
rval * | Expression | |||||||
StringLiteral | val | String | The String in val is the literal string entered. "\n" isn't converted to a newline, for instance. | |||||
CharLiteral | val | String | The String in val is the literal string entered. '\n' isn't converted to a newline, for instance. | |||||
CompoundLiteral | type * | Type |
Here's an example: (struct S){1, .x = 2, .y [3] .z = 4} parses as: CompoundLiteral type: Struct name: "S" member_inits: - MemberInit init: IntLiteral val: 1 - MemberInit member: - Member name: "x" init: IntLiteral val: 2 - MemberInit member: - Member name: "y" - IntLiteral val: 3 - Member name: "z" init: IntLiteral val: 4 |
|||||
member_inits * | NodeList of MemberInit | NodeArray[] | ||||||
IntLiteral | format | :dec, :hex, :oct | :dec |
Also:
|
||||
val | Integer | |||||||
suffix | String | |||||||
FloatLiteral | format | :dec, :hex | :dec | |||||
val | Float | |||||||
exponent | Integer | |||||||
suffix | String | |||||||
Pointer | const? | true, false | ||||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
type * | Type | |||||||
Array | const? | true, false | ||||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
type * | Type | |||||||
length * | Expression | |||||||
Function | const? | true, false | ||||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
type * | Type | |||||||
params * | NodeList of Parameter | NodeArray[] | ||||||
var_args? | true, false | |||||||
Struct | const? | true, false | ||||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
name | String | |||||||
members * | NodeList of Member | NodeArray[] | ||||||
Union | const? | true, false | ||||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
name | String | |||||||
members * | NodeList of Member | NodeArray[] | ||||||
Enum | const? | true, false | ||||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
name | String | |||||||
members * | NodeList of Enumerator | |||||||
CustomType | const? | true, false | For typedef'd names. | |||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
name | String | |||||||
Void | const? | true, false | const is for things like const void *. | |||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
Int | const? | true, false |
Also:
|
|||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
longness | -1, 0, 1, 2 | 0 | ||||||
unsigned? | true, false | |||||||
Float | const? | true, false |
Also:
|
|||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
longness | 0, 1, 2 | 0 | ||||||
Char | const? | true, false |
Also:
|
|||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
signed | true, false, nil | |||||||
Bool | const? | true, false | This is the rarely seen _Bool type. | |||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
Complex | const? | true, false |
This is the rarely seen _Complex type.
|
|||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
longness | 0, 1, 2 | 0 | ||||||
Imaginary | const? | true, false |
This is the rarely seen _Imaginary type.
|
|||||
restrict? | true, false | |||||||
volatile? | true, false | |||||||
longness | 0, 1, 2 | 0 | ||||||
BlockExpression | block * | Block | Block.new | Only if the block_expressions extension is enabled. See "Extensions" section below. |
C.parse
will use the default parser (C.default_parser
), but you
can also manage your own parser(s) if you need finer control over
state. Parser state consists of:
type_names
: a Set of Strings. As a parser eatstypedef
s, this grows.pos
: theNode::Pos
this parser will start parsing at.
A Node::Pos
has three read-write attributes: filename
, line_num
,
col_num
. Default is nil, 1, 0.
Note that the type names the parser has seen affects the parser! For example, consider:
a * b;
- If only
a
is a type, this is a declaration. - If neither
a
norb
are types, this is a multiplication statement. - Otherwise, it's a syntax error.
You may append type names implicitly, by parsing typedef
s, or
explicitly like this:
parser.type_names << 'Thing' << 'OtherThing'
C.parse
will parse the toplevel C construct, a C::TranslationUnit
,
but you can also parse other snippets of C:
C::Statement.parse('while (not_looking) { paint_car(); }')
C::Type.parse('void *(*)(int *(*)[][2], ...)')
This works for both concrete and abstract Node
subclasses. A
Parser
may be given as an optional second argument.
Type
s are allowed as function arguments. This is needed to parse C99 macros likeva_arg()
.Block
s in parentheses are allowed as expressions (a gcc extension). You need to callparser.enable_block_expressions
first. They appear asBlockExpression
nodes.
This can be tricky for a number of reasons. Here are the issues you'll likely encounter.
Directives that start with #
are not handled by the Parser
, as
they're external to the C grammar. CAST ships with a Preprocessor
,
which wraps the preprocessor used to build your Ruby interpreter.
cpp = C::Preprocessor.new
cpp.include_path << '/usr/include' << /usr/local/include'
cpp.macros['DEBUG'] = '1'
cpp.macros['max(a, b)'] = '((a) > (b) ? (a) : (b))'
cpp.preprocess(code)
Note however, that preprocessors tend to leave vendor-specific
extensions in their output. GNU cpp
, for example, leaves
"linemarkers" (lines that begin with #
) in the output which you'll
need to filter out manually before feeding it to a Parser
.
Mac OS 10.5's system cpp
for instance assumes the compiler will
recognize types such as __darwin_va_list
.
Some code may take advantage of compiler-specific extensions to the
syntax. For example, gcc
supports inline assembly via directives
like:
asm("movl %1, %%eax;
"movl %%eax, %0;"
:"=r"(y)
:"r"(x)
:"%eax");
Such code is fairly rare, so there is no direct support in CAST for
this. You'll need to manually massage such constructs out of the
Parser
input. Or send me patches. Delicious patches.
- Bug reports
- Source
- Patches: Fork on Github, send pull request.
- Include tests where practical.
- Leave the version alone, or bump it in a separate commit.
Copyright (c) George Ogata. See LICENSE for details.