Warning
I'm new to Python internals, so the tutorial may contain mistakes and clunky solutions.
The goal is to add an operator to perform bash-like chain calls to CPython:
>>> [1,2] |> map(lambda x:x*2) |> list()
[2, 4]
According to the CPython devguide, the compilation of source code involves several steps:
Tokenize the source code
Parse the stream of tokens into an Abstract Syntax Tree.
Transform AST into an instruction sequence
Construct a Control Flow Graph and apply optimizations to it.
Emit bytecode based on the Control Flow Graph.
To implement |>
, we are going to change the first three steps: modify the parsing and compilation processes.
Clone the cpython repo and checkout to a new branch.
$ git clone [email protected]:python/cpython.git && cd cpython
$ git checkout tags/v3.12.1 -b pipes
Let's start with tokenization. Tokenization is a process of splitting an input string into a sequence of tokens.
Add a new PIPE
token to the Grammar/Tokens
ELLIPSIS '...'
COLONEQUAL ':='
EXCLAMATION '!'
+PIPE '|>'
Next, run make regen-token
to regenerate pycore_token.h
, Parser/token.c
, Lib/token.py
.
For example, look at the changes in Parser/token.c
. regen-token
added a switch case to the parser code.
case '|':
switch (c2) {
case '=': return VBAREQUAL;
+ case '>': return PIPE;
}
break;
}
Next, move to Parser/Python.asdl
. ASDL is a blueprint for Python AST nodes. Each AST node is defined by the ASDL.
Let's add a new node – PipeOP expression. It contains two children – right and left expressions.
expr = BoolOp(boolop op, expr* values)
| NamedExpr(expr target, expr value)
| BinOp(expr left, operator op, expr right)
+ | PipeOp(expr left, expr right)
| UnaryOp(unaryop op, expr operand)
| Lambda(arguments args, expr body)
| IfExp(expr test, expr body, expr orelse)
Then run make regen-ast
to regenerate pycore_ast.h
and Python/Python-ast.c
.
Look at the changes in Python/Python-ast.c
. regen-ast
generated the constructor function for the pipe operator expression.
expr_ty
_PyAST_PipeOp(expr_ty left, expr_ty right, int lineno, int col_offset, int
end_lineno, int end_col_offset, PyArena *arena)
{
expr_ty p;
if (!left) {
PyErr_SetString(PyExc_ValueError,
"field 'left' is required for PipeOp");
return NULL;
}
if (!right) {
PyErr_SetString(PyExc_ValueError,
"field 'right' is required for PipeOp");
return NULL;
}
p = (expr_ty)_PyArena_Malloc(arena, sizeof(*p));
if (!p)
return NULL;
p->kind = PipeOp_kind;
p->v.PipeOp.left = left;
p->v.PipeOp.right = right;
p->lineno = lineno;
p->col_offset = col_offset;
p->end_lineno = end_lineno;
p->end_col_offset = end_col_offset;
return p;
}
Now let's move to the Parser/python.gram
.
This file contains the whole python grammar. In short: it describes how to construct an abstract syntax tree using the grammar rules.
Add a pipe_op
expression definition.
power[expr_ty]:
| a=await_primary '**' b=factor { _PyAST_BinOp(a, Pow, b, EXTRA) }
+ | pipe_op
+
+pipe_op[expr_ty]:
+ | a=pipe_op '|>' b=await_primary { _PyAST_PipeOp(a, b, EXTRA) }
| await_primary
It means "When matching a await_primary
after |>
token, construct a AST node using _PyAST_PipeOp
function"
Then run make regen-pegen
to regenerate Parser/parser.c
.
The whole AST parser is already generated by regen-pegen
.
We only need to update AST validation. Add a case for PipeOp_king
to Python/ast.c/validate_expr
.
+ case PipeOp_kind:
+ if (exp->v.PipeOp.right->kind != Call_kind) {
+ PyErr_SetString(PyExc_TypeError,
+ "Pipe op arg must be a function call");
+ return 0;
+ }
+ ret = validate_expr(state, exp->v.PipeOp.left, Load) &&
+ validate_expr(state, exp->v.PipeOp.right, Load);
+ break;
Recompile CPython with make -j4
and test the parser with ast.parse
module:
>>> import ast
>>> tree = ast.parse('1 |> f()')
>>> ast.dump(tree)
"Module(body=[Expr(value=PipeOp(left=Constant(value=1), right=Call(func=Name(id='f', ctx=Load()), args=[], keywords=[])))], type_ignores=[])"
Parsing is working, so we can move to the compilation part.
The next stage is compilation. Now we need to translate an AST to a sequence of commands for the Python VM.
a |> f(b)
is another way of saying f(a, b)
Let's examine how f(a, b)
is compiled into bytecode instructions using the dis
module:
>>> import dis
>>> dis.dis("f(a, b)")
0 0 RESUME 0
1 2 PUSH_NULL
4 LOAD_NAME 0 (f)
6 LOAD_NAME 1 (a)
8 LOAD_NAME 2 (b)
10 PRECALL 2
14 CALL 2
24 RETURN_VALUE
These instructions are telling an interpreter to:
- Load a function to a stack using
LOAD_NAME
- Load value of
a
to the stack usingLOAD_NAME
- Load value of
b
to the stack usingLOAD_NAME
- Call the function using
CALL
with2
arguments.
So, to implement the pipe, we need to add an additional argument to the stack, between a function load and a function call, during compilation.
The starting point of compilation is the Python/compile.c
file.
Let's look to the Python/compile.c/compiler_visit_expr1
.
The function describes compilation of an expr_ty
AST node to with a simple switch. For example, to compile a binary operation like a + b
complier recurcively visitis left and right nodes, and adds binary operation to controll sequence.
static int
compiler_visit_expr1(struct compiler *c, expr_ty e)
{
location loc = LOC(e);
switch (e->kind) {
case NamedExpr_kind:
VISIT(c, expr, e->v.NamedExpr.value);
ADDOP_I(c, loc, COPY, 1);
VISIT(c, expr, e->v.NamedExpr.target);
break;
case BoolOp_kind:
return compiler_boolop(c, e);
case BinOp_kind:
VISIT(c, expr, e->v.BinOp.left);
VISIT(c, expr, e->v.BinOp.right);
ADDOP_BINARY(c, loc, e->v.BinOp.op);
break;
...
Add a new case with for PipeOp_kind
. Let's start with a copy of regular function call.
+ case PipeOp_kind:
+ return compiler_call(c, e->>v.PipeOp.right);
case Lambda_kind:
return compiler_lambda(c, e);
Recompile Cpython with make -j4
and try new operator:
>>> 1 |> f()
Assertion failed: (scope || PyUnicode_READ_CHAR(name, 0) == '_'), function compiler_nameop, file compile.c, line 4209
Seems like compiler can't resolve a f
symbol. Let's fix that.
The purpose of a symbol table is to resolving scopes.
Look at Python/symtable.c/symtable_visit_expr
. The function recursively visits the AST and builds symbol tables.
static int
symtable_visit_expr(struct symtable *st, expr_ty e)
{
...
switch (e->kind) {
case NamedExpr_kind:
if (!symtable_raise_if_annotation_block(st, "named expression", e)) {
VISIT_QUIT(st, 0);
}
if(!symtable_handle_namedexpr(st, e))
VISIT_QUIT(st, 0);
break;
case BoolOp_kind:
VISIT_SEQ(st, expr, e->v.BoolOp.values);
break;
....
Add a case for a PipeOP_kind
:
case BinOp_kind:
VISIT(st, expr, e->v.BinOp.left);
VISIT(st, expr, e->v.BinOp.right);
+ break;
+ case PipeOp_kind:
+ VISIT(st, expr, e->v.PipeOp.left);
+ VISIT(st, expr, e->v.PipeOp.right);
break;
case UnaryOp_kind:
VISIT(st, expr, e->v.UnaryOp.operand);
Recompile CPython with make -j4
and test symbols with symtables
module:
>>> import symtable
>>> st = symtable.symtable('1 |> f()', "example.py", "exec")
>>> [ (symbol.get_name(), symbol.is_global() ) for symbol in st.get_symbols() ]
[('f', True)]
Move back to the Python/compile.c/compiler_visit_expr1
and redefine case for PipeOp_kind
with compiler_pipe_call
:
case PipeOp_kind:
- return compiler_call(c, e->>v.PipeOp.right);
+ return compiler_pipe_call(c, e);
case Lambda_kind:
return compiler_lambda(c, e);
And define a new compiler_pipe_call
function. Start with a modified copy of compiler_call
, which calls the right expression:
static int compiler_pipe_call(struct compiler *c, expr_ty e) {
expr_ty func_e = e->v.PipeOp.right;
expr_ty arg_e = e->v.PipeOp.left;
RETURN_IF_ERROR(validate_keywords(c, func_e->v.Call.keywords));
int ret = maybe_optimize_method_call(c, func_e);
if (ret < 0) {
return ERROR;
}
if (ret == 1) {
return SUCCESS;
}
RETURN_IF_ERROR(check_caller(c, func_e->v.Call.func));
VISIT(c, expr, func_e->v.Call.func);
location loc = LOC(func_e->v.Call.func);
ADDOP(c, loc, PUSH_NULL);
loc = LOC(func_e);
return compiler_call_helper(c, loc, 0,
func_e->v.Call.args,
func_e->v.Call.keywords);
}
Recompile CPython and check that nothing is broken:
>>> f = lambda x:x
>>> 1 |> f()
TypeError: <lambda>() missing 1 required positional argument: 'x'
Fine. Finaly, we need to add e->v.PipeOp.left
to v.Call.args
Let's create new argument sequence using _Py_asdl_expr_seq_new
and fill it with orignal args and left pipe expression:
static int compiler_pipe_call(struct compiler *c, expr_ty e) {
expr_ty func_e = e->v.PipeOp.right;
expr_ty arg_e = e->v.PipeOp.left;
RETURN_IF_ERROR(validate_keywords(c, func_e->v.Call.keywords));
int ret = maybe_optimize_method_call(c, func_e);
if (ret < 0) {
return ERROR;
}
if (ret == 1) {
return SUCCESS;
}
RETURN_IF_ERROR(check_caller(c, func_e->v.Call.func));
VISIT(c, expr, func_e->v.Call.func);
location loc = LOC(func_e->v.Call.func);
ADDOP(c, loc, PUSH_NULL);
loc = LOC(func_e);
+ Py_ssize_t original_len = asdl_seq_LEN(func_e->v.Call.args);
+ asdl_expr_seq *new_args = _Py_asdl_expr_seq_new(
+ original_len+1, c->c_arena);
+ for (Py_ssize_t i = 0; i < original_len; i++) {
+ asdl_seq_SET(new_args, i, asdl_seq_GET(func_e->v.Call.args, i));
+ }
+ asdl_seq_SET(new_args, original_len, arg_e);
return compiler_call_helper(c, loc, 0,
- func_e->v.Call.args,
+ new_args,
func_e->v.Call.keywords);
}
Thats it!
Recompile Cpython with make -j4
and try new operator:
>>> [1,2] |> map(lambda x:x*2) |> list()
[2, 4]
Yaaaaay!
>>> import dis
>>> dis.dis("a |> f(b)")
0 RESUME 0
1 LOAD_NAME 0 (f)
PUSH_NULL
LOAD_NAME 1 (b)
LOAD_NAME 2 (a)
CALL 2
RETURN_VALUE
The bytecode is the same to a f(a,b)
.