Skip to content

Latest commit

 

History

History
417 lines (319 loc) · 11.6 KB

pipe.md

File metadata and controls

417 lines (319 loc) · 11.6 KB

Warning

I'm new to Python internals, so the tutorial may contain mistakes and clunky solutions.

Pipe operator

The goal is to add an operator to perform bash-like chain calls to CPython:

>>> [1,2] |> map(lambda x:x*2) |> list()

[2, 4]

Plan

According to the CPython devguide, the compilation of source code involves several steps:

  1. Tokenize the source code

  2. Parse the stream of tokens into an Abstract Syntax Tree.

  3. Transform AST into an instruction sequence

  4. Construct a Control Flow Graph and apply optimizations to it.

  5. Emit bytecode based on the Control Flow Graph.

To implement |>, we are going to change the first three steps: modify the parsing and compilation processes.

Prereq­ui­sites

Clone the cpython repo and checkout to a new branch.

$ git clone [email protected]:python/cpython.git && cd cpython
$ git checkout tags/v3.12.1 -b pipes

Parsing

Tokenezation

Let's start with tokenization. Tokenization is a process of splitting an input string into a sequence of tokens.

Add a new PIPE token to the Grammar/Tokens

 ELLIPSIS                '...'
 COLONEQUAL              ':='
 EXCLAMATION             '!'
+PIPE                    '|>'

Next, run make regen-token to regenerate pycore_token.h, Parser/token.c, Lib/token.py.

For example, look at the changes in Parser/token.c. regen-token added a switch case to the parser code.

     case '|':
         switch (c2) {
         case '=': return VBAREQUAL;
+        case '>': return PIPE;
         }
         break;
     }

ASDL

Next, move to Parser/Python.asdl. ASDL is a blueprint for Python AST nodes. Each AST node is defined by the ASDL.

Let's add a new node – PipeOP expression. It contains two children – right and left expressions.

     expr = BoolOp(boolop op, expr* values)
          | NamedExpr(expr target, expr value)
          | BinOp(expr left, operator op, expr right)
+         | PipeOp(expr left, expr right)
          | UnaryOp(unaryop op, expr operand)
          | Lambda(arguments args, expr body)
          | IfExp(expr test, expr body, expr orelse)

Then run make regen-ast to regenerate pycore_ast.h and Python/Python-ast.c.

Look at the changes in Python/Python-ast.c. regen-ast generated the constructor function for the pipe operator expression.

expr_ty
_PyAST_PipeOp(expr_ty left, expr_ty right, int lineno, int col_offset, int
              end_lineno, int end_col_offset, PyArena *arena)
{
    expr_ty p;
    if (!left) {
        PyErr_SetString(PyExc_ValueError,
                        "field 'left' is required for PipeOp");
        return NULL;
    }
    if (!right) {
        PyErr_SetString(PyExc_ValueError,
                        "field 'right' is required for PipeOp");
        return NULL;
    }
    p = (expr_ty)_PyArena_Malloc(arena, sizeof(*p));
    if (!p)
        return NULL;
    p->kind = PipeOp_kind;
    p->v.PipeOp.left = left;
    p->v.PipeOp.right = right;
    p->lineno = lineno;
    p->col_offset = col_offset;
    p->end_lineno = end_lineno;
    p->end_col_offset = end_col_offset;
    return p;
}

Grammar

Now let's move to the Parser/python.gram.

This file contains the whole python grammar. In short: it describes how to construct an abstract syntax tree using the grammar rules.

Add a pipe_op expression definition.

 power[expr_ty]:
     | a=await_primary '**' b=factor { _PyAST_BinOp(a, Pow, b, EXTRA) }
+    | pipe_op
+
+pipe_op[expr_ty]:
+    | a=pipe_op '|>' b=await_primary { _PyAST_PipeOp(a, b, EXTRA) } 
     | await_primary

It means "When matching a await_primary after |> token, construct a AST node using _PyAST_PipeOp function"

Then run make regen-pegen to regenerate Parser/parser.c.

Parsing from tokens to AST

The whole AST parser is already generated by regen-pegen.

We only need to update AST validation. Add a case for PipeOp_king to Python/ast.c/validate_expr.

+    case PipeOp_kind:
+        if (exp->v.PipeOp.right->kind != Call_kind) {
+            PyErr_SetString(PyExc_TypeError,
+                            "Pipe op arg must be a function call");
+            return 0;
+        }
+        ret = validate_expr(state, exp->v.PipeOp.left, Load) &&
+        validate_expr(state, exp->v.PipeOp.right, Load);
+        break;

Test parsing to AST

Recompile CPython with make -j4 and test the parser with ast.parse module:

>>> import ast
>>> tree = ast.parse('1 |> f()')
>>> ast.dump(tree)

"Module(body=[Expr(value=PipeOp(left=Constant(value=1), right=Call(func=Name(id='f', ctx=Load()), args=[], keywords=[])))], type_ignores=[])"

Parsing is working, so we can move to the compilation part.

Compilation from AST to bytecode

The next stage is compilation. Now we need to translate an AST to a sequence of commands for the Python VM.

Target bytecode

a |> f(b) is another way of saying f(a, b)

Let's examine how f(a, b) is compiled into bytecode instructions using the dis module:

>>> import dis
>>> dis.dis("f(a, b)")
  0           0 RESUME                   0

  1           2 PUSH_NULL
              4 LOAD_NAME                0 (f)
              6 LOAD_NAME                1 (a)
              8 LOAD_NAME                2 (b)
             10 PRECALL                  2
             14 CALL                     2
             24 RETURN_VALUE

These instructions are telling an interpreter to:

  1. Load a function to a stack using LOAD_NAME
  2. Load value of a to the stack using LOAD_NAME
  3. Load value of b to the stack using LOAD_NAME
  4. Call the function using CALL with 2 arguments.

So, to implement the pipe, we need to add an additional argument to the stack, between a function load and a function call, during compilation.

Compile.c

The starting point of compilation is the Python/compile.c file.

Let's look to the Python/compile.c/compiler_visit_expr1 .

The function describes compilation of an expr_ty AST node to with a simple switch. For example, to compile a binary operation like a + b complier recurcively visitis left and right nodes, and adds binary operation to controll sequence.

static int
compiler_visit_expr1(struct compiler *c, expr_ty e)
{
    location loc = LOC(e);
    switch (e->kind) {
    case NamedExpr_kind:
        VISIT(c, expr, e->v.NamedExpr.value);
        ADDOP_I(c, loc, COPY, 1);
        VISIT(c, expr, e->v.NamedExpr.target);
        break;
    case BoolOp_kind:
        return compiler_boolop(c, e);
    case BinOp_kind:
        VISIT(c, expr, e->v.BinOp.left);
        VISIT(c, expr, e->v.BinOp.right);
        ADDOP_BINARY(c, loc, e->v.BinOp.op);
        break;
...

Add a new case with for PipeOp_kind. Let's start with a copy of regular function call.

+    case PipeOp_kind:
+        return compiler_call(c, e->>v.PipeOp.right);
     case Lambda_kind:
         return compiler_lambda(c, e);

Recompile Cpython with make -j4 and try new operator:

>>> 1 |> f()

Assertion failed: (scope || PyUnicode_READ_CHAR(name, 0) == '_'), function compiler_nameop, file compile.c, line 4209

Seems like compiler can't resolve a f symbol. Let's fix that.

Symbables

The purpose of a symbol table is to resolving scopes.

Look at Python/symtable.c/symtable_visit_expr. The function recursively visits the AST and builds symbol tables.

static int
symtable_visit_expr(struct symtable *st, expr_ty e)
{
    ...
    switch (e->kind) {
    case NamedExpr_kind:
        if (!symtable_raise_if_annotation_block(st, "named expression", e)) {
            VISIT_QUIT(st, 0);
        }
        if(!symtable_handle_namedexpr(st, e))
            VISIT_QUIT(st, 0);
        break;
    case BoolOp_kind:
        VISIT_SEQ(st, expr, e->v.BoolOp.values);
        break;
    ....

Add a case for a PipeOP_kind:

     case BinOp_kind:
         VISIT(st, expr, e->v.BinOp.left);
         VISIT(st, expr, e->v.BinOp.right);
+        break;
+    case PipeOp_kind:
+        VISIT(st, expr, e->v.PipeOp.left);
+        VISIT(st, expr, e->v.PipeOp.right);
         break;
     case UnaryOp_kind:
         VISIT(st, expr, e->v.UnaryOp.operand);

Recompile CPython with make -j4 and test symbols with symtables module:

>>> import symtable
>>> st = symtable.symtable('1 |> f()',  "example.py", "exec")
>>> [ (symbol.get_name(), symbol.is_global() ) for symbol in st.get_symbols() ]

[('f', True)]

Compilaton

Move back to the Python/compile.c/compiler_visit_expr1 and redefine case for PipeOp_kind with compiler_pipe_call:

    case PipeOp_kind:
-        return compiler_call(c, e->>v.PipeOp.right);
+        return compiler_pipe_call(c, e);
     case Lambda_kind:
         return compiler_lambda(c, e);

And define a new compiler_pipe_call function. Start with a modified copy of compiler_call, which calls the right expression:

static int compiler_pipe_call(struct compiler *c, expr_ty e) {
    expr_ty func_e = e->v.PipeOp.right;
    expr_ty arg_e = e->v.PipeOp.left;

    RETURN_IF_ERROR(validate_keywords(c, func_e->v.Call.keywords));
    int ret = maybe_optimize_method_call(c, func_e);
    if (ret < 0) {
        return ERROR;
    }
    if (ret == 1) {
        return SUCCESS;
    }
    
    RETURN_IF_ERROR(check_caller(c, func_e->v.Call.func));
    VISIT(c, expr, func_e->v.Call.func);
    location loc = LOC(func_e->v.Call.func);
    ADDOP(c, loc, PUSH_NULL);
    loc = LOC(func_e);

    return compiler_call_helper(c, loc, 0,
                                func_e->v.Call.args,
                                func_e->v.Call.keywords);
}

Recompile CPython and check that nothing is broken:

>>> f = lambda x:x
>>> 1 |> f()

TypeError: <lambda>() missing 1 required positional argument: 'x'

Fine. Finaly, we need to add e->v.PipeOp.left to v.Call.args

Let's create new argument sequence using _Py_asdl_expr_seq_new and fill it with orignal args and left pipe expression:

static int compiler_pipe_call(struct compiler *c, expr_ty e) {
    expr_ty func_e = e->v.PipeOp.right;
    expr_ty arg_e = e->v.PipeOp.left;

    RETURN_IF_ERROR(validate_keywords(c, func_e->v.Call.keywords));
    int ret = maybe_optimize_method_call(c, func_e);
    if (ret < 0) {
        return ERROR;
    }
    if (ret == 1) {
        return SUCCESS;
    }
    
    RETURN_IF_ERROR(check_caller(c, func_e->v.Call.func));
    VISIT(c, expr, func_e->v.Call.func);
    location loc = LOC(func_e->v.Call.func);
    ADDOP(c, loc, PUSH_NULL);
    loc = LOC(func_e);

+    Py_ssize_t original_len = asdl_seq_LEN(func_e->v.Call.args);
+    asdl_expr_seq *new_args = _Py_asdl_expr_seq_new(
+            original_len+1, c->c_arena);
    
+    for (Py_ssize_t i = 0; i < original_len; i++) {
+            asdl_seq_SET(new_args, i, asdl_seq_GET(func_e->v.Call.args, i));
+    }

+    asdl_seq_SET(new_args, original_len, arg_e);

    return compiler_call_helper(c, loc, 0,
-                                func_e->v.Call.args,
+                                new_args,
                                func_e->v.Call.keywords);
}

Thats it!

Recompile Cpython with make -j4 and try new operator:

>>> [1,2] |> map(lambda x:x*2) |> list()

[2, 4]

Yaaaaay!

Final dis

>>> import dis
>>> dis.dis("a |> f(b)")

  0           RESUME                   0

  1           LOAD_NAME                0 (f)
              PUSH_NULL
              LOAD_NAME                1 (b)
              LOAD_NAME                2 (a)
              CALL                     2
              RETURN_VALUE

The bytecode is the same to a f(a,b).