Skip to content

Latest commit

 

History

History
540 lines (396 loc) · 25.8 KB

Custom-Pylint-checks.md

File metadata and controls

540 lines (396 loc) · 25.8 KB

Table of contents

Introduction

We use Pylint to lint our Python code. Before writing a custom rule, you should check whether you can achieve your goals with Pylint's built-in rules and configurations. If those are insufficient, you can follow this guide to write a custom Pylint checker.

Limits of linters

Before we get too far into custom lint rules, let's make sure we understand what linters like ESLint are actually capable of doing.

Consider two types things you might want to prohibit with a lint rule:

  • A property of the code's text. For example, you might want to ensure that all lines are at most 80 characters long.
  • A property of the code's behavior. For example, you might want to ensure that the code, when executed, does not print a particular string.

Linters operate on the code as text, so you can write a lint rule that bans a textual property with perfect accuracy. However, any lint rule that tries to prohibit a behavior must be imperfect. To see this intuitively, consider the problem we'll approach below: writing a linter to prohibit programs that print Hello, world!. If our linter searched for the string Hello, world!, it would miss a program that printed 'Hello, ' + 'world!'.

If you are interested in computability theory, the technical description for this problem is that predicting a program's behavior is an undecidable problem. Look up the halting problem if you want to learn more.

Whenever you are creating a new lint rule, consider whether you are trying to prohibit a textual or a behavioral property. If you want to prohibit a textual property, then your task is easy--just write a lint rule that perfectly blocks the property. On the other hand, if you want to prohibit a behavior, then you'll need to first decide on textual properties that you can prohibit to approximately prohibit the behavior. For example, if you want to stop programs from printing Hello, world!, it's approximately equivalent to prohibit literal Hello, world! strings. Then, write your lint rule to perfectly prohibit those textual properties. Note that since your textual properties will be an imperfect translation of the behavioral properties you want to prohibit, your lint rule will miss some code that performs the banned behavior, will flag some code that doesn't actually perform the banned behavior, or both. That's expected given the limits of linters.

Write checkers

Types of checkers

Pylint supports three kinds of checkers:

  • AST checkers operate on an abstract syntax tree (AST) of the code that Pylint generates. The checker provides functions to be called as Pylint traverses the AST, and these functions can raise lint errors.

  • Token checkers operate on a list of the tokens that represents the code being linted. The tokens are of the same form as the tokens generated by tokenize.tokenize().

  • Raw checkers operate on the raw code without any pre-processing by Pylint.

You should avoid writing custom raw checkers since they are less efficient than token or AST checkers. You should use an AST checker whenever the syntax information of an AST is useful. For example, if you want to count the number of variables in each function, the AST can tell you where the variables are and where the functions start and end. If you don't need syntax information, you can just use a token checker. Since most lint rules are easier to write given syntax information, you'll probably write mostly AST checkers. However, token checkers can be useful for simple rules, particularly those where you only need to consider one token at a time. For example, a token checker would be a great way to search for any string that uses double-quotes ("").

As we work through the example below, we'll cover each of the three kinds of checkers.

Hello world

To get a better idea for how lint checks work, let's try writing a simple Pylint rule that forbids the string Hello, world!. (Maybe you don't want people accidentally opening PRs that include code that prints Hello, world! from when they were learning Python.) First consider what kinds of code should violate the rule. Here are some examples:

s = "Hello, world!";

print("Hello, world!");

Below, we'll walk through how to implement this rule using each of the three kinds of checkers in Pylint: AST checkers, token checkers, and raw checkers.

Write the AST checker

Examine the AST

To develop an intuition for how ASTs work, let's see how astroid represents some example Hello, world! code. Open a Python shell in your Oppia directory and import a file from Oppia like the pre-commit linter:

>>> from scripts.linters import run_lint_checks

This import will modify your PYTHONPATH to include Oppia's Python dependencies, so you can now import astroid, which Pylint uses to generate the AST:

>>> import astroid

Next, use astroid to parse the code:

>>> code = """
... s = "Hello, world!"
... print("Hello, world!")
... """
>>> print(astroid.parse(code).repr_tree())
Module(
   name='',
   doc=None,
   file='<?>',
   path=['<?>'],
   package=False,
   pure_python=True,
   future_imports=set(),
   body=[Assign(
         targets=[AssignName(name='s')],
         value=Const(value='Hello, world!')),
      Expr(value=Call(
            func=Name(name='print'),
            args=[Const(value='Hello, world!')],
            keywords=[]))])

Notice that every time a "Hello, world!" literal appears in the code, the AST has a Const node. You can look up all the available node types in the astroid documentation.

Design the AST checker

Suppose you decide to write a checker that raises an error whenever it finds a node that:

  • is of type "Const"
  • has value "Hello, world!"

After designing any rule, it's important to consider what benign code it will raise errors on (false positives) and what bad code it will miss (false negatives). For our rule:

  • False positives

    • Code that includes the string Hello, world! but doesn't print it will still raise errors. There probably aren't any good reasons to do this though.
  • False negatives

    • Code that constructs the string Hello, world! instead of including it as a literal will pass the lint check. This is an acceptable imperfection though since developers are almost always going to use the string literal when writing a hello world program.
Write the AST checker code

Start by creating a checker class with some metadata in scripts/linters/pylint_extensions.py:

class HelloWorldASTChecker(BaseChecker):
    __implements__ = IAstroidChecker

    name = 'hello-world-ast'
    priority = -1
    msgs = {
        'C9001': (
            'Uses a "Hello, world!" string.',
            'hello-world-ast',
            'No code should use "Hello, world!" statements.',
        ),
    }

Here's what we've defined so far:

  • __implements__: Specifies what kind of checker we are creating. In this case, we are creating an AST checker.

  • name: Short name of the checker that will be used in the configuration file.

  • priority: A negative integer indicating when the check should run. Checkers run from the most negative priority to most positive.

  • msgs: A dictionary specifying what messages this checker can raise. The dictionary has the following form:

    {
        'message id': ('displayed message', 'message symbol', 'message help')
    }
    • Message ID: A unique identifier for the message consisting of a letter followed by 4 digits. The letter is the first letter of the message category, and the available categories are Convention, Warning, Error, Fatal, and Refactoring. The first two digits must be the same for all of the checker's messages.

      Be sure to check that your message ID isn't already used by one of Pylint's existing rules or one of our custom rules.

    • Displayed message: The error message displayed to the user.

    • Message symbol: An alias of the message ID that can be used wherever the ID can be used. For example, to disable a message with a code comment, you can specify either the ID or the symbol.

    • Message help: Displayed when the user runs pylint --help-msg to learn more about your message.

You can also set an options variable to specify options that can be set by the configuration file and read by your rule. See the Pylint documentation for details.

Next, you need to write the checker code. Pylint lets you define methods that get called as it traverses the AST. Methods of the form visit_nodename() are called upon entering the subtree rooted at a node of type nodename, and methods of the form leave_nodename() are called upon leaving the subtree. Each method is called with a single argument--the node of type nodename.

For this case, you can define a method visit_const() that will be called whenever Pylint reaches a Const node. Then you can check the node's value to see if it equals "Hello, world!". The method looks like this:

def visit_const(self, node):
    if node.value == "Hello, world!":
        self.add_message(
            'hello-world', node=node)

That's it! After working through the two other kinds of linters, we'll discuss how to run and test your checker.

If you want an alternative tutorial for writing an AST checker, take a look at the one Pylint provides.

Write the token checker

Examine the tokens

Fill a file test.py with the following code:

s = "Hello, world!"
print("Hello, world!")

Then convert the file code into tokens:

$ python -m tokenize test.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,1:            NAME           's'
1,2-1,3:            OP             '='
1,4-1,19:           STRING         '"Hello, world!"'
1,19-1,20:          NEWLINE        '\n'
2,0-2,5:            NAME           'print'
2,5-2,6:            OP             '('
2,6-2,21:           STRING         '"Hello, world!"'
2,21-2,22:          OP             ')'
2,22-2,23:          NEWLINE        '\n'
3,0-3,0:            ENDMARKER      ''

Notice that our Hello, world! strings are each their own tokens of type STRING.

Design the token checker

We could write a checker that raises an error whenever it finds a token that:

  • is of type "STRING"
  • has value "Hello, world!", ignoring quotation marks.

After designing any rule, it's important to consider what benign code it will raise errors on (false positives) and what bad code it will miss (false negatives). Our rule suffers from the same problems as the AST checker.

Write the token checker code

Start by creating a checker class with some metadata in scripts/linters/pylint_extensions.py:

class HelloWorldTokenChecker(checkers.BaseChecker):
    __implements__ = interfaces.ITokenChecker

    name = 'hello-world-token'
    priority = -1
    msgs = {
        'C9002': (
            'Uses a "Hello, world!" string.',
            'hello-world-token',
            'No code should use "Hello, world!" statements.',
        ),
    }

To learn about these metadata options, see the section above on the AST checker code. Notice that unlike the AST checker, we set __implements__ equal to interfaces.ITokenChecker, and we used a new message ID, name, and symbol.

For the token checker, you only need to implement a process_tokens() method that accepts a list of tokens. Each token is described by a named 5-tuple as documented in the tokenize library. The tuple takes the form (type, string, start, end, line):

  • type: The token type, for example tokenize.STRING.
  • string: A string with the code that the token represents.
  • start: A 2-tuple (row, col) of the row and column where the token starts in the code.
  • end: A 2-tuple (row, col) of the row and column where the token ends in the code.
  • line: A string with the line of code where the token was found.

You can use tokenize.tokenize() to see what these tuples look like for our test.py code:

>>> import pprint, tokenize
>>> f = open('test.py', 'rb')  # Note that tokenize expects the file to be opened in binary mode
>>> tokens = list(tokenize.tokenize(f.readline))
>>> pprint.pprint(tokens)
[TokenInfo(type=57 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
 TokenInfo(type=1 (NAME), string='s', start=(1, 0), end=(1, 1), line='s = "Hello, world!"\n'),
 TokenInfo(type=53 (OP), string='=', start=(1, 2), end=(1, 3), line='s = "Hello, world!"\n'),
 TokenInfo(type=3 (STRING), string='"Hello, world!"', start=(1, 4), end=(1, 19), line='s = "Hello, world!"\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 19), end=(1, 20), line='s = "Hello, world!"\n'),
 TokenInfo(type=1 (NAME), string='print', start=(2, 0), end=(2, 5), line='print("Hello, world!")\n'),
 TokenInfo(type=53 (OP), string='(', start=(2, 5), end=(2, 6), line='print("Hello, world!")\n'),
 TokenInfo(type=3 (STRING), string='"Hello, world!"', start=(2, 6), end=(2, 21), line='print("Hello, world!")\n'),
 TokenInfo(type=53 (OP), string=')', start=(2, 21), end=(2, 22), line='print("Hello, world!")\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 22), end=(2, 23), line='print("Hello, world!")\n'),
 TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')]

Based on this output, we can see that we need to check that the token type is tokenize.STRING and that the token string is Hello, world! after we strip any quotes. Here's a process_tokens() method that implements this check:

def process_tokens(self, tokens):
    for token in tokens:
        if token.type == tokenize.STRING:
            quotes_stripped = token.string.strip('"').strip('\'')
            if quotes_stripped == 'Hello, world!':
                self.add_message('hello-world-token', line=token.start[0])

Write the raw checker

When writing a raw checker, we don't have to worry about any pre-processing by Pylint. However, that also makes these checkers slower than AST or token checkers. Therefore, you should avoid using raw checkers wherever possible. If you do need to use a raw checker, you should get approval from the linter team first.

Pylint provides the code file as a node, which you can read using our read_from_node() function. This function returns an iterable of lines from the file.

Design the raw checker

We could write a checker that raises an error whenever it finds a line that contains one of the following substrings:

  • "Hello, world!"
  • 'Hello, world!'

After designing any rule, it's important to consider what benign code it will raise errors on (false positives) and what bad code it will miss (false negatives). Our rule suffers from the same problems as the AST checker.

Write the raw checker code

Start by creating a checker class with some metadata in scripts/linters/pylint_extensions.py:

class HelloWorldRawChecker(checkers.BaseChecker):
    __implements__ = interfaces.IRawChecker

    name = 'hello-world-raw'
    priority = -1
    msgs = {
        'C9003': (
            'Uses a "Hello, world!" string.',
            'hello-world-raw',
            'No code should use "Hello, world!" statements.',
        ),
    }

To learn about these metadata options, see the section above on the AST checker code. Notice that unlike the AST checker, we set __implements__ equal to interfaces.IRawChecker, and we used a new message ID, name, and symbol.

Now we need to implement the process_module() method, which accepts the node with the file contents as an argument. Then we iterate over the lines returned by read_from_node() and look for the substrings we identified in our design:

def process_module(self, node):
    file_content = read_from_node(node)
    for (line_num, line) in enumerate(file_content):
        if '"Hello, world!"' in line or '\'Hello, world!\'' in line:
            self.add_message('hello-world-raw', line=line_num)

Register the checker

At the bottom of pylint_extensions.py, you need to register your new checker by calling linter.register_checker(). For example, here's how to register the three checkers we wrote for hello world strings:

def register(linter):
    ...
    linter.register_checker(HelloWorldASTChecker(linter))
    linter.register_checker(HelloWorldTokenChecker(linter))
    linter.register_checker(HelloWorldRawChecker(linter))

Beyond hello world

The example of looking for Hello, world! strings was fairly simple because we only needed information from one small part of the code at a time to identify lint violations. For more complicated lint rules, you'll need to think about how to keep track of information from many different places in the code. For example, the Pylint documentation describes how to write an AST checker to ensure that all the return statements in a function that return a constant return different constants. Here is some pseudocode for how you can do this:

class UniqueReturnChecker(BaseChecker):
    ...
    def __init__(self, linter=None):
        create a stack to store function nodes

    def visit_functiondef(self, node):
        push node onto the stack

    def leave_functiondef(self, _):
        pop a node off the stack

    def visit_return(self, node):
        if node is not of type Const: return

        for other_return in function_stack[-1]:
            if node.value == other_return's node's value:
                report a lint violation

Note that we have to use a stack to track function nodes to handle nested functions! For more examples of trickier lint checkers, take a look at our existing linters in pylint_extensions.py.

Test custom Pylint checks

We test our checkers with [[backend tests|Writing-backend-tests]] in pylint_extensions_test.py.

Design tests

You should write test cases to cover the following situations:

  • True positives: Bad code that you expect to result in lint errors.
  • True negatives: Good code that you don't expect to result in lint errors.
  • False positives: Good code that you expect to result in lint errors because you can't write a perfect linter.
  • False negatives: Bad code that you don't expect to result in lint errors because the linter is imperfect.

Pylint testing utilities

While tests for the checkers are just normal backend tests, Pylint provides some helpful testing utilities that you can use, particularly pylint.testutils.CheckerTestCase. Pylint's documentation shows how to create a test class that inherits from the CheckerTestCase class, but at Oppia we prefer to inherit from unittest.TestCase. We store the CheckerTestCase object as an instance variable instead, so your setUp() methods will look like this for your checker MyChecker:

def setUp(self):
    super().setUp()
    self.checker_test_object = testutils.CheckerTestCase()
    self.checker_test_object.CHECKER_CLASS = pylint_extensions.MyChecker
    self.checker_test_object.setup_method()

Use testing utilities for hello world checkers

We'll walk through how to use the checker_test_object for each kind of checker below. Note that these examples don't show comprehensive test cases--see above for suggestions for how you can develop test cases.

Tests for hello world AST checker

To test our AST checker, we begin by using astroid.extract_node() to get some nodes that we can pass to our checker. We pass a string with code to astroid.extract_node(), annotating our code with #@ to indicate which lines we want to get nodes for:

def test_finds_hello_world(self):
    assignment_expr, func_call_expr = astroid.extract_node(
        """
    s = "Hello, world!" #@
    print("Hello, world!") #@
    """)

    assignment_node = assignment_expr.value
    func_call_node = func_call_expr.args[0]

    with self.checker_test_object.assertAddsMessages(
        testutils.Message(
            msg_id='hello-world-ast',
            node=assignment_node,
        )
    ):
        self.checker_test_object.checker.visit_const(
            assignment_node)

    with self.checker_test_object.assertAddsMessages(
        testutils.Message(
            msg_id='hello-world-ast',
            node=func_call_node,
        )
    ):
        self.checker_test_object.checker.visit_const(
            func_call_node)

Notice that because astroid finds the node representing the line we annotated with #@, we have to retrieve the nodes that represent the Hello, world! literals.

Tests for hello world token checker

To test our token checker, we write the code we want to lint to a temporary file. Then we can convert that file into tokens that our checker can understand:

def test_finds_hello_world_assignment(self):
    node = astroid.scoped_nodes.Module(
        name='test',
        doc='Custom test')
    temp_file = tempfile.NamedTemporaryFile()
    filename = temp_file.name
    with open(filename, 'w') as tmp:
        tmp.write('s = "Hello, world!"')
    node.file = filename
    node.path = filename

    self.checker_test_object.checker.process_tokens(
        utils.tokenize_module(node))

    message = testutils.Message(
        msg_id='hello-world-token', line=1)

    with self.checker_test_object.assertAddsMessages(message):
        temp_file.close()

def test_finds_hello_world_func_call(self):
    node = astroid.scoped_nodes.Module(
        name='test',
        doc='Custom test')
    temp_file = tempfile.NamedTemporaryFile()
    filename = temp_file.name
    with open(filename, 'w') as tmp:
        tmp.write('print("Hello, world!")')
    node.file = filename
    node.path = filename

    self.checker_test_object.checker.process_tokens(
        utils.tokenize_module(node))

    message = testutils.Message(
        msg_id='hello-world-token', line=1)

    with self.checker_test_object.assertAddsMessages(message):
        temp_file.close()

Tests for hello world raw checker

Testing raw checkers is a lot like testing token checkers. The main difference is that we don't need to convert our temporary file into tokens. In the example below, we also show how you can test multiple lines at once:

def test_finds_hello_world_assignment(self):
    node = astroid.scoped_nodes.Module(
        name='test',
        doc='Custom test')
    temp_file = tempfile.NamedTemporaryFile()
    filename = temp_file.name

    with open(filename, 'w') as tmp:
        tmp.write("""
        s = "Hello, world!"
        print("Hello, world!")
        """)

    node.file = filename
    node.path = filename

    self.checker_test_object.checker.process_module(node)

    with self.checker_test_object.assertAddsMessages(
        testutils.Message(
            msg_id='hello-world-raw',
            line=1
        ),
        testutils.Message(
            msg_id='hello-world-raw',
            line=2
        ),
    ):
        temp_file.close()

Run tests

Once you've added test cases, you can run them like this:

Python:

python -m scripts.run_backend_tests --test_targets=scripts.linters.pylint_extensions_test --verbose

Docker:

make run_tests.backend PYTHON_ARGS="--test_targets=scripts.linters.pylint_extensions_test --verbose"

The --verbose option will help with debugging by printing out more logging information.