- Introduction
- Write checkers
- Test custom Pylint checks
We use Pylint to lint our Python code. Before writing a custom rule, you should check whether you can achieve your goals with Pylint's built-in rules and configurations. If those are insufficient, you can follow this guide to write a custom Pylint checker.
Before we get too far into custom lint rules, let's make sure we understand what linters like ESLint are actually capable of doing.
Consider two types things you might want to prohibit with a lint rule:
- A property of the code's text. For example, you might want to ensure that all lines are at most 80 characters long.
- A property of the code's behavior. For example, you might want to ensure that the code, when executed, does not print a particular string.
Linters operate on the code as text, so you can write a lint rule that bans a textual property with perfect accuracy. However, any lint rule that tries to prohibit a behavior must be imperfect. To see this intuitively, consider the problem we'll approach below: writing a linter to prohibit programs that print Hello, world!
. If our linter searched for the string Hello, world!
, it would miss a program that printed 'Hello, ' + 'world!'
.
If you are interested in computability theory, the technical description for this problem is that predicting a program's behavior is an undecidable problem. Look up the halting problem if you want to learn more.
Whenever you are creating a new lint rule, consider whether you are trying to prohibit a textual or a behavioral property. If you want to prohibit a textual property, then your task is easy--just write a lint rule that perfectly blocks the property. On the other hand, if you want to prohibit a behavior, then you'll need to first decide on textual properties that you can prohibit to approximately prohibit the behavior. For example, if you want to stop programs from printing Hello, world!
, it's approximately equivalent to prohibit literal Hello, world!
strings. Then, write your lint rule to perfectly prohibit those textual properties. Note that since your textual properties will be an imperfect translation of the behavioral properties you want to prohibit, your lint rule will miss some code that performs the banned behavior, will flag some code that doesn't actually perform the banned behavior, or both. That's expected given the limits of linters.
Pylint supports three kinds of checkers:
-
AST checkers operate on an abstract syntax tree (AST) of the code that Pylint generates. The checker provides functions to be called as Pylint traverses the AST, and these functions can raise lint errors.
-
Token checkers operate on a list of the tokens that represents the code being linted. The tokens are of the same form as the tokens generated by
tokenize.tokenize()
. -
Raw checkers operate on the raw code without any pre-processing by Pylint.
You should avoid writing custom raw checkers since they are less efficient than token or AST checkers. You should use an AST checker whenever the syntax information of an AST is useful. For example, if you want to count the number of variables in each function, the AST can tell you where the variables are and where the functions start and end. If you don't need syntax information, you can just use a token checker. Since most lint rules are easier to write given syntax information, you'll probably write mostly AST checkers. However, token checkers can be useful for simple rules, particularly those where you only need to consider one token at a time. For example, a token checker would be a great way to search for any string that uses double-quotes (""
).
As we work through the example below, we'll cover each of the three kinds of checkers.
To get a better idea for how lint checks work, let's try writing a simple Pylint rule that forbids the string Hello, world!
. (Maybe you don't want people accidentally opening PRs that include code that prints Hello, world!
from when they were learning Python.) First consider what kinds of code should violate the rule. Here are some examples:
s = "Hello, world!";
print("Hello, world!");
Below, we'll walk through how to implement this rule using each of the three kinds of checkers in Pylint: AST checkers, token checkers, and raw checkers.
To develop an intuition for how ASTs work, let's see how astroid represents some example Hello, world!
code. Open a Python shell in your Oppia directory and import a file from Oppia like the pre-commit linter:
>>> from scripts.linters import run_lint_checks
This import will modify your PYTHONPATH to include Oppia's Python dependencies, so you can now import astroid, which Pylint uses to generate the AST:
>>> import astroid
Next, use astroid to parse the code:
>>> code = """
... s = "Hello, world!"
... print("Hello, world!")
... """
>>> print(astroid.parse(code).repr_tree())
Module(
name='',
doc=None,
file='<?>',
path=['<?>'],
package=False,
pure_python=True,
future_imports=set(),
body=[Assign(
targets=[AssignName(name='s')],
value=Const(value='Hello, world!')),
Expr(value=Call(
func=Name(name='print'),
args=[Const(value='Hello, world!')],
keywords=[]))])
Notice that every time a "Hello, world!"
literal appears in the code, the AST has a Const
node. You can look up all the available node types in the astroid documentation.
Suppose you decide to write a checker that raises an error whenever it finds a node that:
- is of type "Const"
- has value "Hello, world!"
After designing any rule, it's important to consider what benign code it will raise errors on (false positives) and what bad code it will miss (false negatives). For our rule:
-
False positives
- Code that includes the string
Hello, world!
but doesn't print it will still raise errors. There probably aren't any good reasons to do this though.
- Code that includes the string
-
False negatives
- Code that constructs the string
Hello, world!
instead of including it as a literal will pass the lint check. This is an acceptable imperfection though since developers are almost always going to use the string literal when writing a hello world program.
- Code that constructs the string
Start by creating a checker class with some metadata in scripts/linters/pylint_extensions.py
:
class HelloWorldASTChecker(BaseChecker):
__implements__ = IAstroidChecker
name = 'hello-world-ast'
priority = -1
msgs = {
'C9001': (
'Uses a "Hello, world!" string.',
'hello-world-ast',
'No code should use "Hello, world!" statements.',
),
}
Here's what we've defined so far:
-
__implements__
: Specifies what kind of checker we are creating. In this case, we are creating an AST checker. -
name
: Short name of the checker that will be used in the configuration file. -
priority
: A negative integer indicating when the check should run. Checkers run from the most negative priority to most positive. -
msgs
: A dictionary specifying what messages this checker can raise. The dictionary has the following form:{ 'message id': ('displayed message', 'message symbol', 'message help') }
-
Message ID: A unique identifier for the message consisting of a letter followed by 4 digits. The letter is the first letter of the message category, and the available categories are Convention, Warning, Error, Fatal, and Refactoring. The first two digits must be the same for all of the checker's messages.
Be sure to check that your message ID isn't already used by one of Pylint's existing rules or one of our custom rules.
-
Displayed message: The error message displayed to the user.
-
Message symbol: An alias of the message ID that can be used wherever the ID can be used. For example, to disable a message with a code comment, you can specify either the ID or the symbol.
-
Message help: Displayed when the user runs
pylint --help-msg
to learn more about your message.
-
You can also set an options
variable to specify options that can be set by the configuration file and read by your rule. See the Pylint documentation for details.
Next, you need to write the checker code. Pylint lets you define methods that get called as it traverses the AST. Methods of the form visit_nodename()
are called upon entering the subtree rooted at a node of type nodename
, and methods of the form leave_nodename()
are called upon leaving the subtree. Each method is called with a single argument--the node of type nodename
.
For this case, you can define a method visit_const()
that will be called whenever Pylint reaches a Const
node. Then you can check the node's value to see if it equals "Hello, world!"
. The method looks like this:
def visit_const(self, node):
if node.value == "Hello, world!":
self.add_message(
'hello-world', node=node)
That's it! After working through the two other kinds of linters, we'll discuss how to run and test your checker.
If you want an alternative tutorial for writing an AST checker, take a look at the one Pylint provides.
Fill a file test.py
with the following code:
s = "Hello, world!"
print("Hello, world!")
Then convert the file code into tokens:
$ python -m tokenize test.py
0,0-0,0: ENCODING 'utf-8'
1,0-1,1: NAME 's'
1,2-1,3: OP '='
1,4-1,19: STRING '"Hello, world!"'
1,19-1,20: NEWLINE '\n'
2,0-2,5: NAME 'print'
2,5-2,6: OP '('
2,6-2,21: STRING '"Hello, world!"'
2,21-2,22: OP ')'
2,22-2,23: NEWLINE '\n'
3,0-3,0: ENDMARKER ''
Notice that our Hello, world!
strings are each their own tokens of type STRING
.
We could write a checker that raises an error whenever it finds a token that:
- is of type "STRING"
- has value "Hello, world!", ignoring quotation marks.
After designing any rule, it's important to consider what benign code it will raise errors on (false positives) and what bad code it will miss (false negatives). Our rule suffers from the same problems as the AST checker.
Start by creating a checker class with some metadata in scripts/linters/pylint_extensions.py
:
class HelloWorldTokenChecker(checkers.BaseChecker):
__implements__ = interfaces.ITokenChecker
name = 'hello-world-token'
priority = -1
msgs = {
'C9002': (
'Uses a "Hello, world!" string.',
'hello-world-token',
'No code should use "Hello, world!" statements.',
),
}
To learn about these metadata options, see the section above on the AST checker code. Notice that unlike the AST checker, we set __implements__
equal to interfaces.ITokenChecker
, and we used a new message ID, name, and symbol.
For the token checker, you only need to implement a process_tokens()
method that accepts a list of tokens. Each token is described by a named 5-tuple as documented in the tokenize library. The tuple takes the form (type, string, start, end, line)
:
type
: The token type, for exampletokenize.STRING
.string
: A string with the code that the token represents.start
: A 2-tuple (row, col) of the row and column where the token starts in the code.end
: A 2-tuple (row, col) of the row and column where the token ends in the code.line
: A string with the line of code where the token was found.
You can use tokenize.tokenize()
to see what these tuples look like for our test.py
code:
>>> import pprint, tokenize
>>> f = open('test.py', 'rb') # Note that tokenize expects the file to be opened in binary mode
>>> tokens = list(tokenize.tokenize(f.readline))
>>> pprint.pprint(tokens)
[TokenInfo(type=57 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
TokenInfo(type=1 (NAME), string='s', start=(1, 0), end=(1, 1), line='s = "Hello, world!"\n'),
TokenInfo(type=53 (OP), string='=', start=(1, 2), end=(1, 3), line='s = "Hello, world!"\n'),
TokenInfo(type=3 (STRING), string='"Hello, world!"', start=(1, 4), end=(1, 19), line='s = "Hello, world!"\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 19), end=(1, 20), line='s = "Hello, world!"\n'),
TokenInfo(type=1 (NAME), string='print', start=(2, 0), end=(2, 5), line='print("Hello, world!")\n'),
TokenInfo(type=53 (OP), string='(', start=(2, 5), end=(2, 6), line='print("Hello, world!")\n'),
TokenInfo(type=3 (STRING), string='"Hello, world!"', start=(2, 6), end=(2, 21), line='print("Hello, world!")\n'),
TokenInfo(type=53 (OP), string=')', start=(2, 21), end=(2, 22), line='print("Hello, world!")\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 22), end=(2, 23), line='print("Hello, world!")\n'),
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')]
Based on this output, we can see that we need to check that the token type is tokenize.STRING
and that the token string is Hello, world!
after we strip any quotes. Here's a process_tokens()
method that implements this check:
def process_tokens(self, tokens):
for token in tokens:
if token.type == tokenize.STRING:
quotes_stripped = token.string.strip('"').strip('\'')
if quotes_stripped == 'Hello, world!':
self.add_message('hello-world-token', line=token.start[0])
When writing a raw checker, we don't have to worry about any pre-processing by Pylint. However, that also makes these checkers slower than AST or token checkers. Therefore, you should avoid using raw checkers wherever possible. If you do need to use a raw checker, you should get approval from the linter team first.
Pylint provides the code file as a node, which you can read using our read_from_node()
function. This function returns an iterable of lines from the file.
We could write a checker that raises an error whenever it finds a line that contains one of the following substrings:
"Hello, world!"
'Hello, world!'
After designing any rule, it's important to consider what benign code it will raise errors on (false positives) and what bad code it will miss (false negatives). Our rule suffers from the same problems as the AST checker.
Start by creating a checker class with some metadata in scripts/linters/pylint_extensions.py
:
class HelloWorldRawChecker(checkers.BaseChecker):
__implements__ = interfaces.IRawChecker
name = 'hello-world-raw'
priority = -1
msgs = {
'C9003': (
'Uses a "Hello, world!" string.',
'hello-world-raw',
'No code should use "Hello, world!" statements.',
),
}
To learn about these metadata options, see the section above on the AST checker code. Notice that unlike the AST checker, we set __implements__
equal to interfaces.IRawChecker
, and we used a new message ID, name, and symbol.
Now we need to implement the process_module()
method, which accepts the node with the file contents as an argument. Then we iterate over the lines returned by read_from_node()
and look for the substrings we identified in our design:
def process_module(self, node):
file_content = read_from_node(node)
for (line_num, line) in enumerate(file_content):
if '"Hello, world!"' in line or '\'Hello, world!\'' in line:
self.add_message('hello-world-raw', line=line_num)
At the bottom of pylint_extensions.py
, you need to register your new checker by calling linter.register_checker()
. For example, here's how to register the three checkers we wrote for hello world strings:
def register(linter):
...
linter.register_checker(HelloWorldASTChecker(linter))
linter.register_checker(HelloWorldTokenChecker(linter))
linter.register_checker(HelloWorldRawChecker(linter))
The example of looking for Hello, world!
strings was fairly simple because we only needed information from one small part of the code at a time to identify lint violations. For more complicated lint rules, you'll need to think about how to keep track of information from many different places in the code. For example, the Pylint documentation describes how to write an AST checker to ensure that all the return statements in a function that return a constant return different constants. Here is some pseudocode for how you can do this:
class UniqueReturnChecker(BaseChecker):
...
def __init__(self, linter=None):
create a stack to store function nodes
def visit_functiondef(self, node):
push node onto the stack
def leave_functiondef(self, _):
pop a node off the stack
def visit_return(self, node):
if node is not of type Const: return
for other_return in function_stack[-1]:
if node.value == other_return's node's value:
report a lint violation
Note that we have to use a stack to track function nodes to handle nested functions! For more examples of trickier lint checkers, take a look at our existing linters in pylint_extensions.py
.
We test our checkers with [[backend tests|Writing-backend-tests]] in pylint_extensions_test.py
.
You should write test cases to cover the following situations:
- True positives: Bad code that you expect to result in lint errors.
- True negatives: Good code that you don't expect to result in lint errors.
- False positives: Good code that you expect to result in lint errors because you can't write a perfect linter.
- False negatives: Bad code that you don't expect to result in lint errors because the linter is imperfect.
While tests for the checkers are just normal backend tests, Pylint provides some helpful testing utilities that you can use, particularly pylint.testutils.CheckerTestCase
. Pylint's documentation shows how to create a test class that inherits from the CheckerTestCase
class, but at Oppia we prefer to inherit from unittest.TestCase
. We store the CheckerTestCase
object as an instance variable instead, so your setUp()
methods will look like this for your checker MyChecker
:
def setUp(self):
super().setUp()
self.checker_test_object = testutils.CheckerTestCase()
self.checker_test_object.CHECKER_CLASS = pylint_extensions.MyChecker
self.checker_test_object.setup_method()
We'll walk through how to use the checker_test_object
for each kind of checker below. Note that these examples don't show comprehensive test cases--see above for suggestions for how you can develop test cases.
To test our AST checker, we begin by using astroid.extract_node()
to get some nodes that we can pass to our checker. We pass a string with code to astroid.extract_node()
, annotating our code with #@
to indicate which lines we want to get nodes for:
def test_finds_hello_world(self):
assignment_expr, func_call_expr = astroid.extract_node(
"""
s = "Hello, world!" #@
print("Hello, world!") #@
""")
assignment_node = assignment_expr.value
func_call_node = func_call_expr.args[0]
with self.checker_test_object.assertAddsMessages(
testutils.Message(
msg_id='hello-world-ast',
node=assignment_node,
)
):
self.checker_test_object.checker.visit_const(
assignment_node)
with self.checker_test_object.assertAddsMessages(
testutils.Message(
msg_id='hello-world-ast',
node=func_call_node,
)
):
self.checker_test_object.checker.visit_const(
func_call_node)
Notice that because astroid finds the node representing the line we annotated with #@
, we have to retrieve the nodes that represent the Hello, world!
literals.
To test our token checker, we write the code we want to lint to a temporary file. Then we can convert that file into tokens that our checker can understand:
def test_finds_hello_world_assignment(self):
node = astroid.scoped_nodes.Module(
name='test',
doc='Custom test')
temp_file = tempfile.NamedTemporaryFile()
filename = temp_file.name
with open(filename, 'w') as tmp:
tmp.write('s = "Hello, world!"')
node.file = filename
node.path = filename
self.checker_test_object.checker.process_tokens(
utils.tokenize_module(node))
message = testutils.Message(
msg_id='hello-world-token', line=1)
with self.checker_test_object.assertAddsMessages(message):
temp_file.close()
def test_finds_hello_world_func_call(self):
node = astroid.scoped_nodes.Module(
name='test',
doc='Custom test')
temp_file = tempfile.NamedTemporaryFile()
filename = temp_file.name
with open(filename, 'w') as tmp:
tmp.write('print("Hello, world!")')
node.file = filename
node.path = filename
self.checker_test_object.checker.process_tokens(
utils.tokenize_module(node))
message = testutils.Message(
msg_id='hello-world-token', line=1)
with self.checker_test_object.assertAddsMessages(message):
temp_file.close()
Testing raw checkers is a lot like testing token checkers. The main difference is that we don't need to convert our temporary file into tokens. In the example below, we also show how you can test multiple lines at once:
def test_finds_hello_world_assignment(self):
node = astroid.scoped_nodes.Module(
name='test',
doc='Custom test')
temp_file = tempfile.NamedTemporaryFile()
filename = temp_file.name
with open(filename, 'w') as tmp:
tmp.write("""
s = "Hello, world!"
print("Hello, world!")
""")
node.file = filename
node.path = filename
self.checker_test_object.checker.process_module(node)
with self.checker_test_object.assertAddsMessages(
testutils.Message(
msg_id='hello-world-raw',
line=1
),
testutils.Message(
msg_id='hello-world-raw',
line=2
),
):
temp_file.close()
Once you've added test cases, you can run them like this:
Python:
python -m scripts.run_backend_tests --test_targets=scripts.linters.pylint_extensions_test --verbose
Docker:
make run_tests.backend PYTHON_ARGS="--test_targets=scripts.linters.pylint_extensions_test --verbose"
The --verbose
option will help with debugging by printing out more logging information.