New approach to function call matching #307

nene · 2022-07-12T19:37:21Z

Previously the approach to detecting function calls was as follows:

Inside Tokenizer: save preceding whitespace of each token to whitespaceBefore field
Inside Parser: check if RESERVED_KEYWORD or IDENTIFIER is followed by ( and there's no whitespace between them.

This has been a great hack to format function calls without really having a knowledge of which words are function names. But this approach has the fundamental downside of it relying on the whitespace of original code to make formatting decisions, which leads to surprising behavior (see #140 and #243) where re-formatting an already formatted SQL produces a different result.

The new approach introduced in this PR goes as follows:

Inside Tokenizer: label all function names as type: RESERVED_FUNCTION_NAME tokens.
Inside Parser: check if RESERVED_FUNCTION_NAME is followed by (.

This is mostly how one would expect this to work. However there are various caveats:

A common convention is to format data types similarly to function calls e.g. DECIMAL(5, 2). For now this is solved by also including data types (that can take parameters) also to the functions list. In the future should completely separate data types from functions.
CAST is used in various forms, but as its main use is in CAST(expr AS type), it's for now listed alongside functions.
ANY and SOME can be used as aggregate functions, in which case they should be formatted as HAVING ANY(condition). But a way more common usage is as an operator: WHERE x = ANY (1, 2, 3). So I've opted to remove these from list of functions.
NULLIF and COALESCE have sometimes been listed as keywords, at other times as function names. Changed them to be function names in every dialect. That really hints of a general problem that these keyword/function-name lists need a proper review.

The major downside of this new approach is, that:

calls of a custom function like foo(1, 2) are now formatted as foo (1, 2).

But there's also a major benefit from parsing side:

this change should greatly simplify switching to a parser-generator, as the parser won't have to look at whitespace (e.g. Nearley or Jison can't really distinguish between a token with and without preceding whitespace. Unless we go and introduce separate whitespace tokens, it's a major blocker for adopting a parser generator).

So, in the end it's about tradeoffs. IMHO the sacrifices are worth the gains.

which no more needs to return a value

nene · 2022-07-12T19:40:34Z

test/features/case.ts

@@ -20,12 +20,12 @@ export default function supportsCase(format: FormatFn) {

  it('formats CASE ... WHEN with an expression', () => {
    const result = format(
-      "CASE toString(getNumber()) WHEN 'one' THEN 1 WHEN 'two' THEN 2 WHEN 'three' THEN 3 ELSE 4 END;"
+      "CASE trim(sqrt(2)) WHEN 'one' THEN 1 WHEN 'two' THEN 2 WHEN 'three' THEN 3 ELSE 4 END;"


Here and in several other tests I've switched over to using the some widely adopted function names, so the tests would work across all dialects.

nene · 2022-07-12T19:41:34Z

test/behavesLikeSqlFormatter.ts

-      SELECT IF(dq.id_discounter_shopping = 2, dq.value, dq.value / 100),
-      IF (dq.id_discounter_shopping = 2, 'amount', 'percentage') FROM foo);
+      SELECT COALESCE(dq.id_discounter_shopping = 2, dq.value, dq.value / 100),
+      COALESCE (dq.id_discounter_shopping = 2, 'amount', 'percentage') FROM foo);


Switched from IF to COALESCE which is supported in all dialects.

interesting that IF isn't as universal as COALESCE

nene · 2022-07-12T19:43:42Z

test/plsql.test.ts

@@ -164,7 +165,7 @@ describe('PlSqlFormatter', () => {
    `);
    expect(result).toBe(dedent`
      WITH
-        t1(id, parent_id) AS (
+        t1 (id, parent_id) AS (


Not really sure which way is better here. It's not a function call.

However it should be possible to improve this with special parsing of WITH clause in the future.

nene · 2022-07-12T19:49:43Z

test/bigquery.test.ts

@@ -114,10 +114,11 @@ describe('BigQueryFormatter', () => {
    `);
  });

+  // TODO: Possibly incorrect formatting of STRUCT<>() and ARRAY<>()


This is currently an unfortunate change in formatting.

This whole typed arrays/structs parsing needs a complete rewrite though. The current solution is a big hack. It should really be parsed inside Parser not hacked together inside postProcess() of tokenizer.

nene · 2022-07-12T19:50:49Z

src/languages/sqlite.formatter.ts

+    'JSON_GROUP_OBJECT',
+    'JSON_EACH',
+    'JSON_TREE',
+  ],


Found lots of SQLite functions that we were missing.

nene · 2022-07-12T19:53:20Z

This change also paves the way for implementing separate uppercasing option for function names: #237

inferrinizzard

I think this overall makes sense but agree that the keyword lists need to be reviewed 👍

This one is part of standard, unlike LAG()

nene added 18 commits July 12, 2022 14:34

Separate token for function names

5a3d720

Define data types as functions

b1cb912

Don't treat ANY/EVERY/SOME as functions

bd49c2c

BigQuery functions

e494705

DB2 functions

5d8ec7b

Hive functions

81410ff

MariaDB functions

25e60c4

MySQL functions

98f41ff

N1QL functions

e364303

PL/SQL functions

0fdef4d

PostgreSQL functions

87487a9

Redshift functions

6d461b3

Spark functions

143b211

SQLite functions

6c39543

TransactSQL functions

b3ef52a

Drop Token.whitespaceBefore field

40ff7b8

Change getWhitespace() to skipWhitespace()

1a8656e

which no more needs to return a value

Remove unused capture group from WHITESPACE_REGEX

1ba99b1

nene commented Jul 12, 2022

View reviewed changes

Fix OFFSET() formatting for BigQuery

759bf70

nene commented Jul 12, 2022

View reviewed changes

nene requested a review from inferrinizzard July 12, 2022 19:53

Consistently always dedupe function names list

722326e

inferrinizzard approved these changes Jul 13, 2022

View reviewed changes

nene added 2 commits July 14, 2022 12:06

Make reservedFunctionNames config mandatory

66f7ebd

Merge branch 'master' into function-names

0fe33a2

nene added 2 commits July 14, 2022 12:24

Add missing window functions for Hive

75441c5

Use ROW_NUMBER() in WINDOW clause test

81416cb

This one is part of standard, unlike LAG()

nene merged commit afbadd1 into master Jul 14, 2022

nene deleted the function-names branch July 14, 2022 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New approach to function call matching #307

New approach to function call matching #307

nene commented Jul 12, 2022

nene Jul 12, 2022

nene Jul 12, 2022

inferrinizzard Jul 12, 2022

nene Jul 12, 2022

nene Jul 12, 2022

nene Jul 12, 2022

nene commented Jul 12, 2022

inferrinizzard left a comment

New approach to function call matching #307

New approach to function call matching #307

Conversation

nene commented Jul 12, 2022

nene Jul 12, 2022

Choose a reason for hiding this comment

nene Jul 12, 2022

Choose a reason for hiding this comment

inferrinizzard Jul 12, 2022

Choose a reason for hiding this comment

nene Jul 12, 2022

Choose a reason for hiding this comment

nene Jul 12, 2022

Choose a reason for hiding this comment

nene Jul 12, 2022

Choose a reason for hiding this comment

nene commented Jul 12, 2022

inferrinizzard left a comment

Choose a reason for hiding this comment