Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New approach to function call matching #307

Merged
merged 24 commits into from
Jul 14, 2022
Merged

New approach to function call matching #307

merged 24 commits into from
Jul 14, 2022

Conversation

nene
Copy link
Collaborator

@nene nene commented Jul 12, 2022

Previously the approach to detecting function calls was as follows:

  • Inside Tokenizer: save preceding whitespace of each token to whitespaceBefore field
  • Inside Parser: check if RESERVED_KEYWORD or IDENTIFIER is followed by ( and there's no whitespace between them.

This has been a great hack to format function calls without really having a knowledge of which words are function names. But this approach has the fundamental downside of it relying on the whitespace of original code to make formatting decisions, which leads to surprising behavior (see #140 and #243) where re-formatting an already formatted SQL produces a different result.

The new approach introduced in this PR goes as follows:

  • Inside Tokenizer: label all function names as type: RESERVED_FUNCTION_NAME tokens.
  • Inside Parser: check if RESERVED_FUNCTION_NAME is followed by (.

This is mostly how one would expect this to work. However there are various caveats:

  • A common convention is to format data types similarly to function calls e.g. DECIMAL(5, 2). For now this is solved by also including data types (that can take parameters) also to the functions list. In the future should completely separate data types from functions.
  • CAST is used in various forms, but as its main use is in CAST(expr AS type), it's for now listed alongside functions.
  • ANY and SOME can be used as aggregate functions, in which case they should be formatted as HAVING ANY(condition). But a way more common usage is as an operator: WHERE x = ANY (1, 2, 3). So I've opted to remove these from list of functions.
  • NULLIF and COALESCE have sometimes been listed as keywords, at other times as function names. Changed them to be function names in every dialect. That really hints of a general problem that these keyword/function-name lists need a proper review.

The major downside of this new approach is, that:

  • calls of a custom function like foo(1, 2) are now formatted as foo (1, 2).

But there's also a major benefit from parsing side:

  • this change should greatly simplify switching to a parser-generator, as the parser won't have to look at whitespace (e.g. Nearley or Jison can't really distinguish between a token with and without preceding whitespace. Unless we go and introduce separate whitespace tokens, it's a major blocker for adopting a parser generator).

So, in the end it's about tradeoffs. IMHO the sacrifices are worth the gains.

@@ -20,12 +20,12 @@ export default function supportsCase(format: FormatFn) {

it('formats CASE ... WHEN with an expression', () => {
const result = format(
"CASE toString(getNumber()) WHEN 'one' THEN 1 WHEN 'two' THEN 2 WHEN 'three' THEN 3 ELSE 4 END;"
"CASE trim(sqrt(2)) WHEN 'one' THEN 1 WHEN 'two' THEN 2 WHEN 'three' THEN 3 ELSE 4 END;"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in several other tests I've switched over to using the some widely adopted function names, so the tests would work across all dialects.

SELECT IF(dq.id_discounter_shopping = 2, dq.value, dq.value / 100),
IF (dq.id_discounter_shopping = 2, 'amount', 'percentage') FROM foo);
SELECT COALESCE(dq.id_discounter_shopping = 2, dq.value, dq.value / 100),
COALESCE (dq.id_discounter_shopping = 2, 'amount', 'percentage') FROM foo);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched from IF to COALESCE which is supported in all dialects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting that IF isn't as universal as COALESCE

@@ -164,7 +165,7 @@ describe('PlSqlFormatter', () => {
`);
expect(result).toBe(dedent`
WITH
t1(id, parent_id) AS (
t1 (id, parent_id) AS (
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure which way is better here. It's not a function call.

However it should be possible to improve this with special parsing of WITH clause in the future.

@@ -114,10 +114,11 @@ describe('BigQueryFormatter', () => {
`);
});

// TODO: Possibly incorrect formatting of STRUCT<>() and ARRAY<>()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently an unfortunate change in formatting.

This whole typed arrays/structs parsing needs a complete rewrite though. The current solution is a big hack. It should really be parsed inside Parser not hacked together inside postProcess() of tokenizer.

'JSON_GROUP_OBJECT',
'JSON_EACH',
'JSON_TREE',
],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found lots of SQLite functions that we were missing.

@nene
Copy link
Collaborator Author

nene commented Jul 12, 2022

This change also paves the way for implementing separate uppercasing option for function names: #237

@nene nene requested a review from inferrinizzard July 12, 2022 19:53
Copy link
Collaborator

@inferrinizzard inferrinizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this overall makes sense but agree that the keyword lists need to be reviewed 👍

@nene nene merged commit afbadd1 into master Jul 14, 2022
@nene nene deleted the function-names branch July 14, 2022 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants