Skip to content

Conversation

@anlowee
Copy link

@anlowee anlowee commented Sep 12, 2025

Description

The corresponding Presto PR: y-scope/presto#67

This PR addressed #13 .

This PR copied the original type parser so that we could do some customization for CLP connector without affecting other connectors. The current customization we have made is adding special chars including -, @, #, $ and \.

Then we can use this new parser in the protocol instead of using original parser which cannot recogonize those speical chars.

The protocol PR: y-scope/presto#67

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Added unit tests.
Passed the CI.
Passed the end-to-end testing. (t."$date" is querable).

Summary by CodeRabbit

  • New Features

    • Added a case-insensitive CLP type parser to convert textual type definitions (row, array, map, function, decimal and custom types) into Velox types, with alias handling, optional field names, and support for special/unquoted/quoted identifiers and clearer parse errors.
  • Build

    • Optional integration of the CLP parser into the build when the CLP connector flag is enabled; generated parser/scanner sources produced as part of the build.
  • Tests

    • Added unit tests validating complex row parsing, special-character and quoted field names, and overall parser correctness.

@coderabbitai
Copy link

coderabbitai bot commented Sep 12, 2025

Walkthrough

Adds a Flex/Bison CLP type parser under velox/type/fbclp: build integration (CMake), generated lexer/parser, scanner and utility code, public parse API, and unit tests, all gated by VELOX_ENABLE_CLP_CONNECTOR.

Changes

Cohort / File(s) Summary
Top-level build
velox/type/CMakeLists.txt
Conditionally includes the fbclp subdirectory when VELOX_ENABLE_CLP_CONNECTOR is enabled.
fbclp build config
velox/type/fbclp/CMakeLists.txt
New CMake rules to run Flex/Bison, build velox_type_fbclp_parser, optionally create a gen-src target for mono-library builds, and optionally build tests; gated by VELOX_ENABLE_CLP_CONNECTOR.
Parser utilities
velox/type/fbclp/ClpParserUtil.h, velox/type/fbclp/ClpParserUtil.cpp
Adds typeFromString, customTypeWithChildren, and inferTypeWithSpaces for normalizing type names, constructing types with children, and handling multi-word type tokens.
Scanner and API headers
velox/type/fbclp/ClpScanner.h, velox/type/fbclp/ClpTypeParser.h
Adds ClpScanner (Flex lexer wrapper) and declares public API parseClpType(const std::string&).
Lexer / Parser sources
velox/type/fbclp/ClpTypeParser.ll, velox/type/fbclp/ClpTypeParser.yy
Implements Flex lexer and Bison grammar to parse Presto-style type strings (primitive, array, map, row, function, decimal, custom types), integrates with scanner utilities, and reports errors via VELOX_UNSUPPORTED.
Tests
velox/type/fbclp/tests/CMakeLists.txt, velox/type/fbclp/tests/ClpTypeParserTest.cpp
Adds a GoogleTest-based test target and a test validating parsing of a row type with special-character field names.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as parseClpType
  participant Scanner as ClpScanner (Flex)
  participant Parser as ClpParser (Bison)
  participant Util as ClpParserUtil
  participant Registry as Type registry

  Client->>API: typeText
  API->>Scanner: init streams, output TypePtr&, input view
  API->>Parser: parse(Scanner*)
  loop tokenization/parsing
    Scanner-->>Parser: token + yylval
    Parser->>Util: typeFromString / customTypeWithChildren / inferTypeWithSpaces
    Util->>Registry: getType(name, params)
    Registry-->>Util: TypePtr
    Util-->>Parser: TypePtr
  end
  Parser-->>API: parsed TypePtr
  API-->>Client: TypePtr or error

  rect rgba(220,240,255,0.35)
  note right of Parser: On parse failure, ClpParser::error reports via VELOX_UNSUPPORTED with input context.
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately captures the primary change: adding a CLP type parser that accepts additional special characters in field names and references issue #13. It is concise, specific, and directly matches the PR changes (new parser files, build integration, and tests) without extraneous detail. A reviewer scanning history will understand the main intent at a glance.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@anlowee anlowee marked this pull request as ready for review September 15, 2025 16:02
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae5119 and 028ab93.

📒 Files selected for processing (10)
  • velox/type/CMakeLists.txt (1 hunks)
  • velox/type/fbclp/CMakeLists.txt (1 hunks)
  • velox/type/fbclp/ClpParserUtil.cpp (1 hunks)
  • velox/type/fbclp/ClpParserUtil.h (1 hunks)
  • velox/type/fbclp/ClpScanner.h (1 hunks)
  • velox/type/fbclp/ClpTypeParser.h (1 hunks)
  • velox/type/fbclp/ClpTypeParser.ll (1 hunks)
  • velox/type/fbclp/ClpTypeParser.yy (1 hunks)
  • velox/type/fbclp/tests/CMakeLists.txt (1 hunks)
  • velox/type/fbclp/tests/ClpTypeParserTest.cpp (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
velox/type/fbclp/ClpTypeParser.h (2)
velox/type/fbclp/ClpParserUtil.h (1)
  • facebook (22-43)
velox/type/fbclp/ClpScanner.h (1)
  • facebook (27-55)
velox/type/fbclp/ClpParserUtil.h (3)
velox/type/fbclp/ClpScanner.h (1)
  • facebook (27-55)
velox/type/fbclp/ClpTypeParser.h (1)
  • facebook (22-37)
velox/type/fbclp/ClpParserUtil.cpp (6)
  • typeFromString (22-37)
  • typeFromString (22-24)
  • customTypeWithChildren (39-51)
  • customTypeWithChildren (39-41)
  • inferTypeWithSpaces (53-66)
  • inferTypeWithSpaces (53-55)
velox/type/fbclp/tests/ClpTypeParserTest.cpp (1)
velox/python/type/type.pyi (2)
  • BIGINT (21-21)
  • VARCHAR (29-29)
velox/type/fbclp/ClpScanner.h (2)
velox/type/fbclp/ClpParserUtil.h (1)
  • facebook (22-43)
velox/type/fbclp/ClpTypeParser.h (1)
  • facebook (22-37)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (5)
velox/type/fbclp/tests/CMakeLists.txt (1)

15-25: LGTM.

Linking against velox_type_fbclp_parser and GTest targets looks correct. No issues.

velox/type/fbclp/ClpTypeParser.ll (1)

60-62: Keep the yyFlexLexer::yylex guard; OK as-is.

Throwing here is standard to prevent accidental use of the base yylex.

velox/type/fbclp/ClpTypeParser.yy (2)

154-156: Good error propagation.

Raising VELOX_UNSUPPORTED with the original input is clear and consistent with Velox errors.


149-150: Confirmed — lexer returns quoted lexeme; parser stripping is correct.

QUOTED_ID is defined in velox/type/fbclp/ClpTypeParser.ll (line 37) and the lexer action returns YYText() unchanged (line 54), so the parser's $1.erase(0,1); $1.pop_back(); in ClpTypeParser.yy is required and will not double-strip. The QUOTED_ID regex only allows alnum/space/underscore inside quotes, so there are no escape sequences to unescape; if you need support for escaped quotes/backslashes, extend the lexer regex and add unescaping in the parser.

velox/type/fbclp/CMakeLists.txt (1)

26-31: Confirm Flex/Bison symbol prefixing matches ClpScanner.

Ensure velox/type/fbclp/ClpScanner.h (or the generated scanner header) contains #define yyFlexLexer veloxtpclpFlexLexer before including FlexLexer.h so the veloxtpclp-prefixed scanner matches the base class. Repository search found no such define — verify and add if missing.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 22

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae5119 and 028ab93.

📒 Files selected for processing (10)
  • velox/type/CMakeLists.txt (1 hunks)
  • velox/type/fbclp/CMakeLists.txt (1 hunks)
  • velox/type/fbclp/ClpParserUtil.cpp (1 hunks)
  • velox/type/fbclp/ClpParserUtil.h (1 hunks)
  • velox/type/fbclp/ClpScanner.h (1 hunks)
  • velox/type/fbclp/ClpTypeParser.h (1 hunks)
  • velox/type/fbclp/ClpTypeParser.ll (1 hunks)
  • velox/type/fbclp/ClpTypeParser.yy (1 hunks)
  • velox/type/fbclp/tests/CMakeLists.txt (1 hunks)
  • velox/type/fbclp/tests/ClpTypeParserTest.cpp (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
velox/type/fbclp/ClpTypeParser.h (1)
velox/type/fbclp/ClpParserUtil.h (1)
  • facebook (22-43)
velox/type/fbclp/ClpParserUtil.h (3)
velox/type/fbclp/ClpScanner.h (1)
  • facebook (27-55)
velox/type/fbclp/ClpTypeParser.h (1)
  • facebook (22-37)
velox/type/fbclp/ClpParserUtil.cpp (6)
  • typeFromString (22-37)
  • typeFromString (22-24)
  • customTypeWithChildren (39-51)
  • customTypeWithChildren (39-41)
  • inferTypeWithSpaces (53-66)
  • inferTypeWithSpaces (53-55)
velox/type/fbclp/tests/ClpTypeParserTest.cpp (1)
velox/python/type/type.pyi (2)
  • BIGINT (21-21)
  • VARCHAR (29-29)
velox/type/fbclp/ClpScanner.h (2)
velox/type/fbclp/ClpParserUtil.h (1)
  • facebook (22-43)
velox/type/fbclp/ClpTypeParser.h (1)
  • facebook (22-37)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: conventional-commits
  • GitHub Check: pre-commit
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (7)
velox/type/fbclp/tests/CMakeLists.txt (1)

15-25: Confirm tests are only added when VELOX_BUILD_TESTING and CLP are enabled.

Looks good, but ensure the parent CMakeLists.txt adds this tests directory only when VELOX_BUILD_TESTING is ON and VELOX_ENABLE_CLP_CONNECTOR is enabled; otherwise CI systems that disable tests or CLP could see spurious targets.

velox/type/fbclp/ClpTypeParser.ll (3)

39-40: Order is fine; keep VARIABLE before WORD.

Token priority is correct; no change needed.


60-63: yylex override is fine.

Throwing on default yylex is acceptable to catch miswiring.


66-76: parseClpType implementation looks good; add defensive error context.

Consider enriching the VELOX_CHECK failure with parser position when available, but current message is acceptable.

velox/type/fbclp/ClpTypeParser.yy (3)

97-99: varchar(var) length ignored by design — confirm this matches Velox semantics

You parse (NUMBER) but discard it. That aligns with existing Velox parsers; just confirm tests cover both varchar and varchar(n).


83-91: Field names with special chars depend on the scanner — ensure unquoted -, @, #, $, \ are emitted as WORD/VARIABLE when allowed

Grammar is ready (field_name accepts WORD), but support hinges on ClpTypeParser.ll. Please confirm tokens for unquoted identifiers with these chars are produced only where intended (e.g., not conflicting with minus).

Also applies to: 146-150


154-156: Good error surfacing

Clear context in errors via scanner->input(). LGTM.

@anlowee anlowee requested a review from wraymo September 15, 2025 20:10
@anlowee
Copy link
Author

anlowee commented Sep 15, 2025

Going to fix the coderabbitai comments first.

@anlowee
Copy link
Author

anlowee commented Sep 16, 2025

Most of comments from coderabbitai are related to the code I coped from velox/type/parser, so I just leave them as is. The only difference we made here is we added sum special chars at velox/type/fbclp/ClpTypeParser.ll:L36.

@anlowee
Copy link
Author

anlowee commented Sep 16, 2025

Fixed some of suggestions in CMake and comments.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
velox/type/fbclp/tests/ClpTypeParserTest.cpp (1)

26-39: Add negative and nested coverage (quoted/illegal-char/nested ROW/ARRAY/MAP).

Strengthen regression guarantees with explicit failing and nested cases as previously suggested.

I can draft targeted TEST_F cases for:

  • Quoted identifiers with spaces and special chars.
  • An illegal unquoted char to assert a clean failure.
  • Nested types with special-char field names.
    Want me to push these?
velox/type/fbclp/ClpTypeParser.h (2)

32-36: Clarify quoted identifier support in docs.

State that quoted names accept any character except the quote delimiter.

Apply:

-/// Quoted field names are supported.
+/// Quoted field names are supported; inside quotes any character is allowed
+/// except the terminating quote delimiter.

19-21: Change parseClpType to take std::string_view; update generated parser

Use std::string_view to avoid copies and match ClpScanner. In velox/type/fbclp/ClpTypeParser.h add #include <string_view> and change
TypePtr parseClpType(const std::string& typeText);
to
TypePtr parseClpType(std::string_view typeText);
Also update velox/type/fbclp/ClpTypeParser.ll (or regenerate from ClpTypeParser.yy) so the definition matches, and ensure the implementation does not store/assume ownership of the incoming characters (lifetime safety). Call sites found: velox/type/fbclp/tests/ClpTypeParserTest.cpp (no change required for string literals).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 028ab93 and e9d3511.

📒 Files selected for processing (3)
  • velox/type/CMakeLists.txt (1 hunks)
  • velox/type/fbclp/ClpTypeParser.h (1 hunks)
  • velox/type/fbclp/tests/ClpTypeParserTest.cpp (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
velox/type/fbclp/ClpTypeParser.h (2)
velox/type/fbclp/ClpParserUtil.h (1)
  • facebook (22-43)
velox/type/fbclp/ClpScanner.h (1)
  • facebook (27-55)
velox/type/fbclp/tests/ClpTypeParserTest.cpp (1)
velox/python/type/type.pyi (2)
  • BIGINT (21-21)
  • VARCHAR (29-29)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (1)
velox/type/CMakeLists.txt (1)

21-23: Approve — VELOX_ENABLE_CLP_CONNECTOR is defined and fbclp gating is correct.
Option defined at CMakeLists.txt:130 (default ON); gating present in velox/type/CMakeLists.txt:21 — prevents accidental toolchain requirements and keeps non‑CLP builds clean.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
velox/type/fbclp/tests/ClpTypeParserTest.cpp (1)

26-39: Add negative, quoted-with-spaces, and nested coverage.

These cement lexer/parser behaviour and guard against regressions.

Apply:

@@
 TEST_F(TypeParserTest, rowTypeWithSpecialChars) {
@@
 }
 
+TEST_F(TypeParserTest, quotedIdentifiersAndWhitespace) {
+  ASSERT_EQ(
+      *parseClpType(R"(row("hy-phen" varchar,"sp ace" varchar,"$$" varchar))"),
+      *ROW({"hy-phen", "sp ace", "$$"}, {VARCHAR(), VARCHAR(), VARCHAR()}));
+}
+
+TEST_F(TypeParserTest, nestedTypesWithSpecialCharNames) {
+  ASSERT_EQ(
+      *parseClpType(
+          R"(row("$root$" row("-arr-" array(varchar), "#m#" map(varchar, bigint))))"),
+      *ROW(
+          {"$root$"},
+          {ROW({"-arr-", "#m#"}, {ARRAY(VARCHAR()), MAP(VARCHAR(), BIGINT())})}));
+}
+
+TEST_F(TypeParserTest, rejectsIllegalIdentifierChar) {
+  // '!' is still illegal; ensure fast failure.
+  VELOX_ASSERT_THROW(parseClpType(R"(row(inva!id bigint))"), "invalid");
+}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9d3511 and 1725ea8.

📒 Files selected for processing (1)
  • velox/type/fbclp/tests/ClpTypeParserTest.cpp (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
velox/type/fbclp/tests/ClpTypeParserTest.cpp (1)
velox/python/type/type.pyi (2)
  • BIGINT (21-21)
  • VARCHAR (29-29)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (2)
velox/type/fbclp/tests/ClpTypeParserTest.cpp (2)

26-33: Good unquoted coverage for -, @, #, $, and .

This locks in the lexer change for unquoted identifiers and matches expected field names/types.


26-39: Confirm additional parser tests exist (quoted / whitespace / nested / negative cases)

Automated search found no matches under velox/type/fbclp/tests — confirm these tests were added in another commit/location or add them under velox/type/fbclp/tests (see ClpTypeParserTest.cpp lines 26–39).

@anlowee
Copy link
Author

anlowee commented Sep 16, 2025

Ready fto review.

@kirkrodrigues
Copy link
Member

@anlowee anlowee marked this pull request as draft September 24, 2025 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants