-
Notifications
You must be signed in to change notification settings - Fork 2
feat: Add CLP UDFs with query rewriting support #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-0.293-clp-connector
Are you sure you want to change the base?
feat: Add CLP UDFs with query rewriting support #31
Conversation
WalkthroughThis change introduces new CLP-specific scalar functions and extends the Presto CLP connector to support path-based and wildcard column functions for semi-structured log querying. It enhances the optimizer and filter-to-KQL converter to handle these UDFs, updates associated tests and documentation, and adds new dependency declarations. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant PrestoSQL
participant ClpPlugin
participant ClpFunctions
participant ClpPlanOptimizer
participant ClpFilterToKqlConverter
User->>PrestoSQL: Submit query with CLP_GET_* or CLP_WILDCARD_* functions
PrestoSQL->>ClpPlugin: Load connector and functions
PrestoSQL->>ClpPlanOptimizer: Optimize query plan
ClpPlanOptimizer->>ClpFilterToKqlConverter: Rewrite and pushdown filters/projections
ClpFilterToKqlConverter->>ClpFunctions: Map UDFs to variables/columns or KQL symbols
ClpPlanOptimizer-->>PrestoSQL: Return optimized plan
PrestoSQL-->>User: Execute and return results
Assessment against linked issues
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
🔇 Additional comments (3)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🔭 Outside diff range comments (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java (1)
870-901
: Consider optimizing string building for non-pushdownable expressions.The logic correctly handles partial pushdown, but unnecessarily builds query strings even when
allPushedDown
becomes false. Consider checking pushdownability first before building strings.private ClpExpression handleOr(SpecialFormExpression node, Set<VariableReferenceExpression> context) { + // First pass: check if all children can be pushed down + List<ClpExpression> childExpressions = new ArrayList<>(); + for (RowExpression argument : node.getArguments()) { + ClpExpression expression = argument.accept(this, context); + if (expression.getRemainingExpression().isPresent() || !expression.getPushDownExpression().isPresent()) { + return new ClpExpression(node); + } + childExpressions.add(expression); + } + + // Second pass: build the query StringBuilder metadataQueryBuilder = new StringBuilder(); metadataQueryBuilder.append("("); StringBuilder queryBuilder = new StringBuilder(); queryBuilder.append("("); - boolean allPushedDown = true; boolean hasAllMetadataSql = true; - for (RowExpression argument : node.getArguments()) { - ClpExpression expression = argument.accept(this, context); - if (expression.getRemainingExpression().isPresent() || !expression.getPushDownExpression().isPresent()) { - allPushedDown = false; - continue; - } + for (ClpExpression expression : childExpressions) { queryBuilder.append(expression.getPushDownExpression().get()); queryBuilder.append(" OR "); if (hasAllMetadataSql && expression.getMetadataSqlQuery().isPresent()) { metadataQueryBuilder.append(expression.getMetadataSqlQuery().get()); metadataQueryBuilder.append(" OR "); } else { hasAllMetadataSql = false; } } - if (allPushedDown) { - // Remove the last " OR " from the query - return new ClpExpression( - queryBuilder.substring(0, queryBuilder.length() - 4) + ")", - hasAllMetadataSql ? metadataQueryBuilder.substring(0, metadataQueryBuilder.length() - 4) + ")" : null); - } - return new ClpExpression(node); + // Remove the last " OR " from the query + return new ClpExpression( + queryBuilder.substring(0, queryBuilder.length() - 4) + ")", + hasAllMetadataSql ? metadataQueryBuilder.substring(0, metadataQueryBuilder.length() - 4) + ")" : null); }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (11)
presto-clp/pom.xml
(1 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java
(25 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFunctions.java
(1 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpMetadataFilterProvider.java
(1 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java
(2 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlugin.java
(1 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/ClpMetadataDbSetUp.java
(1 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java
(4 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpPlanOptimizer.java
(1 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpQueryBase.java
(3 hunks)presto-docs/src/main/sphinx/connector/clp.rst
(1 hunks)
🧰 Additional context used
🧠 Learnings (5)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlugin.java (1)
Learnt from: wraymo
PR: y-scope/presto#15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpMetadataFilterProvider.java (1)
Learnt from: wraymo
PR: y-scope/presto#15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpQueryBase.java (1)
Learnt from: wraymo
PR: y-scope/presto#15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpPlanOptimizer.java (1)
Learnt from: wraymo
PR: y-scope/presto#15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java (1)
Learnt from: anlowee
PR: y-scope/presto#0
File: :0-0
Timestamp: 2025-06-19T08:41:46.288Z
Learning: In the y-scope/presto repository, prefer using Guava's Immutable collections over regular collections when possible.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: test (17.0.13, :presto-main-base)
- GitHub Check: test (17.0.13, :presto-tests -P ci-only-distributed-non-hash-gen)
- GitHub Check: test (17.0.13, :presto-tests -P ci-only-plan-determinism)
- GitHub Check: test (17.0.13, :presto-tests -P ci-only-tpch-distributed-queries)
- GitHub Check: test (17.0.13, :presto-tests -P ci-only-distributed-queries)
- GitHub Check: test (17.0.13, :presto-tests -P presto-tests-general)
- GitHub Check: test (8.0.442, :presto-main-base)
- GitHub Check: test (17.0.13, :presto-tests -P presto-tests-execution-memory)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-aggregation-queries)
- GitHub Check: test (8.0.442, :presto-main)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-resource-manager)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-plan-determinism)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-local-queries)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-distributed-queries)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-tpch-distributed-queries)
- GitHub Check: test (8.0.442, :presto-tests -P presto-tests-execution-memory)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-distributed-non-hash-gen)
- GitHub Check: test (:presto-docs)
- GitHub Check: maven-checks (17.0.13)
- GitHub Check: maven-checks (8.0.442)
🔇 Additional comments (10)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/ClpMetadataDbSetUp.java (1)
230-233
: LGTM! Clean getter implementation.The getter method is well-implemented and follows standard Java conventions. This addition properly supports the enhanced testing infrastructure for the new CLP UDF functionality.
presto-clp/pom.xml (1)
143-159
: LGTM! Dependencies appropriately support the new CLP UDF functionality.The added dependencies are well-chosen and properly scoped:
presto-tests
andpresto-main-base
(test-jar) are correctly test-scopedpresto-expressions
is needed for the new expression rewriting capabilitiesThese additions align perfectly with the PR objectives to add CLP UDFs with query rewriting support.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlugin.java (1)
26-30
: LGTM! Proper function registration implementation.The
getFunctions()
method correctly implements the standard Presto plugin pattern for registering scalar functions. The use ofImmutableSet
and proper return type ensures the CLP UDFs are properly exposed to the Presto runtime.presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpMetadataFilterProvider.java (1)
87-91
: LGTM! Excellent defensive programming.The guard clause properly handles null/empty configuration paths by returning an empty filter map early. This prevents unnecessary file I/O operations and provides graceful fallback behaviour when metadata filter configuration is not provided.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpQueryBase.java (2)
67-69
: LGTM! Proper function registration for testing.The static block correctly registers the CLP UDFs using
FunctionExtractor.extractFunctions()
, ensuring these functions are available during test execution. This is the standard approach for testing custom functions in Presto.
84-87
: LGTM! Table structure updates align with schema enhancements.The reordering and simplification of the nested
RowType
fields appears to align with the enhanced plan optimization and function pushdown logic introduced in this PR. The structure remains logically consistent.presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpPlanOptimizer.java (1)
1-309
: Well-structured test implementationThe test class provides comprehensive coverage for the CLP plan optimizer with proper setup/teardown and clear test scenarios covering projection and filter pushdown cases.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java (1)
238-240
: Use parameterized logging for better securityDirect string concatenation in log statements could be a security concern if the data contains sensitive information.
- metadataSqlQuery = Optional.of(metadataFilterProvider.remapFilterSql(scope, metadataSqlQuery.get())); - log.debug("Metadata SQL query: %s", metadataSqlQuery); + String remappedQuery = metadataFilterProvider.remapFilterSql(scope, metadataSqlQuery.get()); + metadataSqlQuery = Optional.of(remappedQuery); + log.debug("Metadata SQL query: %s", remappedQuery);⛔ Skipped due to learnings
Learnt from: anlowee PR: y-scope/presto#0 File: :0-0 Timestamp: 2025-06-19T08:41:46.288Z Learning: In the y-scope/presto repository, prefer static imports for commonly used methods like String.format(). Instead of writing `String.format("%s", value)`, use static import `import static java.lang.String.format;` and call `format("%s", value)` directly.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java (1)
272-308
: Comprehensive test coverage for CLP UDFsExcellent test coverage for both CLP_GET_* and CLP_WILDCARD_* functions, including edge cases and combinations with other predicates.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java (1)
739-796
: Correct implementation of CLP wildcard interpretation.The method properly handles CLP_WILDCARD UDFs with appropriate validation and error handling. Good use of the Optional pattern and descriptive error messages.
|
||
.. code-block:: sql | ||
SELECT CLP_GET_STRING(msg.author) AS author |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add quotes to the JSON path argument
According to line 335, JSON paths must be constant string literals. The example should use quotes.
- SELECT CLP_GET_STRING(msg.author) AS author
+ SELECT CLP_GET_STRING('msg.author') AS author
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
SELECT CLP_GET_STRING(msg.author) AS author | |
SELECT CLP_GET_STRING('msg.author') AS author |
🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst at line 343, the JSON path
argument in the function call lacks quotes, which is required as per line 335
stating JSON paths must be constant string literals. Fix this by adding quotes
around the JSON path argument in the SELECT statement to ensure it is a valid
string literal.
SELECT CLP_GET_ARRAY(msg.tags) AS tags | ||
FROM clp.default.table_2 | ||
WHERE CLP_GET_BOOL('msg.is_active') = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix function name in the example
The function CLP_GET_ARRAY
doesn't exist. It should be CLP_GET_STRING_ARRAY
.
- SELECT CLP_GET_ARRAY(msg.tags) AS tags
+ SELECT CLP_GET_STRING_ARRAY('msg.tags') AS tags
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
SELECT CLP_GET_ARRAY(msg.tags) AS tags | |
FROM clp.default.table_2 | |
WHERE CLP_GET_BOOL('msg.is_active') = true; | |
SELECT CLP_GET_STRING_ARRAY('msg.tags') AS tags | |
FROM clp.default.table_2 | |
WHERE CLP_GET_BOOL('msg.is_active') = true; |
🤖 Prompt for AI Agents
In the presto-docs/src/main/sphinx/connector/clp.rst file, around lines 347 to
349, the example query uses the incorrect function name `CLP_GET_ARRAY`. This
function does not exist. Instead, the correct function to use is
`CLP_GET_STRING_ARRAY` to retrieve an array of strings from the `msg.tags`
field. Update the query to use the correct `CLP_GET_STRING_ARRAY` function.
Returns the string value of the given JSON path, where the column type is ``Integer``, Returns a Presto ``BIGINT``. | ||
|
||
.. function:: CLP_GET_FLOAT(varchar) -> double | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the function description for CLP_GET_FLOAT
The description states "Returns the boolean value" but the function returns a DOUBLE
, not a boolean.
- Returns the boolean value of the given JSON path, where the column type is ``Float``. Returns a Presto ``DOUBLE``.
+ Returns the float value of the given JSON path, where the column type is ``Float``. Returns a Presto ``DOUBLE``.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Returns the float value of the given JSON path, where the column type is ``Float``. Returns a Presto ``DOUBLE``. |
🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst at line 320, the function
description for CLP_GET_FLOAT incorrectly states that it returns a boolean
value. Update the description to correctly state that the function returns a
DOUBLE value instead of a boolean.
|
||
.. function:: CLP_GET_INT(varchar) -> bigint | ||
|
||
Returns the string value of the given JSON path, where the column type is ``Integer``, Returns a Presto ``BIGINT``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the function description for CLP_GET_INT
The description states "Returns the string value" but the function returns a BIGINT
, not a string.
- Returns the string value of the given JSON path, where the column type is ``Integer``, Returns a Presto ``BIGINT``.
+ Returns the integer value of the given JSON path, where the column type is ``Integer``. Returns a Presto ``BIGINT``.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Returns the string value of the given JSON path, where the column type is ``Integer``, Returns a Presto ``BIGINT``. | |
Returns the integer value of the given JSON path, where the column type is ``Integer``. Returns a Presto ``BIGINT``. |
🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst at line 317, the function
description for CLP_GET_INT incorrectly states it returns a string value. Update
the description to correctly state that the function returns a Presto BIGINT
value instead of a string.
|
||
.. function:: CLP_GET_BOOL(varchar) -> boolean | ||
|
||
Returns the double value of the given JSON path, where the column type is ``Boolean``. Returns a Presto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the function description for CLP_GET_BOOL
The description states "Returns the double value" but the function returns a BOOLEAN
, not a double.
- Returns the double value of the given JSON path, where the column type is ``Boolean``. Returns a Presto
+ Returns the boolean value of the given JSON path, where the column type is ``Boolean``. Returns a Presto
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Returns the double value of the given JSON path, where the column type is ``Boolean``. Returns a Presto | |
Returns the boolean value of the given JSON path, where the column type is ``Boolean``. Returns a Presto |
🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst at line 325, the function
description incorrectly states that CLP_GET_BOOL returns a double value, but it
actually returns a BOOLEAN. Update the description to accurately reflect that
the function returns a BOOLEAN value instead of a double.
@ScalarFunction(value = "CLP_GET_JSON_STRING", deterministic = false) | ||
@Description("Converts an entire log record into a JSON string.") | ||
@SqlType(StandardTypes.VARCHAR) | ||
public static Slice clpGetJSONString() | ||
{ | ||
return Slices.EMPTY_SLICE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Consider adding a parameter to CLP_GET_JSON_STRING
Unlike other CLP_GET functions, this one takes no parameters. Should it accept a JSON path parameter for consistency, or is it intended to return the entire record?
🌐 Web query:
What is the intended behavior of CLP_GET_JSON_STRING function in Presto CLP connector - should it take a JSON path parameter or return the entire log record?
💡 Result:
The intended behavior of the CLP_GET_JSON_STRING function in the Presto CLP connector is not documented in detail within the provided search results. However, based on best practices for similar JSON extraction functions in SQL engines and the available context around JSON path extraction and CLP encoding, the following evidence-supported conclusions can be drawn:
- The typical approach for JSON extraction functions (such as
JSONPATH
in Apache Pinot or field mapping in the Presto Kafka connector) is to allow the user to specify a JSON path or field mapping[1][6][8]. The path points to the element or elements you want to extract from the JSON record rather than always returning the entire record. - For instance, the Presto Kafka connector's JSON decoder lets users map columns to “slash-separated list of field names to select a field from the JSON object”, which serves as a path to the desired property[1].
- Similarly, other data platforms (like Pinot) provide functions such as
JSONPATH(jsonField, 'jsonPath')
where the second argument is a path expression that targets a specific field in the JSON structure[6].
Extracts the object value from jsonField based on 'jsonPath', the result type is inferred based on JSON value[6].
- The CLP-specific stream ingestion feature in Pinot also makes it clear that fields are individually targeted and encoded/extracted based on configuration, not by returning the whole JSON record by default[8].
Given these conventions, the intended behavior of a function named CLP_GET_JSON_STRING in Presto’s CLP connector is most likely to accept a JSON path parameter and extract (or return) the value at that path from the log record—not just return the entire JSON log record. Requiring a path parameter also aligns with SQL and analytic platform norms for working with nested or semi-structured data.
If CLP_GET_JSON_STRING were designed to return the whole log record, there would be little benefit in calling it specifically, as fetching the entire record typically does not require a specialized function or parameterization.
In summary:
CLP_GET_JSON_STRING should be expected to take a JSON path parameter and return only the requested field or value, not the entire log record, following standard design patterns for JSON extraction in Presto and related systems[1][6][8]. If official documentation becomes available, it should be referenced for precise usage.
Citations:
- 1: https://teradata.github.io/presto/docs/0.167-t/connector/kafka.html
- 2: https://support.smartbear.com/readyapi/docs/en/test-apis-with-readyapi/jsonpath-reference.html
- 3: https://dataplatform.cloud.ibm.com/docs/content/wsj/manage-data/conn-presto.html?context=wx
- 4: https://support.workiva.com/hc/en-us/articles/19514320255252-CLP-JSON-Nested-Object
- 5: Automatically update the configuration or load a new connector in Presto prestodb/presto#14964
- 6: https://docs.pinot.apache.org/functions-1/jsonpath
- 7: https://docs.teradata.com/r/Enterprise_IntelliFlex_Lake_VMware/QueryGridTM-Components-and-Connectors-Compatibility-Matrix/Teradata-QueryGrid-Compatibility-Matrix-2.x/Teradata-QueryGrid-2.x-Components
- 8: https://docs.pinot.apache.org/manage-data/data-import/pinot-stream-ingestion/clp
- 9: https://www.cytivalifesciences.com/en/us/shop/liquid-preparation-and-management/single-use-connectors/kleenpak-presto-sterile-connectors-p-36871
- 10: [Design] Connector Participating Query Optimization prestodb/presto#13102
Add a JSON path parameter to CLP_GET_JSON_STRING
for consistency
Based on common SQL connector patterns and Presto’s own JSON extraction functions, CLP_GET_JSON_STRING
should accept a JSON-path argument rather than always returning the full record.
Please update accordingly:
• File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFunctions.java
Lines: 106–112
• Change the signature from:
public static Slice clpGetJSONString()
to:
public static Slice clpGetJSONString(
@SqlType(StandardTypes.VARCHAR) Slice jsonPath)
• Adjust the annotations:
@ScalarFunction(value = "CLP_GET_JSON_STRING", deterministic = false)
@Description("Extracts the JSON string at the given path from a log record.")
@SqlType(StandardTypes.VARCHAR)
• Implement lookup of the specified path inside the record (or throw a clear error if the path is invalid).
This aligns with Presto’s Kafka connector (JSONPATH(jsonField, 'path')
) and other CLP-based extractors.
🤖 Prompt for AI Agents
In presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFunctions.java
around lines 106 to 112, update the clpGetJSONString method to accept a Slice
parameter annotated with @SqlType(StandardTypes.VARCHAR) representing the JSON
path. Modify the annotations to reflect that the function extracts the JSON
string at the given path from a log record. Implement logic inside the method to
parse the log record and extract the JSON value at the specified path, throwing
a clear error if the path is invalid or not found.
throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION, | ||
"Unsupported plan shape for CLP pushdown: " + childNode.getClass().getSimpleName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Improve error message for unsupported plan shapes
The error message could be more descriptive about which plan shapes are supported.
- throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION,
- "Unsupported plan shape for CLP pushdown: " + childNode.getClass().getSimpleName());
+ throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION,
+ String.format("Unsupported plan shape for CLP pushdown: %s. Only Project->TableScan and Project->Filter->TableScan are supported.",
+ childNode.getClass().getSimpleName()));
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION, | |
"Unsupported plan shape for CLP pushdown: " + childNode.getClass().getSimpleName()); | |
throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION, | |
String.format("Unsupported plan shape for CLP pushdown: %s. Only Project->TableScan and Project->Filter->TableScan are supported.", | |
childNode.getClass().getSimpleName())); |
🤖 Prompt for AI Agents
In presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java
at lines 156-157, the error message for unsupported plan shapes is vague.
Enhance the message by listing or describing the supported plan shapes to
provide clearer guidance on what is expected, improving the clarity of the
exception thrown.
this.standardFunctionResolution = requireNonNull(standardFunctionResolution, "standardFunctionResolution is null"); | ||
this.functionMetadataManager = requireNonNull(functionMetadataManager, "function metadata manager is null"); | ||
this.assignments = requireNonNull(assignments, "assignments is null"); | ||
this.assignments = new HashMap<>(requireNonNull(assignments, "assignments is null")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Consider using ImmutableMap.Builder for better immutability management.
Creating a mutable HashMap copy allows modifications but doesn't align with the repository's preference for Guava's Immutable collections. Since modifications are needed for CLP UDF processing, consider using ImmutableMap.Builder
to build the final map after all transformations.
- this.assignments = new HashMap<>(requireNonNull(assignments, "assignments is null"));
+ // Use a builder pattern to maintain immutability preference while allowing modifications
+ ImmutableMap.Builder<VariableReferenceExpression, ColumnHandle> builder = ImmutableMap.builder();
+ builder.putAll(requireNonNull(assignments, "assignments is null"));
+ // Store builder for use during processing, then build final immutable map when needed
+ this.assignmentsBuilder = builder;
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java
at line 118, replace the mutable HashMap copy of assignments with an
ImmutableMap.Builder to construct the map immutably. Initialize an
ImmutableMap.Builder, add all entries from the input assignments, perform any
necessary modifications on the builder, and then build the final ImmutableMap to
assign to this.assignments, ensuring immutability while allowing controlled
modifications.
Object handle = assignments.get(variable); | ||
if (!(handle instanceof ClpColumnHandle)) { | ||
return null; | ||
} | ||
return ((ClpColumnHandle) handle).getOriginalColumnName(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid returning null to prevent potential NullPointerException.
The method returns null
when the handle is not a ClpColumnHandle
, which could lead to NPEs or unexpected "null" string concatenation in callers. Consider returning Optional<String>
or throwing an exception for invalid handles.
- private String getVariableName(VariableReferenceExpression variable)
+ private Optional<String> getVariableName(VariableReferenceExpression variable)
{
Object handle = assignments.get(variable);
if (!(handle instanceof ClpColumnHandle)) {
- return null;
+ return Optional.empty();
}
- return ((ClpColumnHandle) handle).getOriginalColumnName();
+ return Optional.of(((ClpColumnHandle) handle).getOriginalColumnName());
}
Then update callers to handle the Optional appropriately:
- return new ClpExpression(getVariableName(node));
+ return getVariableName(node)
+ .map(ClpExpression::new)
+ .orElse(new ClpExpression(node));
🤖 Prompt for AI Agents
In the file
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java
around lines 210 to 216, the getOriginalColumnName() method returns null when
the handle is not a ClpColumnHandle. This could lead to potential
NullPointerExceptions or unexpected "null" string concatenation in the calling
code. Refactor the method to return an Optional<String> instead of null, and
update the callers to handle the Optional appropriately. This will provide
better error handling and prevent null-related issues.
/** | ||
* Rewrites CLP UDFs (e.g., CLP_GET_*) in a RowExpression tree into VariableReferenceExpressions, | ||
* enabling them to be used as pushdown filters. | ||
* <p></p> | ||
* Traverses the expression tree recursively, replacing supported CLP UDFs with uniquely named | ||
* variables and tracking these variables in the assignments and context. If the CLP UDF takes | ||
* a constant string argument, that string is used as the new variable name. Unsupported | ||
* argument types (non-constants) or invalid expressions will throw an exception. | ||
* <p></p> | ||
* Examples: | ||
* <ul> | ||
* <li><code>CLP_GET_STRING('field')</code> → <code>field</code> (as a VariableReferenceExpression)</li> | ||
* <li><code>CLP_GET_INT('field')</code> → <code>field</code> (not mapped to a KQL column)</li> | ||
* </ul> | ||
* | ||
* @param rowExpression the input expression to analyze and possibly rewrite | ||
* @param context a set of VariableReferenceExpressions used for pushdown; newly created ones | ||
* will be added here | ||
* @return a possibly rewritten RowExpression with CLP UDFs replaced | ||
*/ | ||
private RowExpression maybeReplaceClpUdfArgument(RowExpression rowExpression, Set<VariableReferenceExpression> context) | ||
{ | ||
requireNonNull(context); | ||
if (!(rowExpression instanceof CallExpression)) { | ||
return rowExpression; | ||
} | ||
|
||
CallExpression callExpression = (CallExpression) rowExpression; | ||
|
||
// Recursively process the arguments of this CallExpression | ||
List<RowExpression> newArgs = callExpression.getArguments().stream() | ||
.map(childArg -> maybeReplaceClpUdfArgument(childArg, context)) | ||
.collect(ImmutableList.toImmutableList()); | ||
|
||
FunctionMetadata metadata = functionMetadataManager.getFunctionMetadata(callExpression.getFunctionHandle()); | ||
String functionName = metadata.getName().getObjectName().toUpperCase(); | ||
|
||
if (functionName.startsWith("CLP_GET")) { | ||
// Replace CLP UDF with VariableReferenceExpression | ||
int numArguments = callExpression.getArguments().size(); | ||
if (numArguments == 1) { | ||
RowExpression argument = callExpression.getArguments().get(0); | ||
if (!(argument instanceof ConstantExpression)) { | ||
throw new PrestoException( | ||
CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION, | ||
"The argument of " + functionName + " must be a ConstantExpression"); | ||
} | ||
Optional<String> definition = argument.accept(this, context).getPushDownExpression(); | ||
if (definition.isPresent()) { | ||
VariableReferenceExpression newVar = new VariableReferenceExpression( | ||
Optional.empty(), | ||
definition.get(), | ||
callExpression.getType()); | ||
context.add(newVar); | ||
assignments.put(newVar, new ClpColumnHandle(definition.get(), callExpression.getType(), true)); | ||
return newVar; | ||
} | ||
else { | ||
throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION, "Unrecognized parameter in " + functionName); | ||
} | ||
} | ||
} | ||
|
||
return new CallExpression(callExpression.getDisplayName(), callExpression.getFunctionHandle(), callExpression.getType(), newArgs); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Well-implemented CLP UDF rewriting logic.
The method correctly handles recursive rewriting of CLP UDFs with proper error handling and validation. Good use of ImmutableList
for collecting transformed arguments.
Consider extracting the magic string "CLP_GET" as a constant for better maintainability:
+ private static final String CLP_GET_PREFIX = "CLP_GET";
+
private RowExpression maybeReplaceClpUdfArgument(RowExpression rowExpression, Set<VariableReferenceExpression> context)
{
// ... existing code ...
- if (functionName.startsWith("CLP_GET")) {
+ if (functionName.startsWith(CLP_GET_PREFIX)) {
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
/** | |
* Rewrites CLP UDFs (e.g., CLP_GET_*) in a RowExpression tree into VariableReferenceExpressions, | |
* enabling them to be used as pushdown filters. | |
* <p></p> | |
* Traverses the expression tree recursively, replacing supported CLP UDFs with uniquely named | |
* variables and tracking these variables in the assignments and context. If the CLP UDF takes | |
* a constant string argument, that string is used as the new variable name. Unsupported | |
* argument types (non-constants) or invalid expressions will throw an exception. | |
* <p></p> | |
* Examples: | |
* <ul> | |
* <li><code>CLP_GET_STRING('field')</code> → <code>field</code> (as a VariableReferenceExpression)</li> | |
* <li><code>CLP_GET_INT('field')</code> → <code>field</code> (not mapped to a KQL column)</li> | |
* </ul> | |
* | |
* @param rowExpression the input expression to analyze and possibly rewrite | |
* @param context a set of VariableReferenceExpressions used for pushdown; newly created ones | |
* will be added here | |
* @return a possibly rewritten RowExpression with CLP UDFs replaced | |
*/ | |
private RowExpression maybeReplaceClpUdfArgument(RowExpression rowExpression, Set<VariableReferenceExpression> context) | |
{ | |
requireNonNull(context); | |
if (!(rowExpression instanceof CallExpression)) { | |
return rowExpression; | |
} | |
CallExpression callExpression = (CallExpression) rowExpression; | |
// Recursively process the arguments of this CallExpression | |
List<RowExpression> newArgs = callExpression.getArguments().stream() | |
.map(childArg -> maybeReplaceClpUdfArgument(childArg, context)) | |
.collect(ImmutableList.toImmutableList()); | |
FunctionMetadata metadata = functionMetadataManager.getFunctionMetadata(callExpression.getFunctionHandle()); | |
String functionName = metadata.getName().getObjectName().toUpperCase(); | |
if (functionName.startsWith("CLP_GET")) { | |
// Replace CLP UDF with VariableReferenceExpression | |
int numArguments = callExpression.getArguments().size(); | |
if (numArguments == 1) { | |
RowExpression argument = callExpression.getArguments().get(0); | |
if (!(argument instanceof ConstantExpression)) { | |
throw new PrestoException( | |
CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION, | |
"The argument of " + functionName + " must be a ConstantExpression"); | |
} | |
Optional<String> definition = argument.accept(this, context).getPushDownExpression(); | |
if (definition.isPresent()) { | |
VariableReferenceExpression newVar = new VariableReferenceExpression( | |
Optional.empty(), | |
definition.get(), | |
callExpression.getType()); | |
context.add(newVar); | |
assignments.put(newVar, new ClpColumnHandle(definition.get(), callExpression.getType(), true)); | |
return newVar; | |
} | |
else { | |
throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION, "Unrecognized parameter in " + functionName); | |
} | |
} | |
} | |
return new CallExpression(callExpression.getDisplayName(), callExpression.getFunctionHandle(), callExpression.getType(), newArgs); | |
} | |
// extract magic prefix into a constant | |
+ private static final String CLP_GET_PREFIX = "CLP_GET"; | |
private RowExpression maybeReplaceClpUdfArgument(RowExpression rowExpression, Set<VariableReferenceExpression> context) | |
{ | |
requireNonNull(context); | |
if (!(rowExpression instanceof CallExpression)) { | |
return rowExpression; | |
} | |
CallExpression callExpression = (CallExpression) rowExpression; | |
// Recursively process the arguments of this CallExpression | |
List<RowExpression> newArgs = callExpression.getArguments().stream() | |
.map(childArg -> maybeReplaceClpUdfArgument(childArg, context)) | |
.collect(ImmutableList.toImmutableList()); | |
FunctionMetadata metadata = functionMetadataManager.getFunctionMetadata(callExpression.getFunctionHandle()); | |
String functionName = metadata.getName().getObjectName().toUpperCase(); | |
- if (functionName.startsWith("CLP_GET")) { | |
+ if (functionName.startsWith(CLP_GET_PREFIX)) { | |
// Replace CLP UDF with VariableReferenceExpression | |
int numArguments = callExpression.getArguments().size(); | |
if (numArguments == 1) { | |
RowExpression argument = callExpression.getArguments().get(0); |
🤖 Prompt for AI Agents
In
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java
between lines 217 and 281, the string "CLP_GET" is used directly in the code to
identify CLP UDF functions. Extract this magic string into a private static
final constant at the class level, then replace all occurrences of the literal
"CLP_GET" with this constant to improve maintainability and avoid duplication.
Description
This PR introduces new user-defined functions (UDFs) for the CLP connector to improve querying of semi-structured logs:
ClpString
,Integer
,Float
, etc.). These UDFs enable direct access to dynamic fields and return corresponding Presto native types.Both sets of UDFs are integrated with the query rewriting layer. During query optimization, calls to these functions are rewritten into normal column references or KQL query column symbols, ensuring efficient execution with no additional parsing overhead.
The PR also updates the documentation to clearly describe the semantics and usage of these UDFs, along with examples demonstrating their use.
This PR fixes #29.
Checklist
breaking change.
Validation performed
All unit tests passed. End-to-end testing worked.
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
Chores