-
Notifications
You must be signed in to change notification settings - Fork 1
wc -l for tengo SDK #1090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
wc -l for tengo SDK #1090
Conversation
🦋 Changeset detectedLatest commit: c3689db The changes in this PR will be included in the next version bump. This PR includes changesets to release 10 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @dbolotin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new countLines function to the Tengo SDK's txt library, providing a high-performance solution for counting lines in text files. This feature is powered by a new Python backend utility that utilizes the Polars library for efficient processing and includes support for optional regex filtering to exclude specific lines. The changes enhance the SDK's text processing capabilities, offering a robust and performant way to analyze file content within Platforma workflows.
Highlights
- New
countLinesFunction: A newcountLinesfunction has been introduced to the Tengo SDK'stxtlibrary, allowing users to efficiently count lines in text files within their workflows. - High-Performance Polars Backend: The core line counting logic is implemented in a new Python backend script (
wc-l.py) that leverages the Polars library for optimized performance, capable of handling large files rapidly. - Regex Filtering Capability: The
countLinesfunction supports an optionalignorePatternargument, enabling users to filter out lines matching a specified regular expression from the total count. - Updated Package Configuration and Dependencies: The
lib/ptexterpackage'spackage.jsonhas been updated to include the newwc-lbinary, along with dedicated requirements files (requirements-head.txt,requirements-wc-l.txt) for better dependency management. - New Test Coverage: Comprehensive test cases, including a new Tengo test template and TypeScript tests, have been added to validate the functionality and correctness of the
countLinesfunction.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a wc -l-like functionality to the Tengo SDK, backed by a Python script using polars for high performance. The changes are well-structured, separating concerns into a backend Python script and a frontend Tengo function. The PR also includes new tests and documentation.
My review focuses on several areas for improvement:
- A critical performance issue in the Python script where large files are read into memory entirely instead of being processed lazily.
- A bug in the Tengo function's argument handling.
- Mismatches between function documentation and implementation.
- Weaknesses in the test suite that fail to properly validate the new functionality.
- Some code complexity in tests that can be simplified.
| lines_df = df.collect() | ||
|
|
||
| # Get the column (should be the first/only column) | ||
| col_name = lines_df.columns[0] | ||
|
|
||
| # Filter out lines matching the ignore pattern | ||
| filtered_df = lines_df.filter( | ||
| ~pl.col(col_name).str.contains(ignore_pattern, literal=False) | ||
| ) | ||
|
|
||
| return len(filtered_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation reads the entire file into memory with df.collect() before filtering. For large files, this will be very memory-intensive and can lead to out-of-memory errors, defeating the purpose of using a lazy reader like scan_csv. The filtering should be performed on the lazy DataFrame to ensure memory efficiency.
| lines_df = df.collect() | |
| # Get the column (should be the first/only column) | |
| col_name = lines_df.columns[0] | |
| # Filter out lines matching the ignore pattern | |
| filtered_df = lines_df.filter( | |
| ~pl.col(col_name).str.contains(ignore_pattern, literal=False) | |
| ) | |
| return len(filtered_df) | |
| # Perform filtering lazily on the scanned dataframe | |
| # to avoid loading the entire file into memory. | |
| col_name = df.columns[0] | |
| filtered_lazy_df = df.filter( | |
| ~pl.col(col_name).str.contains(ignore_pattern, literal=False) | |
| ) | |
| # Collect only the final count, which is very memory-efficient. | |
| return filtered_lazy_df.select(pl.len()).collect().item() |
| if len(opts) == 0 { | ||
| opts = {} | ||
| } else if len(opts) == 1 { | ||
| opts = opts[0] | ||
| } else { | ||
| ll.panic("too many arguments") | ||
| } | ||
|
|
||
| if !is_map(opts) { | ||
| ll.panic("opts must be a map or undefined. Got: %T", opts) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for handling the optional opts parameter is buggy. If the function is called with undefined as the second argument (e.g., countLines(file, undefined)), it will panic because the check at line 98 is flawed. The error message is also misleading. This logic should be refactored to robustly handle cases where opts is not provided, is a map, or is undefined.
if len(opts) == 0 {
opts = {}
} else if len(opts) == 1 {
opts = opts[0]
if is_undefined(opts) {
opts = {}
}
} else {
ll.panic("too many arguments")
}
if !is_map(opts) {
ll.panic("opts must be a map. Got: %T", opts)
}
| { | ||
| name: 'count-lines-with-comment-filter', | ||
| fileName: 'maybe_the_number_of_lines_is_the_answer.txt', | ||
| countOptions: { ignorePattern: '^#' }, // Ignore lines starting with # | ||
| expectedCount: 42, // Assuming no comment lines in this file, same count | ||
| handleProvider: async (driverKit) => { | ||
| return await driverKit.lsDriver.getLocalFileHandle( | ||
| path.resolve('../../assets/maybe_the_number_of_lines_is_the_answer.txt'), | ||
| ); | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case for ignorePattern is not effective. It uses a pattern (^#) on a file that contains no matching lines, so the expectedCount is the same as the total line count. This doesn't actually verify that the filtering logic works correctly. A more robust test would use a dedicated test file containing lines that should be filtered out, and assert that the resulting count is lower than the total.
| # Requirements for wc-l endpoint - high performance line counting | ||
| polars-lts-cpu==1.30.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| except re.error as e: | ||
| raise ValueError(f"Invalid regex pattern '{ignore_pattern}': {e}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This except re.error block is unreachable. When an invalid regex pattern is passed to polars.str.contains, it raises a polars.exceptions.ComputeError, not a re.error. The regex validation in the main function correctly prevents this from happening, but this exception handler in wc_lines is misleading as it provides a false sense of security. It should be removed or changed to catch polars.exceptions.ComputeError if wc_lines is intended to be robust on its own.
| saveFileContent("output.txt") | ||
|
|
||
| result := cmdBuilder.run() | ||
| return result.getFileContent("output.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function's JSDoc specifies a return type of {number}, but it currently returns a string representation of the number. To align with the documentation and provide a more convenient API for consumers, the result should be converted to an integer before being returned.
return int(result.getFileContent("output.txt"))
| countResult := undefined | ||
| if inputs.countOptions == false { | ||
| countResult = txt.countLines(importResult.file) | ||
| } else { | ||
| countResult = txt.countLines(importResult.file, inputs.countOptions) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is overly complex and relies on the test sending a false boolean for missing options. After fixing the argument handling in the countLines function, this can be greatly simplified. The test should pass undefined for missing options, and this template can just make a single, unconditional call.
// With improved argument handling in `countLines`, this logic can be simplified.
// The function will correctly handle `undefined` for the options map,
// defaulting to an empty map internally.
// This makes the template cleaner and less reliant on test-side workarounds.
countResult := txt.countLines(importResult.file, inputs.countOptions)
| countOptions: countOptions | ||
| ? tx.createValue( | ||
| Pl.JsonObject, | ||
| JSON.stringify(countOptions), | ||
| ) | ||
| : tx.createValue(Pl.JsonObject, 'false'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic for passing countOptions is unnecessarily complex. Sending a JSON-encoded string 'false' when options are undefined forces the Tengo template to have special conditional logic. A cleaner approach is to pass undefined directly, which can be handled gracefully by the countLines function after the recommended fixes are applied.
| countOptions: countOptions | |
| ? tx.createValue( | |
| Pl.JsonObject, | |
| JSON.stringify(countOptions), | |
| ) | |
| : tx.createValue(Pl.JsonObject, 'false'), | |
| countOptions: countOptions | |
| ? tx.createValue( | |
| Pl.JsonObject, | |
| JSON.stringify(countOptions), | |
| ) // Pass options map if it exists | |
| : tx.createValue(Pl.Undefined), // Pass undefined if it does not |
| @@ -0,0 +1,2 @@ | |||
| # Requirements for wc-l endpoint - high performance line counting | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need them to be different for wc-l and head?
| }, | ||
| "wc-l": { | ||
| "binary": { | ||
| "artifact": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of creating single package with single requirements.txt, but different entrypoints for it?
{
"block-software": {
"artifacts": {
"py": {
"type": "python",
"registry": "platforma-open",
"environment": "@platforma-open/milaboratories.runenv-python-3:3.12.6",
"dependencies": {
"toolset": "pip",
"requirements": "requirements-head.txt"
},
"root": "./src"
}
},
"entrypoints": {
"phead-lines": {
"binary": {
"artifact": "py",
"cmd": [
"python",
"{pkg}/phead-lines.py"
]
}
},
"wc-l": {
"binary": {
"artifact": "py",
"cmd": [
"python",
"{pkg}/wc-l.py"
]
}
}
}
}
}
No description provided.