wc -l for tengo SDK #1090

dbolotin · 2025-08-20T20:09:03Z

No description provided.

changeset-bot · 2025-08-20T20:09:07Z

🦋 Changeset detected

Latest commit: c3689db

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages

Name	Type
@platforma-open/milaboratories.software-ptexter	Minor
@platforma-sdk/workflow-tengo	Minor
@milaboratories/pl-middle-layer	Patch
@milaboratories/milaboratories.monetization-test.workflow	Patch
@milaboratories/milaboratories.ui-examples.workflow	Patch
@milaboratories/milaboratories.pool-explorer.workflow	Patch
@platforma-sdk/test	Patch
@milaboratories/milaboratories.monetization-test	Patch
@milaboratories/milaboratories.ui-examples	Patch
@milaboratories/milaboratories.pool-explorer	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

gemini-code-assist

Summary of Changes

Hello @dbolotin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new countLines function to the Tengo SDK's txt library, providing a high-performance solution for counting lines in text files. This feature is powered by a new Python backend utility that utilizes the Polars library for efficient processing and includes support for optional regex filtering to exclude specific lines. The changes enhance the SDK's text processing capabilities, offering a robust and performant way to analyze file content within Platforma workflows.

Highlights

New countLines Function: A new countLines function has been introduced to the Tengo SDK's txt library, allowing users to efficiently count lines in text files within their workflows.
High-Performance Polars Backend: The core line counting logic is implemented in a new Python backend script (wc-l.py) that leverages the Polars library for optimized performance, capable of handling large files rapidly.
Regex Filtering Capability: The countLines function supports an optional ignorePattern argument, enabling users to filter out lines matching a specified regular expression from the total count.
Updated Package Configuration and Dependencies: The lib/ptexter package's package.json has been updated to include the new wc-l binary, along with dedicated requirements files (requirements-head.txt, requirements-wc-l.txt) for better dependency management.
New Test Coverage: Comprehensive test cases, including a new Tengo test template and TypeScript tests, have been added to validate the functionality and correctness of the countLines function.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a wc -l-like functionality to the Tengo SDK, backed by a Python script using polars for high performance. The changes are well-structured, separating concerns into a backend Python script and a frontend Tengo function. The PR also includes new tests and documentation.

My review focuses on several areas for improvement:

A critical performance issue in the Python script where large files are read into memory entirely instead of being processed lazily.
A bug in the Tengo function's argument handling.
Mismatches between function documentation and implementation.
Weaknesses in the test suite that fail to properly validate the new functionality.
Some code complexity in tests that can be simplified.

gemini-code-assist · 2025-08-20T20:11:44Z

lib/ptexter/src/wc-l.py

+        lines_df = df.collect()
+
+        # Get the column (should be the first/only column)
+        col_name = lines_df.columns[0]
+
+        # Filter out lines matching the ignore pattern
+        filtered_df = lines_df.filter(
+            ~pl.col(col_name).str.contains(ignore_pattern, literal=False)
+        )
+
+        return len(filtered_df)


This implementation reads the entire file into memory with df.collect() before filtering. For large files, this will be very memory-intensive and can lead to out-of-memory errors, defeating the purpose of using a lazy reader like scan_csv. The filtering should be performed on the lazy DataFrame to ensure memory efficiency.

Suggested change

lines_df = df.collect()

# Get the column (should be the first/only column)

col_name = lines_df.columns[0]

# Filter out lines matching the ignore pattern

filtered_df = lines_df.filter(

~pl.col(col_name).str.contains(ignore_pattern, literal=False)

)

return len(filtered_df)

# Perform filtering lazily on the scanned dataframe

# to avoid loading the entire file into memory.

col_name = df.columns[0]

filtered_lazy_df = df.filter(

~pl.col(col_name).str.contains(ignore_pattern, literal=False)

)

# Collect only the final count, which is very memory-efficient.

return filtered_lazy_df.select(pl.len()).collect().item()

gemini-code-assist · 2025-08-20T20:11:44Z

sdk/workflow-tengo/src/txt/index.lib.tengo

+    if len(opts) == 0 {
+        opts = {}
+    } else if len(opts) == 1 {
+        opts = opts[0]
+    } else {
+        ll.panic("too many arguments")
+    }
+
+    if !is_map(opts) {
+        ll.panic("opts must be a map or undefined. Got: %T", opts)
+    }


The current logic for handling the optional opts parameter is buggy. If the function is called with undefined as the second argument (e.g., countLines(file, undefined)), it will panic because the check at line 98 is flawed. The error message is also misleading. This logic should be refactored to robustly handle cases where opts is not provided, is a map, or is undefined.

if len(opts) == 0 { opts = {} } else if len(opts) == 1 { opts = opts[0] if is_undefined(opts) { opts = {} } } else { ll.panic("too many arguments") } if !is_map(opts) { ll.panic("opts must be a map. Got: %T", opts) }

gemini-code-assist · 2025-08-20T20:11:44Z

tests/workflow-tengo/src/test/txt/txt.test.ts

+  {
+    name: 'count-lines-with-comment-filter',
+    fileName: 'maybe_the_number_of_lines_is_the_answer.txt',
+    countOptions: { ignorePattern: '^#' }, // Ignore lines starting with #
+    expectedCount: 42, // Assuming no comment lines in this file, same count
+    handleProvider: async (driverKit) => {
+      return await driverKit.lsDriver.getLocalFileHandle(
+        path.resolve('../../assets/maybe_the_number_of_lines_is_the_answer.txt'),
+      );
+    },
+  },


This test case for ignorePattern is not effective. It uses a pattern (^#) on a file that contains no matching lines, so the expectedCount is the same as the total line count. This doesn't actually verify that the filtering logic works correctly. A more robust test would use a dedicated test file containing lines that should be filtered out, and assert that the resulting count is lower than the total.

gemini-code-assist · 2025-08-20T20:11:44Z

lib/ptexter/src/requirements.txt

+# Requirements for wc-l endpoint - high performance line counting
+polars-lts-cpu==1.30.0


The modification of this file seems unnecessary as it is no longer referenced in lib/ptexter/package.json. The entrypoints now use requirements-head.txt and requirements-wc-l.txt. Keeping this unreferenced file can be confusing and add maintenance overhead in the future.

gemini-code-assist · 2025-08-20T20:11:44Z

lib/ptexter/src/wc-l.py

+    except re.error as e:
+        raise ValueError(f"Invalid regex pattern '{ignore_pattern}': {e}") from e


This except re.error block is unreachable. When an invalid regex pattern is passed to polars.str.contains, it raises a polars.exceptions.ComputeError, not a re.error. The regex validation in the main function correctly prevents this from happening, but this exception handler in wc_lines is misleading as it provides a false sense of security. It should be removed or changed to catch polars.exceptions.ComputeError if wc_lines is intended to be robust on its own.

gemini-code-assist · 2025-08-20T20:11:44Z

sdk/workflow-tengo/src/txt/index.lib.tengo

+        saveFileContent("output.txt")
+
+    result := cmdBuilder.run()
+    return result.getFileContent("output.txt")


The function's JSDoc specifies a return type of {number}, but it currently returns a string representation of the number. To align with the documentation and provide a more convenient API for consumers, the result should be converted to an integer before being returned.

return int(result.getFileContent("output.txt"))

gemini-code-assist · 2025-08-20T20:11:45Z

tests/workflow-tengo/src/test/txt/countLines.tpl.tengo

+    countResult := undefined
+    if inputs.countOptions == false {
+        countResult = txt.countLines(importResult.file)
+    } else {
+        countResult = txt.countLines(importResult.file, inputs.countOptions)
+    }


This logic is overly complex and relies on the test sending a false boolean for missing options. After fixing the argument handling in the countLines function, this can be greatly simplified. The test should pass undefined for missing options, and this template can just make a single, unconditional call.

// With improved argument handling in `countLines`, this logic can be simplified. // The function will correctly handle `undefined` for the options map, // defaulting to an empty map internally. // This makes the template cleaner and less reliant on test-side workarounds. countResult := txt.countLines(importResult.file, inputs.countOptions)

gemini-code-assist · 2025-08-20T20:11:45Z

tests/workflow-tengo/src/test/txt/txt.test.ts

+        countOptions: countOptions
+          ? tx.createValue(
+            Pl.JsonObject,
+            JSON.stringify(countOptions),
+          )
+          : tx.createValue(Pl.JsonObject, 'false'),


This logic for passing countOptions is unnecessarily complex. Sending a JSON-encoded string 'false' when options are undefined forces the Tengo template to have special conditional logic. A cleaner approach is to pass undefined directly, which can be handled gracefully by the countLines function after the recommended fixes are applied.

Suggested change

countOptions: countOptions

? tx.createValue(

Pl.JsonObject,

JSON.stringify(countOptions),

)

: tx.createValue(Pl.JsonObject, 'false'),

countOptions: countOptions

? tx.createValue(

Pl.JsonObject,

JSON.stringify(countOptions),

) // Pass options map if it exists

: tx.createValue(Pl.Undefined), // Pass undefined if it does not

DenKoren · 2025-08-21T10:12:02Z

lib/ptexter/src/requirements-wc-l.txt

@@ -0,0 +1,2 @@
+# Requirements for wc-l endpoint - high performance line counting


Do we really need them to be different for wc-l and head?

DenKoren · 2025-08-21T10:14:09Z

lib/ptexter/package.json

+      },
+      "wc-l": {
+        "binary": {
+          "artifact": {


What do you think of creating single package with single requirements.txt, but different entrypoints for it?

{ "block-software": { "artifacts": { "py": { "type": "python", "registry": "platforma-open", "environment": "@platforma-open/milaboratories.runenv-python-3:3.12.6", "dependencies": { "toolset": "pip", "requirements": "requirements-head.txt" }, "root": "./src" } }, "entrypoints": { "phead-lines": { "binary": { "artifact": "py", "cmd": [ "python", "{pkg}/phead-lines.py" ] } }, "wc-l": { "binary": { "artifact": "py", "cmd": [ "python", "{pkg}/wc-l.py" ] } } } } }

wc -l for tengo SDK

c3689db

gemini-code-assist bot reviewed Aug 20, 2025

View reviewed changes

DenKoren reviewed Aug 21, 2025

View reviewed changes

		# Requirements for wc-l endpoint - high performance line counting
		polars-lts-cpu==1.30.0

		except re.error as e:
		raise ValueError(f"Invalid regex pattern '{ignore_pattern}': {e}") from e

		@@ -0,0 +1,2 @@
		# Requirements for wc-l endpoint - high performance line counting

wc -l for tengo SDK #1090

Are you sure you want to change the base?

wc -l for tengo SDK #1090

Uh oh!

Conversation

dbolotin commented Aug 20, 2025

Uh oh!

changeset-bot bot commented Aug 20, 2025

🦋 Changeset detected

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

DenKoren Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DenKoren Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DenKoren Aug 21, 2025 •

edited

Loading

DenKoren Aug 21, 2025 •

edited

Loading