Skip to content

Conversation

@dbolotin
Copy link
Member

No description provided.

@changeset-bot
Copy link

changeset-bot bot commented Aug 20, 2025

🦋 Changeset detected

Latest commit: c3689db

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages
Name Type
@platforma-open/milaboratories.software-ptexter Minor
@platforma-sdk/workflow-tengo Minor
@milaboratories/pl-middle-layer Patch
@milaboratories/milaboratories.monetization-test.workflow Patch
@milaboratories/milaboratories.ui-examples.workflow Patch
@milaboratories/milaboratories.pool-explorer.workflow Patch
@platforma-sdk/test Patch
@milaboratories/milaboratories.monetization-test Patch
@milaboratories/milaboratories.ui-examples Patch
@milaboratories/milaboratories.pool-explorer Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @dbolotin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new countLines function to the Tengo SDK's txt library, providing a high-performance solution for counting lines in text files. This feature is powered by a new Python backend utility that utilizes the Polars library for efficient processing and includes support for optional regex filtering to exclude specific lines. The changes enhance the SDK's text processing capabilities, offering a robust and performant way to analyze file content within Platforma workflows.

Highlights

  • New countLines Function: A new countLines function has been introduced to the Tengo SDK's txt library, allowing users to efficiently count lines in text files within their workflows.
  • High-Performance Polars Backend: The core line counting logic is implemented in a new Python backend script (wc-l.py) that leverages the Polars library for optimized performance, capable of handling large files rapidly.
  • Regex Filtering Capability: The countLines function supports an optional ignorePattern argument, enabling users to filter out lines matching a specified regular expression from the total count.
  • Updated Package Configuration and Dependencies: The lib/ptexter package's package.json has been updated to include the new wc-l binary, along with dedicated requirements files (requirements-head.txt, requirements-wc-l.txt) for better dependency management.
  • New Test Coverage: Comprehensive test cases, including a new Tengo test template and TypeScript tests, have been added to validate the functionality and correctness of the countLines function.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a wc -l-like functionality to the Tengo SDK, backed by a Python script using polars for high performance. The changes are well-structured, separating concerns into a backend Python script and a frontend Tengo function. The PR also includes new tests and documentation.

My review focuses on several areas for improvement:

  • A critical performance issue in the Python script where large files are read into memory entirely instead of being processed lazily.
  • A bug in the Tengo function's argument handling.
  • Mismatches between function documentation and implementation.
  • Weaknesses in the test suite that fail to properly validate the new functionality.
  • Some code complexity in tests that can be simplified.

Comment on lines +42 to +52
lines_df = df.collect()

# Get the column (should be the first/only column)
col_name = lines_df.columns[0]

# Filter out lines matching the ignore pattern
filtered_df = lines_df.filter(
~pl.col(col_name).str.contains(ignore_pattern, literal=False)
)

return len(filtered_df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This implementation reads the entire file into memory with df.collect() before filtering. For large files, this will be very memory-intensive and can lead to out-of-memory errors, defeating the purpose of using a lazy reader like scan_csv. The filtering should be performed on the lazy DataFrame to ensure memory efficiency.

Suggested change
lines_df = df.collect()
# Get the column (should be the first/only column)
col_name = lines_df.columns[0]
# Filter out lines matching the ignore pattern
filtered_df = lines_df.filter(
~pl.col(col_name).str.contains(ignore_pattern, literal=False)
)
return len(filtered_df)
# Perform filtering lazily on the scanned dataframe
# to avoid loading the entire file into memory.
col_name = df.columns[0]
filtered_lazy_df = df.filter(
~pl.col(col_name).str.contains(ignore_pattern, literal=False)
)
# Collect only the final count, which is very memory-efficient.
return filtered_lazy_df.select(pl.len()).collect().item()

Comment on lines +90 to +100
if len(opts) == 0 {
opts = {}
} else if len(opts) == 1 {
opts = opts[0]
} else {
ll.panic("too many arguments")
}

if !is_map(opts) {
ll.panic("opts must be a map or undefined. Got: %T", opts)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for handling the optional opts parameter is buggy. If the function is called with undefined as the second argument (e.g., countLines(file, undefined)), it will panic because the check at line 98 is flawed. The error message is also misleading. This logic should be refactored to robustly handle cases where opts is not provided, is a map, or is undefined.

    if len(opts) == 0 {
        opts = {}
    } else if len(opts) == 1 {
        opts = opts[0]
        if is_undefined(opts) {
            opts = {}
        }
    } else {
        ll.panic("too many arguments")
    }
    if !is_map(opts) {
        ll.panic("opts must be a map. Got: %T", opts)
    }

Comment on lines +194 to +204
{
name: 'count-lines-with-comment-filter',
fileName: 'maybe_the_number_of_lines_is_the_answer.txt',
countOptions: { ignorePattern: '^#' }, // Ignore lines starting with #
expectedCount: 42, // Assuming no comment lines in this file, same count
handleProvider: async (driverKit) => {
return await driverKit.lsDriver.getLocalFileHandle(
path.resolve('../../assets/maybe_the_number_of_lines_is_the_answer.txt'),
);
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test case for ignorePattern is not effective. It uses a pattern (^#) on a file that contains no matching lines, so the expectedCount is the same as the total line count. This doesn't actually verify that the filtering logic works correctly. A more robust test would use a dedicated test file containing lines that should be filtered out, and assert that the resulting count is lower than the total.

Comment on lines +1 to +2
# Requirements for wc-l endpoint - high performance line counting
polars-lts-cpu==1.30.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The modification of this file seems unnecessary as it is no longer referenced in lib/ptexter/package.json. The entrypoints now use requirements-head.txt and requirements-wc-l.txt. Keeping this unreferenced file can be confusing and add maintenance overhead in the future.

Comment on lines +90 to +91
except re.error as e:
raise ValueError(f"Invalid regex pattern '{ignore_pattern}': {e}") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This except re.error block is unreachable. When an invalid regex pattern is passed to polars.str.contains, it raises a polars.exceptions.ComputeError, not a re.error. The regex validation in the main function correctly prevents this from happening, but this exception handler in wc_lines is misleading as it provides a false sense of security. It should be removed or changed to catch polars.exceptions.ComputeError if wc_lines is intended to be robust on its own.

saveFileContent("output.txt")

result := cmdBuilder.run()
return result.getFileContent("output.txt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function's JSDoc specifies a return type of {number}, but it currently returns a string representation of the number. To align with the documentation and provide a more convenient API for consumers, the result should be converted to an integer before being returned.

    return int(result.getFileContent("output.txt"))

Comment on lines +13 to +18
countResult := undefined
if inputs.countOptions == false {
countResult = txt.countLines(importResult.file)
} else {
countResult = txt.countLines(importResult.file, inputs.countOptions)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic is overly complex and relies on the test sending a false boolean for missing options. After fixing the argument handling in the countLines function, this can be greatly simplified. The test should pass undefined for missing options, and this template can just make a single, unconditional call.

    // With improved argument handling in `countLines`, this logic can be simplified.
    // The function will correctly handle `undefined` for the options map,
    // defaulting to an empty map internally.
    // This makes the template cleaner and less reliant on test-side workarounds.
    countResult := txt.countLines(importResult.file, inputs.countOptions)

Comment on lines +220 to +225
countOptions: countOptions
? tx.createValue(
Pl.JsonObject,
JSON.stringify(countOptions),
)
: tx.createValue(Pl.JsonObject, 'false'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for passing countOptions is unnecessarily complex. Sending a JSON-encoded string 'false' when options are undefined forces the Tengo template to have special conditional logic. A cleaner approach is to pass undefined directly, which can be handled gracefully by the countLines function after the recommended fixes are applied.

Suggested change
countOptions: countOptions
? tx.createValue(
Pl.JsonObject,
JSON.stringify(countOptions),
)
: tx.createValue(Pl.JsonObject, 'false'),
countOptions: countOptions
? tx.createValue(
Pl.JsonObject,
JSON.stringify(countOptions),
) // Pass options map if it exists
: tx.createValue(Pl.Undefined), // Pass undefined if it does not

@@ -0,0 +1,2 @@
# Requirements for wc-l endpoint - high performance line counting
Copy link
Member

@DenKoren DenKoren Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need them to be different for wc-l and head?

},
"wc-l": {
"binary": {
"artifact": {
Copy link
Member

@DenKoren DenKoren Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of creating single package with single requirements.txt, but different entrypoints for it?

{
  "block-software": {
    "artifacts": {
      "py": {
        "type": "python",
        "registry": "platforma-open",
        "environment": "@platforma-open/milaboratories.runenv-python-3:3.12.6",
        "dependencies": {
          "toolset": "pip",
          "requirements": "requirements-head.txt"
        },
        "root": "./src"
      }
    },
    "entrypoints": {
      "phead-lines": {
        "binary": {
          "artifact": "py",
          "cmd": [
            "python",
            "{pkg}/phead-lines.py"
          ]
        }
      },
      "wc-l": {
        "binary": {
          "artifact": "py",
          "cmd": [
            "python",
            "{pkg}/wc-l.py"
          ]
        }
      }
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants