Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .changeset/add-wc-l-countlines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
"@platforma-open/milaboratories.software-ptexter": minor
"@platforma-sdk/workflow-tengo": minor
---

Add countLines function with high-performance Polars line counting and regex filtering support
29 changes: 29 additions & 0 deletions lib/ptexter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# @platforma-open/milaboratories.software-ptexter

Text processing utilities backend for Platforma workflows.

## Overview

This package provides Python-based text processing tools that serve as the backend implementation for the Platforma `txt` library. The utilities in this package are designed to be called from Tengo workflows through the corresponding frontend library located at `sdk/workflow-tengo/src/txt/`.

## Architecture

- **Backend (this package)**: Python scripts that perform the actual text processing operations
- **Frontend**: Tengo library (`txt`) that provides a convenient workflow API and calls these backend utilities

## Usage

This package is typically not used directly. Instead, use the `txt` library in your Tengo workflows:

```tengo
txt := import(":txt")

// The txt library will automatically call the appropriate ptexter backend utilities
result := txt.head(inputs.myFile, {lines: 10})
```

The backend utilities are packaged as Platforma software artifacts and automatically managed by the platform's execution environment.

## Development

This package follows the standard Platforma software packaging conventions and is built using the `@platforma-sdk/package-builder` toolchain.
20 changes: 19 additions & 1 deletion lib/ptexter/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
"environment": "@platforma-open/milaboratories.runenv-python-3:3.12.6",
"dependencies": {
"toolset": "pip",
"requirements": "requirements.txt"
"requirements": "requirements-head.txt"
},
"root": "./src"
},
Expand All @@ -34,6 +34,24 @@
"{pkg}/phead-lines.py"
]
}
},
"wc-l": {
"binary": {
"artifact": {
Copy link
Member

@DenKoren DenKoren Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of creating single package with single requirements.txt, but different entrypoints for it?

{
  "block-software": {
    "artifacts": {
      "py": {
        "type": "python",
        "registry": "platforma-open",
        "environment": "@platforma-open/milaboratories.runenv-python-3:3.12.6",
        "dependencies": {
          "toolset": "pip",
          "requirements": "requirements-head.txt"
        },
        "root": "./src"
      }
    },
    "entrypoints": {
      "phead-lines": {
        "binary": {
          "artifact": "py",
          "cmd": [
            "python",
            "{pkg}/phead-lines.py"
          ]
        }
      },
      "wc-l": {
        "binary": {
          "artifact": "py",
          "cmd": [
            "python",
            "{pkg}/wc-l.py"
          ]
        }
      }
    }
  }
}

"type": "python",
"registry": "platforma-open",
"environment": "@platforma-open/milaboratories.runenv-python-3:3.12.6",
"dependencies": {
"toolset": "pip",
"requirements": "requirements-wc-l.txt"
},
"root": "./src"
},
"cmd": [
"python",
"{pkg}/wc-l.py"
]
}
}
}
}
Expand Down
1 change: 1 addition & 0 deletions lib/ptexter/src/requirements-head.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# No external dependencies for head endpoint - uses only Python standard library
2 changes: 2 additions & 0 deletions lib/ptexter/src/requirements-wc-l.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Requirements for wc-l endpoint - high performance line counting
Copy link
Member

@DenKoren DenKoren Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need them to be different for wc-l and head?

polars-lts-cpu==1.30.0
3 changes: 2 additions & 1 deletion lib/ptexter/src/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
# No external dependencies - uses only Python standard library
# Requirements for wc-l endpoint - high performance line counting
polars-lts-cpu==1.30.0
Comment on lines +1 to +2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The modification of this file seems unnecessary as it is no longer referenced in lib/ptexter/package.json. The entrypoints now use requirements-head.txt and requirements-wc-l.txt. Keeping this unreferenced file can be confusing and add maintenance overhead in the future.

154 changes: 154 additions & 0 deletions lib/ptexter/src/wc-l.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
#!/usr/bin/env python3
"""
wc-l.py - Count lines in a text file with optional regex filtering
High-performance line counter using Polars with optional regex pattern to ignore certain lines.
Outputs just the count number (no trailing newline) to the specified output file.
"""

import argparse
import sys
import re
from pathlib import Path
import polars as pl


def count_lines_optimized(input_file: str, ignore_pattern: str = None) -> int:
"""
Count lines using optimized Polars approach (best from benchmarks: 31,000+ MB/s).
Args:
input_file: Path to input file
ignore_pattern: Optional regex pattern - lines matching this will be ignored
Returns:
Number of lines (excluding ignored lines)
"""
# Use the optimized single-column approach from our benchmarks
df = pl.scan_csv(
input_file,
has_header=False,
separator='\x00', # Null separator to read as single column
infer_schema_length=0,
ignore_errors=True,
low_memory=True,
)

if ignore_pattern is None:
# Fast path - just count all lines
return df.select(pl.len()).collect().item()
else:
# Need to filter lines - read the column and apply regex filter
lines_df = df.collect()

# Get the column (should be the first/only column)
col_name = lines_df.columns[0]

# Filter out lines matching the ignore pattern
filtered_df = lines_df.filter(
~pl.col(col_name).str.contains(ignore_pattern, literal=False)
)

return len(filtered_df)
Comment on lines +42 to +52
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This implementation reads the entire file into memory with df.collect() before filtering. For large files, this will be very memory-intensive and can lead to out-of-memory errors, defeating the purpose of using a lazy reader like scan_csv. The filtering should be performed on the lazy DataFrame to ensure memory efficiency.

Suggested change
lines_df = df.collect()
# Get the column (should be the first/only column)
col_name = lines_df.columns[0]
# Filter out lines matching the ignore pattern
filtered_df = lines_df.filter(
~pl.col(col_name).str.contains(ignore_pattern, literal=False)
)
return len(filtered_df)
# Perform filtering lazily on the scanned dataframe
# to avoid loading the entire file into memory.
col_name = df.columns[0]
filtered_lazy_df = df.filter(
~pl.col(col_name).str.contains(ignore_pattern, literal=False)
)
# Collect only the final count, which is very memory-efficient.
return filtered_lazy_df.select(pl.len()).collect().item()



def wc_lines(input_file: str, output_file: str, ignore_pattern: str = None):
"""
Count lines in input_file and write count to output_file.
Args:
input_file: Path to input file
output_file: Path to output file (will contain just the count)
ignore_pattern: Optional regex pattern - lines matching this will be ignored
"""
try:
input_path = Path(input_file)
output_path = Path(output_file)

if not input_path.exists():
raise FileNotFoundError(f"Input file not found: {input_file}")

if not input_path.is_file():
raise ValueError(f"Input path is not a file: {input_file}")

# Create output directory if it doesn't exist
output_path.parent.mkdir(parents=True, exist_ok=True)

# Count lines using optimized Polars approach
line_count = count_lines_optimized(input_file, ignore_pattern)

# Write count to output file (no trailing newline as requested)
with open(output_path, 'w', encoding='utf-8') as outfile:
outfile.write(str(line_count))

return line_count

except UnicodeDecodeError as e:
raise ValueError(f"Failed to decode input file as UTF-8: {e}") from e
except IOError as e:
raise IOError(f"File I/O error: {e}") from e
except re.error as e:
raise ValueError(f"Invalid regex pattern '{ignore_pattern}': {e}") from e
Comment on lines +90 to +91
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This except re.error block is unreachable. When an invalid regex pattern is passed to polars.str.contains, it raises a polars.exceptions.ComputeError, not a re.error. The regex validation in the main function correctly prevents this from happening, but this exception handler in wc_lines is misleading as it provides a false sense of security. It should be removed or changed to catch polars.exceptions.ComputeError if wc_lines is intended to be robust on its own.



def main():
parser = argparse.ArgumentParser(
description='Count lines in a text file with optional regex filtering',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python wc-l.py input.txt output.txt
python wc-l.py --ignore-pattern '^#' input.txt output.txt # Skip comment lines
python wc-l.py --ignore-pattern '^\\s*$' input.txt output.txt # Skip empty lines
"""
)

parser.add_argument(
'--ignore-pattern',
type=str,
help='Optional regex pattern - lines matching this pattern will be ignored'
)

parser.add_argument(
'input_file',
help='Input text file path'
)

parser.add_argument(
'output_file',
help='Output file path (will contain just the line count)'
)

args = parser.parse_args()

# Validate regex pattern if provided
if args.ignore_pattern:
try:
re.compile(args.ignore_pattern)
except re.error as e:
print(f"Error: Invalid regex pattern '{args.ignore_pattern}': {e}", file=sys.stderr)
sys.exit(1)

try:
line_count = wc_lines(
args.input_file,
args.output_file,
args.ignore_pattern
)

ignored_msg = f" (excluding lines matching '{args.ignore_pattern}')" if args.ignore_pattern else ""
print(f"Successfully counted {line_count} lines{ignored_msg} and wrote to {args.output_file}")

except (FileNotFoundError, ValueError, IOError) as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
except KeyboardInterrupt:
print("\nOperation cancelled by user", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Unexpected error: {e}", file=sys.stderr)
sys.exit(1)


if __name__ == '__main__':
main()
61 changes: 60 additions & 1 deletion sdk/workflow-tengo/src/txt/index.lib.tengo
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,65 @@ head := func(fileRef, opts) {
return result.getFileContent("output.txt")
}

/**
* Counts lines in a text file with optional regex filtering.
*
* @param fileRef {resource} - Resource reference to the input text file
* @param ...opts {map} (optional) - Options map with optional fields:
* - ignorePattern {string} (optional): Regex pattern - lines matching this will be ignored
* @returns {number} - Number of lines in the file (excluding ignored lines)
* @example
* // Count all lines
* lineCount := txt.countLines(inputs.myFile)
*
* // Count lines ignoring comments
* lineCount := txt.countLines(inputs.myFile, {ignorePattern: "^#"})
*
* // Count non-empty lines
* lineCount := txt.countLines(inputs.myFile, {ignorePattern: "^\\s*$"})
*/
countLines := func(fileRef, ...opts) {
if !smart.isReference(fileRef) {
ll.panic("fileRef must be a valid resource reference. Got: %T", fileRef)
}

if len(opts) == 0 {
opts = {}
} else if len(opts) == 1 {
opts = opts[0]
} else {
ll.panic("too many arguments")
}

if !is_map(opts) {
ll.panic("opts must be a map or undefined. Got: %T", opts)
}
Comment on lines +90 to +100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for handling the optional opts parameter is buggy. If the function is called with undefined as the second argument (e.g., countLines(file, undefined)), it will panic because the check at line 98 is flawed. The error message is also misleading. This logic should be refactored to robustly handle cases where opts is not provided, is a map, or is undefined.

    if len(opts) == 0 {
        opts = {}
    } else if len(opts) == 1 {
        opts = opts[0]
        if is_undefined(opts) {
            opts = {}
        }
    } else {
        ll.panic("too many arguments")
    }
    if !is_map(opts) {
        ll.panic("opts must be a map. Got: %T", opts)
    }


wcSw := assets.importSoftware("@platforma-open/milaboratories.software-ptexter:wc-l")

cmdBuilder := exec.builder().
software(wcSw)

if !is_undefined(opts.ignorePattern) {
if !is_string(opts.ignorePattern) {
ll.panic("opts.ignorePattern must be a string. Got: %T", opts.ignorePattern)
}
cmdBuilder = cmdBuilder.
arg("--ignore-pattern").
arg(opts.ignorePattern)
}

cmdBuilder = cmdBuilder.
arg("input.txt").
arg("output.txt").
addFile("input.txt", fileRef).
saveFileContent("output.txt")

result := cmdBuilder.run()
return result.getFileContent("output.txt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function's JSDoc specifies a return type of {number}, but it currently returns a string representation of the number. To align with the documentation and provide a more convenient API for consumers, the result should be converted to an integer before being returned.

    return int(result.getFileContent("output.txt"))

}

export ll.toStrict({
head: head
head: head,
countLines: countLines
})
24 changes: 24 additions & 0 deletions tests/workflow-tengo/src/test/txt/countLines.tpl.tengo
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
// txt countLines function test template

self := import("@platforma-sdk/workflow-tengo:tpl")
file := import("@platforma-sdk/workflow-tengo:file")
txt := import("@platforma-sdk/workflow-tengo:txt")

self.defineOutputs(["result", "progress"])

self.body(func(inputs) {
importResult := file.importFile(inputs.importHandle)

// Apply txt.countLines function with the specified options
countResult := undefined
if inputs.countOptions == false {
countResult = txt.countLines(importResult.file)
} else {
countResult = txt.countLines(importResult.file, inputs.countOptions)
}
Comment on lines +13 to +18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic is overly complex and relies on the test sending a false boolean for missing options. After fixing the argument handling in the countLines function, this can be greatly simplified. The test should pass undefined for missing options, and this template can just make a single, unconditional call.

    // With improved argument handling in `countLines`, this logic can be simplified.
    // The function will correctly handle `undefined` for the options map,
    // defaulting to an empty map internally.
    // This makes the template cleaner and less reliant on test-side workarounds.
    countResult := txt.countLines(importResult.file, inputs.countOptions)


return {
result: countResult,
progress: importResult.handle
}
})
Loading