Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Slicing to support negative indices #277

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

vineetg3
Copy link
Contributor

Resolves #260.
Please read the issue for more details.

The SUBSTR conversion logic in convert_slice is being expanded to handle negative indices.
Here is the outline of the logic to support Python based slicing semantics:

  • Case 1: (None, None)
    • Returns the string as is.
  • Case 2: (start, None)
    • Positive start: Convert to 1-based indexing and slice from start.
    • Negative start: Compute LENGTH(string) + start + 1; clamp to 1 if less than 1.
  • Case 3: (None, stop)
    • Positive stop: Slice from position 1 to stop.
    • Negative stop: Compute LENGTH(string) + stop; clamp to 0 if less than 0 (empty slice).
  • Case 4: (start, stop)
      1. Both start & stop >= 0:
      • Convert start to 1-based.
      • Set length = stop - start.
      1. start < 0, stop >= 0:
      • Convert start to positive. If < 1, set to 1.
      • Compute length = stop - start (clamp to 0 if negative).
      1. start >= 0, stop < 0:
      • Convert stop & start to positive.
      • If stop < 1, slice is empty (length = 0).
      • Else, length = stop - start.
      1. start < 0, stop < 0:
      • Convert start & stop to positive. If start < 1, set to 1.
      • If stop < 1, slice is empty (length = 0).
      • Else, length = stop - start.

@vineetg3 vineetg3 requested review from knassre-bodo and hadia206 and removed request for knassre-bodo February 21, 2025 21:23
Copy link

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
I have minor comments.

- Convert `start` to positive. If < 1, set to 1.
- Compute `length = stop - start` (clamp to 0 if negative).
- 3. `start >= 0`, `stop < 0`:
- Convert `stop` & `start` to positive.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on 3. case start is already positive.

Suggested change
- Convert `stop` & `start` to positive.
- Convert `stop` to positive.

@@ -59,7 +59,7 @@ These are created with a prefix operator syntax instead of called as a function.

These are other PyDough operators that are not necessarily used as functions:

- `SLICE`: operator used for string slicing, with the same semantics as Python string slicing. If `s[a:b:c]` is done, that is translated to `SLICE(s,a,b,c)` in PyDough, and any of `a`/`b`/`c` could be absent. Currently, PyDough does not support negative indices or providing step values other than 1.
- `SLICE`: operator used for string slicing, with the same semantics as Python string slicing. If `s[a:b:c]` is done, that is translated to `SLICE(s,a,b,c)` in PyDough, and any of `a`/`b`/`c` could be absent. Negative slicing is supported, however currently, PyDough does not support providing step values other than 1.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is this PyDough does not support providing step values other than 1. applies for both positive and negative values.
Current edit makes it that only negative doesn't support step values other than 1.
A suggestion.

Suggested change
- `SLICE`: operator used for string slicing, with the same semantics as Python string slicing. If `s[a:b:c]` is done, that is translated to `SLICE(s,a,b,c)` in PyDough, and any of `a`/`b`/`c` could be absent. Negative slicing is supported, however currently, PyDough does not support providing step values other than 1.
- `SLICE`: operator used for string slicing, with the same semantics as Python string slicing. If `s[a:b:c]` is done, that is translated to `SLICE(s,a,b,c)` in PyDough, and any of `a`/`b`/`c` could be absent. Negative slicing is supported. Currently PyDough does not support providing step values other than 1.

if isinstance(step, sqlglot_expressions.Literal):
if int(step.this) != 1:
try:
step_int = int(step.this)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be like others to handle None case?
start_idx = int(start.this) if not isinstance(start, sqlglot_expressions.Null) else None

try:
step_int = int(step.this)
except ValueError:
raise ValueError(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to also add which argument was the issue.
e.g. SLICE function step argument must be integer literal or absent

)
except ValueError:
raise ValueError(
"SLICE function currently only supports the slicing arguments being integer literals or absent"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. Specify argument

)
except ValueError:
raise ValueError(
"SLICE function currently only supports the slicing arguments being integer literals or absent"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. Specify argument

start_idx: int | None = None
stop_idx: int | None = None
try:
start_idx = (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have value in ()?

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic @vineetg3!!! Just a few comments before merging.

Comment on lines +166 to +168
name_without_first_char = name[1:],
last_digit = phone[-1:],
name_without_start_and_end_char = name[1:-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also add:

  • phone_without_last_5_chars = phone[:-5]
  • name_second_to_last_char = name[-2:-1]

@@ -504,12 +504,68 @@ def convert_concat(
return Concat(expressions=inputs)


def positive_index(
column: SQLGlotExpression, neg_index: int, is_0_based: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: is_zero_indexed instead if is_0_based

based on the length of the column.

Args:
`column`: The column expression to reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: this isn't necessarily a column (could be a function call, or a literal). Let's just call this string_expr and say it refers to the string that the negative index is referencing a point within.

"""
assert len(sql_glot_args) == 4
_, start, stop, step = sql_glot_args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: can unpack string_expr as the first arg & use that instead of sql_glot_args[0] in the rest of the function.

Comment on lines 583 to 584
for arg in sql_glot_args[1:]:
if not isinstance(arg, (sqlglot_expressions.Literal, sqlglot_expressions.Null)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: can replace this the loop guard with for arg in (start, stop, step)

Comment on lines 590 to 591
try:
step_int = int(step.this)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: can push this check into the the earlier for loop by checking if isinstance(arg.this, int) when it is a literal.

Comment on lines 603 to 604
try:
start_idx = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: no need to check with try/excepts, just psuh the check into the earlier loop & just do start_idx = int(start.this) if start isn't null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integer literal is presented as a string by sqlglot, for example 2 is presented as '2'. So, we would need to check if that could be converted to an int by using try catch. This isn't the cleanest way probably and I'm open to other suggestions.

if start_idx < 0 and stop_idx < 0:
# Early return if start index is greater than stop index
# e.g., "hello"[-2:-4] should return empty string
if start_idx > stop_idx:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be >= since "hello"[-2:-2] returns an empty string (the upper bound is exclusive.

Comment on lines 76 to 78
def bad_slice_4():
# Unsupported slicing: reversed
return Customers(name[::-1])
# Unsupported slicing: non-integer start
return Customers(name[datetime.datetime.now() : -1 :])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about testing with non-literals?



def step_slicing():
return Customers.WHERE(name == "Jane Smith")(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the test generates the answers, why not just test on the entire table (its only 20 rows)?

@vineetg3 vineetg3 self-assigned this Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for more slicing cases with SUBSTR
3 participants