Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated chunker.py #986

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Updated chunker.py #986

wants to merge 1 commit into from

Conversation

Ranjana761
Copy link

Summary

The improved implementation addresses several key issues and adds important features:

  • Generic Type Support: Added proper generic type support using TypeVar, making the chunker work with any sequence type, not just text.
  • Better Error Handling:
  • Empty Input Handling: Properly handles empty input sequences by returning an empty list
  • Improved Documentation:

Copy link
Collaborator

@HenryL27 HenryL27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings and validations are appreciated, thank you.

It appears that some unit tests are failing. Can you make sure your implementation passes the tests? If the current implementation fails in some edge case can you add that to the tests? Otherwise I don't understand why it needs to be changed.
(poetry run pytest lib/sycamore/sycamore/tests/unit/functions/test_text_chunker.py)

Also please add a few test cases for the negative parameters

Comment on lines +77 to +79
if (start == 0 or
len(chunk) >= self._chunk_overlap_token_count or
end == len(tokens)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to understand why we need all these conditions. Is there ever a case, given the validations added in __init__ where this can evaluate to False?

if this is not the last chunk, then
len(chunk) = _chunk_token_count >= _chunk_overlap_token_count

if this is the last chunk then we add it anyway.

I think you only need the if start == 0 case for when there is only one rather small chunk, in which case it's also the last chunk.

Comment on lines +73 to +75
end = min(start + self._chunk_token_count, len(tokens))
chunk = tokens[start:end]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python can understand out-of-bounds slices and truncates appropriately. e.g.

>>> "01234"[2:10]
"234"

Comment on lines +83 to +84
if end == len(tokens):
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just exit the for loop because we're at the end of the for loop in this case, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: I think this is the cause of the test failures - if the natural chunk boundary lines up exactly with the end of the sequence then you'll break prematurely, right?
overlap = 2, len = 5
"ABCDEFGH" -> "ABCDE", "DEFGH" instead of "ABCDE", "DEFGH", "GH"

Comment on lines +73 to +84
end = min(start + self._chunk_token_count, len(tokens))
chunk = tokens[start:end]

# Add chunk if it's the first chunk, maintains minimum size, or is the last piece
if (start == 0 or
len(chunk) >= self._chunk_overlap_token_count or
end == len(tokens)):
chunks.append(chunk)

# If we've processed all tokens, break
if end == len(tokens):
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting all these together, you can rewrite the loop more concisely as

for start in range(0, len(tokens), stride):
    chunks.append(tokens[start: start + self._chunk_token_count])

which is equivalent to the original list comprehension

Comment on lines +4 to 9
T = TypeVar('T')

class Chunker:
class Chunker(Generic[T]):
@abstractmethod
def chunk(self, tokens: list[Any]) -> list[Any]:
def chunk(self, tokens: List[T]) -> List[List[T]]:
pass
Copy link
Contributor

@MarkLindblad MarkLindblad Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generics were added in python 3.12. We want to be able to support >=3.9,<3.13. You will need to find another way of writing this.

python = ">=3.9,<3.13"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants