-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated chunker.py #986
base: main
Are you sure you want to change the base?
Updated chunker.py #986
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstrings and validations are appreciated, thank you.
It appears that some unit tests are failing. Can you make sure your implementation passes the tests? If the current implementation fails in some edge case can you add that to the tests? Otherwise I don't understand why it needs to be changed.
(poetry run pytest lib/sycamore/sycamore/tests/unit/functions/test_text_chunker.py
)
Also please add a few test cases for the negative parameters
if (start == 0 or | ||
len(chunk) >= self._chunk_overlap_token_count or | ||
end == len(tokens)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to understand why we need all these conditions. Is there ever a case, given the validations added in __init__
where this can evaluate to False?
if this is not the last chunk, then
len(chunk) = _chunk_token_count >= _chunk_overlap_token_count
if this is the last chunk then we add it anyway.
I think you only need the if start == 0 case for when there is only one rather small chunk, in which case it's also the last chunk.
end = min(start + self._chunk_token_count, len(tokens)) | ||
chunk = tokens[start:end] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python can understand out-of-bounds slices and truncates appropriately. e.g.
>>> "01234"[2:10]
"234"
if end == len(tokens): | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just exit the for loop because we're at the end of the for loop in this case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: I think this is the cause of the test failures - if the natural chunk boundary lines up exactly with the end of the sequence then you'll break prematurely, right?
overlap = 2, len = 5
"ABCDEFGH" -> "ABCDE", "DEFGH" instead of "ABCDE", "DEFGH", "GH"
end = min(start + self._chunk_token_count, len(tokens)) | ||
chunk = tokens[start:end] | ||
|
||
# Add chunk if it's the first chunk, maintains minimum size, or is the last piece | ||
if (start == 0 or | ||
len(chunk) >= self._chunk_overlap_token_count or | ||
end == len(tokens)): | ||
chunks.append(chunk) | ||
|
||
# If we've processed all tokens, break | ||
if end == len(tokens): | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting all these together, you can rewrite the loop more concisely as
for start in range(0, len(tokens), stride):
chunks.append(tokens[start: start + self._chunk_token_count])
which is equivalent to the original list comprehension
T = TypeVar('T') | ||
|
||
class Chunker: | ||
class Chunker(Generic[T]): | ||
@abstractmethod | ||
def chunk(self, tokens: list[Any]) -> list[Any]: | ||
def chunk(self, tokens: List[T]) -> List[List[T]]: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generics were added in python 3.12. We want to be able to support >=3.9,<3.13. You will need to find another way of writing this.
sycamore/lib/sycamore/pyproject.toml
Line 16 in 96b61e0
python = ">=3.9,<3.13" |
Summary
The improved implementation addresses several key issues and adds important features: