New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Updated chunker.py #986

Open

Ranjana761 wants to merge 1 commit into aryn-ai:main from Ranjana761:patch-1

Ranjana761 commented Oct 26, 2024

Summary

The improved implementation addresses several key issues and adds important features:

Generic Type Support: Added proper generic type support using TypeVar, making the chunker work with any sequence type, not just text.
Better Error Handling:
Empty Input Handling: Properly handles empty input sequences by returning an empty list
Improved Documentation:


          Update chunker.py

ed2c015

HenryL27 reviewed

View reviewed changes

Collaborator

HenryL27 left a comment

Docstrings and validations are appreciated, thank you.

It appears that some unit tests are failing. Can you make sure your implementation passes the tests? If the current implementation fails in some edge case can you add that to the tests? Otherwise I don't understand why it needs to be changed.
(poetry run pytest lib/sycamore/sycamore/tests/unit/functions/test_text_chunker.py)

Also please add a few test cases for the negative parameters

lib/sycamore/sycamore/functions/chunker.py

Comment on lines +77 to +79

+                          if (start == 0 or
+                              len(chunk) >= self._chunk_overlap_token_count or
+                              end == len(tokens)):

Collaborator

HenryL27 Oct 26, 2024

I'm struggling to understand why we need all these conditions. Is there ever a case, given the validations added in __init__ where this can evaluate to False?

if this is not the last chunk, then
len(chunk) = _chunk_token_count >= _chunk_overlap_token_count

if this is the last chunk then we add it anyway.

I think you only need the if start == 0 case for when there is only one rather small chunk, in which case it's also the last chunk.

lib/sycamore/sycamore/functions/chunker.py

Comment on lines +73 to +75

		end = min(start + self._chunk_token_count, len(tokens))
		chunk = tokens[start:end]

Collaborator

HenryL27 Oct 26, 2024

python can understand out-of-bounds slices and truncates appropriately. e.g.

>>> "01234"[2:10]
"234"

lib/sycamore/sycamore/functions/chunker.py

Comment on lines +83 to +84

		if end == len(tokens):
		break

Collaborator

HenryL27 Oct 26, 2024

we can just exit the for loop because we're at the end of the for loop in this case, right?

Collaborator

HenryL27 Oct 26, 2024

edit: I think this is the cause of the test failures - if the natural chunk boundary lines up exactly with the end of the sequence then you'll break prematurely, right?
overlap = 2, len = 5
"ABCDEFGH" -> "ABCDE", "DEFGH" instead of "ABCDE", "DEFGH", "GH"

lib/sycamore/sycamore/functions/chunker.py

Comment on lines +73 to +84

+                          end = min(start + self._chunk_token_count, len(tokens))
+                          chunk = tokens[start:end]
+                          # Add chunk if it's the first chunk, maintains minimum size, or is the last piece
+                          if (start == 0 or
+                              len(chunk) >= self._chunk_overlap_token_count or
+                              end == len(tokens)):
+                              chunks.append(chunk)
+                          # If we've processed all tokens, break
+                          if end == len(tokens):
+                              break

Collaborator

HenryL27 Oct 26, 2024

putting all these together, you can rewrite the loop more concisely as

for start in range(0, len(tokens), stride):
    chunks.append(tokens[start: start + self._chunk_token_count])

which is equivalent to the original list comprehension

MarkLindblad reviewed

View reviewed changes

lib/sycamore/sycamore/functions/chunker.py

Comment on lines +4 to 9

    
              T = TypeVar('T')

              class Chunker:

              class Chunker(Generic[T]):

                  @abstractmethod

                  def chunk(self, tokens: list[Any]) -> list[Any]:

                  def chunk(self, tokens: List[T]) -> List[List[T]]:

                      pass

Contributor

MarkLindblad Oct 28, 2024 •

edited

Loading

Generics were added in python 3.12. We want to be able to support >=3.9,<3.13. You will need to find another way of writing this.

sycamore/lib/sycamore/pyproject.toml

Line 16 in 96b61e0

python = ">=3.9,<3.13"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet