Skip to content

Commit

Permalink
Merge branch 'development' into feat-cohere
Browse files Browse the repository at this point in the history
  • Loading branch information
bhavnicksm authored Jan 6, 2025
2 parents 90e5ee4 + 3e3bef7 commit 3d6e6c1
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 18 deletions.
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,19 @@ _The no-nonsense RAG chunking library that's lightweight, lightning-fast, and re

</div>

so i found myself making another RAG bot (for the 2342148th time) and meanwhile, explaining to my juniors about why we should use chunking in our RAG bots, only to realise that i would have to write chunking all over again unless i use the bloated software library X or the extremely feature-less library Y. _WHY CAN I NOT HAVE SOMETHING JUST RIGHT, UGH?_
Ever found yourself building yet another RAG bot (your 2,342,148th one, but who's counting?), only to hit that all-too-familiar wall? You know the one - where you're stuck choosing between:

Can't i just install, import and run chunking and not have to worry about dependencies, bloat, speed or other factors?
- Library X: A behemoth that takes forever to install and probably includes three different kitchen sinks
- Library Y: So bare-bones it might as well be a "Hello World" program
- Writing it yourself? For the 2,342,149th time... sigh

Well, with chonkie you can! (chonkie boi is a gud boi)
And you think to yourself:

> "WHY CAN'T THIS JUST BE SIMPLE?!" </br>
> "Why do I need to choose between bloated and bare-bones?" </br>
> "Why can't I just install, import, and CHONK?!" </br>
Well, look no further than Chonkie! (a chonkie boi is a gud boi 🦛💕)

**🚀 Feature-rich**: All the CHONKs you'd ever need </br>
**✨ Easy to use**: Install, Import, CHONK </br>
Expand All @@ -37,7 +45,7 @@ Well, with chonkie you can! (chonkie boi is a gud boi)
**🦛 Cute CHONK mascot**: psst it's a pygmy hippo btw </br>
**❤️ [Moto Moto](#acknowledgements)'s favorite python library** </br>

What're you waiting for, **just CHONK it**!
**Chonkie** is a chunking library that "**just works™**". So what're you waiting for, **just CHONK it**!

# Installation

Expand Down
8 changes: 6 additions & 2 deletions src/chonkie/chunker/recursive.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,13 @@ def _split_text(self,
# Usually a good idea to check if there are any splits that are too short in characters
# and then merge them
merged_splits = []
for split in splits:
for i, split in enumerate(splits):
if len(split) < self.min_characters_per_chunk:
merged_splits[-1] += split
if merged_splits:
merged_splits[-1] += split
else:
splits[i+1] = split + splits[i+1] # When merge splits is empty, we merge the current split with the next split
continue
else:
merged_splits.append(split)
splits = merged_splits
Expand Down
23 changes: 11 additions & 12 deletions src/chonkie/chunker/token.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,14 @@ def _create_chunks(
current_index = end_index - overlap_length

return chunks

def _token_group_generator(self, tokens: List[int]) -> Generator[List[int], None, None]:
"""Generate chunks from a list of tokens."""
for start in range(0, len(tokens), self.chunk_size - self.chunk_overlap):
end = min(start + self.chunk_size, len(tokens))
yield tokens[start:end]
if end == len(tokens):
break

def chunk(self, text: str) -> List[Chunk]:
"""Split text into overlapping chunks of specified token size.
Expand All @@ -102,9 +110,8 @@ def chunk(self, text: str) -> List[Chunk]:
# Encode full text
text_tokens = self._encode(text)

# Calculate chunk positions
token_groups = [text_tokens[start_index : min(start_index + self.chunk_size, len(text_tokens))]
for start_index in range(0, len(text_tokens), self.chunk_size - self.chunk_overlap)]
# Calculate token groups and counts
token_groups = list(self._token_group_generator(text_tokens))
token_counts = [len(toks) for toks in token_groups]

# decode the token groups into the chunk texts
Expand All @@ -115,12 +122,6 @@ def chunk(self, text: str) -> List[Chunk]:

return chunks

def _token_group_generator(self, tokens: List[int]) -> Generator[List[int], None, None]:
"""Generate chunks from a list of tokens."""
for start in range(0, len(tokens), self.chunk_size - self.chunk_overlap):
end = min(start + self.chunk_size, len(tokens))
yield tokens[start:end]

def _process_batch(self,
chunks: List[Tuple[List[int], int, int]],
full_text: str) -> List[Chunk]:
Expand Down Expand Up @@ -153,9 +154,7 @@ def _process_text_batch(self, texts: List[str]) -> List[List[Chunk]]:
continue

# get the token groups
token_groups = []
for token_group in self._token_group_generator(tokens):
token_groups.append(token_group)
token_groups = list(self._token_group_generator(tokens))

# get the token counts
token_counts = [len(token_group) for token_group in token_groups]
Expand Down

0 comments on commit 3d6e6c1

Please sign in to comment.