[FEAT] Support regular expressions for splitting in RecursiveChunker #144

sophiehenning · 2025-01-07T15:59:17Z

📋 Quick Check

I've checked this feature isn't already implemented or proposed
This feature is relevant to Chonkie's purpose (text chunking for RAG)

💡 Feature Description

Currently, only simple strings can be used for splitting texts in RecursiveRules. It would be great if one additionally could use regular expressions for this purpose, e.g., to better catch bold-faced headings in PDFs converted to Markdown format. Moreover, it would be good to have the flexibility to split before or after the match.

🛠️ Implementation Approach

Example of how this feature might work

from chonkie import RecursiveChunker, RecursiveRules, RecursiveLevel
import re

first_level_re = re.compile(r"(\n\*\*\d\s\w+\s*\*\*)")  # for section headings like "**2 Related Work**", to be used together with similar regexes for further levels, e.g. "**2.1 Topic 1**"
custom_level = RecursiveLevel(regexes=[first_level_re], split_before_match=True)  # 2 new attributes for RecursiveLevel class
custom_rules = RecursiveRules([first_level_re])
custom_chunker = RecursiveChunker(rules=custom_rules)

Your implementation idea

In recursive.py:

  def _split_text(self,
                    text: str,
                    rule: RecursiveLevel, 
                    sep: str = "🦛") -> List[str]:
        """Split the text into chunks using the delimiters."""
        # At every delimiter, replace it with the sep   
        if rule.delimiters or rule.regexes:  # Assumption: each rule uses at most one of the two options, would need to ensure this in RecursiveLevel constructor
            if rule.delimiters:
              for delimiter in rule.delimiters:
                  text = text.replace(delimiter, delimiter + sep)
           if rule.regexes:
             for regex in rule.regexes:
                  if rule.split_before_match:
                    text = re.sub(regex, sep + r"\1", text)
                  else:
                    text = re.sub(regex, r"\1" + sep, text)

🎯 Why is this needed?

Section headings or other structural elements in a document can be much more variable than simple string delimiters can express.

bhavnicksm · 2025-01-07T17:04:25Z

Hey @sophiehenning! 😄

Thanks for opening a feature request! That's a brilliant suggestion~ I would love to have this in chonkie!

A few comments that I would like to make on this are (in no particular order):

I believe it's not exactly necessary to only have one, either delimiters or regexes. I think you can pass in both at the same level as well. For eg. delimiters = ['\n'] and regexes=r'\b(https?|ftp|file)://\S+' And they would be able to work together just fine! Though I personally would prefer they be in two separate levels stylistically, we should not hold the user to it.
We can probably combine the regex into one single long regex with something like "|".join(regexes) and compile them for efficiency, I think. My only doubt is if it would create issues with adding the separator...?
Just a minor nit, but I'd prefer if split_before_match: bool was actually just boundary: Literal["before", "after"] for a shorter argument and explicit option pass from the user end.

Would you be willing to make a PR for this? I would be happy to support you with it~

Thanks! 😊

sophiehenning · 2025-01-08T12:26:24Z

Hi @bhavnicksm,
Thanks for the quick feedback and further suggestions!
I'd be willing to make a PR for this, but I need to check this with my employer first. Do I understand correctly from CONTRIBUTING.md that you do not require contributors to sign any kind of Contributor License Agreement?

bhavnicksm · 2025-01-08T12:52:54Z

Hey @sophiehenning!

Do I understand correctly from CONTRIBUTING.md that you do not require contributors to sign any kind of Contributor License Agreement?

Yes, there's no CLA required on Chonkie to contribute! Since Chonkie is under the MIT License, all contributions made are open-source and free to use for everyone.

I hope that answers your question properly 😄

Thanks!

sophiehenning added the enhancement New feature or request label Jan 7, 2025

sophiehenning assigned bhavnicksm Jan 7, 2025

sophiehenning closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2025

sophiehenning reopened this Jan 8, 2025

finnschwall mentioned this issue Jan 15, 2025

[FEAT] Advanced regex based parsing + XML+ chunk metadata #148

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Support regular expressions for splitting in RecursiveChunker #144

[FEAT] Support regular expressions for splitting in RecursiveChunker #144

sophiehenning commented Jan 7, 2025

bhavnicksm commented Jan 7, 2025

sophiehenning commented Jan 8, 2025 •

edited

Loading

bhavnicksm commented Jan 8, 2025

[FEAT] Support regular expressions for splitting in RecursiveChunker #144

[FEAT] Support regular expressions for splitting in RecursiveChunker #144

Comments

sophiehenning commented Jan 7, 2025

📋 Quick Check

💡 Feature Description

🛠️ Implementation Approach

Example of how this feature might work

Your implementation idea

🎯 Why is this needed?

bhavnicksm commented Jan 7, 2025

sophiehenning commented Jan 8, 2025 • edited Loading

bhavnicksm commented Jan 8, 2025

sophiehenning commented Jan 8, 2025 •

edited

Loading