Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support regular expressions for splitting in RecursiveChunker #144

Open
2 tasks done
sophiehenning opened this issue Jan 7, 2025 · 3 comments
Open
2 tasks done
Assignees
Labels
enhancement New feature or request

Comments

@sophiehenning
Copy link

📋 Quick Check

  • I've checked this feature isn't already implemented or proposed
  • This feature is relevant to Chonkie's purpose (text chunking for RAG)

💡 Feature Description

Currently, only simple strings can be used for splitting texts in RecursiveRules. It would be great if one additionally could use regular expressions for this purpose, e.g., to better catch bold-faced headings in PDFs converted to Markdown format. Moreover, it would be good to have the flexibility to split before or after the match.

🛠️ Implementation Approach

Example of how this feature might work

from chonkie import RecursiveChunker, RecursiveRules, RecursiveLevel
import re

first_level_re = re.compile(r"(\n\*\*\d\s\w+\s*\*\*)")  # for section headings like "**2 Related Work**", to be used together with similar regexes for further levels, e.g. "**2.1 Topic 1**"
custom_level = RecursiveLevel(regexes=[first_level_re], split_before_match=True)  # 2 new attributes for RecursiveLevel class
custom_rules = RecursiveRules([first_level_re])
custom_chunker = RecursiveChunker(rules=custom_rules)

Your implementation idea

In recursive.py:

  def _split_text(self,
                    text: str,
                    rule: RecursiveLevel, 
                    sep: str = "🦛") -> List[str]:
        """Split the text into chunks using the delimiters."""
        # At every delimiter, replace it with the sep   
        if rule.delimiters or rule.regexes:  # Assumption: each rule uses at most one of the two options, would need to ensure this in RecursiveLevel constructor
            if rule.delimiters:
              for delimiter in rule.delimiters:
                  text = text.replace(delimiter, delimiter + sep)
           if rule.regexes:
             for regex in rule.regexes:
                  if rule.split_before_match:
                    text = re.sub(regex, sep + r"\1", text)
                  else:
                    text = re.sub(regex, r"\1" + sep, text)

🎯 Why is this needed?

Section headings or other structural elements in a document can be much more variable than simple string delimiters can express.

@sophiehenning sophiehenning added the enhancement New feature or request label Jan 7, 2025
@bhavnicksm
Copy link
Collaborator

Hey @sophiehenning! 😄

Thanks for opening a feature request! That's a brilliant suggestion~ I would love to have this in chonkie!

A few comments that I would like to make on this are (in no particular order):

  • I believe it's not exactly necessary to only have one, either delimiters or regexes. I think you can pass in both at the same level as well. For eg. delimiters = ['\n'] and regexes=r'\b(https?|ftp|file)://\S+' And they would be able to work together just fine! Though I personally would prefer they be in two separate levels stylistically, we should not hold the user to it.
  • We can probably combine the regex into one single long regex with something like "|".join(regexes) and compile them for efficiency, I think. My only doubt is if it would create issues with adding the separator...?
  • Just a minor nit, but I'd prefer if split_before_match: bool was actually just boundary: Literal["before", "after"] for a shorter argument and explicit option pass from the user end.

Would you be willing to make a PR for this? I would be happy to support you with it~

Thanks! 😊

@sophiehenning
Copy link
Author

sophiehenning commented Jan 8, 2025

Hi @bhavnicksm,
Thanks for the quick feedback and further suggestions!
I'd be willing to make a PR for this, but I need to check this with my employer first. Do I understand correctly from CONTRIBUTING.md that you do not require contributors to sign any kind of Contributor License Agreement?

@sophiehenning sophiehenning closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2025
@sophiehenning sophiehenning reopened this Jan 8, 2025
@bhavnicksm
Copy link
Collaborator

Hey @sophiehenning!

Do I understand correctly from CONTRIBUTING.md that you do not require contributors to sign any kind of Contributor License Agreement?

Yes, there's no CLA required on Chonkie to contribute! Since Chonkie is under the MIT License, all contributions made are open-source and free to use for everyone.

I hope that answers your question properly 😄

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants