You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've checked this feature isn't already implemented or proposed
This feature is relevant to Chonkie's purpose (text chunking for RAG)
💡 Feature Description
Currently, only simple strings can be used for splitting texts in RecursiveRules. It would be great if one additionally could use regular expressions for this purpose, e.g., to better catch bold-faced headings in PDFs converted to Markdown format. Moreover, it would be good to have the flexibility to split before or after the match.
🛠️ Implementation Approach
Example of how this feature might work
fromchonkieimportRecursiveChunker, RecursiveRules, RecursiveLevelimportrefirst_level_re=re.compile(r"(\n\*\*\d\s\w+\s*\*\*)") # for section headings like "**2 Related Work**", to be used together with similar regexes for further levels, e.g. "**2.1 Topic 1**"custom_level=RecursiveLevel(regexes=[first_level_re], split_before_match=True) # 2 new attributes for RecursiveLevel classcustom_rules=RecursiveRules([first_level_re])
custom_chunker=RecursiveChunker(rules=custom_rules)
Your implementation idea
In recursive.py:
def_split_text(self,
text: str,
rule: RecursiveLevel,
sep: str="🦛") ->List[str]:
"""Split the text into chunks using the delimiters."""# At every delimiter, replace it with the sep ifrule.delimitersorrule.regexes: # Assumption: each rule uses at most one of the two options, would need to ensure this in RecursiveLevel constructorifrule.delimiters:
fordelimiterinrule.delimiters:
text=text.replace(delimiter, delimiter+sep)
ifrule.regexes:
forregexinrule.regexes:
ifrule.split_before_match:
text=re.sub(regex, sep+r"\1", text)
else:
text=re.sub(regex, r"\1"+sep, text)
🎯 Why is this needed?
Section headings or other structural elements in a document can be much more variable than simple string delimiters can express.
The text was updated successfully, but these errors were encountered:
Thanks for opening a feature request! That's a brilliant suggestion~ I would love to have this in chonkie!
A few comments that I would like to make on this are (in no particular order):
I believe it's not exactly necessary to only have one, either delimiters or regexes. I think you can pass in both at the same level as well. For eg. delimiters = ['\n'] and regexes=r'\b(https?|ftp|file)://\S+' And they would be able to work together just fine! Though I personally would prefer they be in two separate levels stylistically, we should not hold the user to it.
We can probably combine the regex into one single long regex with something like "|".join(regexes) and compile them for efficiency, I think. My only doubt is if it would create issues with adding the separator...?
Just a minor nit, but I'd prefer if split_before_match: bool was actually just boundary: Literal["before", "after"] for a shorter argument and explicit option pass from the user end.
Would you be willing to make a PR for this? I would be happy to support you with it~
Hi @bhavnicksm,
Thanks for the quick feedback and further suggestions!
I'd be willing to make a PR for this, but I need to check this with my employer first. Do I understand correctly from CONTRIBUTING.md that you do not require contributors to sign any kind of Contributor License Agreement?
Do I understand correctly from CONTRIBUTING.md that you do not require contributors to sign any kind of Contributor License Agreement?
Yes, there's no CLA required on Chonkie to contribute! Since Chonkie is under the MIT License, all contributions made are open-source and free to use for everyone.
📋 Quick Check
💡 Feature Description
Currently, only simple strings can be used for splitting texts in RecursiveRules. It would be great if one additionally could use regular expressions for this purpose, e.g., to better catch bold-faced headings in PDFs converted to Markdown format. Moreover, it would be good to have the flexibility to split before or after the match.
🛠️ Implementation Approach
Example of how this feature might work
Your implementation idea
In recursive.py:
🎯 Why is this needed?
Section headings or other structural elements in a document can be much more variable than simple string delimiters can express.
The text was updated successfully, but these errors were encountered: