You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the HTML embedder splits a webpage into sections for each paragraph, code block, header etc.
This works pretty great for articles and blog posts, where information is divided as such and each paragraph can be used independently. However, this falls apart for most documentation as it loses the relationship between a paragraph explaining code, and the code it's explaining.
It would be cool if this splitting behavior was configurable. I think the default behavior is fine for most cases, but the added ability of either only splitting on certain tokens, or not splitting at all would also be great options to have.
The text was updated successfully, but these errors were encountered:
Yes. It makes sense. But most applications may require splitting as people use it for RAG and the chunky size needs to be moderate. Do you have any ideas in mind for the splitting strategy. One option that I can think of is to convert the webpage to markdown style format and chunk it like any other markdown. Let me know if you have any other alternate option.
Yeah I think the current splitting strategy makes sense and works well for most sites, it's just the few where it doesn't 😅
Something that groups by content under a heading would be ideal for the remaining "some" I think - if we can convert to Markdown to keep the context of Heading, paragraph, code etc then that'd be even better!
Currently, the HTML embedder splits a webpage into sections for each paragraph, code block, header etc.
This works pretty great for articles and blog posts, where information is divided as such and each paragraph can be used independently. However, this falls apart for most documentation as it loses the relationship between a paragraph explaining code, and the code it's explaining.
It would be cool if this splitting behavior was configurable. I think the default behavior is fine for most cases, but the added ability of either only splitting on certain tokens, or not splitting at all would also be great options to have.
The text was updated successfully, but these errors were encountered: