Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Docling v2 hierarchical chunking instead of the existing context-aware chunking implementation #350

Open
Tracked by #374
jwm4 opened this issue Nov 8, 2024 · 1 comment
Labels

Comments

@jwm4
Copy link

jwm4 commented Nov 8, 2024

Context

The following was done recently:

This allows us to ensure that we're using Docling v2 and get all the latest quality improvements and we were able to do so without breaking the existing code. However, the solution winds up being suboptimal because it is not making full use of the power of Docling v2. Specifically, by getting all the Docling outputs from the Docling v1.x format via doc.legacy_document, it misses out on the opportunity to benefit from enhancements to the Docling v2 format. Furthermore, the context-aware chunker is confusing, idiosyncratic, and there are hardly any comments. It doesn't seem to handle some obvious edge cases (e.g., long document elements that don't fit into a single chunk) and it doesn't have comments that explain why it is not doing so.

Much of the work needed to do context-aware chunking is already handled by the Docling v2 hierarchical chunker. However, the Docling v2 hierarchical chunker does not constrain the size of the chunks it produces; often the chunks are very small (much smaller than we would want for most practical uses such as SDG) and sometimes they are very large (and thus unusable for SDG without splitting or truncating). For those reasons, we need to post-process the outputs of the Docling v2 hierarchical chunker to split up large chunks and/or merge small chunks as needed.

There is a discussion in the docling repo about how to do this well, and I have proposed a candidate implementation that I reference in the discussion. Here is a basic overview of how it works:

  1. First it goes through all the chunks (from the Docling hierarchical chunker) that contain multiple doc_items (e.g., itemized lists) and checks to see if they are too long for the context window. If they are, it splits them into smaller chunks along the doc_item boundaries. For example, if it gets a list with 5 elements each with 200 tokens and the context window is 500 tokens, it will put the first two list elements into one chunk, the next two into a second chunk, and the last one in a third chunk. At this stage, if some doc_item is too long for the context window, it becomes its own chunks.
  2. Next it goes through all the chunks and it finds chunks that are too large for the context window and splits them using a naive text splitter (currently semchunk).
  3. Finally it goes through all the chunks from first to last and whenever it encounters a sequence of chunks that combined would fit into a a single context window, it merges them. This greedy, first-to-last approach is probably not ideal. For example, say you have 11 chunks each of length 50 and a context window of 512. This approach will observe that the first 10 chunks fit into the context window, merge them into one chunk and then observe that there is only one chunk left of length 50 and keep it as is, so you wind up with one chunk of length 500 and one of length 50. However, for most purposes it probably would have been better to have one chunk of length 300 and one of length 250 so you don't wind up with one tiny chunk at the end that is too small to be very useful. So rather than doing greedy, first-to-last merging, it would probably be better to do a non-greedy search for a solution that minimizes the number of chunks AND provides the greatest minimum chunk size. However, addressing this limitation is probably not particularly urgent or important. It is not in scope for this issue, but could be considered for inclusion in a future issue.

This implementation has gone through several rounds of revisions and incorporates a lot of changes from Panos on the Docling team. However, the Docling team still has some open concerns around my implementation:

  1. They don't like my dependency on semchunk, but we're still debating what alternatives to pursue.
  2. They would like the chunker to output a stream of chunks instead of building a full list of chunks before outputting all of them at once. That seems fine to me.

Ideally, once these open issues are resolved, some version of a fixed-size hierarchical chunker will hopefully be included in Docling. So this issue in InstructLab could be as simple as just calling the fixed-size hierarchical chunker in Docling. However, if we are in a hurry and/or want more control, we could just use the existing hierarchical chunker in Docling and include some version of the code I propose for imposing a fixed size on the outputs of the hierarchical chunker.

Work to do

Either:

  • Wait for a fixed-size hierarchical chunker to be included in Docling.
  • Then use that fixed-size hierarchical chunker.

Or:

  • Use the existing (unlimitted size) hierarchical chunker in Docling.
  • Add our own code to split and/or merge chunks as needed, possibly using my candidate implementation as a starting point.

Also, this work item will need lots of testing of course.

@vagenas
Copy link

vagenas commented Dec 9, 2024

@jwm4 @aakankshaduggal @khaledsulayman

Wait for a fixed-size hierarchical chunker to be included in Docling. Then use that fixed-size hierarchical chunker.

This has now been released, please check out our recently announced Hybrid Chunker.

Let us know in case of any questions or help needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants