Use Docling v2 hierarchical chunking instead of the existing context-aware chunking implementation #350

jwm4 · 2024-11-08T15:37:50Z

Context

The following was done recently:

Context aware chunking for PDFs (only) was delivered with a dependency on Docling v1.x: context-aware chunking #271
Also delivered was upgrading to Docling v2 but still using Docling v1.x format by accessing doc.legacy_document. This is still also doing context-aware chunking for PDFs only. Move to docling v2 for PDF support #333

This allows us to ensure that we're using Docling v2 and get all the latest quality improvements and we were able to do so without breaking the existing code. However, the solution winds up being suboptimal because it is not making full use of the power of Docling v2. Specifically, by getting all the Docling outputs from the Docling v1.x format via doc.legacy_document, it misses out on the opportunity to benefit from enhancements to the Docling v2 format. Furthermore, the context-aware chunker is confusing, idiosyncratic, and there are hardly any comments. It doesn't seem to handle some obvious edge cases (e.g., long document elements that don't fit into a single chunk) and it doesn't have comments that explain why it is not doing so.

Much of the work needed to do context-aware chunking is already handled by the Docling v2 hierarchical chunker. However, the Docling v2 hierarchical chunker does not constrain the size of the chunks it produces; often the chunks are very small (much smaller than we would want for most practical uses such as SDG) and sometimes they are very large (and thus unusable for SDG without splitting or truncating). For those reasons, we need to post-process the outputs of the Docling v2 hierarchical chunker to split up large chunks and/or merge small chunks as needed.

There is a discussion in the docling repo about how to do this well, and I have proposed a candidate implementation that I reference in the discussion. Here is a basic overview of how it works:

First it goes through all the chunks (from the Docling hierarchical chunker) that contain multiple doc_items (e.g., itemized lists) and checks to see if they are too long for the context window. If they are, it splits them into smaller chunks along the doc_item boundaries. For example, if it gets a list with 5 elements each with 200 tokens and the context window is 500 tokens, it will put the first two list elements into one chunk, the next two into a second chunk, and the last one in a third chunk. At this stage, if some doc_item is too long for the context window, it becomes its own chunks.
Next it goes through all the chunks and it finds chunks that are too large for the context window and splits them using a naive text splitter (currently semchunk).
Finally it goes through all the chunks from first to last and whenever it encounters a sequence of chunks that combined would fit into a a single context window, it merges them. This greedy, first-to-last approach is probably not ideal. For example, say you have 11 chunks each of length 50 and a context window of 512. This approach will observe that the first 10 chunks fit into the context window, merge them into one chunk and then observe that there is only one chunk left of length 50 and keep it as is, so you wind up with one chunk of length 500 and one of length 50. However, for most purposes it probably would have been better to have one chunk of length 300 and one of length 250 so you don't wind up with one tiny chunk at the end that is too small to be very useful. So rather than doing greedy, first-to-last merging, it would probably be better to do a non-greedy search for a solution that minimizes the number of chunks AND provides the greatest minimum chunk size. However, addressing this limitation is probably not particularly urgent or important. It is not in scope for this issue, but could be considered for inclusion in a future issue.

This implementation has gone through several rounds of revisions and incorporates a lot of changes from Panos on the Docling team. However, the Docling team still has some open concerns around my implementation:

They don't like my dependency on semchunk, but we're still debating what alternatives to pursue.
They would like the chunker to output a stream of chunks instead of building a full list of chunks before outputting all of them at once. That seems fine to me.

Ideally, once these open issues are resolved, some version of a fixed-size hierarchical chunker will hopefully be included in Docling. So this issue in InstructLab could be as simple as just calling the fixed-size hierarchical chunker in Docling. However, if we are in a hurry and/or want more control, we could just use the existing hierarchical chunker in Docling and include some version of the code I propose for imposing a fixed size on the outputs of the hierarchical chunker.

Work to do

Either:

Wait for a fixed-size hierarchical chunker to be included in Docling.
Then use that fixed-size hierarchical chunker.

Or:

Use the existing (unlimitted size) hierarchical chunker in Docling.
Add our own code to split and/or merge chunks as needed, possibly using my candidate implementation as a starting point.

Also, this work item will need lots of testing of course.

vagenas · 2024-12-09T10:41:37Z

@jwm4 @aakankshaduggal @khaledsulayman

Wait for a fixed-size hierarchical chunker to be included in Docling. Then use that fixed-size hierarchical chunker.

This has now been released, please check out our recently announced Hybrid Chunker.

Let us know in case of any questions or help needed!

bbrowning · 2025-01-27T14:12:10Z

Closing this one as won't fix in favor of #503, which is to move to the newer hybrid chunking.

This was referenced Nov 8, 2024

Migrate to docling v2 json format #344

Closed

Enable support for docling V2 supported filetypes #353

Open

ktam3 mentioned this issue Nov 13, 2024

[Epic] Fully Utilize Docling V2 Capabilities #374

Open

6 tasks

ktam3 added the jira label Nov 14, 2024

aakankshaduggal mentioned this issue Jan 16, 2025

Remove the legacy document format to move to docling v2 output #485

Closed

bbrowning closed this as not planned Won't fix, can't repro, duplicate, stale Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Docling v2 hierarchical chunking instead of the existing context-aware chunking implementation #350

Use Docling v2 hierarchical chunking instead of the existing context-aware chunking implementation #350

jwm4 commented Nov 8, 2024

vagenas commented Dec 9, 2024

bbrowning commented Jan 27, 2025

Use Docling v2 hierarchical chunking instead of the existing context-aware chunking implementation #350

Use Docling v2 hierarchical chunking instead of the existing context-aware chunking implementation #350

Comments

jwm4 commented Nov 8, 2024

Context

Work to do

vagenas commented Dec 9, 2024

bbrowning commented Jan 27, 2025