[Feature Request]: In IngestionPipeline
, cache node transformations at node level
#16394
Labels
IngestionPipeline
, cache node transformations at node level
#16394
Feature Description
In
IngestionPipeline
, cache the transformations on individual nodes, so that the same node+transformation can be retrieved from the cache, instead of requiring the whole list of nodes to be the same.Reason
The hash used to key the cache is based on the whole list of Nodes (e.g. Documents), instead of individual Nodes. Even if only a single Node has changed, the transformations are executed on the whole list again, since the hash is different.
llama_index/llama-index-core/llama_index/core/ingestion/pipeline.py
Lines 55 to 102 in 535d0a4
Code example:
Value of Feature
I would like to use
IngestionPipeline
as part of a data preparation step in an ML pipeline. The inputs are Documents, and outputs are embeddings that are passed to an ML model. There is no need for storage or retrieval. For example:In cases where incoming data represents a full snapshot (e.g. daily snaphots of Documents in a production system), the set of Documents comprises:
It would be more efficient if the transformations can be run only on (1) and (2), and retrieve the cached results for (3), especially if (3) represents a large proportion of the batch of data.
Currently, the cache works on the whole list of Documents passed to
IngestionPIpeline.run()
, so even if only 1 Document has changed, all the other Documents are processed again.Furthermore, the cache will become bloated over time, as each batch of data is likely to be unique (albeit with many overlapping Documents). It is extremely unlikely that the exact same list of Documents is encountered again, which diminishes the utility of the cache.
The text was updated successfully, but these errors were encountered: