-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat(api): add file_processor API skeleton #4113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This change adds a file_processor API skeleton that provides a foundationfor converting files into structured content for vector store ingestionwith support for chunking strategies and optional embedding generation. Signed-off-by: Alina Ryan <[email protected]>
b3ccdb2 to
2664aee
Compare
cdoern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few comments to start out. Thanks for working on this!
| - provider_type: remote::weaviate | ||
| files: | ||
| - provider_type: inline::localfs | ||
| file_processor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have this API in starter? or should we exclude it until it graduated out of alpha / has more providers.
I know post_training is in here, but we had similar issues with that API being in starter due to its startup process/heavy dependencies (torch).
I feel like this API may be similar in that way. What do you think?
| files = "files" | ||
| prompts = "prompts" | ||
| conversations = "conversations" | ||
| file_processor = "file_processor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be plural like file_processors? like the APIs above it? This is kind of a nit, but just something to think about!
| async def initialize(self) -> None: | ||
| pass | ||
|
|
||
| async def process_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a reference provider if that provider Is a no-op? Instead should we do with this what we did with SDG, where it is just a stub until an actual provider implementation is added? Otherwise this is dead code that someone could put in their run.yaml and get no output from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this. Let's first propose the new API, then add an implementation in another PR. Thanks!
This PR builds on the file processing workflow demonstrated in a recent Llama Stack community meeting, where we showcased file upload and processing capabilities through the UI. It introduces the backend API foundation that enables those integrations- specifically, a file_processor API skeleton that establishes a framework for converting files into structured content suitable for vector store ingestion, with support for configurable chunking strategies and optional embedding generation.
A follow-up PR will add an inline PyPDF provider implementation that can be invoked either within the vector store or as a standalone processor.
Related to:
#4114
#4003
#2484
Test Plan
Started the Llama Stack server using the starter distribution configuration to verify the new file_processor API is properly integrated.
Ran:
uv run llama stack run src/llama_stack/distributions/starter/run.yaml
Results:
The server started successfully using the starter distro config with no errors.
file_processorappeared in the list of available APIs.cc: @franciscojavierarceo @alimaredia