Skip to content

Conversation

@alinaryan
Copy link
Contributor

@alinaryan alinaryan commented Nov 9, 2025

This PR builds on the file processing workflow demonstrated in a recent Llama Stack community meeting, where we showcased file upload and processing capabilities through the UI. It introduces the backend API foundation that enables those integrations- specifically, a file_processor API skeleton that establishes a framework for converting files into structured content suitable for vector store ingestion, with support for configurable chunking strategies and optional embedding generation.

A follow-up PR will add an inline PyPDF provider implementation that can be invoked either within the vector store or as a standalone processor.

Related to:
#4114
#4003
#2484

Test Plan

Started the Llama Stack server using the starter distribution configuration to verify the new file_processor API is properly integrated.

Ran:
uv run llama stack run src/llama_stack/distributions/starter/run.yaml

Results:
The server started successfully using the starter distro config with no errors. file_processor appeared in the list of available APIs.

cc: @franciscojavierarceo @alimaredia

This change adds a file_processor API skeleton that provides a foundationfor converting files into structured content for vector store ingestionwith support for chunking strategies and optional embedding generation.

Signed-off-by: Alina Ryan <[email protected]>
@alinaryan alinaryan force-pushed the add-file-processor-skeleton branch from b3ccdb2 to 2664aee Compare November 9, 2025 05:24
@alinaryan alinaryan marked this pull request as draft November 9, 2025 05:31
Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments to start out. Thanks for working on this!

- provider_type: remote::weaviate
files:
- provider_type: inline::localfs
file_processor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have this API in starter? or should we exclude it until it graduated out of alpha / has more providers.

I know post_training is in here, but we had similar issues with that API being in starter due to its startup process/heavy dependencies (torch).

I feel like this API may be similar in that way. What do you think?

files = "files"
prompts = "prompts"
conversations = "conversations"
file_processor = "file_processor"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be plural like file_processors? like the APIs above it? This is kind of a nit, but just something to think about!

async def initialize(self) -> None:
pass

async def process_file(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a reference provider if that provider Is a no-op? Instead should we do with this what we did with SDG, where it is just a stub until an actual provider implementation is added? Otherwise this is dead code that someone could put in their run.yaml and get no output from.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. Let's first propose the new API, then add an implementation in another PR. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants