feat(api): add file_processor API skeleton #4113

alinaryan · 2025-11-09T05:11:52Z

This PR builds on the file processing workflow demonstrated in a recent Llama Stack community meeting, where we showcased file upload and processing capabilities through the UI. It introduces the backend API foundation that enables those integrations- specifically, a file_processor API skeleton that establishes a framework for converting files into structured content suitable for vector store ingestion, with support for configurable chunking strategies and optional embedding generation.

A follow-up PR will add an inline PyPDF provider implementation that can be invoked either within the vector store or as a standalone processor.

Related to:
#4114
#4003
#2484

Test Plan

Started the Llama Stack server using the starter distribution configuration to verify the new file_processor API is properly integrated.

Ran:
uv run llama stack run src/llama_stack/distributions/starter/run.yaml

Results:
The server started successfully using the starter distro config with no errors. file_processor appeared in the list of available APIs.

cc: @franciscojavierarceo @alimaredia

This change adds a file_processor API skeleton that provides a foundationfor converting files into structured content for vector store ingestionwith support for chunking strategies and optional embedding generation. Signed-off-by: Alina Ryan <[email protected]>

cdoern

a few comments to start out. Thanks for working on this!

cdoern · 2025-11-10T14:12:28Z

src/llama_stack/distributions/starter/build.yaml

    - provider_type: remote::weaviate
    files:
    - provider_type: inline::localfs
+    file_processor:


should we have this API in starter? or should we exclude it until it graduated out of alpha / has more providers.

I know post_training is in here, but we had similar issues with that API being in starter due to its startup process/heavy dependencies (torch).

I feel like this API may be similar in that way. What do you think?

cdoern · 2025-11-10T14:13:25Z

src/llama_stack/apis/datatypes.py

    files = "files"
    prompts = "prompts"
    conversations = "conversations"
+    file_processor = "file_processor"


I wonder if this should be plural like file_processors? like the APIs above it? This is kind of a nit, but just something to think about!

cdoern · 2025-11-10T14:14:40Z

src/llama_stack/providers/inline/file_processor/reference/reference.py

+    async def initialize(self) -> None:
+        pass
+
+    async def process_file(


do we need a reference provider if that provider Is a no-op? Instead should we do with this what we did with SDG, where it is just a stub until an actual provider implementation is added? Otherwise this is dead code that someone could put in their run.yaml and get no output from.

+1 on this. Let's first propose the new API, then add an implementation in another PR. Thanks!

alinaryan requested review from ashwinb, bbrowning, ehhuang, franciscojavierarceo, hardikjshah, leseb, mattf, raghotham, reluctantfuturist, slekkala1, terrytangyuan and yanxi0830 as code owners November 9, 2025 05:11

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 9, 2025

alinaryan force-pushed the add-file-processor-skeleton branch from b3ccdb2 to 2664aee Compare November 9, 2025 05:24

alinaryan marked this pull request as draft November 9, 2025 05:31

cdoern reviewed Nov 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(api): add file_processor API skeleton #4113

feat(api): add file_processor API skeleton #4113

Uh oh!

alinaryan commented Nov 9, 2025 •

edited

Loading

Uh oh!

cdoern left a comment

Uh oh!

cdoern Nov 10, 2025

Uh oh!

cdoern Nov 10, 2025

Uh oh!

cdoern Nov 10, 2025

Uh oh!

leseb Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(api): add file_processor API skeleton #4113

Are you sure you want to change the base?

feat(api): add file_processor API skeleton #4113

Uh oh!

Conversation

alinaryan commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

cdoern Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

leseb Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alinaryan commented Nov 9, 2025 •

edited

Loading