Add image and audio prompting API #71

domenic · 2025-01-20T07:01:59Z

Closes #40. Somewhat helps with #70.

README.md

michaelwasserman · 2025-01-21T20:21:55Z

README.md

+
+const response1 = await session.prompt([
+  "Give a helpful artistic critique of how well the second image matches the first:",
+  { type: "image", data: referenceImage },


Some models may only accept a single image or single audio input with each request. Consider describing that edge-case detail behavior (e.g. throw a "NotSupportedError" DOMException) when an unsupported number or combination of image/audio prompt pieces are passed. Maybe also give a single-image example here for wider compatibility.

I think we should encapsulate that away from the user so they don't have to worry about it, by sending two requests to the backend.

I think number of images aspect of this can be handled through the contextoverflow event. https://github.com/webmachinelearning/prompt-api#tokenization-context-window-length-limits-and-overflow.

Phi 3.5 vision https://huggingface.co/microsoft/Phi-3.5-vision-instruct, recommends at the most 16 images but passing more (say 17) will only run into context length limitations. The context length limitation can be reached earlier as well with large image.

Sending two requests to the backend may not work though, the assistant is going to add its response in between.

Are there other limitations in terms of no mixing of images/audio or limitations in number of images/audio. Ill have to check what the server-side models do for their API.

michaelwasserman · 2025-01-31T07:06:10Z

README.md

+
+// Initial prompt lines
+
+dictionary AILanguageModelInitialPromptLineDict {


Is it worthwhile to have separate role enums and dictionaries for initial/not lines?
Would simply throwing exceptions for invalid 'system' role use in the impl suffice?

I realize you've explored this thoroughly, but I wonder why the idl couldn't be simpler, like:

dictionary AILanguageModelPromptDict { required AILanguageModelPromptRole role = "user"; required AILanguageModelPromptType type = "text"; required AILanguageModelPromptData data; } typedef (DOMString or AILanguageModelPromptDict) AILanguageModelPrompt; typedef (AILanguageModelPrompt or sequence<AILanguageModelPrompt>) AILanguageModelPromptInput;

Is it worthwhile to have separate role enums and dictionaries for initial/not lines?
Would simply throwing exceptions for invalid 'system' role use in the impl suffice?

Yeah that's totally valid. If you as an implementer think that's simpler, we can do it.

I realize you've explored this thoroughly, but I wonder why the idl couldn't be simpler, like:

(Note that required and default values are not compatible.)

So, your version is mainly about unnesting the content, right?

I think it mostly works, it just departs from what's idiomatic with other chat completion APIs. Hmm.

Yeah, I imagined unnesting might slightly simplify IDL, impl, and atypical uses specifying more fields.
create({initialPrompts: [{role: "system", content: {type: "image", data: pixels}}]})
vs
create({initialPrompts: [{role: "system", type: "image", data: pixels}]})

I'm unfamiliar with rationale behind other APIs' idiomatic designs, so I'd readily defer to folks that know more, and maybe even opt for the more familiar pattern.

I think it's worth considering, since we're not matching the APIs exactly in other ways. I will try to outline the two possibilities in more detail to get developer feedback in #40, on Monday.

Add IDL for new LanguageModel[Factory] API types, etc., per: webmachinelearning/prompt-api#71 API use with currently supported input types is unchanged. API use with new input types throws TypeErrors for now. Move create() WPTs into a new file as separate tests. Bug: 385173789, 385173368 Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647 Reviewed-by: Clark DuVall <[email protected]> Commit-Queue: Mike Wasserman <[email protected]> Cr-Commit-Position: refs/heads/main@{#1414456}

This reverts commit 259d706. Reason for revert: Failing win builder: https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8724180044051350113/+/u/compile/raw_io.output_text_failure_summary_ Original change's description: > Prompt API: Add multimodal input IDL skeleton > > Add IDL for new LanguageModel[Factory] API types, etc., per: > webmachinelearning/prompt-api#71 > > API use with currently supported input types is unchanged. > API use with new input types throws TypeErrors for now. > > Move create() WPTs into a new file as separate tests. > > Bug: 385173789, 385173368 > Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698 > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647 > Reviewed-by: Clark DuVall <[email protected]> > Commit-Queue: Mike Wasserman <[email protected]> > Cr-Commit-Position: refs/heads/main@{#1414456} Bug: 385173789, 385173368 Change-Id: If058bf22fceb6880b7ba12ebccbdca74a0274535 No-Presubmit: true No-Tree-Checks: true No-Try: true Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6221804 Commit-Queue: Mike Wasserman <[email protected]> Auto-Submit: Mike Wasserman <[email protected]> Reviewed-by: Clark DuVall <[email protected]> Cr-Commit-Position: refs/heads/main@{#1414477}

This reverts commit 77f9f49. Reason for revert: Workaround Windows IDL compiler path length issues Original change's description: > Revert "Prompt API: Add multimodal input IDL skeleton" > > This reverts commit 259d706. > > Reason for revert: Failing win builder: > https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8724180044051350113/+/u/compile/raw_io.output_text_failure_summary_ > > Original change's description: > > Prompt API: Add multimodal input IDL skeleton > > > > Add IDL for new LanguageModel[Factory] API types, etc., per: > > webmachinelearning/prompt-api#71 > > > > API use with currently supported input types is unchanged. > > API use with new input types throws TypeErrors for now. > > > > Move create() WPTs into a new file as separate tests. > > > > Bug: 385173789, 385173368 > > Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698 > > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647 > > Reviewed-by: Clark DuVall <[email protected]> > > Commit-Queue: Mike Wasserman <[email protected]> > > Cr-Commit-Position: refs/heads/main@{#1414456} > > Bug: 385173789, 385173368 > Change-Id: If058bf22fceb6880b7ba12ebccbdca74a0274535 > No-Presubmit: true > No-Tree-Checks: true > No-Try: true > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6221804 > Commit-Queue: Mike Wasserman <[email protected]> > Auto-Submit: Mike Wasserman <[email protected]> > Reviewed-by: Clark DuVall <[email protected]> > Cr-Commit-Position: refs/heads/main@{#1414477} Bug: 385173789, 385173368, 394123703 Change-Id: I7a47d8ff8c6b6ae797c3198608859640ae81e1df Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6222573 Reviewed-by: Clark DuVall <[email protected]> Commit-Queue: Mike Wasserman <[email protected]> Cr-Commit-Position: refs/heads/main@{#1415399}

Add image and audio prompting API

2a9f391

Closes #40. Somewhat helps with #70.

beaufortfrancois reviewed Jan 21, 2025

View reviewed changes

README.md Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

bradtriebwasser reviewed Jan 21, 2025

View reviewed changes

README.md Show resolved Hide resolved

michaelwasserman reviewed Jan 21, 2025

View reviewed changes

domenic added 3 commits January 22, 2025 12:36

Respond to review feedback

e2e6752

More complicated typedefs!!

996364b

Missing []s

6839d63

michaelwasserman reviewed Jan 31, 2025

View reviewed changes

domenic mentioned this pull request Feb 4, 2025

Add multimodal API such as using image as part of prompt #40

Open

christianliebel mentioned this pull request Feb 4, 2025

FR: Real-time capabilities #80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add image and audio prompting API #71

Add image and audio prompting API #71

domenic commented Jan 20, 2025

michaelwasserman Jan 21, 2025

domenic Jan 22, 2025

sushraja-msft Jan 23, 2025 •

edited

Loading

michaelwasserman Jan 31, 2025

michaelwasserman Jan 31, 2025 •

edited

Loading

domenic Jan 31, 2025

michaelwasserman Jan 31, 2025

domenic Jan 31, 2025


		// Initial prompt lines

		dictionary AILanguageModelInitialPromptLineDict {

Add image and audio prompting API #71

Are you sure you want to change the base?

Add image and audio prompting API #71

Conversation

domenic commented Jan 20, 2025

michaelwasserman Jan 21, 2025

Choose a reason for hiding this comment

domenic Jan 22, 2025

Choose a reason for hiding this comment

sushraja-msft Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

michaelwasserman Jan 31, 2025

Choose a reason for hiding this comment

michaelwasserman Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

domenic Jan 31, 2025

Choose a reason for hiding this comment

michaelwasserman Jan 31, 2025

Choose a reason for hiding this comment

domenic Jan 31, 2025

Choose a reason for hiding this comment

sushraja-msft Jan 23, 2025 •

edited

Loading

michaelwasserman Jan 31, 2025 •

edited

Loading