Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image and audio prompting API #71

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add image and audio prompting API #71

wants to merge 4 commits into from

Conversation

domenic
Copy link
Collaborator

@domenic domenic commented Jan 20, 2025

Closes #40. Somewhat helps with #70.

Closes #40. Somewhat helps with #70.
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved

const response1 = await session.prompt([
"Give a helpful artistic critique of how well the second image matches the first:",
{ type: "image", data: referenceImage },

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some models may only accept a single image or single audio input with each request. Consider describing that edge-case detail behavior (e.g. throw a "NotSupportedError" DOMException) when an unsupported number or combination of image/audio prompt pieces are passed. Maybe also give a single-image example here for wider compatibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should encapsulate that away from the user so they don't have to worry about it, by sending two requests to the backend.

Copy link

@sushraja-msft sushraja-msft Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think number of images aspect of this can be handled through the contextoverflow event. https://github.com/webmachinelearning/prompt-api#tokenization-context-window-length-limits-and-overflow.

Phi 3.5 vision https://huggingface.co/microsoft/Phi-3.5-vision-instruct, recommends at the most 16 images but passing more (say 17) will only run into context length limitations. The context length limitation can be reached earlier as well with large image.

Sending two requests to the backend may not work though, the assistant is going to add its response in between.

Are there other limitations in terms of no mixing of images/audio or limitations in number of images/audio. Ill have to check what the server-side models do for their API.


// Initial prompt lines

dictionary AILanguageModelInitialPromptLineDict {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile to have separate role enums and dictionaries for initial/not lines?
Would simply throwing exceptions for invalid 'system' role use in the impl suffice?

Copy link

@michaelwasserman michaelwasserman Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you've explored this thoroughly, but I wonder why the idl couldn't be simpler, like:

dictionary AILanguageModelPromptDict {
  required AILanguageModelPromptRole role = "user";
  required AILanguageModelPromptType type = "text";
  required AILanguageModelPromptData data;
}

typedef (DOMString or AILanguageModelPromptDict) AILanguageModelPrompt;
typedef (AILanguageModelPrompt or sequence<AILanguageModelPrompt>) AILanguageModelPromptInput;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile to have separate role enums and dictionaries for initial/not lines?
Would simply throwing exceptions for invalid 'system' role use in the impl suffice?

Yeah that's totally valid. If you as an implementer think that's simpler, we can do it.

I realize you've explored this thoroughly, but I wonder why the idl couldn't be simpler, like:

(Note that required and default values are not compatible.)

So, your version is mainly about unnesting the content, right?

I think it mostly works, it just departs from what's idiomatic with other chat completion APIs. Hmm.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I imagined unnesting might slightly simplify IDL, impl, and atypical uses specifying more fields.
create({initialPrompts: [{role: "system", content: {type: "image", data: pixels}}]})
vs
create({initialPrompts: [{role: "system", type: "image", data: pixels}]})

I'm unfamiliar with rationale behind other APIs' idiomatic designs, so I'd readily defer to folks that know more, and maybe even opt for the more familiar pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth considering, since we're not matching the APIs exactly in other ways. I will try to outline the two possibilities in more detail to get developer feedback in #40, on Monday.

aarongable pushed a commit to chromium/chromium that referenced this pull request Feb 1, 2025
Add IDL for new LanguageModel[Factory] API types, etc., per:
  webmachinelearning/prompt-api#71

API use with currently supported input types is unchanged.
API use with new input types throws TypeErrors for now.

Move create() WPTs into a new file as separate tests.

Bug: 385173789, 385173368
Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647
Reviewed-by: Clark DuVall <[email protected]>
Commit-Queue: Mike Wasserman <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1414456}
aarongable pushed a commit to chromium/chromium that referenced this pull request Feb 1, 2025
This reverts commit 259d706.

Reason for revert: Failing win builder:
https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8724180044051350113/+/u/compile/raw_io.output_text_failure_summary_

Original change's description:
> Prompt API: Add multimodal input IDL skeleton
>
> Add IDL for new LanguageModel[Factory] API types, etc., per:
>   webmachinelearning/prompt-api#71
>
> API use with currently supported input types is unchanged.
> API use with new input types throws TypeErrors for now.
>
> Move create() WPTs into a new file as separate tests.
>
> Bug: 385173789, 385173368
> Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698
> Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647
> Reviewed-by: Clark DuVall <[email protected]>
> Commit-Queue: Mike Wasserman <[email protected]>
> Cr-Commit-Position: refs/heads/main@{#1414456}

Bug: 385173789, 385173368
Change-Id: If058bf22fceb6880b7ba12ebccbdca74a0274535
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6221804
Commit-Queue: Mike Wasserman <[email protected]>
Auto-Submit: Mike Wasserman <[email protected]>
Reviewed-by: Clark DuVall <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1414477}
aarongable pushed a commit to chromium/chromium that referenced this pull request Feb 4, 2025
This reverts commit 77f9f49.

Reason for revert: Workaround Windows IDL compiler path length issues

Original change's description:
> Revert "Prompt API: Add multimodal input IDL skeleton"
>
> This reverts commit 259d706.
>
> Reason for revert: Failing win builder:
> https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8724180044051350113/+/u/compile/raw_io.output_text_failure_summary_
>
> Original change's description:
> > Prompt API: Add multimodal input IDL skeleton
> >
> > Add IDL for new LanguageModel[Factory] API types, etc., per:
> >   webmachinelearning/prompt-api#71
> >
> > API use with currently supported input types is unchanged.
> > API use with new input types throws TypeErrors for now.
> >
> > Move create() WPTs into a new file as separate tests.
> >
> > Bug: 385173789, 385173368
> > Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698
> > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647
> > Reviewed-by: Clark DuVall <[email protected]>
> > Commit-Queue: Mike Wasserman <[email protected]>
> > Cr-Commit-Position: refs/heads/main@{#1414456}
>
> Bug: 385173789, 385173368
> Change-Id: If058bf22fceb6880b7ba12ebccbdca74a0274535
> No-Presubmit: true
> No-Tree-Checks: true
> No-Try: true
> Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6221804
> Commit-Queue: Mike Wasserman <[email protected]>
> Auto-Submit: Mike Wasserman <[email protected]>
> Reviewed-by: Clark DuVall <[email protected]>
> Cr-Commit-Position: refs/heads/main@{#1414477}

Bug: 385173789, 385173368, 394123703
Change-Id: I7a47d8ff8c6b6ae797c3198608859640ae81e1df
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6222573
Reviewed-by: Clark DuVall <[email protected]>
Commit-Queue: Mike Wasserman <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1415399}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add multimodal API such as using image as part of prompt
5 participants