Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image and audio prompting API #71

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 101 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,64 @@ console.log(await promptWithCalculator("What is 2 + 2?"));

We'll likely explore more specific APIs for tool- and function-calling in the future; follow along in [issue #7](https://github.com/webmachinelearning/prompt-api/issues/7).

### Multimodal inputs

All of the above examples have been of text prompts. Some language models also support other inputs. Our design initially includes the potential to support images and audio clips as inputs. This is done by using objects in the form `{ type: "image", data }` and `{ type: "audio", data }` instead of strings. The `data` values can be the following:

* For image inputs: [`ImageBitmapSource`](https://html.spec.whatwg.org/#imagebitmapsource), i.e. `Blob`, `ImageData`, `ImageBitmap`, `VideoFrame`, `OffscreenCanvas`, `HTMLImageElement`, `SVGImageElement`, `HTMLCanvasElement`, or `HTMLVideoElement` (will get the current frame). Also raw bytes via `BufferSource` (i.e. `ArrayBuffer` or typed arrays).

* For audio inputs: for now, `Blob`, `AudioBuffer`, `HTMLAudioElement`. Also raw bytes via `BufferSource`. Other possibilities we're investigating include `AudioData` and `MediaStream`, but we're not yet sure if those are suitable to represent "clips".
domenic marked this conversation as resolved.
Show resolved Hide resolved

Sessions that will include these inputs need to be created using the `expectedInputTypes` option, to ensure that any necessary downloads are done as part of session creation, and that if the model is not capable of such multimodal prompts, the session creation fails.

A sample of using these APIs:

```js
const session = await ai.languageModel.create({
expectedInputTypes: ["audio", "image"] // "text" is always expected
});

const referenceImage = await (await fetch("/reference-image.jpeg")).blob();
const userDrawnImage = document.querySelector("canvas");

const response1 = await session.prompt([
"Give a helpful artistic critique of how well the second image matches the first:",
domenic marked this conversation as resolved.
Show resolved Hide resolved
{ type: "image", data: referenceImage },

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some models may only accept a single image or single audio input with each request. Consider describing that edge-case detail behavior (e.g. throw a "NotSupportedError" DOMException) when an unsupported number or combination of image/audio prompt pieces are passed. Maybe also give a single-image example here for wider compatibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should encapsulate that away from the user so they don't have to worry about it, by sending two requests to the backend.

Copy link

@sushraja-msft sushraja-msft Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think number of images aspect of this can be handled through the contextoverflow event. https://github.com/webmachinelearning/prompt-api#tokenization-context-window-length-limits-and-overflow.

Phi 3.5 vision https://huggingface.co/microsoft/Phi-3.5-vision-instruct, recommends at the most 16 images but passing more (say 17) will only run into context length limitations. The context length limitation can be reached earlier as well with large image.

Sending two requests to the backend may not work though, the assistant is going to add its response in between.

Are there other limitations in terms of no mixing of images/audio or limitations in number of images/audio. Ill have to check what the server-side models do for their API.

{ type: "image", data: userDrawnImage }
]);

console.log(response1);

const audioBlob = await captureMicrophoneInput({ seconds: 10 });

const response2 = await session.prompt(
"My response to your critique:",
{ type: "audio", data: audioBlob }
);
```

Future extensions may include more ambitious multimodal inputs, such as video clips, or realtime audio or video. (Realtime might require a different API design, more based around events or streams instead of messages.)

Details:

* Cross-origin data that has not been exposed using the `Access-Control-Allow-Origin` header cannot be used with the prompt API, and will reject with a `"SecurityError"` `DOMException`. This applies to `HTMLImageElement`, `SVGImageElement`, `HTMLAudioElement`, `HTMLVideoElement`, `HTMLCanvasElement`, and `OffscreenCanvas`. Note that this is more strict than `createImageBitmap()`, which has a tainting mechanism which allows creating opaque image bitmaps from unexposed cross-origin resources. For the prompt API, such resources will just fail. This includes attempts to use cross-origin-tainted canvases.

* Raw-bytes cases (`Blob` and `BufferSource`) will apply the appropriate sniffing rules ([for images](https://mimesniff.spec.whatwg.org/#rules-for-sniffing-images-specifically), [for audio](https://mimesniff.spec.whatwg.org/#rules-for-sniffing-audio-and-video-specifically)) and reject with a `"NotSupportedError"` `DOMException` if the format is not supported. This behavior is similar to that of `createImageBitmap()`.

* Animated images will be required to snapshot the first frame (like `createImageBitmap()`). In the future, animated image input may be supported via some separate opt-in, similar to video clip input. But we don't want interoperability problems from some implementations supporting animated images and some not, in the initial version.

* `HTMLAudioElement` can also represent streaming audio data (e.g., when it is connected to a `MediaSource`). Such cases will reject with a `"NotSupportedError"` `DOMException` for now.

* `HTMLAudioElement` might be connected to an audio source (e.g., a URL) that is not totally downloaded when the prompt API is called. In such cases, calling into the prompt API will force the download to complete.

* Similarly for `HTMLVideoElement`, even a single frame might not yet be downloaded when the prompt API is called. In such cases, calling into the prompt API will force at least a single frame's worth of video to download. (The intent is to behave the same as `createImageBitmap(videoEl)`.)

* Text prompts can also be done via `{ type: "text", data: aString }`, instead of just `aString`. This can be useful for generic code.

* Attempting to supply an invalid combination, e.g. `{ type: "audio", data: anImageBitmap }`, `{ type: "image", data: anAudioBuffer }`, or `{ type: "text", data: anArrayBuffer }`, will reject with a `TypeError`.

* Attempting to give an image or audio prompt with the `"assistant"` role will currently reject with a `"NotSupportedError"` `DOMException`. (Although as we explore multimodal outputs, this restriction might be lifted in the future.)

### Configuration of per-session parameters

In addition to the `systemPrompt` and `initialPrompts` options shown above, the currently-configurable model parameters are [temperature](https://huggingface.co/blog/how-to-generate#sampling) and [top-K](https://huggingface.co/blog/how-to-generate#top-k-sampling). The `params()` API gives the default, minimum, and maximum values for these parameters.
Expand Down Expand Up @@ -355,7 +413,11 @@ The method will return a promise that fulfills with one of the following availab
An example usage is the following:

```js
const options = { expectedInputLanguages: ["en", "es"], temperature: 2 };
const options = {
expectedInputLanguages: ["en", "es"],
expectedInputTypes: ["audio"],
temperature: 2
};

const supportsOurUseCase = await ai.languageModel.availability(options);

Expand Down Expand Up @@ -450,6 +512,7 @@ interface AILanguageModel : EventTarget {
readonly attribute unsigned long topK;
readonly attribute float temperature;
readonly attribute FrozenArray<DOMString>? expectedInputLanguages;
readonly attribute FrozenArray<AILanguageModelPromptType> expectedInputTypes; // always contains at least "text"

attribute EventHandler oncontextoverflow;

Expand All @@ -469,35 +532,60 @@ dictionary AILanguageModelCreateCoreOptions {
[EnforceRange] unsigned long topK;
float temperature;
sequence<DOMString> expectedInputLanguages;
}
sequence<AILanguageModelPromptType> expectedInputTypes;
};

dictionary AILanguageModelCreateOptions : AILanguageModelCreateCoreOptions {
AbortSignal signal;
AICreateMonitorCallback monitor;

DOMString systemPrompt;
sequence<AILanguageModelInitialPrompt> initialPrompts;
sequence<AILanguageModelInitialPromptLine> initialPrompts;
};

dictionary AILanguageModelPromptOptions {
AbortSignal signal;
};

dictionary AILanguageModelCloneOptions {
AbortSignal signal;
};

dictionary AILanguageModelInitialPrompt {
// The argument to the prompt() method and others like it

typedef (AILanguageModelPromptLine or sequence<AILanguageModelPromptLine>) AILanguageModelPromptInput;

// Initial prompt lines

dictionary AILanguageModelInitialPromptLineDict {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile to have separate role enums and dictionaries for initial/not lines?
Would simply throwing exceptions for invalid 'system' role use in the impl suffice?

Copy link

@michaelwasserman michaelwasserman Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you've explored this thoroughly, but I wonder why the idl couldn't be simpler, like:

dictionary AILanguageModelPromptDict {
  required AILanguageModelPromptRole role = "user";
  required AILanguageModelPromptType type = "text";
  required AILanguageModelPromptData data;
}

typedef (DOMString or AILanguageModelPromptDict) AILanguageModelPrompt;
typedef (AILanguageModelPrompt or sequence<AILanguageModelPrompt>) AILanguageModelPromptInput;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile to have separate role enums and dictionaries for initial/not lines?
Would simply throwing exceptions for invalid 'system' role use in the impl suffice?

Yeah that's totally valid. If you as an implementer think that's simpler, we can do it.

I realize you've explored this thoroughly, but I wonder why the idl couldn't be simpler, like:

(Note that required and default values are not compatible.)

So, your version is mainly about unnesting the content, right?

I think it mostly works, it just departs from what's idiomatic with other chat completion APIs. Hmm.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I imagined unnesting might slightly simplify IDL, impl, and atypical uses specifying more fields.
create({initialPrompts: [{role: "system", content: {type: "image", data: pixels}}]})
vs
create({initialPrompts: [{role: "system", type: "image", data: pixels}]})

I'm unfamiliar with rationale behind other APIs' idiomatic designs, so I'd readily defer to folks that know more, and maybe even opt for the more familiar pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth considering, since we're not matching the APIs exactly in other ways. I will try to outline the two possibilities in more detail to get developer feedback in #40, on Monday.

required AILanguageModelInitialPromptRole role;
required DOMString content;
required AILanguageModelPromptContent content;
};

dictionary AILanguageModelPrompt {
typedef (DOMString or AILanguageModelInitialPromptLineDict) AILanguageModelInitialPromptLine;

// Prompt lines

dictionary AILanguageModelPromptLineDict {
required AILanguageModelPromptRole role;
required DOMString content;
required AILanguageModelPromptContent content;
};

dictionary AILanguageModelPromptOptions {
AbortSignal signal;
};
typedef (DOMString or AILanguageModelPromptLineDict) AILanguageModelPromptLine;

dictionary AILanguageModelCloneOptions {
AbortSignal signal;
// Prompt content inside the lines

dictionary AILanguageModelPromptContentDict {
required AILanguageModelPromptType type;
required AILanguageModelPromptData data;
};

typedef (DOMString or AILanguageModelPrompt or sequence<AILanguageModelPrompt>) AILanguageModelPromptInput;
typedef (DOMString or AILanguageModelPromptContentDict) AILanguageModelPromptContent;

typedef (ImageBitmapSource or BufferSource or AudioBuffer or HTMLAudioElement or DOMString) AILanguageModelPromptData;
enum AILanguageModelPromptType { "text", "image", "audio" };

// Prompt roles inside the lines

enum AILanguageModelInitialPromptRole { "system", "user", "assistant" };
enum AILanguageModelPromptRole { "user", "assistant" };
Expand Down