Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add image and audio prompting API #71
base: main
Are you sure you want to change the base?
Add image and audio prompting API #71
Changes from 2 commits
2a9f391
e2e6752
996364b
6839d63
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some models may only accept a single image or single audio input with each request. Consider describing that edge-case detail behavior (e.g. throw a
"NotSupportedError"
DOMException
) when an unsupported number or combination of image/audio prompt pieces are passed. Maybe also give a single-image example here for wider compatibility.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should encapsulate that away from the user so they don't have to worry about it, by sending two requests to the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think number of images aspect of this can be handled through the contextoverflow event. https://github.com/webmachinelearning/prompt-api#tokenization-context-window-length-limits-and-overflow.
Phi 3.5 vision https://huggingface.co/microsoft/Phi-3.5-vision-instruct, recommends at the most 16 images but passing more (say 17) will only run into context length limitations. The context length limitation can be reached earlier as well with large image.
Sending two requests to the backend may not work though, the assistant is going to add its response in between.
Are there other limitations in terms of no mixing of images/audio or limitations in number of images/audio. Ill have to check what the server-side models do for their API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worthwhile to have separate role enums and dictionaries for initial/not lines?
Would simply throwing exceptions for invalid 'system' role use in the impl suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize you've explored this thoroughly, but I wonder why the idl couldn't be simpler, like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's totally valid. If you as an implementer think that's simpler, we can do it.
(Note that required and default values are not compatible.)
So, your version is mainly about unnesting the content, right?
I think it mostly works, it just departs from what's idiomatic with other chat completion APIs. Hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I imagined unnesting might slightly simplify IDL, impl, and atypical uses specifying more fields.
create({initialPrompts: [{role: "system", content: {type: "image", data: pixels}}]})
vs
create({initialPrompts: [{role: "system", type: "image", data: pixels}]})
I'm unfamiliar with rationale behind other APIs' idiomatic designs, so I'd readily defer to folks that know more, and maybe even opt for the more familiar pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth considering, since we're not matching the APIs exactly in other ways. I will try to outline the two possibilities in more detail to get developer feedback in #40, on Monday.