Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[new feature request] Is it possible to support screenshot as a server or resource? #48

Closed
wa008 opened this issue Nov 26, 2024 · 6 comments

Comments

@wa008
Copy link

wa008 commented Nov 26, 2024

I want LLM to read screenshot as a input, like what computer use do, but MCP can have a better permission control management.

Is it possible? or is there some clue or suggestion that can help me to contribute it.

@jspahrsummers
Copy link
Member

You could definitely build an MCP server for this. The server could offer one tool which takes a screenshot, and returns it as an image.

Does that help?

@deksden
Copy link

deksden commented Nov 28, 2024

Is it possible?

Puppeteer server sample have tool:

puppeteer_screenshot

  • Capture screenshots of the entire page or specific elements
  • Inputs:
    • name (string, required): Name for the screenshot
    • selector (string, optional): CSS selector for element to screenshot
    • width (number, optional, default: 800): Screenshot width
    • height (number, optional, default: 600): Screenshot height

@rmensing
Copy link

Not sure if this is within the scope here.

We know that the Claude desktop app, through MCP, has access to images using the screenshot tool in the puppeteer server, although the image is available to Claude in base64 encoded format.

We could probably modify the filesystem server to also make images available to Claude also.

The question is, though, how can we actually give it to Claude so he can view it the same as if we uploaded an image directly in chat?

@deksden
Copy link

deksden commented Nov 30, 2024

@rmensing I m sure you have to serve it base64 encoded, same as for any API request.

MCP server only provides some data ( context, prompt, tools) to host/client, and its up to you how you communicate with LLM. No difference where you get data: from MCP server or just performing generic API call to LLM.

@rmensing
Copy link

rmensing commented Dec 1, 2024

@deksden In chat it cannot directly use the base64 formatted image. Although it could write code to convert it back to the image format, it still does not have the ability to view it. The only reason base64 is used in the API is so it can be sent in a text format. once it is received it is converted back to an image more than likely.

There would need to be the functionality in the Claude desktop app MCP client implementation for it to attach the image, or other documents for that matter, the same as you do in the chat interface. Unfortunately we don't have any documentation on the interface between the MCP client code and the desktop app as far as what the desktop app can or can't do. My guess is that they could probably code it to accept the images and view them, it's only a matter of will they. It could be a case of them having a reason to not implement it.

@jspahrsummers
Copy link
Member

As this is supported in MCP, and demonstrated in the examples, I'm going to close this out as resolved. Please start a discussion if you have other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants