-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make async clipboard APIs (read/write) to sanitize interoperably with setData/getData for text/html #150
Comments
I'm supportive of having |
How did the discrepancy arise in Chrome? Part of the plan with https://wicg.github.io/sanitizer-api/ is to get this all standardized. Does that meet the requirements you have? |
I don't know much about the specific question here, but on clipboard vs Sanitizer API: This is a v2 item. One major block of functionality missing is that the clipboard sanitizers do some form of style normalization, while the Sanitizer is presently HTML only. If you want to help this along: I haven't found any documentation from any browser about how the style normalization actually works. It seems all browser engines have implemented something along those lines, and somehow they all forgot to document what they actually do. Describing the existing functionality would help a lot in getting this into Sanitizer API. |
The I think the issue here is that due to strict sanitization performed by async clipboard APIs (that don't mutate the DOM directly) websites lose style information, meta tags etc when content is copied from Chromium and pasted into native apps. Office online & Office native apps use proprietary style formats in the HTML and the browser's in-built sanitizer removes this content from the HTML markup when it's read/written via async clipboard APIs. The sanitizer API can certainly be used by web authors to sanitize the HTML content if they want to, but we should give them the flexibility to decide whether to use the Browser sanitizer API or their own proprietary sanitizer service. |
No. One of the reasons we sanitize the markup content is to strip way any content that's invisible to the user so we strip away any invisible content as determined by the style & layout information. We're not gonna change that. |
@rniwa I don't think that is the issue here. Although Safari doesn't remove any meta, style etc tags from the custom pasteboard format which it creates during |
We only apply sanitization when the content is read or written across cross origin or cross applications, not when read or written within a single origin. |
@snianu's understanding seems correct (this was recently documented in lots of misc CL comments). fwiw I've drafted some documentation for my understanding of Chrome's clipboard, including a brief history/summary of how/what we sanitize. |
This means you trust the clipboard content during read? I think clipboard content should always be treated as untrusted. It doesn't matter if Safari wrote the html payload to the clipboard or some malicious or trusted native apps did. |
We want to standardize the clipboard sanitization procedure as it has caused issues for many popular apps while using async clipboard HTML read/write. The spec says that during write, we should only write the sanitized payload of the item, but doesn't say anything about what the process of sanitization looks like and what kind of data is expected in the clipboard item. e.g. when writing HTML payload using the async clibpoard write, what should be the format of the HTML markup string? Should it be a document fragment or complete HTML document? Also FWIW currently at least in Chromium & Firefox, we write unsanitized HTML content to the clipboard when web developers use To better describe the sanitization & clipboard write process for HTML payload we propose the following:
[1] Here is a GIF that shows the style loss when user copies and paste from Excel online to win32 Excel app: https://drive.google.com/file/d/1Nsyp1rUKc_NF4l0n-O05snAKabHAKeiG/view |
Not entirely related, but I'll chime in about the state of image (PNG) sanitization to provide a +1 for updating the spec to make the sanitization process more explicit. The spec mentions that images should be "safe", but (as you noted) is silent on the specifics of this sanitization, leading to inconsistencies across browsers. Currently in Chrome, images are sanitized on both read and write. Skia is used to strip almost all metadata, retaining only what fits in an We've proposed to change Chrome to only sanitize images on write to align with other browsers. Note that this allows a site to read an unsanitized PNG directly from the system clipboard, so the "only sanitize on write" approach there is not acceptable for HTML data (and seems to be the opposite direction of what's being proposed above for HTML). In updating the spec to be more explicit about sanitization procedures (which I agree we should do) we'll have to spell out the process for each type we care about, since there is no prescriptive sanitization process that works for all types. |
@snianu thanks for tackling this! Could you detail step 1 a bit more? If I do something like |
Early on during our work on the sanitizer, we've been eying the issue of incoherent, implicit sanitizer usage during copy&paste. I'm very much in favor of specifying things more explicitly. |
Be aware that start and end fragment comment tags are Windows-specific. On other operating systems (e.g. Ubuntu), that will have to be handled differently.
How is the HTML header info platform-specific?
|
Presumably |
Sure. Step 1 basically creates the document object from the HTML string provided by the web authors. It could just be a document fragment or a full HTML document . We use the DOMParser to create a well formed HTML document from the input, and then use the sanitizer API to strip out harmful contents from the markup. After this step, we insert platform specific header info into the serialized document and then write it to the clipboard. Here is a more detailed proposal for the HTML sanitization:
On Linux we have:
etc... |
@snianu I would also like to understand why the fragment parser does not suffice, as it seems to preserve the elements you indicated were dropped. |
If we parse the HTML markup string in the context of |
But as I showed elements such as |
Sept. WG Meeting: discussed @snianu's proposal above. Good learnings. No immediate actions/resolutions. |
@annevk maybe I can clarify... @snianu's goal is process markup like this:
Into a DOM that represents all the content basically as written above including the attributes of the HTML element. If we use the fragment parser we need to provide a context element. Let me know if I'm mistaken but I don't see anything in the spec about how we could we can initialize the fragment parser so that it sets up the HTML parser in the initial insertion mode. So the best we could do with the fragment parser given the second markup example I provided is to use an |
I must point out that those start fragment & end fragment marker is Windows specific construct that doesn't apply to Apple's platforms. In Apple's platforms, it is pasteboard data provider's responsibility to serialize HTML in such a way as to preserve all its styles (typically with inline style content attributes). |
@rniwa I agree... I wasn't trying to make any particular point by including those, I was just grabbing some real markup as supplied by Excel Online to setData. |
I get that. But your primary point is that we need some special way of initializing the fragment parser. I'm saying that this is not the case in Apple's platforms so this issue is at best applicable to some platforms; put it differently, the issue presents itself in different ways in different platforms. I don't think there is a way to create an interoperable behavior here given what gets put into the system's pasteboard / clipboard is dependent on that specific platform. |
I'm glad you brought that up. Actually, the discussion on the fragment parser is a bit of a tangent and a means to an end. The primary point is captured in Anupam's first sentence in this comment:
I do think it's possible to standardize what authors should expect to be sanitized away from the data that they supply to navigator.clipboard.write for our well-known mime-types. So the key point in Anupam's post is really this:
I believe his initial thinking is to sanitize clipboard using the Sanitizer's default configuration, which leaves things like @whsieh covered today in the Web Editing WG meeting that Safari prefers to also remove content that would be invisible like comments, display:none, opacity: 0, etc. So maybe we could define a configuration and/or upgrades to the sanitization algorithm that does that too if we decide that's important aspect to keep. Any objection to standardizing what a "sanitized copy" means to remove ambiguity from the clipboard API's write algorithm? |
This is an area of the browser engine (and the operating system) that we'd like to continue to make improvements on so I don't think we want to prematurely shoehorned into a very specific algorithm. Furthermore, even if someone were to make a concrete proposal and we were to agree with the generic algorithm in principle, it's very much unlikely that anything remotely close to it will be implemented in WebKit in the foreseeable feature since there are specific concerns and constraints we're dealing with Apple's platforms. |
I'd be very much in favor of removing ambiguity from the term "sanitized copy". It's very vague right now. @rniwa Can you share some of the constraints? Maybe there's a way to remove some grade of ambiguity while respecting those constraints? |
That's just your opinion.
I've mentioned how to get this time and time again. You can literally launch Microsoft Office products on macOS and copy stuff and see what kind of markup is put into the pasteboard. I'd emphasize my point once again: Different browser vendors have different priorities and as such, we're not interested in standardizing copy & paste sanitization behavior at this point since we clearly have differing opinion about what the proper level of protection for user's privacy or how much value should be put into the web & app compatibilities. |
Yes, Word and other Office native apps do provide a local image urls, but that doesn't expose user's mailing address. I think we have a separate issue to address this which might alleviate your concern regarding the user's home directory and temp folder. |
I'm done explaining our position here. We are under no obligation to agree to your proposals. We're not gonna change our implementation, and that's the end of this discussion. |
The Web Editing Working Group just discussed The full IRC log of that discussion<Travis> Topic: Sanitize clipboard issue<Travis> github: https://github.com//issues/150 <Travis> BoCupp: what do we need from the WG on this issue? <Travis> snianu: We will be taking this discussion to the chromium security group (for chromium only). <Travis> BoCupp: pastes are unpredictable... <Travis> .. can't tell if a 3rd party messes with the clipboard. <Travis> .. the question is whether there's some expected behavior from App to browser... <Travis> whsieh: We were thinking about adding things to the spec to fix compatibility in the short term. But, if we keep this the same (left to user agent discretion), we are OK with that. <Travis> BoCupp: We don't think that keeping the spec as-is will be harmful, etc. Propose we can close the issue now. <Travis> ACTION: BoCupp to close this issue; keep whsieh appraised of the security review (as FYI). |
Per the log of the last Web Editing WG meeting above, we agreed to leave the details of sanitization up to the UA, i.e. we won't attempt to standardize it at this time. Additionally, to leave room for Safari to implement cross-origin restrictions or whatever policy they would like to use to govern access to the pickled formats (which are also unsanitized) @snianu will land the PR for clipboard pickling without mentioning the We don't think this will affect the developer experience. We do expect that depending on whatever sanitization level the UA chooses to apply will just change the fidelity of the markup available to a paste operation, but developers don't need to write different code and do need to deal with any shape of markup from the clipboard already. Closing this issue. If you think there needs to be any further discussion please reactivate. Thanks! |
Even if we don't define the details of sanitization, which seems like a mistake and something that needs revisiting in due course, I do think we should require the legacy and new API to behave in an equivalent manner. I think it's very surprising that they end up with different results. |
@annevk I'm not clear on what you're proposing. If we aren't standardizing (at least for now) whether or not to sanitize or what sanitizing is then what are you proposing we should change? Do you just want us to write down that both APIs should be aligned in terms of whatever sanitization steps they choose to apply? |
In addition to @BoCupp-Microsoft 's comment above, we also want to set some expectations here. We are NOT going to change anything in the legacy DataTransfer APIs as that would break our first party apps and legacy Win32 Office apps that are still being used in enterprises. If the expectation here is to align async clipboard APIs with the DataTransfer APIs, then I propose to open a new issue as from the comments above it is pretty clear that we are not going to come to an agreement. At least on Windows we want to provide the best developer and user experience. We are not going to compromise on functionality that would worsen the experience on our platform. |
@BoCupp-Microsoft right. @snianu how would moving things to a new issue help matters? |
It probably wouldn't and that is why we decided to close this issue. I'm not sure why you reopened it? |
@annevk @mbrodesser Could you please clarify Mozilla's position on providing the sanitization behavior for async read/write methods for well-known formats? From @annevk's proposal here, it sounds like you agree that we shouldn't be sanitizing |
My proposal doesn't attempt to resolve this one way or another. It merely adds support for a parallel set of formats that are never sanitized. I think for built-in formats we want:
|
I think it does affect the well-known formats at least on Safari. Consider the below case: Now, on Chrome and Firefox, regardless of what origin the author queries text/html from, DataTransfer APIs always return unsanitized HTML format, but this is different from pickled format as the standard HTML format contains platform specific headers as described here. These headers are stripped out, but the HTML markup is a full HTML document, not a sanitized fragment. With your proposal, we have to return unsanitized HTML document with platform specific headers stripped out when authors queries |
If someone copies |
The way I understood Anne's proposal, if site X wrote "text/html" and then site X reads "text/html", in WebKit, there would be a single item "text/html" that would contain the original unsanitized data upon reading. If site Y were to read the same data copied from site X, it would contain the sanitized "text/html" data. |
@snianu and I discussed offline. I don't see any of the points discussed since this post from @annevk as being a blocker for us. Both of the points mentioned:
...allow for Apple's preferred same-origin behavior, as well as our preferred behavior for Chromium, which is to allow unsanitized access to well-known formats like |
Closing this issue as it's resolved in #174 |
Hi All,
Excel online team reached out to us on several issues they found with clipboard formats. Sanitization across legacy and async clipboard APIs is not consistent for some mime types. For example, while text/html when set with setData/getData, keeps meta tags in Chrome, async clipboard APIs strip them down.
This creates an issue for online text editing applications. Target apps, processing paste, rely on meta tags to infer information about the clipboard payload source. @snianu has wrote a detailed analysis which can be found here.
A matrix view of formats can be found here.
We propose that we make async clipboard read/write serialization behaviors consistent with legacy clipboard API such as setData/getData.
The reason being browsers already expose this information through setData/getData and making async clipboard read/write to behave in a compatible way which will ease its adoption over time.
If we can all agree on this, we will follow up with a PR clarifying the behavior in the Clipboard Spec which will also take care of #140
CC: @rniwa @whsieh @dway123 @garykac @BoCupp-Microsoft @megangardner @johanneswilm
The text was updated successfully, but these errors were encountered: