Solve rendering problem when iframe loads content from https or touches storage/cookies #92

weatherpattern · 2018-01-27T02:28:07Z

Scanner has rendering problems because we render diffs by getting an HTML string back from the differ and shoving it into an iframe. Unfortunately, that means the frame runs with extremely restricted security, so if it tries to load scripts or styles or images from https, it will fail, and if it tries to touch storage or cookies, it will throw an exception. A lot of the sites aren’t well coded to account for that, so the JavaScript code in them starts running but never finishes.

Solving this is kinda complicated and I’m not sure whether it is best done in the UI or DB layer

See Menu on Jan 18 scrape:
https://monitoring.envirodatagov.org/page/f4409490-e8f0-455d-ad8d-7ae0cf6aab74/0ad4bcbc-d751-44b0-bc41-434755416868..f690d39f-7f71-4fc7-93a4-89e7f01dfce8

Also:
https://monitoring.envirodatagov.org/page/dbd5f818-44a6-486c-a3ec-9b49cb6b72d0/9b7a5215-e2b9-42d1-857c-a2ef874c4f5b..d4cb9d0f-3042-4bfb-adde-e335f75a1ab0

Related to:
edgi-govdata-archiving/web-monitoring-db#91

The text was updated successfully, but these errors were encountered:

This got much more complex than anticipated! There are a few major bits here: - We now have a `MediaType` type for modeling and comparing different media types. It includes a concept of canonical types, so `parseMediaType('text/html').equals(parseMediaType('application/xhtml')) === true`. If we can reliably clean this up in the DB, we can get rid of that, but it's useful for now. Similarly, we can also get media types for file extensions (it only supports a minimal list of extensions we know we have in the DB). Once the DB has content types for all versions, we can get rid of that, too. - The `diffTypes` module now has a `diffTypesFor(mediaType)` method, allowing us to get the diffs that are appropriate to present for a given type of content. We've also added three "raw" diff types which indicate we should simply show the content in an `<iframe>` because we don't know how to diff it (or if the two versions in question have differing content types, which often happens when a PDF or image responded with an HTML error page [like a 404] for a given version). - The `SelectDiffType` form control now takes a list of diff types (from the above `diffTypesFor()`) to show. - There's a little extra complexity in retrieving an actual diff where we check to see if the diff type has an actual diff service associated with it ("raw" types don't, because they represent viewing the actual content, not a diff). We request the actual raw content there because we need to sniff it for HTML (when we present HTML, we need to set a `<base>` tag so associated images, styles, and scripts work). This part is definitely a little janky, but I don't see a clear improvement without a lot more other changes. It slightly exacerbates the existing race condition issue. The sniffing HTML situation should/could probably be fixed alongside edgi-govdata-archiving/web-monitoring#92, where we basically need some kind of proxy or pre-cleaned-up version to render. - Finally, this makes a slight tweak to the 'no changes' message; it shows a message indicating two versions are exactly the same if their hashes are identical, but shows the old message (with slightly more detail) if the hashes are different but `change_count` is 0. Fixes #179.

Mr0grog · 2018-02-06T20:43:07Z

To be more clear on this one, we need a proxy service to load the pages and diffs from (and that manipulates the HTML as needed along the way). This way, we load the pages from that service instead of stuffing client-side manipulated source code directly into the iframe—so the iframe actually has an origin and is allowed to do more things.

It’s not clear to me whether that proxy should be part of the UI’s existing server layer (because it is addressing concerns specific to the UI) or if it should live in the DB project (because it could probably be done in a relatively display-agnostic way and other potential UIs might benefit from it, or even just because we don’t want to put quite so much server logic into the UI project).

This also has other implications in the UI for how we handle getting metadata about a diff and potentially in processing for how we output these diffs and how we pack that metadata into different kinds of outputs (there is only one kind of output right now).

Add a setting (on by default) to transform subresource URLs (styles, scripts, images, etc.) such that they load from the Internet Archive’s Wayback Machine instead of from their original URL, which might be either missing or different from what it was at the time you are looking at a snapshot of. This also handily works around some (but not all) of the URL security issues in edgi-govdata-archiving/web-monitoring#92.

stale · 2019-01-09T22:37:08Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog · 2019-01-10T00:00:12Z

This is still definitely a major failing of our tools.

stale · 2019-07-09T00:21:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

weatherpattern changed the title ~~Solve rendering problem when iframe loads content from https and or touch storage/cooikes~~ Solve rendering problem when iframe loads content from https or touches storage/cookies Jan 27, 2018

This was referenced Feb 5, 2018

Only request diffs for types we *can* diff edgi-govdata-archiving/web-monitoring-ui#200

Merged

Block page redirects and ask user how to proceed edgi-govdata-archiving/web-monitoring-ui#201

Open

Mr0grog mentioned this issue Jul 31, 2018

Strip inline style attributes when "Remove Formatting" is on edgi-govdata-archiving/web-monitoring-ui#230

Closed

Mr0grog mentioned this issue Sep 23, 2018

Load page subresources from Wayback Machine edgi-govdata-archiving/web-monitoring-ui#289

Merged

stale bot added the stale label Jan 9, 2019

stale bot removed the stale label Jan 10, 2019

stale bot added the stale label Jul 9, 2019

Mr0grog added bug never-stale labels Jul 15, 2019

stale bot removed stale labels Jul 15, 2019

Mr0grog mentioned this issue May 18, 2020

Prevent opening up intra-page links in separate tab/window in diff view closes #494 edgi-govdata-archiving/web-monitoring-ui#540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve rendering problem when iframe loads content from https or touches storage/cookies #92

Solve rendering problem when iframe loads content from https or touches storage/cookies #92

weatherpattern commented Jan 27, 2018

Mr0grog commented Feb 6, 2018 •

edited

Loading

stale bot commented Jan 9, 2019

Mr0grog commented Jan 10, 2019

stale bot commented Jul 9, 2019

Solve rendering problem when iframe loads content from https or touches storage/cookies #92

Solve rendering problem when iframe loads content from https or touches storage/cookies #92

Comments

weatherpattern commented Jan 27, 2018

Mr0grog commented Feb 6, 2018 • edited Loading

stale bot commented Jan 9, 2019

Mr0grog commented Jan 10, 2019

stale bot commented Jul 9, 2019

Mr0grog commented Feb 6, 2018 •

edited

Loading