Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out how to display images from <en-media> tags inline in Datasette #5

Open
simonw opened this issue Oct 11, 2020 · 6 comments
Open
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Contributor

simonw commented Oct 11, 2020

Relates to #1. Evernote XML looks like this:

<?xml version="1.0"?>
<en-note>
  <div>This note includes two images.</div>
  <div>
    <b>The Python logo</b>
  </div>
  <div>
    <en-media hash="61098c2c541de7f0a907c301dd6542da" type="image/svg+xml" width="125"/>
  </div>
  <div>
    <b>The Evernote logo</b>
  </div>
  <div>
    <en-media hash="91bd26175acac0b2ffdb6efac199f8ca" type="image/svg+xml" width="125"/>
  </div>
</en-note>

That hash is the md5 we use to store resources. It should be possible to turn these into embedded image tags, especially if done in conjunction with the https://github.com/simonw/datasette-media plugin.

@simonw simonw added the enhancement New feature or request label Oct 11, 2020
@simonw
Copy link
Contributor Author

simonw commented Oct 11, 2020

We could even do server-side thumbnailing for some of these images, but I'm inclined to serve up the full size ones and set a width on the image element based on the width attribute on <en-media>.

@simonw
Copy link
Contributor Author

simonw commented Oct 11, 2020

Alternatively, rather than relying on datasette-media this could base64-embed the images. evernote-to-sqlite could register itself as a Datasette plugin that knows how to do this.

Maybe rename the column to evernote_content and register a render cell hook that knows how to rewrite those note bodies so that they are visible?

Might need to feed them through Bleach too, just in case any nasty code can get into them.

@simonw
Copy link
Contributor Author

simonw commented Oct 11, 2020

Or... I could do this client-side. JavaScript that looks for <en-media> tags and fetches the data using fetch() wouldn't be too hard to write.

@simonw
Copy link
Contributor Author

simonw commented Oct 11, 2020

Maybe the best way do this is with a custom route, /-/evernote/note-id - that way I can clean the HTML and resolve the other things in the <en-note> structure without using render_cell() and the like. My concern about using render_cell() is that it could lead to weird security problems when combined with ?sql= queries.

@simonw
Copy link
Contributor Author

simonw commented Oct 11, 2020

... but it's still important to be able to get to the rendered note directly from the browse notes /evernote/notes page. Maybe use a simple render_cell() hook that just knows how to generate the link to the rendered note page?

@simonw
Copy link
Contributor Author

simonw commented Oct 12, 2020

Here's my first attempt at a plugin for this:

from datasette import hookimpl
import jinja2

START = "<en-note"
END = "</en-note>"
TEMPLATE = """
<div style="max-width: 500px; white-space: normal; overflow-wrap: break-word;">{}</div>
""".strip()

EN_MEDIA_SCRIPT = """
Array.from(document.querySelectorAll('en-media')).forEach(el => {
    let hash = el.getAttribute('hash');
    let type = el.getAttribute('type');
    let path = `/evernote/resources_data/${hash}.json?_shape=array`;
    fetch(path).then(r => r.json()).then(rows => {
        let b64 = rows[0].data.encoded;
        let data = `data:${type};base64,${b64}`;
        el.innerHTML = `<img style="max-width: 300px" src="${data}">`;
    });
});
"""


@hookimpl
def render_cell(value, table):
    if not table:
        # Don't render content from arbitrary SQL queries, could be XSS hole
        return
    if not value or not isinstance(value, str):
        return
    value = value.strip()
    if value.startswith(START) and value.endswith(END):
        trimmed = value[len(START) : -len(END)]
        trimmed = trimmed.split(">", 1)[1]
        # Replace those horrible double newlines
        trimmed = trimmed.replace("<div><br /></div>", "<br>")
        return jinja2.Markup(TEMPLATE.format(trimmed))


@hookimpl
def extra_body_script():
    return EN_MEDIA_SCRIPT

It works!

It does however demonstrate that Evernote's "clip this webpage" feature means there is a LOT of weird HTML that can get into a note. It looks like they've filtered out the scripts but I wouldn't bet on it - they certainly don't filter out many of the inline styles. So running Bleach is almost certainly a good idea.

@simonw simonw changed the title Figure out how to display images en-media tags inline in Datasette Figure out how to display images from <en-media> tags inline in Datasette Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant