Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[transformation] alt text with both single and double quotes has double-escaped double quotes #112

Open
zkamvar opened this issue Mar 28, 2023 · 2 comments

Comments

@zkamvar
Copy link
Contributor

zkamvar commented Mar 28, 2023

tmp <- withr::local_tempfile()
writeLines(c("![\"data\" is a 3 by 3 numpy 'array'](../fig/python-zero-index.svg)"), tmp)
pegboard::Episode$new(tmp)$use_sandpaper()$show()
#> ---
#> ~
#> ---
#> 
#> ![](fig/python-zero-index.svg){alt="\\"data\\" is a 3 by 3 numpy 'array'"}

Created on 2023-03-28 with reprex v2.0.2

@zkamvar
Copy link
Contributor Author

zkamvar commented Mar 31, 2023

The reason this is not working is because we find the attributes curly brace and then take those attributes, insert them into an HTML block so that we can use {xml2}'s HTML parsing capabilities to extract the alt text in set_alt_attr():

pegboard/R/get_images.R

Lines 80 to 98 in e815ea1

set_alt_attr <- function(images, xpath, ns) {
attrs <- xml2::xml_find_all(images, glue::glue("./{xpath}"), ns = ns)
# We have the text of the alt text here, but it's possible that the alt text
# was separated on different lines
attr_texts <- xml2::xml_text(attrs)
no_closing <- !grepl("[}]", attr_texts)
if (any(no_closing)) {
fixed_text <- purrr::map_chr(attrs[no_closing], get_broken_attr_text, ns)
attr_texts[no_closing] <- fixed_text
}
htmls <- paste(gsub("[{](.+)[}]", "<img \\1/>", attr_texts), collapse = "\n")
htmls <- xml2::read_html(htmls)
alts <- xml2::xml_find_all(htmls, ".//img")
alts <- xml2::xml_attr(alts, "alt")
purrr::walk2(images, alts, function(img, alt) {
if (!is.na(alt)) xml2::xml_set_attr(img, "alt", alt)
})
invisible(images)
}

The problem is that because we are going from XML -> HTML, the quotes are already escaped in the XML, so they look like normal quotes in HTML, which causes it to truncate on the attributes.

One solution is to rewrite this so that we search for the boundaries between elements (e.g. splitting on =["']) and then reconstructing from there, but it's a big enough undertaking that I am willing to hold off on this until we have the lessons transformed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant