Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Optionally extract raw html instead of parse5 serialization #42

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vbraun
Copy link
Contributor

@vbraun vbraun commented Aug 3, 2020

This adds a rawHtml option to extract the actual source html instead of the parse5 roundtripped version; Not sure if its a good idea but I'm trying to replace a gettext extractor that does just this.

extractor
    .createHtmlParser([
        HtmlExtractors.elementContent('translate, [translate]', {
            attributes: {
                context: 'translate-context',
                comment: 'translate-comment',
            },
            rawHtml: true,
        }),
    ])
    .parseFilesGlob('./src/**/*.html');

Documentation and lint needs fixing, but maybe its not a good idea to start with? ;-)

@lukasgeiter
Copy link
Owner

Can you go a bit more into detail on the problem your change addresses? Is this just about HTML entities (similar to #36) or do you have other issues with the extracted contents?

@vbraun
Copy link
Contributor Author

vbraun commented Aug 6, 2020

Yes, its about HTML entities, that is, roundtripping through parse5 loses information. In particular, the angular-gettext-cli extractor doesn't do that and … extracts as literal. Now as a first step to replace it I wanted to reproduce the extracted po file in an existing project, and found that I was unable to do so for various html entities.

Now one might argue that this the correct way of doing things since the DOM does that as well, and you are going to match el.innerText / el.innerHTML anyways. And I'm open to editing my po files to move html entites around. Still, it seems that for full flexibility one should at least be able to have po files where the msgid is either

  • innerHTML
  • innerText
  • actual source of the template

Slightly related question: getElementContent has some special handling for <, >, and & but not   even though thats also in the spec: https://html.spec.whatwg.org/multipage/parsing.html#escapingString

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants