Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translated comments shown in final document #9

Open
nmontesoro opened this issue Feb 15, 2023 · 1 comment
Open

Translated comments shown in final document #9

nmontesoro opened this issue Feb 15, 2023 · 1 comment

Comments

@nmontesoro
Copy link

nmontesoro commented Feb 15, 2023

The contents of the comments get translated correctly, but when reconstructing the BeautifulSoup object their tags are lost, causing the final translated document to show the contents of the comments when opened with a web browser.

The issue, as far as I can work out, is that neither itag_of_soup nor soup_of_itag differentiate between a bs4.element.NavigableString and a bs4.element.Comment (which inherits from the former).

So, itag_of_soup returns an str object regardless of whether its processing a NavigableString or a Comment. When soup_of_itag is called, it checks if the object passed to it is an instance of str and if so constructs a NavigableString, which for the case of comments results in losing the <!--- ---> characters in the final document.

Here's an example:

import argostranslate.translate
import translatehtml

# Original "file"
content = """
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <!-- This should not be seen in a browser -->
        <h1>Welcome to Test!</h1>
    </body>
</html>
"""

# Define languages for translation from English to Hindi
en = argostranslate.translate.get_language_from_code("en")
hi = argostranslate.translate.get_language_from_code("hi")
ut = en.get_translation(hi)

# Translate the file with translate_html
content = translatehtml.translate_html(ut, content)

# Write the translated file
with open("test.html", "wt") as fp:
    fp.write(str(content))
Screenshot from 2023-02-15 12-33-32
Original file
Screenshot from 2023-02-15 12-33-45
Translation
@nmontesoro
Copy link
Author

A workaround might be to remove the comments from the tree before using translate_html, like so:

soup = BeautifulSoup(content, "html.parser")
comments = soup.find_all(text=lambda text: isinstance(text, Comment))
for comment in comments:
    comment.extract()

Then passing str(soup) instead of content to translate_html.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant