Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doctype is broken after translating #12

Open
dingedi opened this issue May 2, 2023 · 3 comments
Open

doctype is broken after translating #12

dingedi opened this issue May 2, 2023 · 3 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@dingedi
Copy link
Contributor

dingedi commented May 2, 2023

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd">

is broken in

html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"

image

it is an example for this doctype but in general as soon as there is a complex doctype it breaks everything

@PJ-Finlay
Copy link
Collaborator

Is this for passing the DOCTYPE tag through translate-html or is it going through the seq2seq model? I think the soup library should maintain the DOCTYPE but the seq2seq probably doesn't.

@dingedi
Copy link
Contributor Author

dingedi commented May 3, 2023

I had tried via libretranslate, I just tried with translate-html and the problem is similar

image

@dingedi
Copy link
Contributor Author

dingedi commented May 3, 2023

i think the problem come from itag_of_soup

def translate_html(underlying_translation, html):
    soup = BeautifulSoup(html, "html.parser")
    print('SOUP: ', soup)
    itag = itag_of_soup(soup)
    print('ITAG: ', itag)
    translated_tag = translate_tags(underlying_translation, itag)
    translated_soup = soup_of_itag(translated_tag)
    return translated_soup

result

SOUP:  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd">
<p>hello</p>
ITAG:  <class 'argostranslate.tags.Tag'> "['html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"', <argostranslate.tags.Tag object at 0x7f4ebe57f750>]"
html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"<p>Hola.</p>

@argosopentech argosopentech added bug Something isn't working help wanted Extra attention is needed good first issue Good for newcomers labels May 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants