Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect comment parsing #237

Open
prettyroseslover opened this issue Mar 2, 2025 · 1 comment
Open

Incorrect comment parsing #237

prettyroseslover opened this issue Mar 2, 2025 · 1 comment

Comments

@prettyroseslover
Copy link

Hello!
I stumbled across the HTML file, which contains comments between <style> tags.
I try to parse it as follows using scraper 0.23.1:

use scraper::Html;

fn main() {
    let html = r#"
<style><!-- /* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:#954F72;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]-->
<div><!-- ignore comment --></div>
<h1>word1&euro;
word2&nbsp;word3  word4</h1>
<div> word5  &nbsp;<span>word6 <br> word7</span></div>
<p><span>word8 word9 word10</span></p>
<p>Some test
message<br></p>

    "#;
    let document = Html::parse_document(html);
    println!("{:?}", document);
}

However, it classifies the content of <style> as Text, not Comment. It doesn't seem to be the expected behavior to me.

Node { parent: Some(NodeId(4)), prev_sibling: None, next_sibling: None, children: None, value: Text("<!-- /* Font Definitions */\n@font-face\n\t{font-family:Calibri;\n\tpanose- [AND SO ON...]-->") },
@nathaniel-daniel
Copy link

Chrome also treats HTML comments inside of style tags as text:

<html>
    <head>
        <style>
            <!-- Comment -->
            body {
                background-color: black;
            }
        </style>
    </head>
    <body>
        <script>
            console.log(document.querySelector('style').innerText);
        </script>
    </body>
</html>

Output:


            <!-- Comment -->
            body {
                background-color: black;
            }
 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants