Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.child and .last_child not working when those child are in a separeted html line #114

Open
LeDilam opened this issue May 12, 2024 · 0 comments

Comments

@LeDilam
Copy link

LeDilam commented May 12, 2024

When using Node.child or Node.last_child, it doesn't return the first and last child of the current node in the case where those child are in a separated line than the start of the end of current node in the html.
If we use Node.iter(), the first and last node of the iterator are what we expect to get. Node.iter() seems to always work contrary to Node.child and Node.last_child.

Here is some simple example :

from selectolax.parser import HTMLParser, Node

html = '''<!DOCTYPE html>
<head>
	<title>Simple Example</title>
</head>
<body>
	<div>
		<p>text1</p>
		<p>text2</p>
		<p>text3</p>
	</div>
</body>
'''

tree = HTMLParser(html)

my_div = tree.css_first("div")

print ("My div :")
print (my_div.html)

first_child = my_div.child	# Dedicated function : BUG
# first_child = next(my_div.iter())	# Using iter() : OK
print ("first_child :")
print (first_child.html)

last_child = my_div.last_child	# Dedicated function : BUG
# *_,last_child = my_div.iter()	# Using iter() : OK
print ("last_child :")
print (last_child.html)

It doesn't works.

My div :
<div>
      <p>text1</p>
      <p>text2</p>
      <p>text3</p>
   </div>
first_child :

      
last_child :

   

But if we change html by putting the first child on the line of the start of the div :

html = '''<!DOCTYPE html>
<head>
	<title>Simple Example</title>
</head>
<body>
	<div><p>text1</p>
		<p>text2</p>
		<p>text3</p>
	</div>
</body>
'''

the first child work.

My div :
<div><p>text1</p>
      <p>text2</p>
      <p>text3</p>
   </div>
first_child :
<p>text1</p>
last_child :

   

And if we change the html by putting the last child on the line of the end of the div :

html = '''<!DOCTYPE html>
<head>
	<title>Simple Example</title>
</head>
<body>
	<div>
		<p>text1</p>
		<p>text2</p>
		<p>text3</p></div>
</body>
'''

the last child work.

My div :
<div>
                <p>text1</p>
                <p>text2</p>
                <p>text3</p></div>
first_child :


last_child :
<p>text3</p>

And if we use .iter() instead of .child and .last_child :

first_child = next(my_div.iter())	# Using iter() : OK
print ("first_child :")
print (first_child.html)

*_,last_child = my_div.iter()	# Using iter() : OK
print ("last_child :")
print (last_child.html)

it works.

My div :
<div>
                <p>text1</p>
                <p>text2</p>
                <p>text3</p>
        </div>
first_child :
<p>text1</p>
last_child :
<p>text3</p>

Clearly, something becomes wrong with .child and .last_child depending on the line where the current node start or end compared to the line of the wanted child node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant