Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preserve space in certain elements #429

Merged
merged 10 commits into from
Nov 20, 2023
Merged

Conversation

idoshamun
Copy link
Contributor

It's still WIP but @adbar, I need your advice and early review to see that I'm heading the right direction.
Currently I manage to preserve spaces in code elements but it breaks proper formatting in other elements. You can see the test cases. I'm not sure what will be the right way to address these cases.
It's mostly where <p>H <hi>i</hi>, <hi>w</hi></p>.
Basically content elements (p, item, etc) should preserve spaces as well I think

Closes #422

@adbar
Copy link
Owner

adbar commented Oct 11, 2023

Hi @idoshamun, I also think that's the idea. My remarks:

  • PRESERVE_SPACE_TAGS would need to be a set for faster "in" operations
  • Currently this functionality/line from sanitize() is missing from the PR:
    '\n'.join(filter(None, (line_processing(l) for l in text.splitlines())))

@idoshamun
Copy link
Contributor Author

@adbar good catch! I applied the changes please check again

@idoshamun
Copy link
Contributor Author

@adbar would you mind suggesting the next step?

@adbar
Copy link
Owner

adbar commented Nov 1, 2023

@idoshamun Sorry, I missed your last comment. The tests don't pass, do we need to adapt them to your changes or change the code further? That would be the next step IMO.

@idoshamun
Copy link
Contributor Author

@adbar this is exactly my debate, not sure if I need to adjust the tests or my code. Can you have a look and let me know your suggestion please?
Also regardless of the tests, does the code make sense to you before I proceed further with the changes?

@adbar
Copy link
Owner

adbar commented Nov 3, 2023

Hi @idoshamun, yes, the code makes sense so I'd start with the tests, I'm not sure why they fail.

@idoshamun
Copy link
Contributor Author

@adbar it's ready for your review!

@adbar
Copy link
Owner

adbar commented Nov 8, 2023

Thanks, I need to check the impact on general performance (accuracy and speed) so it's going to take a bit of time. I'm not completely sure regarding tables but this is the way to go for code segments.

Just to clarify: you inserted \xa0 spaces in the tests as a result of the str.replace('&nbsp;', '\u00A0') transformation, right?

@idoshamun
Copy link
Contributor Author

Just to clarify: you inserted \xa0 spaces in the tests as a result of the str.replace(' ', '\u00A0') transformation, right?

Yap.

Let me know if I can help with something :)

Copy link

codecov bot commented Nov 10, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (0ebbbcb) 96.82% compared to head (0fca4ea) 96.84%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #429      +/-   ##
==========================================
+ Coverage   96.82%   96.84%   +0.01%     
==========================================
  Files          22       22              
  Lines        3342     3361      +19     
==========================================
+ Hits         3236     3255      +19     
  Misses        106      106              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar
Copy link
Owner

adbar commented Nov 10, 2023

So, the changes pass the tests, including on my benchmark, they are also not critical for performance. Thanks for the PR! I thought we could maybe make it better:

  • You're only checking the parent, it cannot be that a relevant element is more deeply nested, i.e. a tag like <code> is never far?
  • What about <pre>? I don't see it in your list but would be tempted to add it to SPACING_PROTECTED.
  • We could also let users decide by putting this list among the variables in settings.cfg, that's an open question, I'm thinking more and more to put constants there instead of embedded somewhere in the scripts.

@adbar adbar linked an issue Nov 10, 2023 that may be closed by this pull request
@idoshamun
Copy link
Contributor Author

idoshamun commented Nov 12, 2023

Thanks a lot for checking and the effort @adbar!

You're only checking the parent, it cannot be that a relevant element is more deeply nested, i.e. a tag like <code> is never far?

I can suggest a different approach then, whenever we detect a new preserve space element, we'd propagate down an attribute to indicate we need to preserve space to every child element. This way all the tree will preserve space. Let me know if I should go down this path.

What about <pre>? I don't see it in your list but would be tempted to add it to SPACING_PROTECTED.

You're right. I'll add the pre element to the set.

We could also let users decide by putting this list among the variables in settings.cfg, that's an open question, I'm thinking more and more to put constants there instead of embedded somewhere in the scripts.

Once you decide to apply this strategy we can consider it. But if it's not a common pattern in the library currently than I suggest we skip this one. Wdyt?

@adbar
Copy link
Owner

adbar commented Nov 13, 2023

The way we perform tree traversal is crucial because it can save time. I was just curious if elements such as code are deeply nested, possibly not but I have no data on that.
Using attributes on the HTML structure might even be better but I'm not sure that preserving spaces in all the tree would be beneficial. Since your PR is already an improvement we could leave it at that and see the rest later. It also depends on the time you can invest in it.

Yes, please add pre to the set.

We can see the configuration/options later, no worries.

@idoshamun
Copy link
Contributor Author

@adbar done!

@adbar
Copy link
Owner

adbar commented Nov 14, 2023

I still get this FutureWarning:
FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

I believe it's about statements like (p and p.tag in SPACING_PROTECTED).
You could also use if/else statements if the lines become too long.

@adbar
Copy link
Owner

adbar commented Nov 14, 2023

Another issue, the XML output is garbled instead of pretty-printed because of the PR but there is not test for that and I'm not sure why.

@idoshamun
Copy link
Contributor Author

@adbar I'm not sure exactly what I should change. Can you perhaps apply the fix please?

@adbar
Copy link
Owner

adbar commented Nov 14, 2023

OK, I'll take it from here and try to restore parts of the removed code and to avoid the warnings.

@adbar adbar changed the title preserve spaces in code elements preserve space in certain elements Nov 17, 2023
@adbar adbar merged commit abfe094 into adbar:master Nov 20, 2023
14 checks passed
@idoshamun
Copy link
Contributor Author

Thanks a lot @adbar for the support!
When is the next release?

@adbar
Copy link
Owner

adbar commented Nov 21, 2023

In the coming weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple spaces within a text element are not supported
2 participants