Feature request - Option to skip over an element #79

danncasey · 2022-09-30T01:41:21Z

Using markdownify 0.11.6, and it is working like a charm except for one thing. I'm scraping a site that has a youtube video embedded in an iframe. In this case i need to just it unchanged from out the site had it.

an option like --skip 'iframe' for example would be great. (ideally with some criteria, such as matching an id or regex).

The following change produces the desired outcome. It's obviously just a quick hack, but it demonstrates to the functionality.

in init.py line ~143

        for el in node.children:
            if isinstance(el, Comment) or isinstance(el, Doctype):
                continue
            elif isinstance(el, NavigableString):
                text += self.process_text(el)
            else:
                if el.name == 'iframe':
                    text += self.process_text(el)
                else:
                    text += self.process_tag(el, convert_children_as_inline)

test.html

<p>Need a way to preserve the original html for a given element.</p>
<i>Please don't discard my iframe :) </i>
<div class="ratio ratio-16x9" data-video="">
    <iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="border" data-video="" src="https://www.youtube.com/embed/EHfq0miBu8c?modestbranding=0&amp;rel=0&amp;showinfo=0"></iframe>
</div>
<hr />

Produces:

===================


Need a way to preserve the original html for a given element.


*Please don't discard my iframe :)* 

<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="border" data-video="" src="https://www.youtube.com/embed/EHfq0miBu8c?modestbranding=0&amp;rel=0&amp;showinfo=0"></iframe>



---```

The text was updated successfully, but these errors were encountered:

sopoforic · 2022-10-24T14:57:15Z

I have a similar issue. I have custom tags that I want to retain when converting, e.g. I'd like to be able to call something like:

md("<ul><li><foo>bar</foo></li></ul>", keep=['foo'])

and get back:

* <foo>bar</foo>

instead of:

* bar

Or, alternatively (or in addition), have an option to keep all unrecognized elements.

I can handle this with a custom converter, but it seems like it should be a pretty common use case, so it'd be nice if there were a simple option for it.

ZobaJakColbert · 2023-02-03T23:47:16Z

How do you write custom converter for the foo tag?

I want to keep something like <span custom-style="MyStyle">

I know I can edit init.py line ~143 like this, but I want to know how to achive this in custom converter.

            else:
                if el.name == 'span' and el['custom-style']:
                    text += self.process_text(el)
                else:
                    text += self.process_tag(el, convert_children_as_inline)

sopoforic · 2023-02-07T17:15:48Z

Something like this, I guess:

class MyConverter(MarkdownConverter):
    def convert_span(self, el, text, convert_as_inline):
        if el.get('custom-style'):
            return self.process_text(el)
        else:
            return super().process_tag(el, text, convert_as_inline)

Then you get:

>>> MyConverter().convert('<i>hello <span custom-style="MyStyle">world</span> and <span>all</span></i>')
'*hello <span custom-style="MyStyle">world</span> and all*'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request - Option to skip over an element #79

Feature request - Option to skip over an element #79

danncasey commented Sep 30, 2022

sopoforic commented Oct 24, 2022 •

edited

Loading

ZobaJakColbert commented Feb 3, 2023 •

edited

Loading

sopoforic commented Feb 7, 2023

Feature request - Option to skip over an element #79

Feature request - Option to skip over an element #79

Comments

danncasey commented Sep 30, 2022

sopoforic commented Oct 24, 2022 • edited Loading

ZobaJakColbert commented Feb 3, 2023 • edited Loading

sopoforic commented Feb 7, 2023

sopoforic commented Oct 24, 2022 •

edited

Loading

ZobaJakColbert commented Feb 3, 2023 •

edited

Loading