Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request - Option to skip over an element #79

Open
danncasey opened this issue Sep 30, 2022 · 3 comments
Open

Feature request - Option to skip over an element #79

danncasey opened this issue Sep 30, 2022 · 3 comments

Comments

@danncasey
Copy link

Using markdownify 0.11.6, and it is working like a charm except for one thing. I'm scraping a site that has a youtube video embedded in an iframe. In this case i need to just it unchanged from out the site had it.

an option like --skip 'iframe' for example would be great. (ideally with some criteria, such as matching an id or regex).

The following change produces the desired outcome. It's obviously just a quick hack, but it demonstrates to the functionality.

in init.py line ~143

        for el in node.children:
            if isinstance(el, Comment) or isinstance(el, Doctype):
                continue
            elif isinstance(el, NavigableString):
                text += self.process_text(el)
            else:
                if el.name == 'iframe':
                    text += self.process_text(el)
                else:
                    text += self.process_tag(el, convert_children_as_inline)

test.html

<p>Need a way to preserve the original html for a given element.</p>
<i>Please don't discard my iframe :) </i>
<div class="ratio ratio-16x9" data-video="">
    <iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="border" data-video="" src="https://www.youtube.com/embed/EHfq0miBu8c?modestbranding=0&amp;rel=0&amp;showinfo=0"></iframe>
</div>
<hr />

Produces:

===================


Need a way to preserve the original html for a given element.


*Please don't discard my iframe :)* 

<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="border" data-video="" src="https://www.youtube.com/embed/EHfq0miBu8c?modestbranding=0&amp;rel=0&amp;showinfo=0"></iframe>



---```
@sopoforic
Copy link

sopoforic commented Oct 24, 2022

I have a similar issue. I have custom tags that I want to retain when converting, e.g. I'd like to be able to call something like:

md("<ul><li><foo>bar</foo></li></ul>", keep=['foo'])

and get back:

* <foo>bar</foo>

instead of:

* bar

Or, alternatively (or in addition), have an option to keep all unrecognized elements.

I can handle this with a custom converter, but it seems like it should be a pretty common use case, so it'd be nice if there were a simple option for it.

@ZobaJakColbert
Copy link

ZobaJakColbert commented Feb 3, 2023

How do you write custom converter for the foo tag?

I want to keep something like <span custom-style="MyStyle">

I know I can edit init.py line ~143 like this, but I want to know how to achive this in custom converter.

            else:
                if el.name == 'span' and el['custom-style']:
                    text += self.process_text(el)
                else:
                    text += self.process_tag(el, convert_children_as_inline)

@sopoforic
Copy link

Something like this, I guess:

class MyConverter(MarkdownConverter):
    def convert_span(self, el, text, convert_as_inline):
        if el.get('custom-style'):
            return self.process_text(el)
        else:
            return super().process_tag(el, text, convert_as_inline)

Then you get:

>>> MyConverter().convert('<i>hello <span custom-style="MyStyle">world</span> and <span>all</span></i>')
'*hello <span custom-style="MyStyle">world</span> and all*'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants