Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction of Youtube iframes and img elements with links #272

Open
sampathmende opened this issue Dec 5, 2022 · 3 comments
Open

Extraction of Youtube iframes and img elements with links #272

sampathmende opened this issue Dec 5, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@sampathmende
Copy link

Not able to fetch image tags
Not able to fetch iframe tags.
From command prompt in windows machine

trafilatura --sitemap "https://www.lyricspulp.com/" --list > linklist.txt
trafilatura --sitemap homepage --list > linklist.txt
trafilatura -i linklist.txt --xml -o outputfile.txt
trafilatura -i linklist.txt --formatting --links --images --no-comments --xml -o outputfile.txt

@adbar adbar added the enhancement New feature or request label Dec 5, 2022
@adbar
Copy link
Owner

adbar commented Dec 5, 2022

Hi @sampathmende, thanks for your feedback.

  • iframes are tricky, they could be missing although I couldn't find an example in the webpage you mention
  • images are a problem here because there are embedded in links, this isn't part of the main focus (the text) but it could be corrected nonetheless

Example: https://web.archive.org/web/20221205103722/https://www.lyricspulp.com/2022/12/thee-thalapathy-lyrics.html

@sampathmende
Copy link
Author

sampathmende commented Dec 5, 2022

This link is having iframe and images

https://lyricsnary.com/dirty-little-secret-nora-fatehi-zack-knight/
From this I can't able to extract images and iframe which is basically YouTube embedded URL
Basically Iam trying extract lyrics from lyric website. I able to extract html elements but not the images and iframes which contains images from youtube and iframe contains embedded video url of Youtube.

@adbar
Copy link
Owner

adbar commented Dec 5, 2022

The library is geared towards text extraction, in the page you mention all of the main text is extracted correctly. Keeping elements containing Youtube videos would require additional code.

@adbar adbar changed the title is there a way to extract iframe tag including with other tags Extraction of Youtube iframes and img elements with links Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants