Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use yt-dlp instead of Youtube API #177

Open
rgaudin opened this issue Sep 1, 2023 · 13 comments
Open

Use yt-dlp instead of Youtube API #177

rgaudin opened this issue Sep 1, 2023 · 13 comments
Labels
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Sep 1, 2023

Youtube Data API v3 has served us relatively well for 4 years now. I think it's time to move away from it because:

  • its restrictions are annoying. we have to manage those API keys, manually set IP whitelist (cant be done programaticaly) and switch from time to time to respect the quota (or suffer failures).
  • it's artificially limited and/or buggy/different from the web UI. we've seen several cases where public stuff are not available via the API.

youtube-dl and the fork we use (yt-dlp) has greatly improved in 4y. Switching to it would have the following benefits:

  • Access to whatever is visible online
  • No API keys needed anymore
  • More flexible target specification (I suppose): using YT URLs
  • Generic: this would be a separate task but yt-dlp supporting many platforms and methods, it shall enable retrieving videos from various places… ⚠️ ZIMs are not single videos. Not sure how we can reproduce a standard experience using data from different platforms. Don't expect a turnkey feature here.

This change would require an important revamp of the scraper but partly because it's still a filesystem-based one

Important feature check list to test/poc first:

  • get list of playlists with details (name, description)
  • get list of videos for each playlists
  • get video details (author, title, description, date)
  • download video (already via yt-dlp)
  • get video thumbnails (already via yt-dlp)
  • get video subtitles (already via yt-dlp)
  • get author's metadata (name, description) and branding (banner, profile)
@benoit74
Copy link
Collaborator

I did small tests of yt-dlp. They are very positive.

Test context : Python 3.11.4, yt-dlp 2023.10.7

  • list of playlists with details : Yes, many details included
  • list of videos for each playlists : Yes, many details included
  • get video details : Yes, many details included
  • author / channel metadata:
    • title, description are available, many other information as well
  • branding:
    • banner : multiple resolutions provided, but only the big image for TV resolutions according to Youtube UI, to get the cropped version used on computers or the even smaller one used on phones, you have to crop yourself
    • profile picture : multiple resolutions provided

Is it working for weird channel names like @Madrasa which does not work without a channel ID

Yes, it even found 5698 videos ...

Is it working for user ID types (old ones, instead of channel)

Yes, it does not make a difference (tested for DirtyBiology which is a channel and cestpassorcierofficiel which is a user)

How to extract all information mentioned above ?

I used this code:

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
  info = ydl.extract_info(url, download=False)

One can also pass "process=False" to not retrieve details (e.g. when you want only author info, not all his videos)

You can use following URL:

  • https://www.youtube.com/watch?v=BaW_jenozKc => get details about one video
  • https://www.youtube.com/@PhilippHagemeister => get details about one channel / user and all its videos
  • https://www.youtube.com/@PhilippHagemeister/playlists => get details about all playlists, and all videos in every playlist
  • https://www.youtube.com/channel/UCtqICqGbPSbTN09K1_7VZ3Q => get details about one channel by ID
  • https://www.youtube.com/playlist?list=PL5Pd1geIk9IUBWUoUUNyBehNl0q5D1IuE => get details about one playlist and its videos

The main limitation is that is seems hard to request only few data (e.g. get the list of all playlists but not the videos which are within).

Whole code used for the tests:

import json
import yt_dlp

def process_one(ydl, url, filename):
    info = ydl.extract_info(url, download=False, process=False)
    with open(filename, "w") as fh:
        # ℹ️ ydl.sanitize_info makes the info json-serializable
        json.dump(ydl.sanitize_info(info), fh, indent=2)

# ℹ️ See help(yt_dlp.YoutubeDL) for a list of available options and public functions
ydl_opts = {}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    process_one(ydl, "https://www.youtube.com/watch?v=BaW_jenozKc", "one_video.json")

@benoit74
Copy link
Collaborator

I also checked, chapters information is returned.

@kelson42 kelson42 pinned this issue Oct 21, 2023
@kelson42
Copy link
Contributor

This improvement has been on the table since quite a long time. Anything stopping us to move forward?

@rgaudin
Copy link
Member Author

rgaudin commented Oct 21, 2023

Anything stopping us to move forward?

Time, priority 😉

@benoit74
Copy link
Collaborator

As you might see, this is not even in the 2.2.0 release I'm preparing because there is already lower hanging fruits to tackle before this "big" change.

@joe-rabbit
Copy link

Hello , I would like to work on this? how do I go about with this thank you

@rgaudin
Copy link
Member Author

rgaudin commented Dec 27, 2023

Hello , I would like to work on this? how do I go about with this thank you

This ticket involves a large refactor of the codebase. It requires a good understanding of the current codebase and a detailed breakdown of how you'd do.
Unless that's something you are willing to do, I'd advise you look at other tickets

@benoit74
Copy link
Collaborator

benoit74 commented Dec 27, 2023 via email

@joe-rabbit
Copy link

i will try my best :)

@benoit74 benoit74 added this to the 3.1.0 milestone Jun 15, 2024
@benoit74
Copy link
Collaborator

I'm not so sure that moving everything to yt-dlp is a wise move, at least it needs to be discussed again because of disadvantages that have not been discussed here so far.

I agree that the advantage is obvious, no need to list them again.

It was however unclear to me until now that there is a big downside to moving everything to yt-dlp. The problem is that for yt-dlp operations, Zimfarm workers are sometimes blacklisted. We failed to understand the exact circumstances for now, but what we know is that the ban is temporary (few hours) and linked to the IP (they have nothing else to ban anyway).

Moving all operations from the YT API to yt-dlp means that we will be even more subject to this ban (more operations probably means more ban) AND the consequence of a ban will be more significant. If we implement #277 and we continue to use YT API instead of yt-dlp, it means that we can refresh the ZIM for UI enhancements typically without having to use yt-dlp at all if channel has not been updated, and hence not being impacted by a temporary ban.

I still consider that the advantages outweigh the disadvantages, especially since it is quite a rare edge cases that channel has been unchanged since last recipe execution, but I think it is very important all of us are aware of this before modifying too much code.

@chapmanjacobd
Copy link

even more subject to this ban (more operations)

While you are evaluating yt-dlp, and measuring the number of requests that it makes, I'd like to suggest turning on a few specific options for the initial metadata scan to reduce the number of network requests:

ydl_opts = {
    "skip_download": True,
    "lazy_playlist": True,
    "extract_flat": True,
}

This might be helpful:

https://github.com/chapmanjacobd/library/blob/f253959d6de2c980fe42238ede2b908ef762c4a8/xklb/createdb/tube_backend.py#L97

And then you can fan-out the more detailed video metadata fetching across many IPs

@benoit74
Copy link
Collaborator

Very good point @chapmanjacobd, thank you for notifying us!

@benoit74 benoit74 modified the milestones: 3.1.0, 3.2.0 Sep 5, 2024
@benoit74 benoit74 modified the milestones: 3.2.0, 3.3.0 Oct 11, 2024
@benoit74 benoit74 modified the milestones: 3.3.0, 3.4.0 Nov 4, 2024
@enema-combatant
Copy link

I'm not so sure that moving everything to yt-dlp is a wise move, at least it needs to be discussed again because of disadvantages that have not been discussed here so far.

I agree that the advantage is obvious, no need to list them again.

It was however unclear to me until now that there is a big downside to moving everything to yt-dlp. The problem is that for yt-dlp operations, Zimfarm workers are sometimes blacklisted. We failed to understand the exact circumstances for now, but what we know is that the ban is temporary (few hours) and linked to the IP (they have nothing else to ban anyway).

Moving all operations from the YT API to yt-dlp means that we will be even more subject to this ban (more operations probably means more ban) AND the consequence of a ban will be more significant. If we implement #277 and we continue to use YT API instead of yt-dlp, it means that we can refresh the ZIM for UI enhancements typically without having to use yt-dlp at all if channel has not been updated, and hence not being impacted by a temporary ban.

I still consider that the advantages outweigh the disadvantages, especially since it is quite a rare edge cases that channel has been unchanged since last recipe execution, but I think it is very important all of us are aware of this before modifying too much code.

Going through proxies may be a solution. bluet/proxybroker2 may have the key for avoiding direct-IP bans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants