-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use yt-dlp instead of Youtube API #177
Comments
I did small tests of Test context : Python 3.11.4, yt-dlp 2023.10.7
Yes, it even found 5698 videos ...
Yes, it does not make a difference (tested for
I used this code:
One can also pass "process=False" to not retrieve details (e.g. when you want only author info, not all his videos) You can use following URL:
The main limitation is that is seems hard to request only few data (e.g. get the list of all playlists but not the videos which are within). Whole code used for the tests:
|
I also checked, chapters information is returned. |
This improvement has been on the table since quite a long time. Anything stopping us to move forward? |
Time, priority 😉 |
As you might see, this is not even in the 2.2.0 release I'm preparing because there is already lower hanging fruits to tackle before this "big" change. |
Hello , I would like to work on this? how do I go about with this thank you |
This ticket involves a large refactor of the codebase. It requires a good understanding of the current codebase and a detailed breakdown of how you'd do. |
yt-dlp is already use to download the video, what we want is to use this
also to get all information about the channels, users, playlists, videos, ..
The plan is :
- identify all places where the scraper uses the youtube API to grab
information about channels, users, playlists, videos
- decide how this code could be refactored to use information from yt-dlp
(this is really the hard part, we want to use the exact same inputs,
produce the same ZIM in the end, but there is absolutely not a one-to-one
match between YouTube API and yt-dlp)
- implement the change
Parts one and two should be done without any coding.
And as Renaud said, this is hence a complex task, but definitely not
infeasible if you are ready to spend some time on it.
Le mer. 27 déc. 2023, 19:52, rgaudin ***@***.***> a écrit :
… Hello , I would like to work on this? how do I go about with this thank you
This ticket involves a large refactor of the codebase. It requires a good
understanding of the current codebase and a detailed breakdown of how you'd
do.
Unless that's something you are willing to do, I'd advise you look at
other tickets
—
Reply to this email directly, view it on GitHub
<#177 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABWF5CJFNBG5AWJQ6Q6SD3TYLRVATAVCNFSM6AAAAAA4HLSR2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZQGU2TAMRSGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
i will try my best :) |
I'm not so sure that moving everything to yt-dlp is a wise move, at least it needs to be discussed again because of disadvantages that have not been discussed here so far. I agree that the advantage is obvious, no need to list them again. It was however unclear to me until now that there is a big downside to moving everything to yt-dlp. The problem is that for yt-dlp operations, Zimfarm workers are sometimes blacklisted. We failed to understand the exact circumstances for now, but what we know is that the ban is temporary (few hours) and linked to the IP (they have nothing else to ban anyway). Moving all operations from the YT API to yt-dlp means that we will be even more subject to this ban (more operations probably means more ban) AND the consequence of a ban will be more significant. If we implement #277 and we continue to use YT API instead of yt-dlp, it means that we can refresh the ZIM for UI enhancements typically without having to use yt-dlp at all if channel has not been updated, and hence not being impacted by a temporary ban. I still consider that the advantages outweigh the disadvantages, especially since it is quite a rare edge cases that channel has been unchanged since last recipe execution, but I think it is very important all of us are aware of this before modifying too much code. |
While you are evaluating yt-dlp, and measuring the number of requests that it makes, I'd like to suggest turning on a few specific options for the initial metadata scan to reduce the number of network requests:
This might be helpful: And then you can fan-out the more detailed video metadata fetching across many IPs |
Very good point @chapmanjacobd, thank you for notifying us! |
Going through proxies may be a solution. bluet/proxybroker2 may have the key for avoiding direct-IP bans. |
Youtube Data API v3 has served us relatively well for 4 years now. I think it's time to move away from it because:
youtube-dl and the fork we use (
yt-dlp
) has greatly improved in 4y. Switching to it would have the following benefits:This change would require an important revamp of the scraper but partly because it's still a filesystem-based one
Important feature check list to test/poc first:
The text was updated successfully, but these errors were encountered: