-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReadme.txt
82 lines (63 loc) · 3.95 KB
/
Readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
This YouTube crawler crawls youtube starting from a SeedUrl provided by the user. It uses a hillclimbing algorithm based on views/(likes) score of videos. The hypotheses being; videos with good content will have more likes per views. Here's a video from Veritasium that explains how YouTube does not do a great job of providing users with good recommendations: https://youtu.be/fHsa9DqmId8
As such, users may benifit from a rather explorative approach from this crawler to diversify their finds on Youtube.
Additionaly, keyword based matching, and author count based suppression, are used to further refine the results.
- Code requirements are captured in "requirements.txt",
other imports should be inbuilt in python 3.5 +.
- to install requirements:
pip install -r requirements.txt
- Uses "argparse" to parse input arguments from command line.
- Argparse expects a path to a config file.
- config file should contain the following:
seedUrls:
- "https://youtu.be/ONVpFtiD-fo"
- "https://youtu.be/P_fHJIYENdI"
outputDir: "knowledge/science/"
numVideos: 500
maxAuthorCount: 5
seedUrls - One or more links to youtube videos can be
added (preferrable around similar topics)
outputDir - where the final html will be written
numVideos - number of videos to crawl
maxAuthorCount - number of times author can be
allowed to repeat in the results
- Outputs:
A sorted html file; written to the outputDir provided in "crawled_outputs" folder.
Format of the output:
Video Title (with hyperlink that opens the video on a new tab on click), Score, Author, Views, Likes, keywords, is_seed, priority (results are sorted by this key)
Score is calculated by the ratio:
No. of Views / (Likes*log10(likes))
- The smaller this number, the "better" the video.
If EVERY person who views a video also hits "like", this score will approach 1.
A keyword matching algorithm also influences the priority of the crawl,
where the keywords of the seedUrls are matched against the keywords of each other
video in the crawl.
- Sample command (Updated 12th Feb 2025):
$python3 youtube_crawler.py
crawling ... find progress in log file: smart_crawl.log
Output File will be named:
radio_triple_j_bbc_mahogany_deezer_1.html
HTML file './crawled_outputs/music/english/radio/radio_triple_j_bbc_mahogany_deezer_1.html' has been created successfully.
0.4 % crawling complete
HTML file './crawled_outputs/music/english/radio/radio_triple_j_bbc_mahogany_deezer_1.html' has been created successfully.
0.899 % crawling complete
HTML file './crawled_outputs/music/english/radio/radio_triple_j_bbc_mahogany_deezer_1.html' has been created successfully.
1.400 % crawling complete
HTML file './crawled_outputs/music/english/radio/radio_triple_j_bbc_mahogany_deezer_1.html' has been created successfully.
1.9 % crawling complete
HTML file './crawled_outputs/music/english/radio/radio_triple_j_bbc_mahogany_deezer_1.html' has been created successfully.
3.300 % crawling complete
.....
................................................................
--- Crawl took 1207.4183235168457 seconds ---
Alternately, the "smart_crawl.log" file can be referred to for detailed progress with individual urls.
- IMPORTANT NOTES:
WAIT TIME IS ADDED FOR "POLITENESS POLICY" WHILE CRAWLING. (set to 1.1 seconds)
PLEASE DO NOT REDUCE IT LEST YOUTUBE THINKS YOU ARE A BOT.
- General Notes:
- Actual number of urls in the crawled file may have slightly more links than specified.
- Links gathered may differ based on geographic location crawled from.
- Some popular videos by location may still show up despite little relation to the source link provided.
- Time taken and scores vary depending on factors like the stats for the source video provided, vpn etc.
- One can also crawl the channel's video page, e.g.:
https://www.youtube.com/@cokestudio/videos
but it will be helpful to also add particular videos from the channel as seed to extract relevant keywords.