Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache subtitles in S3 storage #287

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Cache subtitles in S3 storage #287

wants to merge 1 commit into from

Conversation

dan-niles
Copy link
Collaborator

This PR modifies the download_subtitles method to cache subtitles in S3. The modified method now works as follows:

  • Fetch requested subtitle keys using yt-dlp (e.g., en, fr, de) and store them in requested_subtitle_keys.
  • Iterate through the requested_subtitle_keys and attempt to download each subtitle file from the S3 cache. If a file is - successfully downloaded from the S3 cache, remove the corresponding key from requested_subtitle_keys.
  • For the remaining keys in requested_subtitle_keys, download the subtitles using yt-dlp.
  • Upload the newly downloaded subtitles to the S3 cache.
  • Save the information about the fetched subtitles in a local cache as a JSON file for future use.
  • Add the downloaded subtitles to the ZIM file.

Close #277

@dan-niles dan-niles self-assigned this Aug 4, 2024
@dan-niles dan-niles marked this pull request as ready for review August 4, 2024 07:14
@dan-niles dan-niles requested a review from benoit74 August 4, 2024 07:14
Copy link

codecov bot commented Aug 4, 2024

Codecov Report

Attention: Patch coverage is 0% with 34 lines in your changes missing coverage. Please review.

Project coverage is 1.50%. Comparing base (60b85b5) to head (d34a734).

Files Patch % Lines
scraper/src/youtube2zim/scraper.py 0.00% 34 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #287      +/-   ##
========================================
- Coverage   1.54%   1.50%   -0.05%     
========================================
  Files         11      11              
  Lines       1102    1132      +30     
  Branches     162     170       +8     
========================================
  Hits          17      17              
- Misses      1085    1115      +30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit puzzled about this issue/PR now.

Original goal I expressed in the issue was to avoid being blocked by yt-dlp ban when subtitles did not changed. Goal is not reached in this PR since we still need to get the list of subtitles with yt-dlp.

At the same time, caching as well the list of subtitles is probably not something wishable, we usually do not cache in S3 the responses to API calls, but rather resources which take time/resources to recompute (reencoded videos / images). The idea of caching subtitles in S3 was probably already a deviation from this usual behavior.

To help sort this out, can you please share some data about how this change makes the scraper run faster or not?

@dan-niles
Copy link
Collaborator Author

Original goal I expressed in the issue was to avoid being blocked by yt-dlp ban when subtitles did not changed.

The reason I used yt-dlp to get the list of subtitles is that the scraper doesn't know which language subtitles need to be downloaded. Without the --all-subtitles flag, only the default available subtitles for the video are downloaded. With the --all-subtitles flag, auto-generated subtitles are included as well.

To avoid calling yt-dlp entirely in this scenario, we could save two zipped files in the S3 cache:

  1. subtitles/{video_id}/default.zip - containing only the default subtitles.
  2. subtitles/{video_id}/all.zip - containing all subtitles when --all-subtitles is passed.

WDYT?

@benoit74
Copy link
Collaborator

benoit74 commented Aug 5, 2024

This seems a potential improvement, but does it really help the scraper run faster? Because it has the drawback that we do not know when we should invalidate these to update them.

@dan-niles dan-niles marked this pull request as draft August 5, 2024 12:24
@benoit74
Copy link
Collaborator

benoit74 commented Aug 5, 2024

Let's pause this issue/PR to let me reflect a bit on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache subtitles on S3 as well
2 participants