Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the CC file instead of the transcript #42

Open
reaper-sid opened this issue Jan 16, 2023 · 15 comments · May be fixed by #43
Open

Use the CC file instead of the transcript #42

reaper-sid opened this issue Jan 16, 2023 · 15 comments · May be fixed by #43
Assignees
Labels
enhancement New feature or request

Comments

@reaper-sid
Copy link

Youtube recently decided to merge multiple lines of the CC into each single line of the transcript. This makes youtube2Anki much less useful. I found that the CC file can be pulled as XML. You can find the links to the various CC files in the HTML of the video page below a section that looks like "captions":{"playerCaptionsTracklistRenderer":{"captionTracks":.

After replacing \u0026 with &, the URLs look like this:

https://www.youtube.com/api/timedtext?v=[video_id]&caps=asr&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=[expire_code]&sparams=ip,ipbits,expire,v,caps,xoaf&signature=[signature_code]&key=yt8&lang=en

Would it be possible to rewrite to use the CC file from those links instead of the transcript for a more granular set of data and timing?

Originally posted by @tube-CC in #40

@dobladov
Copy link
Owner

Youtube recently decided to merge multiple lines of the CC into each single line of the transcript. This makes youtube2Anki much less useful.

Can you provide an example for this? As far as I know, YouTube does not decide how people upload their subtitles, it can be that whoever create the CC made them with multiple lines at once.

Would it be possible to rewrite to use the CC file from those links instead of the transcript for a more granular set of data and timing?

How can I get the signature_code? Can you put an example link that returns a valid XML?

As for now, I would say this is a huge refactor of the code and I would prefer not to do it since it looks to be some undocumented API that can lose access at any moment, while the current solution of parsing the HTML can be adapted easily in case of changes.

It would be interesting if we could implement the XML and have he HTML parsing as a fallback, but I would like to see how much work this would be.

@reaper-sid
Copy link
Author

A working link can be parsed from any youtube watch page that has CC. I can't provide a working link here because the signature_code and the expire_code change each time the page is refreshed. More work for sure, but I have basically stopped using this extension because the transcripts are so bad now. Transcripts don't break for end of sentences or even change of speaker. I don't know why Youtube made this change, but it has made my language study harder.

@dobladov
Copy link
Owner

Please share the link to one of those videos you mention without breaks at the end of sentences, I will try to get the signature_code and exprire_code and compare to what the extension obtains. Thanks.

@reaper-sid
Copy link
Author

reaper-sid commented Jan 16, 2023

https://www.youtube.com/watch?v=V_v5Gcjgv3U
If you click on the transcript at 4:29, for example, you can see what I mean. Two people are talking. the CC gives six lines,

<text start="269.76" dur="0.6">Dr. Bai.</text>
<text start="270.88" dur="0.68">Why are you still here?</text>
<text start="271.92" dur="0.6">Time for meeting.</text>
<text start="273" dur="0.88">Only with patience</text>
<text start="273.96" dur="0.72">can you enjoy some</text>
<text start="274.68" dur="1.16">good hand grounding coffee.</text>

the transcript gives one.
4:29 4:39 Dr. Bai. Why are you still here? Time for meeting. Only with patience can you enjoy some good hand grounding coffee. ... 269 280

I might add that I use the timing to embed the videos directly into Anki. Not sure everybody is using it that way.

@dobladov
Copy link
Owner

I see what you mean now, I guess they did it to avoid a long list for transcript.

Your proposal makes a lot of sense, I check how to implement it when I have some free time.

@reaper-sid
Copy link
Author

Cool! Let me know if you need any assistance for example with testing, etc.

@dobladov dobladov self-assigned this Jan 16, 2023
@dobladov dobladov added the enhancement New feature or request label Jan 16, 2023
@dobladov
Copy link
Owner

I managed to get the data into the extension, the refactor for handling the data will take me a while, but it seems to be very worth it because it does not require users to have to manually open the transcript any more.

Screenshot 2023-01-16 at 23 55 55

@reaper-sid
Copy link
Author

Wow, I'm amazed that you were able do to that so quickly! Are you going to give the user the ability to select which CC language they want to use within the extension?

@dobladov
Copy link
Owner

Are you going to give the user the ability to select which CC language they want to use within the extension?

Yes I got all the caption links, the user will have to select a language, this way there's only one URL to request the XML

@dobladov
Copy link
Owner

@tube-CC I was able to make a beta with the functionality you ask, It takes the captions you mentioned from the script with the ytInitialData, but there's a big problem, since that data is not reloaded after navigating to another video this information can only be loaded once at the beginning; If you have any idea of how to get updated caption information let me know, so I can finish the feature and add it to the next release.

Screen.Recording.2023-01-20.at.16.01.38.mov

If you want to give it a test, you can use this package:
chrome://extensions/ -> Developer Mode -> Load unpacked -> The folder of the unpackd extension

@dobladov dobladov linked a pull request Jan 20, 2023 that will close this issue
@reaper-sid
Copy link
Author

reaper-sid commented Jan 20, 2023

Testing the Beta
Using this playlist: https://www.youtube.com/playlist?list=PL6xVgUZ4UP2O_6Y4pRSVmVwT34NRJHJz0

  1. When I visit the first Episode, the extension gives me a language to choose from popup, which then takes me to the CC content.
    Clicking "Delete saved cards" takes me back to the language choice list.
  2. If I then visit another Episode and click the extension, I get the "Transcript not found" popup. If I have the transcript window open, it pulls that content into the extension rather than the CC content.
  3. If I follow step 1 and then visit another Episode and then click the browser's "Reload this page" button, the extension gives me a language to choose from popup, which then takes me to the CC content. Reloading the page appears to reload the extension.

@dobladov
Copy link
Owner

  1. is the issue I have, when you select another episode, the initial data from where I take the captions is no longer valid so I fallback to the previous system of getting the data from the view, I need a way to refresh this data.
    Thanks for checking

@reaper-sid
Copy link
Author

Can all of the pulling and parsing be done at the time the extension button is clicked rather than when the page/extension is loaded? I suppose it would seem slower to the end user, but at least it would cause the data to reload on click.
Or could the extension reload on page load even if the previous page was on youtube.com?

@reaper-sid
Copy link
Author

reaper-sid commented Jan 20, 2023

So, I got it to work. Starting on line 72 in the popup.js
`// Try to get the captions from the UI

            const { subtitles } = await chrome.tabs.sendMessage(id, { type: 'getSubtitles', title, storageId })

            if (subtitles) {

              mainState.subtitles = subtitles

              mainState.view = 'list'

              return

            } else {

              chrome.runtime.reload()

            }`

You can see I added the else which forces the extension to reload. This is a hack, and probably has side effects, but you get the idea that I'm going for. You know your code, and probably can come up with a better implementation.

@dobladov
Copy link
Owner

dobladov commented Jan 20, 2023

Can all of the pulling and parsing be done at the time the extension button is clicked rather than when the page/extension is loaded? I suppose it would seem slower to the end user, but at least it would cause the data to reload on click.

This is what it does already, like you pointed either the extension reloads the page or I should find a way to get the data that YouTube queries when another video is loaded, I got access to the yt object but since it's an undocumented API I'm not sure how to get the captions.

I considered the reload, but I would like to keep it as a last resort solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants