Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a vector search from youtube audio transcripts #289

Closed
Gautam-Rajeev opened this issue Feb 1, 2024 · 17 comments
Closed

Create a vector search from youtube audio transcripts #289

Gautam-Rajeev opened this issue Feb 1, 2024 · 17 comments
Assignees

Comments

@Gautam-Rajeev
Copy link
Collaborator

Gautam-Rajeev commented Feb 1, 2024

Description

Be able to parse all the videos from a Youtube channel or Youtube playlist , extract transcripts from their audios and embed them in a vector DB to enable search/retrieve over it .

Implementation Details

It'll include the following :

  • Receive the channel link/playlist link from user
  • Scrape the audio from all the videos in the link/playlist
  • Extract the transcript along with timestamps from all the videos
  • Create chunks from the transcript (you can use basic chunks like 4 mins of audio or use any fancier chunking algo)
  • Summarise each video using an LLM call and store as a separate chunk
  • Embed this in a vector DB, use COLBERT ( Ragatoullie -LangChain ). Use this for reference
  • Enable COLBERT search and retrieval on the content embedding
  • When a question is searched it returns get related content as well as youtube link as well as timestamps for the relevant content

Can use https://github.com/ytdl-org/youtube-dl for scraping
Can use https://www.youtube.com/@3blue1brown as initial test set for the above
Ticket for using ColBERT is covered here, you only need to make it work locally here using the notebook.

Product Name

AI Tools

Organization Name

SamagraX

Domain

NA

Tech Skills Needed

Pytorch/ Python, ML

Category

Feature

Mentor(s)

@GautamR-Samagra

Complexity

Medium

Copy link

Hi!
Important Details - These following details are helpful for contributors to effectively identify and contribute to tickets.

  • Sub-Category - Please mention the sub-category if any for the ticket

Please update the ticket

@Neelesh2512
Copy link

Guys, Anyone of you can contribute. Let's not wait for the approval. We can start working and raise a PR whenever we want 🙌🏻

@Gautam-Rajeev
Copy link
Collaborator Author

Gautam-Rajeev commented Feb 2, 2024

Hi all. Glad to see the enthusiasm here :) You don't have to ask permission to begin working on tickets. Please raise PRs and comment links to PRs here. I'll not be assigning anyone the ticket as such now

@ChakshuGautam
Copy link
Collaborator

Hey team. Please raise a draft PR that we can review to see if everyone is going in the right direction. Thanks.

@kartikf4
Copy link

kartikf4 commented Feb 2, 2024

@ChakshuGautam I'm facing this issue while working in colab Environment
DownloadError: ERROR: Unable to extract uploader id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
I have updated multiple times and tried with other version but it's still not working for me.
while using yt-dlp for same ,it does perform well upto certain extent. should I continue with yt-dlp.

@ChakshuGautam
Copy link
Collaborator

@kartikf4 Is this happening on non colab env as well? Any alternatives to this package that you tried out?

@kartikf4
Copy link

kartikf4 commented Feb 3, 2024

@kartikf4 Is this happening on non colab env as well? Any alternatives to this package that you tried out?

@ChakshuGautam well i didnt tried in local env but i did tried alternative yt-dlpcheck here

@ChakshuGautam
Copy link
Collaborator

Probably has something to do with colab. Let's do locally.

@rachitavya
Copy link

@kartikf4 Is this happening on non colab env as well? Any alternatives to this package that you tried out?

@ChakshuGautam well i didnt tried in local env but i did tried alternative yt-dlpcheck here

Hey @kartikf4,
This one is doing fine here.

@anshuvermaa
Copy link

Hi I want to contribute to this can you assign me

@Gautam-Rajeev
Copy link
Collaborator Author

@ChakshuGautam https://pypi.org/project/youtube-transcript-api/ gives the transcripts for all videos in English/Hindi (from the auto generated cc).
Can we clarify on the merits of extracting audio and transcribing separately apart from what is given using the above? Do we want to do that for Indian language videos ?

@xorsuyash
Copy link
Collaborator

xorsuyash commented Feb 7, 2024

@ChakshuGautam ,@GautamR-Samagra on the further improvement on the issue

  • built the way to transcript the data from youtube but it only works if we provide the url of a single video , (for Playlist url and chanell url we need to extract chanell-id , i was using youtube-dl for that but youtube-dl is successful in extracting the urls of playlist but throwing error in extracting urls of the videos from playlist , If anyone finds alternative for that let me know)
  • created TranscriptChunker which chunks the data according to given time frame as input
  • created quart web-app for reatreiving data chunking and then feeding it to colbert
  • Trying to setup docker will raise PR soon .

@ChakshuGautam
Copy link
Collaborator

@xorsuyash can you share a draft PR anyway so that we can review in chunks?

@xorsuyash xorsuyash mentioned this issue Feb 7, 2024
@xorsuyash
Copy link
Collaborator

@ChakshuGautam raised draft-pr

@rachitavya
Copy link

rachitavya commented Feb 15, 2024

Hey @xorsuyash,

Let's drop vector and colbart part until the issue is resolved.
Abhi ke liye we'll keep it simple

Single API:
param - yt video link
response - transcript.json

Also I have some questions:

  • What is happening when neither custom captions nor auto generated captions are there for a video ?
  • What happens when there are multiple languages captions ?

@xorsuyash
Copy link
Collaborator

@rachitavya

  • if any video has any audio related to language then youtube generates autogenerated transcript , only in those videos which does not have transcriptable audio like sound_track etc.. youtube does not generates transcript .

  • can you share some videos which have multiple languages so that i can test the api ? ( to check the extent at which youtube_transcript_api can transcript audio)

@Gautam-Rajeev
Copy link
Collaborator Author

@xorsuyash Thanks for completing this.

cc: @Shruti3004 , @ChakshuGautam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants