Technical Assessment: Migrating a Data Pipeline to Production

Scaling a Youtube video transcriber from a prototype in notebook to a production environment

Introduction

This repository contains the scritps and files used in the implementation of the pipeline. The goal is to store the transcription of YouTube videos in a database. The pipeline is an ELTL (Extract, Load, Transform, Load) pipeline.

Extract: The Youtube metadata is extracted using the pytube API.
Load: The Youtube video is downloaded and the metadata is stored in a database.
Transform: The audio is extracted and then it is transcribed using the OpenAI Whisper API. The transcription is stored in the database.
Load: The transcriptions are loaded onto the pg database.

Data Modelling

The proposed database consists of two tables. The first table 'video' basically contains metadata from the video that can be extracted with the pytube library. The video ID is the primary key since it references the object. The data is pre-processed in batches and loaded onto this table.

The second table in the schema is the transcriptions table itself. The table has a foreign key that matches an id in the first table, the relationship is hypothetically one to many because, since the Transcription API is not deterministic one can expect to have a non-identical output each time. By changing certain criteria in the script one can force the re-extraction of metadata from each video id, and by detecting significant changes (for example in the format or the filesize when a video is edited), an action can be triggered to re-transcribe the video and push it as a new row with the same id, since the changes are significant.

Implementation

The pipeline is implemented using the following technologies:

Python 3.10
Docker
Pandas
PostgreSQL
SQLAlchemy

How to run

Clone the repository
Set the environment variables under the configs folder in the db_cred.env file

db_user=
db_host=
db_name=jde_test
db_password=
db_port=
pg_admin_email=
pg_admin_password=

Set the api_key in the configs folder in the api_key.env file

OPENAI_API_KEY=

Set the ssh connection with the virtual machine.
Run the following command to start the pipeline

ssh user@host 'docker compose -f "src/jde_test/docker-compose.yml" --env-file=src/jde_test/configs/db_creds.env up -d --build'

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.debug.yml		docker-compose.debug.yml
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
video_ids.txt		video_ids.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Technical Assessment: Migrating a Data Pipeline to Production

Introduction

Data Modelling

Implementation

How to run

About

Releases

Packages

Languages

camilotorresmestra/JDETest

Folders and files

Latest commit

History

Repository files navigation

Technical Assessment: Migrating a Data Pipeline to Production

Introduction

Data Modelling

Implementation

How to run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages