Skip to content

18520339/islr-mlops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Word-level & Pose-based Sign Language Interpreter

I am passionate about "AI for Good" projects, and I have conceived an idea to develop an AI system that translates sign language. Individuals who use sign language often rely on fingerspellingβ€”signing each letter individuallyβ€”primarily when they do not know the correct sign or when spelling out names. Understanding this, I aim to create a solution that is not only technically impressive but also genuinely beneficial for the community. My vision is to go beyond merely recognizing individual letters and instead provide live captions for sign language without requiring specialized hardware like gloves or glasses.

πŸ‘‰ Check the docs folders for more information about the implementation.

I. Proposed solution

The input will be a video of deaf individuals using sign language, while the output will be the corresponding English text. The solution pipeline is structured as follows:

  1. Pose-to-Gloss: I plan to utilize MediaPipe to extract facial and hand landmarks from each frame. These coordinates will be formatted correctly and fed into a Transformer model. The goal is to classify the isolated signs or glosses represented by these coordinates. This approach has several advantages:
  • By using key points instead of raw video data, we can streamline processing. We only need to analyze a small set of coordinates (3 for each point) per frame, significantly improving efficiency for real-time applications. Additionally, key points are less affected by varying backgrounds, hand sizes, skin tones, and other factors that complicate traditional image classification models.
  • A sequence model will allow us to learn both temporal and spatial information (hand movements) from sequences of key points across frames, rather than classifying each frame in isolation, which can prolong prediction times.
  • I intend to collect and preprocess the WLASL dataset to train the Pose-to-Gloss model. Although this dataset contains around 2000 classes, it is limited to about 5-6 examples per word, leading to sparsity. To address this, I plan to adapt the best solution from the Google - Isolated Sign Language Recognition competition on Kaggle, which utilizes a Conv1D-Transformer model.
  1. Gloss-to-Text: This step involves translating the sequence of glosses into coherent, readable English text. As this is primarily a translation task, I plan to employ prompt engineering with OpenAI's GPT-4o mini to convert our classifier's gloss outputs into their appropriate translations without any additional fine-tuning.

II. Currrent Focus

Initially, our project will concentrate on American Sign Language. In future iterations, I plan to incorporate multilingual capabilities along with the following features:

  • Converting the output from text to audio.

  • Managing multiple signers within a single frame.

  • Implementing temporal segmentation to identify which frames contain sign language, enhancing translation accuracy and speed by allowing us to disregard irrelevant video content during inference.

  • Developing an end-to-end model for direct Pose-to-Text or even Pose-to-Audio. However, I anticipate challenges in processing entire videos compared to a defined set of key points. Additionally, I am uncertain whether omitting the gloss is a wise choice, as it provides direct insight into the signs being demonstrated.

  • Utilizing multimodal inputs to improve translation accuracy:

    • Audio Context: In mixed environments, incorporating audio from non-signers can provide context, helping to disambiguate signs based on spoken topics.

    • Visual Context: Integrating object detection or scene analysis can enhance understanding (e.g., recognizing a kitchen setting to interpret relevant signs).

For the demonstration, I envision creating a website or an extension for video conferencing platforms like Google Meet to generate live captions for deaf individuals. However, I recognize that this concept primarily aids non-signers in understanding deaf individuals rather than empowering deaf people to communicate effectively. My current time constraints prevent me from implementing a text-to-sign feature, so for now, I can only conceptualize this one-way communication demo, rather than a two-way interaction that facilitates communication from deaf individuals back to others.

III. Product High-level Design

The system is structured as a pipeline with distinct subsystems: Data Sources, Feature Store, MLOps Pipeline, Model Serving, Gloss-to-Text Translation, CI/CD Pipeline, and User Interface. Each subsystem is meticulously designed to handle specific aspects of the ASL translation workflow, from raw video data to a user-facing Streamlit app that provides real-time translations. The design integrates ClearML for MLOps automation, GitHub Actions for CI/CD, FastAPI for model serving, and GPT-4o-mini for natural language translation, culminating in a production-ready solution.

graph TD
    %% Data Sources
    subgraph "Data Sources 🌐"
        A1["πŸ“Ή WLASL Videos<br>(Kaggle: risangbaskoro/wlasl-processed<br>sttaseen/wlasl2000-resized)"] -->|Download| A2["πŸ“‚ Raw Data Storage<br>(datasets/wlasl-processed.zip)<br>(datasets/wlasl2000-resized.zip)"]
        A2 -->|Upload to ClearML| A3["πŸ“¦ ClearML Dataset<br>(WLASL2000)"]
    end

    %% Feature Store
    subgraph "Feature Store πŸ—„οΈ"
        A3 -->|Input Videos| B1["πŸ› οΈ MediaPipe<br>Landmark Extraction Task<br>(preprocessing/wlasl2landmarks.py)<br>(180 Keypoints/Frame: 42 Hand, 6 Pose, 132 Face)"]
        B1 -->|Takes ~2 days| B2["πŸ“¦ ClearML Dataset<br>(WLASL_landmarks.npz & WLASL_parsed_metadata.json)<br>Tags: stable"]
        B2 -->|Store Features| B3["πŸ“Š Feature Store<br>(Landmarks Features + Metadata: video_id, gloss, split)"]
    end

    %% MLOps Pipeline
    subgraph "MLOps Pipeline (ClearML) πŸš€"
        %% Data Preparation
        subgraph "Data Preparation πŸ”„"
            B3 -->|Load Features| C1["βš™οΈ Data Splitting<br>(step1_data_splitting.py)<br>(Splits: Train, Val, Test)"]
            C1 -->|Log Statistics| C2["πŸ“Š Feature Stats<br>(Num Frames/Video,<br>Video Count)<br>ClearML PLOTS"]
            C1 -->|"Output: ClearML Artifacts<br>(X_train, y_train)"| C3["βš™οΈ Data Augmentation<br>(Rotate, Zoom, Shift, Speedup, etc.)<br>(step2_data_augmentation.py)"]
            C3 -->|"Output: ClearML Artifacts<br>(X_train, y_train)"| C4["βš™οΈ Data Transformation<br>(Normalize: Nose-based<br>Pad: max_frames=195, padding_value=-100.0)<br>(step3_data_transformation.py)"]
            C1 -->|"Output: ClearML Artifacts<br>(X_val, y_val, X_test, y_test)"| C4
        end

        %% Multi-Model Training and Selection
        subgraph "Multi-Training & Selection 🧠"
            C4 -->|"Input: ClearML Artifacts<br>(X_train, y_train, X_val, y_val)"| D1["🧠 Pose-to-Gloss Models<br>1. GISLR (dim=192, kernel_size=17, dropout=0.2)<br>2. ConvNeXtTiny (include_preprocessing=False, include_top=True)<br>Prepare TF Dataset<br>(model_utils.py)"]
            D1 -->|"Mixed Precision<br>Parallel Training<br>(2 Colab A100 GPUs)"| D2["πŸ“ˆ Model Training<br>(step4_model_training.py)<br>(Batch Size: 128, Epochs: 100, learning_rate=1e-3, AdamW, ReductOnPlateau)"]
            D2 -->|Log Metrics| D3["πŸ“‰ Training Metrics<br>(Loss, Accuracy, Top-5)<br>ClearML SCALARS"]
            D2 -->|Compare val_accuracy| D4["πŸ“‹ Model Selection Task<br>(step5_model_selection.py)<br>(Select GISLR: Top-1 Accuracy 60%)"]
        end

        %% Hyperparameter Tuning
        subgraph "Hyperparameter Tuning πŸ”"
            D4 -->|Best Model| E1["πŸ” HyperParameterOptimizer<br>(step6_hyperparameter_tuning.py)<br>Search Space: learning_rate, dropout_rate, dim<br>Multi-objective (OptimizerOptuna): minimize val_loss & maximize val_accuracy<br>concurrent_tasks=2, total_max_jobs=2"]
            E1 -->|Best Parameters| E2["πŸ“Š Upload artifact<br>(best_job_id, best_hyperparameters, best_metrics)"]
            E1 -->|Log Metrics| E3["πŸ“‰ Updated Metrics<br>(val_accuracy: ~60%, val_top5_accuracy: ~87%)"]
        end

        %% Evaluation
        subgraph "Evaluation πŸ“Š"
            C4 -->|Test Set: X_test, y_test| F1["πŸ“Š Model Evaluation<br>(step7_model_evaluation.py)<br>"]
            E2 -->|Best Job ID, Best Weights| F1
            F1 -->|Log Metrics| F2["πŸ“‰ Evaluation Metrics<br>- Accuracy, Top-5<br>- Classification Report)<br>- Confusion Matrix<br>- ClearML PLOTS"]
        end

        %% Continuous Training
        subgraph "Continuous Training πŸ”„"
            G1["πŸ”” TriggerScheduler: New Data<br>(Monitor Tags: ['landmarks', 'stable'])<br>(trigger.py)"]
            G1 -->|Rerun Pipeline| H1
            B2 --> G1
            G2["πŸ”” TriggerScheduler: Performance Drop<br>(Monitor Test Accuracy < 60%)<br>(trigger.py)"]
            G2 -->|Rerun Pipeline| H1
            F2 --> G2
            H1["πŸ”„ PipelineController<br>(pipeline_from_tasks.py)<br>(Parameters: max_labels=100/300, batch_size=128, max_frames=195)<br>Tags: production"]
            H1 -->|Step 1| C1
            H1 -->|Step 2| C3
            H1 -->|Step 3| C4
            H1 -->|Step 4| D2
            H1 -->|Step 5| D4
            H1 -->|Step 6| E1
            H1 -->|Step 7| F1
            H1 -->|Tag Management| H2["🏷️ Production Tag<br>(Best Pipeline Selected)"]
        end
    end

    %% Model Serving
    subgraph "Model Serving 🌐"
        F1 -->|"Deploy Best Model<br>(TFLite Conversion)"| I1["🌐 FastAPI Endpoint<br>(serving/pose2gloss.py)<br>(Endpoints: /predict, /health, /metadata)<br>(Pydantic: Top-N Glosses with Scores)"]
    end

    %% Gloss-to-Text Translation
    subgraph "Gloss-to-Text Translation πŸ“"
        I1 -->|Top-5 Glosses with Scores| J1["πŸ“ Gloss-to-Text Translation<br>(Beam Search-like Prompt)<br>(gloss2text/translator.py)"]
        J1 -->|Call API| J2["🧠 GPT-4o-mini<br>(OpenAI API)<br>(max_tokens=100, temperature=0.7)"]
        J2 -->|Natural English| J1
        J1 -->|Log Metrics| J3["πŸ“Š Translation Metrics<br>(BLEU, ROUGE Scores)<br>ClearML PLOTS"]
    end

    %% CI/CD Pipeline
    subgraph "CI/CD Pipeline (GitHub Actions) πŸš€"
        K1["πŸ“‚ GitHub Repository<br>(SyntaxSquad)"]
        K1 -->|Push/PR| K2["πŸš€ GitHub Actions<br>(.github/workflows/pipeline.yml)"]
        K2 -->|CI: Test| K3["πŸ› οΈ Test Remote Runnable<br>(cicd/example_task.py)"]
        K3 -->|CI: Execute| K4["πŸ”„ Execute Pipeline<br>(cicd/pipeline_from_tasks.py)"]
        K4 -->|CI: Report| K5["πŸ“Š Report Metrics<br>(cicd/pipeline_reports.py)<br>(Comments on PR)"]
        K4 -->|CD: Tag| K6["🏷️ Production Tagging<br>(cicd/production_tagging.py)"]
        K6 -->|CD: Deploy| K7["🌍 Deploy FastAPI<br>(cicd/deploy_fastapi.py)<br>(Localhost Simulation)"]
        K7 --> I1
    end

    %% User Interface
    subgraph "User Interface πŸ–₯️"
        L1["πŸŽ₯ Webcam Input<br>(apps/streamlit_app.py)"]
        L1 -->|Extract Landmarks| B1
        L1 -->|Fetch Predictions| I1
        L1 -->|Display Translation| J1
        L1 -->|Render UI| L2["πŸ–₯️ Streamlit UI<br>(Real-Time Translation Display)"]
    end

    %% Styling for Color and Beauty
    style A1 fill:#FFD700,stroke:#DAA520
    style A2 fill:#FFD700,stroke:#DAA520
    style A3 fill:#FFD700,stroke:#DAA520
    style B1 fill:#32CD32,stroke:#228B22
    style B2 fill:#32CD32,stroke:#228B22
    style B3 fill:#32CD32,stroke:#228B22
    style C1 fill:#FF69B4,stroke:#FF1493
    style C2 fill:#FF69B4,stroke:#FF1493
    style C3 fill:#FF69B4,stroke:#FF1493
    style C4 fill:#FF69B4,stroke:#FF1493
    style D1 fill:#FF8C00,stroke:#FF4500
    style D2 fill:#FF8C00,stroke:#FF4500
    style D3 fill:#FF8C00,stroke:#FF4500
    style E1 fill:#00CED1,stroke:#008B8B
    style E2 fill:#00CED1,stroke:#008B8B
    style E3 fill:#00CED1,stroke:#008B8B
    style F1 fill:#1E90FF,stroke:#4682B4
    style F2 fill:#1E90FF,stroke:#4682B4
    style G1 fill:#ADFF2F,stroke:#7FFF00
    style G2 fill:#ADFF2F,stroke:#7FFF00
    style H1 fill:#FF6347,stroke:#FF4500
    style H2 fill:#FF6347,stroke:#FF4500
    style I1 fill:#FF1493,stroke:#C71585
    style J1 fill:#FFA500,stroke:#FF8C00
    style J2 fill:#FFA500,stroke:#FF8C00
    style J3 fill:#FFA500,stroke:#FF8C00
    style K1 fill:#DA70D6,stroke:#BA55D3
    style K2 fill:#DA70D6,stroke:#BA55D3
    style K3 fill:#DA70D6,stroke:#BA55D3
    style K4 fill:#DA70D6,stroke:#BA55D3
    style K5 fill:#DA70D6,stroke:#BA55D3
    style K6 fill:#DA70D6,stroke:#BA55D3
    style K7 fill:#DA70D6,stroke:#BA55D3
    style L1 fill:#FFDAB9,stroke:#FF7F50
    style L2 fill:#FFDAB9,stroke:#FF7F50
Loading

1. Data Sources 🌐

This subsystem handles the ingestion of raw data, specifically ASL videos from the WLASL dataset, sourced from Kaggle (risangbaskoro/wlasl-processed and sttaseen/wlasl2000-resized). The videos are downloaded as ZIP files (datasets/wlasl-processed.zip, datasets/wlasl2000-resized.zip) and uploaded to ClearML as a dataset (WLASL2000).

  • Data Sourcing Strategy: The use of 2 Kaggle datasets ensures robustness against missing videos, a challenge I identified in Sprint 1. This redundancy (primary dataset with a backup) is a best practice in data engineering, ensuring data availability for downstream tasks.
  • ClearML Dataset: Storing the raw data in ClearML (WLASL2000) facilitates versioning and traceability, key for MLOps Level 2. ClearML's dataset management allows tagging and monitoring, which supports continuous training triggers later in the pipeline.

2. Feature Store πŸ—„οΈ

This subsystem extracts landmarks from videos using MediaPipe, stores them as features, and registers them in ClearML. The process starts with the WLASL2000 dataset, extracts 180 keypoints per frame (42 hand, 6 pose, 132 face) using wlasl2landmarks.py, and saves the results as WLASL_landmarks.npz and WLASL_parsed_metadata.json. These are tagged as stable in ClearML and stored in a feature store with metadata (video_id, gloss, split).

  • Landmark Extraction: Using MediaPipe to extract 180 keypoints (42 hand, 6 pose, 132 face) is a strategic choice. The selection of landmarks focuses on expressive parts critical for ASL (hands for gestures, face for expressions, pose for body context), optimizing for both accuracy and computational efficiency. The process taking ~2 days highlights the scale of the dataset (~21k videos), but ClearML's task management ensures this is a one-time cost with reusable outputs.
  • Feature Store Design: Storing features with metadata (video_id, gloss, split) in a ClearML dataset enables efficient data access for training and inference. The stable tag ensures reliability for production pipelines, aligning with MLOps Level 2 requirements for robust feature management.
  • Technical Excellence: The feature store's integration with ClearML allows for versioning and reproducibility, critical for iterative development. The choice of 180 keypoints balances model input complexity with predictive power, showcasing expert-level design.

3. MLOps Pipeline (ClearML) πŸš€

This is the core of the system, orchestrating data preparation, training, evaluation, and continuous training. Let's break down its subcomponents.

3.1 Data Preparation πŸ”„

Features from the feature store are processed through three tasks: Data Splitting, Data Augmentation, and Data Transformation. The splitting task (step1_data_splitting.py) divides data into train, validation, and test sets, logging statistics (e.g., number of frames, video count) as ClearML plots. Augmentation (step2_data_augmentation.py) applies transformations (rotate, zoom, shift, speedup), and transformation (step3_data_transformation.py) normalizes landmarks (nose-based) and pads sequences to max_frames=195 with padding_value=-100.0.

  • Splitting and Statistics: Logging feature stats as ClearML plots provides visibility into data distribution, aiding debugging and optimization. The splits ensure balanced datasets for training, validation, and testing, a critical step for model generalization.
  • Augmentation: Techniques like rotation, zooming, and speedup address the sparsity of the WLASL dataset (5-6 examples per gloss), artificially increasing diversity. This is a sophisticated approach to mitigate overfitting, especially given my model's tendency to overfit (noted in Sprint 2).
  • Normalization and Padding: Nose-based normalization standardizes landmark positions, reducing variance due to spatial differences. Padding to max_frames=195 with a distinct value (-100.0) ensures consistent input shapes for the model, with the value chosen to be outside the normalized range for easy masking during training.

3.2 Multi-Training & Selection 🧠

This subsystem trains 2 Pose-to-Gloss models: GISLR (dim=192, kernel_size=17, dropout=0.2) and ConvNeXtTiny (include_preprocessing=False, include_top=True). Training (step4_model_training.py) uses mixed precision on two Colab A100 GPUs, with batch_size=128, epochs=100, learning_rate=1e-3, AdamW optimizer, and ReduceLROnPlateau. Metrics (loss, accuracy, top-5) are logged as ClearML scalars. Model selection (step5_model_selection.py) chooses GISLR based on validation accuracy (60% top-1).

  • Model Choices: GISLR, adapted from the Google Isolated Sign Language Recognition competition, is a strong baseline with proven performance on landmark data. ConvNeXtTiny, a modern convolutional architecture, adds diversity by exploring different feature extraction paradigms, potentially capturing patterns GISLR misses.
  • Training Efficiency: Mixed precision and parallel training on A100 GPUs optimize computational resources, reducing training time (noted as 3-5 minutes in Sprint 2). The use of AdamW with ReduceLROnPlateau addresses my observation of overfitting, dynamically adjusting the learning rate to stabilize training.
  • Model Selection: Selecting GISLR based on validation accuracy (60% top-1) is pragmatic, given its performance aligns with the original WLASL paper (65.89% top-1 for WLASL100). Logging metrics as ClearML scalars enables experiment comparison, a hallmark of MLOps Level 2.

3.3 Hyperparameter Tuning πŸ”

The HyperParameterOptimizer (step6_hyperparameter_tuning.py) tunes the GISLR model's hyperparameters (learning_rate, dropout_rate, dim) using Optuna, with a multi-objective goal (minimize val_loss, maximize val_accuracy). It runs 2 concurrent tasks with a total of 2 jobs, logging best parameters and metrics (val_accuracy ~60%, val_top5_accuracy ~87%) to ClearML.

  • Optimization Strategy: Using HyperParameterOptimizer with Optuna is a state-of-the-art approach for hyperparameter tuning, balancing exploration and exploitation. The multi-objective optimization ensures trade-offs between loss and accuracy are considered, critical given my model's overfitting tendencies.
  • Concurrency: Running 2 concurrent tasks maximizes GPU utilization on Colab, though the total of 2 jobs suggests a focused search, likely to manage resource constraints (noted in Sprint 2). The slight improvement in metrics (val_top5_accuracy from 86% to 87%) indicates effective tuning.

3.4 Evaluation πŸ“Š

The evaluation task (step7_model_evaluation.py) assesses the best model (GISLR with tuned hyperparameters) on the test set, logging accuracy, top-5 accuracy, classification report, and confusion matrix as ClearML plots.

  • Comprehensive Metrics: Logging a classification report and confusion matrix provides deep insights into model performance across glosses, identifying specific areas for improvement (e.g., frequently confused glosses).
  • ClearML Integration: Visualizing metrics as plots in ClearML enhances interpretability, aligning with MLOps best practices for monitoring and debugging.

3.5 Continuous Training πŸ”„

The PipelineController (pipeline_from_tasks.py) orchestrates the entire pipeline, with parameters (max_labels=100/300, batch_size=128, max_frames=195) and a production tag. TriggerScheduler monitors for new data (via tags landmarks, stable) and performance drops (test accuracy < 60%), rerunning the pipeline as needed.

  • Automation: The TriggerScheduler ensures continuous training by monitoring dataset tags and performance metrics, a cornerstone of MLOps Level 2. The new data trigger allows the system to adapt to fresh inputs, while the performance drop trigger (threshold at 60%) ensures reliability in production.
  • PipelineController: Centralizing the pipeline with PipelineController ensures end-to-end execution, with parameters customizable for experiments (e.g., WLASL100 vs. WLASL300). The production tag marks the pipeline for deployment, aligning with CI/CD requirements.

4. Model Serving 🌐

The best model (GISLR) is converted to TFLite for efficient inference and deployed as a FastAPI endpoint (serving/pose2gloss.py). Endpoints include /predict (returns top-N glosses with scores), /health, and /metadata, using Pydantic for request/response validation.

  • TFLite Conversion: Converting the model to TFLite reduces inference latency and memory usage, critical for real-time applications like ASL translation. This optimization ensures the system can run on resource-constrained environments (e.g., local machines).
  • FastAPI Endpoints: The /predict endpoint leverages the top-N gloss prediction strategy (top-5 accuracy ~87%), enhancing translation quality. /health and /metadata endpoints provide operational insights, aligning with production best practices.
  • Pydantic Validation: Using Pydantic ensures robust input/output validation, improving API reliability and user experience.

5. Gloss-to-Text Translation πŸ“

The top-5 glosses with scores from the FastAPI endpoint are fed into the Gloss-to-Text task (gloss2text/translator.py). A beam search-like prompt is used with GPT-4o-mini (max_tokens=100, temperature=0.7) to generate natural English translations. Metrics (BLEU, ROUGE) are logged as ClearML plots.

  • Beam Search-like Prompt: Using top-5 glosses with scores allows GPT-4o-mini to select the most coherent combination, significantly improving translation quality over a top-1 approach (given the 61% top-1 accuracy). This mimics beam search by considering multiple hypotheses, a novel application of LLMs in sign language translation.
  • GPT-4o-mini: The choice of GPT-4o-mini balances performance and latency, with temperature=0.7 introducing controlled creativity for natural translations. max_tokens=100 ensures concise outputs suitable for real-time display.

6. CI/CD Pipeline (GitHub Actions) πŸš€

The GitHub repository (SyntaxSquad) uses GitHub Actions (.github/workflows/pipeline.yml) for CI/CD. CI includes testing (cicd/example_task.py), pipeline execution, and reporting metrics as PR comments (cicd/pipeline_reports.py). CD tags the pipeline (cicd/production_tagging.py) and deploys the FastAPI endpoint to a localhost simulation.

  • CI Workflow: Automated testing ensures code quality, while pipeline execution verifies end-to-end functionality on each push/PR. Reporting metrics as PR comments enhances collaboration, a key MLOps Level 2 feature.
  • CD Workflow: Production tagging and FastAPI deployment ensure rapid updates in production, meeting the reliability goals of MLOps Level 2. The localhost simulation is a practical choice for academic projects, mimicking real-world deployment.

7. User Interface πŸ–₯️

graph LR
    subgraph "Streamlit User Interface πŸ–₯️"
        A["πŸ‘€ User<br>(Deaf Individual)"]
        A -->|Access| B["πŸ–₯️ Streamlit App<br>(streamlit_app.py)<br>(Real-Time ASL Translation)"]
        B -->|Capture Video| C["πŸŽ₯ Webcam Input<br>(OpenCV: cv2.VideoCapture)"]
        C -->|Video Frames| D["πŸ› οΈ MediaPipe<br>Landmark Extraction<br>(180 Keypoints/Frame: 42 Hand, 6 Pose, 132 Face)"]
        D -->|Landmarks| E["πŸ“Š Display Landmarks<br>(Streamlit: st.image)<br>(Visualize 180 Keypoints in Real-Time)"]
        D -->|"Input: Landmarks<br>(frames, 180, 3)<br>(Fetch Predictions)"| F["🌐 FastAPI Endpoint<br>(serving/pose2gloss.py)<br>(/predict with payload {landmarks:ndarray(MAX_FRAMES, 180, 3), top_n})"]
        F -->|"Output: Top-N Glosses with Scores<br>(e.g., MOTHER: 0.85, LOVE: 0.90, FAMILY: 0.80)"| G["πŸ“ Gloss-to-Text Translation<br>(Beam Search-like Selection Prompt)<br>(gloss2text/translator.py)"]
        G -->|Call API| H["🧠 GPT-4o-mini<br>(OpenAI API)<br>(max_tokens=100, temperature=0.7)"]
        H -->|"Natural English<br>(e.g., 'Mother loves her family.')"| G
        G -->|Display| I["πŸ“œ Display Translation<br>(Streamlit: st.text)<br>(Show Natural English Text in Real-Time, Latency <1s)"]
        F -->|Display| J["πŸ“Š Gloss Predictions<br>(Streamlit: st.table)<br>(Show Top-N Glosses with Scores)"]
        B -->|Interact| K["πŸ–±οΈ User Controls<br>(Streamlit: st.button)<br>(Start/Stop Translation, Adjust Settings: top_n)"]
    end

    %% Styling for Color and Beauty
    style A fill:#FF69B4,stroke:#FF1493
    style B fill:#FF69B4,stroke:#FF1493
    style C fill:#FFD700,stroke:#DAA520
    style D fill:#32CD32,stroke:#228B22
    style E fill:#FF69B4,stroke:#FF1493
    style F fill:#FF8C00,stroke:#FF4500
    style G fill:#FFA500,stroke:#FF8C00
    style H fill:#FFA500,stroke:#FF8C00
    style I fill:#FF69B4,stroke:#FF1493
    style J fill:#FF69B4,stroke:#FF1493
    style K fill:#FF69B4,stroke:#FF1493
Loading

The Streamlit UI (streamlit_app.py) provides a real-time ASL translation interface. It captures webcam input (cv2.VideoCapture), extracts landmarks using MediaPipe, fetches top-N gloss predictions via the FastAPI endpoint, and displays translations from the Gloss-to-Text task. The UI renders landmarks (st.image), gloss predictions (st.table), translations (st.text), and user controls (st.button for start/stop, top_n adjustment).

  • Real-Time Processing: The UI's ability to process webcam input, extract landmarks, and display translations with <1s latency is a technical feat, enabled by efficient TFLite inference and FastAPI serving.
  • User-Centric Design: Displaying landmarks, gloss predictions, and translations in real-time provides transparency, enhancing user trust. User controls for starting/stopping translation and adjusting top_n improve interactivity.
  • Integration: The seamless integration of webcam input, FastAPI predictions, and Gloss-to-Text translation demonstrates end-to-end system design, with Streamlit ensuring accessibility for deaf users.

About

Affordable, accessible solution that convert Sign Language into text in real-time and "vice versa", empowering Deaf and Hard-hearing individuals to fully engage in a hearing world

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •