I am passionate about "AI for Good" projects, and I have conceived an idea to develop an AI system that translates sign language. Individuals who use sign language often rely on fingerspellingβsigning each letter individuallyβprimarily when they do not know the correct sign or when spelling out names. Understanding this, I aim to create a solution that is not only technically impressive but also genuinely beneficial for the community. My vision is to go beyond merely recognizing individual letters and instead provide live captions for sign language without requiring specialized hardware like gloves or glasses.
π Check the docs folders for more information about the implementation.
The input will be a video of deaf individuals using sign language, while the output will be the corresponding English text. The solution pipeline is structured as follows:
- Pose-to-Gloss: I plan to utilize MediaPipe to extract facial and hand landmarks from each frame. These coordinates will be formatted correctly and fed into a Transformer model. The goal is to classify the isolated signs or glosses represented by these coordinates. This approach has several advantages:
- By using key points instead of raw video data, we can streamline processing. We only need to analyze a small set of coordinates (3 for each point) per frame, significantly improving efficiency for real-time applications. Additionally, key points are less affected by varying backgrounds, hand sizes, skin tones, and other factors that complicate traditional image classification models.
- A sequence model will allow us to learn both temporal and spatial information (hand movements) from sequences of key points across frames, rather than classifying each frame in isolation, which can prolong prediction times.
- I intend to collect and preprocess the WLASL dataset to train the Pose-to-Gloss model. Although this dataset contains around 2000 classes, it is limited to about 5-6 examples per word, leading to sparsity. To address this, I plan to adapt the best solution from the Google - Isolated Sign Language Recognition competition on Kaggle, which utilizes a Conv1D-Transformer model.
- Gloss-to-Text: This step involves translating the sequence of glosses into coherent, readable English text. As this is primarily a translation task, I plan to employ prompt engineering with OpenAI's GPT-4o mini to convert our classifier's gloss outputs into their appropriate translations without any additional fine-tuning.
Initially, our project will concentrate on American Sign Language. In future iterations, I plan to incorporate multilingual capabilities along with the following features:
-
Converting the output from text to audio.
-
Managing multiple signers within a single frame.
-
Implementing temporal segmentation to identify which frames contain sign language, enhancing translation accuracy and speed by allowing us to disregard irrelevant video content during inference.
-
Developing an end-to-end model for direct Pose-to-Text or even Pose-to-Audio. However, I anticipate challenges in processing entire videos compared to a defined set of key points. Additionally, I am uncertain whether omitting the gloss is a wise choice, as it provides direct insight into the signs being demonstrated.
-
Utilizing multimodal inputs to improve translation accuracy:
-
Audio Context: In mixed environments, incorporating audio from non-signers can provide context, helping to disambiguate signs based on spoken topics.
-
Visual Context: Integrating object detection or scene analysis can enhance understanding (e.g., recognizing a kitchen setting to interpret relevant signs).
-
For the demonstration, I envision creating a website or an extension for video conferencing platforms like Google Meet to generate live captions for deaf individuals. However, I recognize that this concept primarily aids non-signers in understanding deaf individuals rather than empowering deaf people to communicate effectively. My current time constraints prevent me from implementing a text-to-sign feature, so for now, I can only conceptualize this one-way communication demo, rather than a two-way interaction that facilitates communication from deaf individuals back to others.
The system is structured as a pipeline with distinct subsystems: Data Sources, Feature Store, MLOps Pipeline, Model Serving, Gloss-to-Text Translation, CI/CD Pipeline, and User Interface. Each subsystem is meticulously designed to handle specific aspects of the ASL translation workflow, from raw video data to a user-facing Streamlit app that provides real-time translations. The design integrates ClearML for MLOps automation, GitHub Actions for CI/CD, FastAPI for model serving, and GPT-4o-mini for natural language translation, culminating in a production-ready solution.
graph TD
%% Data Sources
subgraph "Data Sources π"
A1["πΉ WLASL Videos<br>(Kaggle: risangbaskoro/wlasl-processed<br>sttaseen/wlasl2000-resized)"] -->|Download| A2["π Raw Data Storage<br>(datasets/wlasl-processed.zip)<br>(datasets/wlasl2000-resized.zip)"]
A2 -->|Upload to ClearML| A3["π¦ ClearML Dataset<br>(WLASL2000)"]
end
%% Feature Store
subgraph "Feature Store ποΈ"
A3 -->|Input Videos| B1["π οΈ MediaPipe<br>Landmark Extraction Task<br>(preprocessing/wlasl2landmarks.py)<br>(180 Keypoints/Frame: 42 Hand, 6 Pose, 132 Face)"]
B1 -->|Takes ~2 days| B2["π¦ ClearML Dataset<br>(WLASL_landmarks.npz & WLASL_parsed_metadata.json)<br>Tags: stable"]
B2 -->|Store Features| B3["π Feature Store<br>(Landmarks Features + Metadata: video_id, gloss, split)"]
end
%% MLOps Pipeline
subgraph "MLOps Pipeline (ClearML) π"
%% Data Preparation
subgraph "Data Preparation π"
B3 -->|Load Features| C1["βοΈ Data Splitting<br>(step1_data_splitting.py)<br>(Splits: Train, Val, Test)"]
C1 -->|Log Statistics| C2["π Feature Stats<br>(Num Frames/Video,<br>Video Count)<br>ClearML PLOTS"]
C1 -->|"Output: ClearML Artifacts<br>(X_train, y_train)"| C3["βοΈ Data Augmentation<br>(Rotate, Zoom, Shift, Speedup, etc.)<br>(step2_data_augmentation.py)"]
C3 -->|"Output: ClearML Artifacts<br>(X_train, y_train)"| C4["βοΈ Data Transformation<br>(Normalize: Nose-based<br>Pad: max_frames=195, padding_value=-100.0)<br>(step3_data_transformation.py)"]
C1 -->|"Output: ClearML Artifacts<br>(X_val, y_val, X_test, y_test)"| C4
end
%% Multi-Model Training and Selection
subgraph "Multi-Training & Selection π§ "
C4 -->|"Input: ClearML Artifacts<br>(X_train, y_train, X_val, y_val)"| D1["π§ Pose-to-Gloss Models<br>1. GISLR (dim=192, kernel_size=17, dropout=0.2)<br>2. ConvNeXtTiny (include_preprocessing=False, include_top=True)<br>Prepare TF Dataset<br>(model_utils.py)"]
D1 -->|"Mixed Precision<br>Parallel Training<br>(2 Colab A100 GPUs)"| D2["π Model Training<br>(step4_model_training.py)<br>(Batch Size: 128, Epochs: 100, learning_rate=1e-3, AdamW, ReductOnPlateau)"]
D2 -->|Log Metrics| D3["π Training Metrics<br>(Loss, Accuracy, Top-5)<br>ClearML SCALARS"]
D2 -->|Compare val_accuracy| D4["π Model Selection Task<br>(step5_model_selection.py)<br>(Select GISLR: Top-1 Accuracy 60%)"]
end
%% Hyperparameter Tuning
subgraph "Hyperparameter Tuning π"
D4 -->|Best Model| E1["π HyperParameterOptimizer<br>(step6_hyperparameter_tuning.py)<br>Search Space: learning_rate, dropout_rate, dim<br>Multi-objective (OptimizerOptuna): minimize val_loss & maximize val_accuracy<br>concurrent_tasks=2, total_max_jobs=2"]
E1 -->|Best Parameters| E2["π Upload artifact<br>(best_job_id, best_hyperparameters, best_metrics)"]
E1 -->|Log Metrics| E3["π Updated Metrics<br>(val_accuracy: ~60%, val_top5_accuracy: ~87%)"]
end
%% Evaluation
subgraph "Evaluation π"
C4 -->|Test Set: X_test, y_test| F1["π Model Evaluation<br>(step7_model_evaluation.py)<br>"]
E2 -->|Best Job ID, Best Weights| F1
F1 -->|Log Metrics| F2["π Evaluation Metrics<br>- Accuracy, Top-5<br>- Classification Report)<br>- Confusion Matrix<br>- ClearML PLOTS"]
end
%% Continuous Training
subgraph "Continuous Training π"
G1["π TriggerScheduler: New Data<br>(Monitor Tags: ['landmarks', 'stable'])<br>(trigger.py)"]
G1 -->|Rerun Pipeline| H1
B2 --> G1
G2["π TriggerScheduler: Performance Drop<br>(Monitor Test Accuracy < 60%)<br>(trigger.py)"]
G2 -->|Rerun Pipeline| H1
F2 --> G2
H1["π PipelineController<br>(pipeline_from_tasks.py)<br>(Parameters: max_labels=100/300, batch_size=128, max_frames=195)<br>Tags: production"]
H1 -->|Step 1| C1
H1 -->|Step 2| C3
H1 -->|Step 3| C4
H1 -->|Step 4| D2
H1 -->|Step 5| D4
H1 -->|Step 6| E1
H1 -->|Step 7| F1
H1 -->|Tag Management| H2["π·οΈ Production Tag<br>(Best Pipeline Selected)"]
end
end
%% Model Serving
subgraph "Model Serving π"
F1 -->|"Deploy Best Model<br>(TFLite Conversion)"| I1["π FastAPI Endpoint<br>(serving/pose2gloss.py)<br>(Endpoints: /predict, /health, /metadata)<br>(Pydantic: Top-N Glosses with Scores)"]
end
%% Gloss-to-Text Translation
subgraph "Gloss-to-Text Translation π"
I1 -->|Top-5 Glosses with Scores| J1["π Gloss-to-Text Translation<br>(Beam Search-like Prompt)<br>(gloss2text/translator.py)"]
J1 -->|Call API| J2["π§ GPT-4o-mini<br>(OpenAI API)<br>(max_tokens=100, temperature=0.7)"]
J2 -->|Natural English| J1
J1 -->|Log Metrics| J3["π Translation Metrics<br>(BLEU, ROUGE Scores)<br>ClearML PLOTS"]
end
%% CI/CD Pipeline
subgraph "CI/CD Pipeline (GitHub Actions) π"
K1["π GitHub Repository<br>(SyntaxSquad)"]
K1 -->|Push/PR| K2["π GitHub Actions<br>(.github/workflows/pipeline.yml)"]
K2 -->|CI: Test| K3["π οΈ Test Remote Runnable<br>(cicd/example_task.py)"]
K3 -->|CI: Execute| K4["π Execute Pipeline<br>(cicd/pipeline_from_tasks.py)"]
K4 -->|CI: Report| K5["π Report Metrics<br>(cicd/pipeline_reports.py)<br>(Comments on PR)"]
K4 -->|CD: Tag| K6["π·οΈ Production Tagging<br>(cicd/production_tagging.py)"]
K6 -->|CD: Deploy| K7["π Deploy FastAPI<br>(cicd/deploy_fastapi.py)<br>(Localhost Simulation)"]
K7 --> I1
end
%% User Interface
subgraph "User Interface π₯οΈ"
L1["π₯ Webcam Input<br>(apps/streamlit_app.py)"]
L1 -->|Extract Landmarks| B1
L1 -->|Fetch Predictions| I1
L1 -->|Display Translation| J1
L1 -->|Render UI| L2["π₯οΈ Streamlit UI<br>(Real-Time Translation Display)"]
end
%% Styling for Color and Beauty
style A1 fill:#FFD700,stroke:#DAA520
style A2 fill:#FFD700,stroke:#DAA520
style A3 fill:#FFD700,stroke:#DAA520
style B1 fill:#32CD32,stroke:#228B22
style B2 fill:#32CD32,stroke:#228B22
style B3 fill:#32CD32,stroke:#228B22
style C1 fill:#FF69B4,stroke:#FF1493
style C2 fill:#FF69B4,stroke:#FF1493
style C3 fill:#FF69B4,stroke:#FF1493
style C4 fill:#FF69B4,stroke:#FF1493
style D1 fill:#FF8C00,stroke:#FF4500
style D2 fill:#FF8C00,stroke:#FF4500
style D3 fill:#FF8C00,stroke:#FF4500
style E1 fill:#00CED1,stroke:#008B8B
style E2 fill:#00CED1,stroke:#008B8B
style E3 fill:#00CED1,stroke:#008B8B
style F1 fill:#1E90FF,stroke:#4682B4
style F2 fill:#1E90FF,stroke:#4682B4
style G1 fill:#ADFF2F,stroke:#7FFF00
style G2 fill:#ADFF2F,stroke:#7FFF00
style H1 fill:#FF6347,stroke:#FF4500
style H2 fill:#FF6347,stroke:#FF4500
style I1 fill:#FF1493,stroke:#C71585
style J1 fill:#FFA500,stroke:#FF8C00
style J2 fill:#FFA500,stroke:#FF8C00
style J3 fill:#FFA500,stroke:#FF8C00
style K1 fill:#DA70D6,stroke:#BA55D3
style K2 fill:#DA70D6,stroke:#BA55D3
style K3 fill:#DA70D6,stroke:#BA55D3
style K4 fill:#DA70D6,stroke:#BA55D3
style K5 fill:#DA70D6,stroke:#BA55D3
style K6 fill:#DA70D6,stroke:#BA55D3
style K7 fill:#DA70D6,stroke:#BA55D3
style L1 fill:#FFDAB9,stroke:#FF7F50
style L2 fill:#FFDAB9,stroke:#FF7F50
This subsystem handles the ingestion of raw data, specifically ASL videos from the WLASL dataset, sourced from Kaggle (risangbaskoro/wlasl-processed
and sttaseen/wlasl2000-resized
). The videos are downloaded as ZIP files (datasets/wlasl-processed.zip
, datasets/wlasl2000-resized.zip
) and uploaded to ClearML as a dataset (WLASL2000
).
- Data Sourcing Strategy: The use of 2 Kaggle datasets ensures robustness against missing videos, a challenge I identified in Sprint 1. This redundancy (primary dataset with a backup) is a best practice in data engineering, ensuring data availability for downstream tasks.
- ClearML Dataset: Storing the raw data in ClearML (
WLASL2000
) facilitates versioning and traceability, key for MLOps Level 2. ClearML's dataset management allows tagging and monitoring, which supports continuous training triggers later in the pipeline.
This subsystem extracts landmarks from videos using MediaPipe, stores them as features, and registers them in ClearML. The process starts with the WLASL2000
dataset, extracts 180 keypoints per frame (42 hand, 6 pose, 132 face) using wlasl2landmarks.py, and saves the results as WLASL_landmarks.npz
and WLASL_parsed_metadata.json. These are tagged as stable
in ClearML and stored in a feature store with metadata (video_id
, gloss
, split
).
- Landmark Extraction: Using MediaPipe to extract 180 keypoints (42 hand, 6 pose, 132 face) is a strategic choice. The selection of landmarks focuses on expressive parts critical for ASL (hands for gestures, face for expressions, pose for body context), optimizing for both accuracy and computational efficiency. The process taking ~2 days highlights the scale of the dataset (~21k videos), but ClearML's task management ensures this is a one-time cost with reusable outputs.
- Feature Store Design: Storing features with metadata (
video_id
,gloss
,split
) in a ClearML dataset enables efficient data access for training and inference. Thestable
tag ensures reliability for production pipelines, aligning with MLOps Level 2 requirements for robust feature management. - Technical Excellence: The feature store's integration with ClearML allows for versioning and reproducibility, critical for iterative development. The choice of 180 keypoints balances model input complexity with predictive power, showcasing expert-level design.
This is the core of the system, orchestrating data preparation, training, evaluation, and continuous training. Let's break down its subcomponents.
Features from the feature store are processed through three tasks: Data Splitting, Data Augmentation, and Data Transformation. The splitting task (step1_data_splitting.py) divides data into train, validation, and test sets, logging statistics (e.g., number of frames, video count) as ClearML plots. Augmentation (step2_data_augmentation.py) applies transformations (rotate, zoom, shift, speedup), and transformation (step3_data_transformation.py) normalizes landmarks (nose-based) and pads sequences to max_frames=195
with padding_value=-100.0
.
- Splitting and Statistics: Logging feature stats as ClearML plots provides visibility into data distribution, aiding debugging and optimization. The splits ensure balanced datasets for training, validation, and testing, a critical step for model generalization.
- Augmentation: Techniques like rotation, zooming, and speedup address the sparsity of the WLASL dataset (5-6 examples per gloss), artificially increasing diversity. This is a sophisticated approach to mitigate overfitting, especially given my model's tendency to overfit (noted in Sprint 2).
- Normalization and Padding: Nose-based normalization standardizes landmark positions, reducing variance due to spatial differences. Padding to
max_frames=195
with a distinct value (-100.0
) ensures consistent input shapes for the model, with the value chosen to be outside the normalized range for easy masking during training.
This subsystem trains 2 Pose-to-Gloss models: GISLR (dim=192
, kernel_size=17
, dropout=0.2
) and ConvNeXtTiny (include_preprocessing=False
, include_top=True
). Training (step4_model_training.py) uses mixed precision on two Colab A100 GPUs, with batch_size=128
, epochs=100
, learning_rate=1e-3
, AdamW optimizer, and ReduceLROnPlateau
. Metrics (loss, accuracy, top-5) are logged as ClearML scalars. Model selection (step5_model_selection.py
) chooses GISLR based on validation accuracy (60% top-1).
- Model Choices: GISLR, adapted from the Google Isolated Sign Language Recognition competition, is a strong baseline with proven performance on landmark data. ConvNeXtTiny, a modern convolutional architecture, adds diversity by exploring different feature extraction paradigms, potentially capturing patterns GISLR misses.
- Training Efficiency: Mixed precision and parallel training on A100 GPUs optimize computational resources, reducing training time (noted as 3-5 minutes in Sprint 2). The use of AdamW with
ReduceLROnPlateau
addresses my observation of overfitting, dynamically adjusting the learning rate to stabilize training. - Model Selection: Selecting GISLR based on validation accuracy (60% top-1) is pragmatic, given its performance aligns with the original WLASL paper (65.89% top-1 for WLASL100). Logging metrics as ClearML scalars enables experiment comparison, a hallmark of MLOps Level 2.
The HyperParameterOptimizer
(step6_hyperparameter_tuning.py) tunes the GISLR model's hyperparameters (learning_rate
, dropout_rate
, dim
) using Optuna, with a multi-objective goal (minimize val_loss
, maximize val_accuracy
). It runs 2 concurrent tasks with a total of 2 jobs, logging best parameters and metrics (val_accuracy ~60%, val_top5_accuracy ~87%) to ClearML.
- Optimization Strategy: Using
HyperParameterOptimizer
with Optuna is a state-of-the-art approach for hyperparameter tuning, balancing exploration and exploitation. The multi-objective optimization ensures trade-offs between loss and accuracy are considered, critical given my model's overfitting tendencies. - Concurrency: Running 2 concurrent tasks maximizes GPU utilization on Colab, though the total of 2 jobs suggests a focused search, likely to manage resource constraints (noted in Sprint 2). The slight improvement in metrics (val_top5_accuracy from 86% to 87%) indicates effective tuning.
The evaluation task (step7_model_evaluation.py) assesses the best model (GISLR with tuned hyperparameters) on the test set, logging accuracy, top-5 accuracy, classification report, and confusion matrix as ClearML plots.
- Comprehensive Metrics: Logging a classification report and confusion matrix provides deep insights into model performance across glosses, identifying specific areas for improvement (e.g., frequently confused glosses).
- ClearML Integration: Visualizing metrics as plots in ClearML enhances interpretability, aligning with MLOps best practices for monitoring and debugging.
The PipelineController
(pipeline_from_tasks.py) orchestrates the entire pipeline, with parameters (max_labels=100/300
, batch_size=128
, max_frames=195
) and a production
tag. TriggerScheduler
monitors for new data (via tags landmarks
, stable
) and performance drops (test accuracy < 60%), rerunning the pipeline as needed.
- Automation: The
TriggerScheduler
ensures continuous training by monitoring dataset tags and performance metrics, a cornerstone of MLOps Level 2. The new data trigger allows the system to adapt to fresh inputs, while the performance drop trigger (threshold at 60%) ensures reliability in production. - PipelineController: Centralizing the pipeline with
PipelineController
ensures end-to-end execution, with parameters customizable for experiments (e.g., WLASL100 vs. WLASL300). Theproduction
tag marks the pipeline for deployment, aligning with CI/CD requirements.
The best model (GISLR) is converted to TFLite for efficient inference and deployed as a FastAPI endpoint (serving/pose2gloss.py). Endpoints include /predict
(returns top-N glosses with scores), /health
, and /metadata
, using Pydantic for request/response validation.
- TFLite Conversion: Converting the model to TFLite reduces inference latency and memory usage, critical for real-time applications like ASL translation. This optimization ensures the system can run on resource-constrained environments (e.g., local machines).
- FastAPI Endpoints: The
/predict
endpoint leverages the top-N gloss prediction strategy (top-5 accuracy ~87%), enhancing translation quality./health
and/metadata
endpoints provide operational insights, aligning with production best practices. - Pydantic Validation: Using Pydantic ensures robust input/output validation, improving API reliability and user experience.
The top-5 glosses with scores from the FastAPI endpoint are fed into the Gloss-to-Text task (gloss2text/translator.py). A beam search-like prompt is used with GPT-4o-mini (max_tokens=100
, temperature=0.7
) to generate natural English translations. Metrics (BLEU, ROUGE) are logged as ClearML plots.
- Beam Search-like Prompt: Using top-5 glosses with scores allows GPT-4o-mini to select the most coherent combination, significantly improving translation quality over a top-1 approach (given the 61% top-1 accuracy). This mimics beam search by considering multiple hypotheses, a novel application of LLMs in sign language translation.
- GPT-4o-mini: The choice of GPT-4o-mini balances performance and latency, with
temperature=0.7
introducing controlled creativity for natural translations.max_tokens=100
ensures concise outputs suitable for real-time display.
The GitHub repository (SyntaxSquad
) uses GitHub Actions (.github/workflows/pipeline.yml) for CI/CD. CI includes testing (cicd/example_task.py), pipeline execution, and reporting metrics as PR comments (cicd/pipeline_reports.py). CD tags the pipeline (cicd/production_tagging.py) and deploys the FastAPI endpoint to a localhost simulation.
- CI Workflow: Automated testing ensures code quality, while pipeline execution verifies end-to-end functionality on each push/PR. Reporting metrics as PR comments enhances collaboration, a key MLOps Level 2 feature.
- CD Workflow: Production tagging and FastAPI deployment ensure rapid updates in production, meeting the reliability goals of MLOps Level 2. The localhost simulation is a practical choice for academic projects, mimicking real-world deployment.
graph LR
subgraph "Streamlit User Interface π₯οΈ"
A["π€ User<br>(Deaf Individual)"]
A -->|Access| B["π₯οΈ Streamlit App<br>(streamlit_app.py)<br>(Real-Time ASL Translation)"]
B -->|Capture Video| C["π₯ Webcam Input<br>(OpenCV: cv2.VideoCapture)"]
C -->|Video Frames| D["π οΈ MediaPipe<br>Landmark Extraction<br>(180 Keypoints/Frame: 42 Hand, 6 Pose, 132 Face)"]
D -->|Landmarks| E["π Display Landmarks<br>(Streamlit: st.image)<br>(Visualize 180 Keypoints in Real-Time)"]
D -->|"Input: Landmarks<br>(frames, 180, 3)<br>(Fetch Predictions)"| F["π FastAPI Endpoint<br>(serving/pose2gloss.py)<br>(/predict with payload {landmarks:ndarray(MAX_FRAMES, 180, 3), top_n})"]
F -->|"Output: Top-N Glosses with Scores<br>(e.g., MOTHER: 0.85, LOVE: 0.90, FAMILY: 0.80)"| G["π Gloss-to-Text Translation<br>(Beam Search-like Selection Prompt)<br>(gloss2text/translator.py)"]
G -->|Call API| H["π§ GPT-4o-mini<br>(OpenAI API)<br>(max_tokens=100, temperature=0.7)"]
H -->|"Natural English<br>(e.g., 'Mother loves her family.')"| G
G -->|Display| I["π Display Translation<br>(Streamlit: st.text)<br>(Show Natural English Text in Real-Time, Latency <1s)"]
F -->|Display| J["π Gloss Predictions<br>(Streamlit: st.table)<br>(Show Top-N Glosses with Scores)"]
B -->|Interact| K["π±οΈ User Controls<br>(Streamlit: st.button)<br>(Start/Stop Translation, Adjust Settings: top_n)"]
end
%% Styling for Color and Beauty
style A fill:#FF69B4,stroke:#FF1493
style B fill:#FF69B4,stroke:#FF1493
style C fill:#FFD700,stroke:#DAA520
style D fill:#32CD32,stroke:#228B22
style E fill:#FF69B4,stroke:#FF1493
style F fill:#FF8C00,stroke:#FF4500
style G fill:#FFA500,stroke:#FF8C00
style H fill:#FFA500,stroke:#FF8C00
style I fill:#FF69B4,stroke:#FF1493
style J fill:#FF69B4,stroke:#FF1493
style K fill:#FF69B4,stroke:#FF1493
The Streamlit UI (streamlit_app.py) provides a real-time ASL translation interface. It captures webcam input (cv2.VideoCapture
), extracts landmarks using MediaPipe, fetches top-N gloss predictions via the FastAPI endpoint, and displays translations from the Gloss-to-Text task. The UI renders landmarks (st.image
), gloss predictions (st.table
), translations (st.text
), and user controls (st.button
for start/stop, top_n
adjustment).
- Real-Time Processing: The UI's ability to process webcam input, extract landmarks, and display translations with <1s latency is a technical feat, enabled by efficient TFLite inference and FastAPI serving.
- User-Centric Design: Displaying landmarks, gloss predictions, and translations in real-time provides transparency, enhancing user trust. User controls for starting/stopping translation and adjusting
top_n
improve interactivity. - Integration: The seamless integration of webcam input, FastAPI predictions, and Gloss-to-Text translation demonstrates end-to-end system design, with Streamlit ensuring accessibility for deaf users.