This repository is a focused fork of our YOLO retraining template, specialized for detecting waste on streets. It uses DVC (Data Version Control) to track datasets, experiments, and trained models for reproducible results and easy collaboration.
- Large files: Keeps datasets and model weights out of Git history while versioning them alongside code.
- Reproducibility: The exact model used for a release is pinned by
dvc.lock
and can be restored withdvc pull
.
About this fork
- Task: waste detection
- Current classes (from params.yaml):
waste
,cigarette
(to keep it focused for now and not introduce too many classes while we don't have much training data)- Pipeline: Ultralytics YOLOv8 with DVC-managed data and outputs
- Models & Releases
- Initial Project Setup
- Managing Project Data
- Experiment Workflow
- Custom Classes
- Raw Data Structure
- Fine-tuning
- Folder Subsets
- Testing
This repo publishes trained models with each GitHub Release and also tracks the currently applied model via DVC.
- Latest model (main): The model currently applied on the main branch is the one referenced in
dvc.lock
under theruns/
output. To fetch it locally, rundvc pull
(requires access to the configured DVC remote). - Release assets: Each release includes the trained weights and metadata so you don’t need DVC to use the model:
weights/best.pt
: the promoted YOLO weights for inferencetest_metrics.json
: evaluation metrics of the promoted runmetadata.yaml
: training metadata (experiment name, epochs, image size, etc.)
Follow these steps once after creating this project from the template.
1. Create the New Repository
- On the GitHub page for this template, click the "Use this template" button.
- Assign a name to your new repository (e.g.,
waste-detection-2025
) and confirm its creation.
2. Clone Your New Repository
3. Install Dependencies This project uses Poetry for dependency management.
poetry install
poetry shell
4. Run the Interactive Setup Script
A helper script configures the project’s connection to remote storage and (optionally) custom classes.
python setup_project.py
The script will:
- prompt for a project & dataset name
- optionally ask for a comma-separated list of class names
- patch
.dvc/config
(remote URL) andparams.yaml
accordingly - create the
raw_data/train
andraw_data/test
folders
5. Commit the Initial Configuration The bootstrap script modifies configuration files. Commit these changes to save the project setup.
git add .dvc/config params.yaml
git commit -m "Initialize project configuration"
The project is now fully configured and ready for data.
After the initial setup, follow this process whenever you add or update the raw dataset.
1. Add Your Data
- Place your new or updated data files into the
raw_data/train/
orraw_data/test/
directories, following the structure guide below.
2. Track Data with DVC
- Use the
dvc add
command to tell DVC to track the state of your data directory.
dvc add raw_data
This command creates/updates a small raw_data.dvc
file. This file acts as a pointer to the actual data, which DVC manages.
3. Commit the Pointer File with Git
- Add the
.dvc
file to Git. This records the "version" of your data that corresponds to your code.
git add raw_data.dvc
git commit -m "Add new batch of training images"
4. Push Both Code and Data
Pushing is a two-step process: dvc push
uploads your large data files to the shared Hetzner storage, and git push
uploads your code and the small DVC pointer file.
# Step 1: Upload the actual data files to remote storage
dvc push
# Step 2: Upload the code and data pointers
git push
This project uses a structured workflow for training and promoting models. The key principle is to perform extensive experimentation locally and only commit significant, "winning" models to the main project history.
All training parameters live in params.yaml
.
Key fields:
data:
dataset_name: waste-detection
experiment_name: waste-detection
# ↓ Optional — override COCO classes
custom_classes:
- "waste"
- "cigarette"
use_coco_classes: false # fallback to COCO if true/empty
train:
model_size: m
image_size: 1280
epochs: 50
batch_size: 8
- If
custom_classes
is non-empty, those names become class 0…n-1. - If it’s empty and
use_coco_classes: true
, the predefined COCO subset is used.
Execute experiments using the dvc exp run
command. This process is entirely local and does not create any Git commits. Use the -n
flag to assign a descriptive name to each run.
# Run an experiment with the settings from params.yaml
dvc exp run -n "large-model-150-epochs"
For quick iterations, you can override parameters from the command line with the -S
flag:
# Test a different parameter without editing params.yaml
dvc exp run -n "test-smaller-batch-size" -S train.batch_size=4
Use dvc exp show
to display a leaderboard of all local experiments. This table includes the parameters and performance metrics for each run, allowing for easy comparison.
# Sort the table by a key metric to find the best performer
dvc exp show --sort-by metrics/fitness --sort-order desc
Once you identify a superior experiment, promote it to become the official version in the main project branch.
-
Apply the winner's results to your workspace. This command updates your
params.yaml
and output files to match the state of the selected experiment.dvc exp apply <name-of-your-winning-experiment>
-
Commit this "Golden" version to Git. This is the only time a commit is made after a series of experiments.
git add . git commit -m "Promote new model with 150 epochs, achieves 0.92 fitness"
-
Push the final result to the shared remotes.
dvc push # Uploads the winning model's data files git push # Pushes the commit with the updated project state
This fork is configured for waste detection. By default, the classes are defined in params.yaml
:
data:
custom_classes:
- waste
- cigarette
use_coco_classes: false
To change classes, edit params.yaml
or re-run python setup_project.py
and follow the prompts. The pipeline will remap labels of imported datasets where a data.yaml
is present.
This project includes unit tests and an end-to-end (E2E) pipeline smoke test.
-
Using unittest
- Run all tests:
poetry run python -m unittest -v
- Run all tests:
-
Using pytest (optional)
- Install dev dependencies once:
poetry install --with dev
- Run tests:
poetry run pytest -q
- Install dev dependencies once:
Notes
- The E2E test creates a tiny synthetic dataset under a temporary directory and runs both the prepare and train/eval stages. It also checks that scene metrics are exported in
results_comparison/results.csv
. - Tests run on CPU with a minimal YOLOv8n configuration and a single epoch to keep runtime low.
Put your raw data in raw_data/train/
(for train+val) and raw_data/test/
(for the final hold‑out set).
The importer now accepts any of the following:
# | Layout | What to do | Notes |
---|---|---|---|
1 | CVAT YOLO export | Drop the whole export folder (data.yaml , images/ , labels/ , train.txt ). |
train.txt is automatically parsed. |
2 | Standard YOLO | Inside a subfolder create images/ & labels/ . |
Class IDs will be remapped if needed. |
3 | Scene‑based test sets | One subfolder per scene, each with its own images/ & labels/ . |
Scene name is appended to filenames so metrics stay separate. |
4 | Any folder that already contains a data.yaml / dataset.yaml |
Just copy it in. | Class IDs will be remapped if needed (names have to be the same as in params.yaml). |
Tip: When you use option 4 you can bring in public datasets or prior labeling runs “as is”.
The importer detects the YAML, builds a temporary copy, remaps the labels, and keeps going—no manual edits required.
Example scene‑based layout:
raw_data/
└── test/
├── Wolfsburg/
│ ├── images/
│ └── labels/
└── Carmel/
├── images/
└── labels/
You can fine-tune from a pre-trained checkpoint (e.g., a prior waste model) instead of starting from the base YOLOv8 weights.
Configure in params.yaml
under train
:
train:
finetune_mode: true # enable fine-tuning
pretrained_model_path: taco-uav-model.pt
finetune_lr: 0.0001 # lower LR for fine-tuning
finetune_epochs: 60 # optional override for epochs
freeze_backbone: false # optionally freeze early layers
Notes:
- When fine-tuning, the pipeline evaluates the chosen pretrained model as the baseline and compares it to the fine-tuned result.
- The DVC pipeline depends on
pretrained_model_path
so runs are reproducible. Make sure the file is available (via DVC or local path).
To limit or oversample specific source folders during dataset preparation, use prepare.folder_subsets
in params.yaml
or the CLI override.
Example params.yaml
configuration:
prepare:
val_split: 0.1
test_split: 0.1
augment_multiplier: 1
folder_subsets:
uavvaste: 0.5 # use 50% of images from this folder
taco: 0.2 # use 20%
cw32-08-07-train: 2 # 200% = oversample
CLI override (multiple allowed):
python yolov8_training/train_pipeline.py \
--stage prepare -d waste-detection \
--folder-subset uavvaste 0.5 \
--folder-subset cw32-08-07-train 2.0
Behavior:
- Ratios between 0 and 1.0 subsample a folder.
- Ratios > 1.0 oversample by repeating images; cross-folder duplicates are removed so balancing doesn’t leak duplicates between sources.