Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vla training #10

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Vla training #10

wants to merge 9 commits into from

Conversation

timothygao8710
Copy link

@timothygao8710 timothygao8710 commented Jul 12, 2024

In this pull request, we fine-tune Open-Source Vision-Language-Action Model - OpenVLA to give Stompy to ability to find the optimal next move based on a

  1. language instruction (what action should the robot make to push the cube to the target?)
    and
  2. a monocular, third-person picture of scene (single 512x512 RBG image capturing the robot, target, cube, and environment).
  • A move is defined in discrete action-space, as a 7-DoF vector defining robot claw's delta x, y, z, row, yaw, pitch, and pinch.
  • A move is considered "correct" if the predicted output token matches the expected output token for all 7 DoFs. The loss function is defined similarly.

OpenVLA is not zero-shot. It needs to be fine-tuned for each new (environment, task).

Data processing scripts take in a directory of json files, each containing the optimal steps for a single episode, collates all (current_image, optimal_next_step) tuples into a .h5 dataset. Right now this data is generated via PPO, which has access to all variables in sim.

datasets.py loads the .h5 file into a custom PyTorch dataset, which is wrapped in dataloader and served with custom batch size, image transformations, tokenizer, etc in finetune.py

finetune.py further has configuration options for: learning_rate, use_lora, lora configs, num_epochs, pretrained_model_path, ....

Training Observations and Logs:

Our first approach didn't work because we only did a single camera angle, we realized that we needed to capture a larger distribution to have to model actually do that, so we randomized camera angles and cube and target location
after a lot of tweaking and tricks we got it to 30% action accuracy in validation, which is decent because the model has to pick the right action out of # discrete tokens ^ # deg of freedom = 256 ^ 7 options for it to be considered correct
But when we tested it it still doesn't work - it works at the start, but the moment the arms "tweaks" / makes a non-optimal move, it goes out of distribution because the training data only ever contains optimal trajectories
so now we're training on more data to get the arm to learn error correction

Screenshot 2024-07-12 at 4 02 17 PM Overfit run - gets around 30% action accuracy on new, randomized push cube task.

@codekansas codekansas closed this Jul 12, 2024
@codekansas codekansas deleted the vla-training branch July 12, 2024 22:37
@codekansas codekansas restored the vla-training branch July 12, 2024 22:37
@codekansas codekansas deleted the vla-training branch July 12, 2024 22:38
@codekansas codekansas restored the vla-training branch July 12, 2024 22:38
@codekansas codekansas reopened this Jul 12, 2024
@codekansas
Copy link
Member

oops sorry

@timothygao8710
Copy link
Author

No worries! I'm starting to eliminate my competitive programming habits and making sure that when I contribute to repos, I'm being more mindful

@timothygao8710
Copy link
Author

Also - I think this vla stuff is independent of if the sim we choose to use and also if we want to adapt it to real life in the future, it only needs access to image and next best move. Maybe not putting it in stompy_live is better? I also got rid of the env_norm stuff, can put that in a separarte repo too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants