-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training with more tokens #5
Comments
We didn't test out anything higher than what was reported in the paper, since the design was motivated by our scaling laws which shows that for visual reasoning and understanding tasks, less visual tokens with more LLM parameters is more compute optimal. We build upon the TokenPacker (https://github.com/CircleRadon/TokenPacker) compression algorithm, which does test on 144 tokens. |
I think the increase in loss after 2k steps is quite unexpected. I don't think it is expected at 144 tokens. Can you try to check your setup say for 36 tokens? |
Hello, authors! Although quecc is designed for very low token counts, I am curious whether you have tested quecc in a higher token count range (the highest in the paper is 36). When I trained with 144 tokens (using the same training hyperparameters as for 36), I observed spikes in the loss. Have you tried training with 144 tokens? Thank you!
The text was updated successfully, but these errors were encountered: