Replies: 1 comment
-
i'm thinking about minmax normalization before tokenization, but not sure about how this will affect the data distribution... |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My data have relatively large numbers(4000+), but difference between each number is actually really small (ranging from 1 to 10 maybe).
As shown in this image, the ids after meanscale tokenization are even the same for several points, which means the resolution of the bins are not enough to properly seperate them.
I wonder if there is similar situation in some of the pretraining data, and how to solve this if it is A problem.
Beta Was this translation helpful? Give feedback.
All reactions