Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFT data preprocessing #28

Open
gathierry opened this issue Jan 17, 2025 · 2 comments
Open

SFT data preprocessing #28

gathierry opened this issue Jan 17, 2025 · 2 comments

Comments

@gathierry
Copy link

Thanks for sharing the data and code.
Could you also share some details about how the raw videos are converted to the patch files?
Although you listed the raw data sources on this page: https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data
I suppose there are also some filtering steps to select only 600k data from them?

@liuzuyan
Copy link
Collaborator

Hi, the raw videos are clipped into frames and saved in bytes format in patch files. We use patch files to increase the i/o efficiency during training. We do not use any downsampling or merging operation for our raw video sources and we just add all the videos from our source datasets together. If you have any detailed questions, please feel free to ask further.

@gathierry
Copy link
Author

gathierry commented Jan 18, 2025

I see, thanks for the prompt reply.
So is it still possible to trace back from the current id oryx_0000... to the video / question ids in their original dataset?

When people trying to add more data for training, this information can be very useful for deduplication

And we may also want to try different frame rates. The current frame rate is 1 FPS right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants