SFT data preprocessing #28

gathierry · 2025-01-17T15:36:08Z

Thanks for sharing the data and code.
Could you also share some details about how the raw videos are converted to the patch files?
Although you listed the raw data sources on this page: https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data
I suppose there are also some filtering steps to select only 600k data from them?

liuzuyan · 2025-01-17T18:08:14Z

Hi, the raw videos are clipped into frames and saved in bytes format in patch files. We use patch files to increase the i/o efficiency during training. We do not use any downsampling or merging operation for our raw video sources and we just add all the videos from our source datasets together. If you have any detailed questions, please feel free to ask further.

gathierry · 2025-01-18T02:54:01Z

I see, thanks for the prompt reply.
So is it still possible to trace back from the current id oryx_0000... to the video / question ids in their original dataset?

When people trying to add more data for training, this information can be very useful for deduplication

And we may also want to try different frame rates. The current frame rate is 1 FPS right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT data preprocessing #28

SFT data preprocessing #28

gathierry commented Jan 17, 2025

liuzuyan commented Jan 17, 2025

gathierry commented Jan 18, 2025 •

edited

Loading

SFT data preprocessing #28

SFT data preprocessing #28

Comments

gathierry commented Jan 17, 2025

liuzuyan commented Jan 17, 2025

gathierry commented Jan 18, 2025 • edited Loading

gathierry commented Jan 18, 2025 •

edited

Loading