-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Gnurro edited this page May 17, 2021
·
2 revisions
FinetuneReFormatter offers multiple modes to prepare training data for GPT (or other LM) finetuning/training:
- SourceInspector mode, which comes with a text editor and tracking/finding of multiple common issues of raw scraped text data
- InitialPrep mode, which can be used to calculate various text statistics, like word count and token distribution, as well as conversion to a rolling context data format saved as JSON, called ChunkFile, and a few quick data tweaks
- ChunkStack mode, which can be used to view and edit ChunkFiles and helps with building rolling context text data
- ChunkCombiner mode, which can be used combine ChunkFiles into proper training data text and allows additional batch formatting determined by chunk types
- TokenExplorer mode, which can be used to check the token dictionary for peculiarities
Current Wiki Version: Beta2