Skip to content
Gnurro edited this page May 17, 2021 · 2 revisions

Welcome to the FinetuneReFormatter wiki!

Features

FinetuneReFormatter offers multiple modes to prepare training data for GPT (or other LM) finetuning/training:

  • SourceInspector mode, which comes with a text editor and tracking/finding of multiple common issues of raw scraped text data
  • InitialPrep mode, which can be used to calculate various text statistics, like word count and token distribution, as well as conversion to a rolling context data format saved as JSON, called ChunkFile, and a few quick data tweaks
  • ChunkStack mode, which can be used to view and edit ChunkFiles and helps with building rolling context text data
  • ChunkCombiner mode, which can be used combine ChunkFiles into proper training data text and allows additional batch formatting determined by chunk types
  • TokenExplorer mode, which can be used to check the token dictionary for peculiarities