Home

Jump to bottom

Gnurro edited this page May 17, 2021 · 2 revisions

Welcome to the FinetuneReFormatter wiki!

Features

FinetuneReFormatter offers multiple modes to prepare training data for GPT (or other LM) finetuning/training:

SourceInspector mode, which comes with a text editor and tracking/finding of multiple common issues of raw scraped text data
InitialPrep mode, which can be used to calculate various text statistics, like word count and token distribution, as well as conversion to a rolling context data format saved as JSON, called ChunkFile, and a few quick data tweaks
ChunkStack mode, which can be used to view and edit ChunkFiles and helps with building rolling context text data
ChunkCombiner mode, which can be used combine ChunkFiles into proper training data text and allows additional batch formatting determined by chunk types
TokenExplorer mode, which can be used to check the token dictionary for peculiarities

Current Wiki Version: Beta2