-
Notifications
You must be signed in to change notification settings - Fork 0
HowTo: Textadventure Reformatting from Source
This is intended to become a full walkthrough on how to turn any text into well-formatted textadventure data for GPT finetuning using FinetuneReFormatter.
Format specification and standards.
Textadventure text must contain player inputs separated by newlines, and marked with a character at the beginning of the line, which traditionally is > .
This is how it should look (roughly):
You are a wizard looking for a familiar. > You read a book about types of familiars. The book contains descriptions of typical familiars: Cats, ravens, toads and owls, but also more exotic familiars. > You look up the more exotic familiars in the book. There are chapters about salamanders, imps and aether wyrms. > You read about aether wyrms. Aether wyrms are small, dragonlike lizards with short wings. They are about as intelligent as a housecat, but fiercely loyal to their owners. They have low-level magical powers, making them a favorite for wizards who want a truly magical companion.
Input and source text sections, collectively called 'chunks', should have a certain number of tokens each, lying between certain thresholds to fit with the implementation that will eventually use the finetuned model.
For NovelAI, this is currently around 30 tokens for source text chunks. This value will be used in preparing ChunkFiles later. Making the handling of these limitations convenient is at the core of FinetuneReFormatter.
The traditional person perspective is second person (abbreviated as 2ndP), but third person (abbreviated as 3rdP) is also viable. Most player inputs should begin with either the second person pronoun 'You', or, for 3rdP texts, with the name of the acting character.
The traditional tense is present tense. Player inputs (actions) and game/AI outputs (results) are 'happening right now'. In terms of style, this is intended to allow for better player agency and to avoid the impression of actions and results 'having already happened', as past tense would suggest, while leaving the results open. Future tense might suggest a degree of predictability that would be unfitting for an adventure.
Past tense can be viable, but is better used for background information, and can help separate it from 'what's currently happening'.
What to look for when selecting texts to reformat, and which texts are easier to reformat.
CYOA texts are good as a base, as they usually already are in 2ndP and present tense. However, choices have to be pruned from the data, and a certain path through the choices should be chosen to be used as the source. Player inputs still have to be inserted, as choices are usually too far apart for textadventure.
These are often already in a fitting style, but usually lack player inputs. Some available sources have possible actions with the descriptive text, and adapting those into training data is as easy as picking one and using it as player input.
Logs from traditional text adventures and other, more modern sources can usually be adapted easily, if not used outright - given they are written well.
Reformatting from prose may take more effort than using sources mentioned above, but is the only way to incorporate classic literature and the like into a textadventure dataset. Since it takes the most extensive amount of editing, this will be the type of source data used for this how-to.
Other source types and their handling might be added in the future. (Might add examples of handling them along with the narrative prose examples to follow below.)
How to use the SourceInspector and InitialPrep modes to clean a source text and turn it into a ChunkFile.
(Large parts of this section will also be viable for the preparation of other source texts to be used for finetuning. Common source formatting issues will be handled here.)
Upon starting FinetuneReFormatter, the introduction screen will be shown:
Click the [Open File] button and select the source text file to be worked on.
Source text files should be UTF-8 plaintext files with the .txt
file extension. ANSI text will be converted to UTF-8, but other encodings produce a imcopatibility warning.
Once the file is successfully loaded, its name will be shown in the window title.
The file will then be opened and FinetuneReFormatter will display its content in SourceInspector mode.
For this HowTo, the first chapter of Moby Dick will be used. For this purpose, it was simply copied from the Project Gutenberg version and pasted into a separate UTF-8 plaintext file using a text editor (NotePad++, but most text editors will work). This is mainly for convenience - smaller files can be handled quicker, especially when it comes to counting tokens and instant text flaw checking.
This is how the raw text looks in SourceInspector mode:
The text can be edited in the main text field. If instant token count is activated, tokens will be immediatly recounted when the text is edited (not advised for large files).
NOTE: Edits will not be instantly saved to the file! A * after the file name in the window title indicates that the text was edited, but has not been saved to the file yet. The file can be saved through the File menu or by pressing Ctrl+S.
While headings like the chapter number and title are usually no issue in prose texts, they can be detrimental in textadventure data. This also applies to page numbers, footnotes, tables and many other formatting conventions - anything but continuous text. Hence, these should be removed.
Going through the text and removing these content flaws first can ease the later use of assisted and automated functions of FinetuneReFormatter.
One of the major flaws many text sources come with is the use of large amounts of line breaks/newline characters. GPT models perceive line breaks as changes in topic/theme - but they are often part of text sources for the sake of (human) readability. A line break occurring in the middle of a sentence will make it hard for the model to learn the connections in that sentence.
(The 'Remove block layout' QuickFix can be used to quickly fix the block layout seen above, but should be used with care.)
SourceInspector has multiple checkers to find the position of badly placed newline characters.
These can be selected using the dropdown at the top (a). To the right of the newline checker selection (b), the currently focused 'bad newline' occurence and the total number of 'bad newlines' found by the active mode is shown. Navigate through occurrences with the [◄] back (c) and [►] forward (d) buttons - this will place the text cursor exactly before the occurrence.
(Negative occurence numbers are visual only and will be fixed in upcoming versions. Click [◄] or [►] to reset the counter and show proper numbers.)
LineEnd checks the end of every line in the text for the presence of characters that usually occur at the end of lines and sentences. These can be changed in the settings.json file. If the line does not end in one of these 'line enders', the newline character at its end will be considered 'bad'.
InLine checks for 'line enders' anywhere in each line, and considers lines that contain none of them 'bad'.
NoDoubles checks for empty lines, considering two adjacent newline characters 'bad'. While these can be used as paragraph breaks in prose texts, they should be removed for textadventure data. (All double newlines can be automatically removed with a single button press in InitialPrep mode; see below.)
Using the newline checkers, fixing 'bad newlines' can be as easy as clicking the [►] button, then pressing [Del] and [Space] on the keyboard. This will remove the 'bad newline' character.
This is the example text after all LineEnd flaws have been removed (paragraphs have also been consolidated into single lines):
However, there are still double newlines, as can be seen with NoDoubles newline checking mode active:
While these can be removed 'by hand' by going through he occurrences and deleting them, FinetuneReFormatter has 'QuickFixes' to automatically remove simple flaws like double newlines. These can be found in the form of buttons in InitialPrep mode.
To go to InitialPrep mode from SourceInspector mode, select it from the Mode menu at the top:
Switching back to SourceInspector mode works the same way. You can also quickly switch between modes by pressing [Ctrl]+[M] on the keyboard.
This is how the InitialPrep mode looks:
Here, various statistics can be calculated and simple fixes applied to the currently opened text. Chunking for further textadventure formatting is also done here.
Quick one-click text fixes can be found in the form of buttons at the bottom.
To automatically remove all double newlines still present in the current text, simply click the [Remove double newlines] button. (To check the result, go back to SourceInspector mode.)
One of the core features of FinetuneReFormatter, the chunking utility in InitialPrep mode allows to easily split source texts into 'chunks': Small text portions containing full sentences with a defined token amount. These are contained in the ChunkFile file format, which is used to conveniently handle chunks of different types and is intended to make creation of textadventure-formatted (and possibly other rolling-context) data easy and convenient.
Once major source text flaws have been fixed using SourceInspector and QuickFixes, the next step in textadventure reformatting is the creation of a ChunkFile. Default settings are set to fit with the default generation parameters of NovelAI.
First, FinetuneReFormatter splits the text into sentences (to check if anything goes wrong there, a .json array file with the text split into sentences can be exported by clicking the [Split into sentences and save] button).
Then it goes through the list of sentences, checking the number of tokens each contains.
It then creates a chunk and adds sentences from the list to it until adding the next sentence would exceed the target token number.
Once the limit for a chunk would be exceeded, it is added to the list of chunks and a new chunk is started.
Once all sentences are put into chunks, they are assigned a chunk type (currently sourceText
, but this might become configurable in upcoming versions). These chunk types will later be used for convenient batch operations (see below).
Chunking can be configured with these options (more might be added in upcoming versions):
First, and most important, is the maximum number of tokens per chunk/token threshold (a). See above for its purpose. (High/low thresholds for later editing convenience might be added here in upcoming versions.)
For textadventure reformatting, placeholder chunks for player actions should be inserted. While chunks can be added and removed later (see below), adding placeholder chunks removes the need to add each player input chunk manually. To insert a placeholder chunk after each chunk of source text, check the 'Insert placeholder chunks' checkbox (b). Same as source text chunks, or any chunk, these placeholder chunks will have a type and a corresponding tag. This type/tag can be set for all placeholder chunks by typing a tag (max 12 characters) into the 'Placeholder type tag' text field (c) - this is neccessary to create proper ChunkFiles when the placeholder insert option is used. 'Placeholder text', set in the corresponding text field (d), will be added as the text content of each placeholder chunk, and while it is not neccessary, it can help with quickly identifying placeholder chunks later.
For ChunkFile export, a file suffix is added to the file name of the currently opened source text. This suffix automatically begins with the set maximum tokens per chunks, lead by a underscore _, and followed by the string in the suffix text field (e): If the source text file is named MDCh1_raw.txt
, the ChunkFile will be named MDCh1_raw_40tknChunks.json
with the settings shown above.
Click the [Create chunks and save] button (f) to save the resulting ChunkFile to the same location the source text file is located.
Click the [Save chunking settings] button (g) to save the current settings to settings.json.
ChunkFiles are .json files containing data on the chunking settings used and chunk type handling data in addtion to the created chunks. Each chunk has its text content and a type tag.
This is an excerpt of a ChunkFile created from the first chapter of Moby Dick, using the settings shown above:
{"projectData": {"targetTknsPerChunk": 40, "tagTypeData": {}}, "chunks": [{"text": "Call me Ishmael.", "type": "sourceText"}, {"text": "act", "type": "playerInput"}, {"text": "Some years ago\u2014never mind how long precisely\u2014having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.", "type": "sourceText"}, {"text": "act", "type": "playerInput"}, {"text": "It is a way I have of driving off the spleen and regulating the circulation.", "type": "sourceText"}, {"text": "act", "type": "playerInput"}, {"text": "Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people\u2019s hats off\u2014then, I account it high time to get to sea as soon as I can.", "type": "sourceText"}, ...
(Tag type handling data will be added later, see below.)
This is just for illustration, as the next step in textadventure reformatting uses the ChunkStack mode of FinetuneReFormatter.
How to use the ChunkStack mode to create a good context flow, what to take care of when editing chunks and anticipating player inputs.
ChunkFiles are opened the same way as source text files, by either starting FinetuneReFormatter and clicking [Open file], or clicking 'Open' in the File menu at the top.
However, FinetuneReFormatter will detect the file type and will make different modes available for different file types: When a .json file is opened, it will have the ChunkStack and ChunkCombiner modes. These modes are used to edit and handle ChunkFiles. (Opening non-ChunkFile .json files will lead to errors.)
ChunkFiles will be shown in ChunkStack mode upon opening.
This is how the Moby Dick chapter 1 ChunkFile created above looks in ChunkStack mode:
ChunkStack mode shows each chunk in the ChunkFile and it's data in a separate area. This allows to edit chunks individually, showing the amount of tokens of each immediatly when editing. Total number of tokens in all chunks currently in view is shown at the top.
To navigate through the ChunkFile, change the 'View beginning at chunk index:' number using the index number field (a). This can be done by clicking the arrows in the field, typing in a number after clicking on it or by moving the mosue cursor over it and scrolling with the mouse wheel. Chunks will be shown beginning from the chunk ID (short for index) set in the index number field. Chunk IDs are shown on each chunk (c). Use [Ctrl] + arrow keys to navigate through the chunk stack: [Ctrl]+[Down] to move view down the stack, [Ctrl]+[Up] to move view up the stack.
While the number of tokens in individual chunks will be shown immediatly upon editing on the chunk, the total number of tokens in view will only update automatically when the view index is changed. To check the number of tokens in view after editing single chunks without changing the view position, click the [Count] button (b) at the top.
A single chunk in view has a text field (a) to edit its text content, the chunks index/ID (b) in the ChunkFile, the number of tokens (c) in its text content and its type tag (d). Each chunk also has a dropdown menu (e) for further editing utilities.
NOTE: Chunk text content edits are immediatly applied to the working data, but to preserve edits in the ChunkFile, it has to be saved. This can be done by selecting 'Save' in the File menu at the top or by pressing [Ctrl]+[S] on the keyboard.
The dropdown allows to insert new chunks above and below the individual chunk, as well as deleting it. New chunks inserted this way will have the generic
chunk type tag, and PLACEHOLDER
as text content. (Configuration options for this are planned for upcoming versions.)
Clicking 'Edit chunk type' makes the chunk type tag field (d) editable. This is useful to set inserted chunks to a fitting type.
NOTE: Editing type tags should be avoided, if possible, as a typo here will lead to complications later and might be hard to find (in v0.1.2/3, v0.2.0; improved type/tag handling is planned, potentially turning the type tag text field into a dropdown to select from chunk types defined elsewhere).
(More small fixes for chunk contents, like newline and trailing space character removal, might be added to this in upcoming versions.)
At this point in the textadventure reformatting process, most of the text work will be done.
If the text content of source text chunks does not conform to the standard textadventure person perspective (second person singular; 'you'/'me'/'my' for the player character) and standard textadventure tense (present tense; 'The wind blows.'/'You enter the room.'), this the intended point to change these.
This requires some decisions on part of the person doing the reformatting: First of all, a character in the story has to be picked as the main/player character who will be addressed as 'you'.
While it is good to refer to one individual character as 'you' throughout the whole text, this might be hard, if not impossible for some sources. In these cases it is viable to add a phrase along the lines of You are now Queequeg.
, preferably as player input chunk, to change perspective - switching characters. If this is done, it has to be done explicitly and at a fitting spot - this is crucial to not teach the model to just switch main character by itself or become 'confused' about how 'you' is when finetuned with the data.
NOTE: This applies similarly if the target person perspective is third person, but for third person data, pronouns have to be used very accurately. It might be useful to repeat names more often than would be considered 'good writing style' to make it easier to 'connect the dots' for the AI - even if it will lead to AI outputs with similar abundance of direct references/names. Indirect references (he = the man = Bertrand = the hoodlum = the one lifting the barrel) might be useful, but lower-parameter GPT models (below about 20B, as a guess) notoriously struggle with these.
The other main part of making the text conform to the textadventure format standards is changing the tense to present tense:
This means not only changing the grammatical tense, but also to set the events described in the source text to happen in the present, 'right now'. Depending on the source text, this can be rather easy (for example when reformatting CYOA source texts) or a great challenge to the editors creative writing abilities (for example when reformatting verbose classic literature). When reformatting narrative prose, specially classic literature, great care should be taken to preserve as much of the original style and content as possible. In short - if the AI should write like Melville was your GM, keep as much Melville in the text as possible.
NOTE: This takes some care, as background information in past tense is viable to use. Simply 'bulldozing' tense might be detrimental!
Along with the basic editing above, this is when the major work to make textadventure format data is to be done: Adding the 'player inputs'.
This may sound trivial, but should be stated: The player inputs must make sense, leading to the following source text chunk in a sensible way.
Great care is to be taken to make the actions in player inputs line up with what follows - this is the most important factor for good textadventure data. It is well worth the time to go over player inputs again and see if they can be improved to better 'flow with the story'. This is crucial to enforce this dynamic to teach the AI to follow 'instructions' as well as to elaborate on the results without producing 'unlogical' or outright disconnected outputs. Lapses in this back-and-forth in the data led to the 'unreliability' of earlier generative textadventure models. Again, as it can't be stated too often: Keep the flow up as first priority, even if this requires adding player inputs that may seem trivial or redundant (as shown in the image above).
Player inputs do not have to conform to the writing style or even level of writing as the source text - it is probably beneficial to have them written in a less creative/sophisticated way than the prose of the source, as this teaches the AI to not only follow with a situationally fitting output, but that player input lines, identified by the leading >, should not lead to less 'well-written' outputs/replies.
Think of it as if a player in a TTRPG game says "I hit it with my axe." - you would want the GM (the role which AI ideally fills here) to elaborate "The blade of your greataxe gleams in the light of the torches as you swing it in a wide arc, severing tendons and muscles as you slash the orc's arm." instead of "You hit. 12 DMG.".
FinetuneReFormatter allows to apply batch formatting additions depending on chunk type for textadventure reformatting convenience. This means that any leading and following text and formatting that the format standard demands do not have to be added to each chunk text individually. These can be automatically added when the ChunkFile is eventually combined. Thus player inputs can be inserted as just the variable content for each, as shown above, and standard formatting - line breaks before and after the player input, > You
at the start, .
at the end - don't need to be written out at this point. (See below.)
Chunks - source text, but also play inputs - should conform to the token limits. The person doing the reformatting needs to decide if they want to pick 'oversized' chunks apart or preserve the original style of the source text - as this depends entirely on the source's style, it is hard to make recommendations on this. However, if the editor wants to keep the style of classic literature intact, 'oversized chunks' may be preferable over conforming to AI limitations. (See chunk 6 in the above example - fragmenting this run-on sentence would alter, if not remove peculiarities of Melville's writing style, which the editor wants to preserve as much as possible in this case.)
Player inputs are less restricted in terms of token 'allowance' than the source text (as source text is what the AI is supposed to learn to produce, mainly) and can be as elaborate as needed to 'keep the flow'. They should not be overly long, though, and the focus should be more on simulating the average player (who tends to be quite lazy in their inputs) than writing elaborate inputs to 'enhance the story' - keep in mind that an application using the finetuned model will never output these player inputs, but will instead actively prune anything that looks like player inputs.
Once the whole ChunkFile has been reformatted, chunk type handling and export is done in ChunkCombiner mode.
How to add finishing touches, combine the chunks and export the resulting data.
To go to ChunkCombiner mode, select it from the Mode menu at the top.
This is how ChunkCombiner mode looks (initially):
FinetuneReFormatter will check the ChunkFile for chunk type tags when it enters ChunkCombiner mode. Total chunk amount, found chunk types and their amounts are shown at the top.
NOTE: Only chunk types of chunks in the ChunkFile are shown - a ChunkFile may contain handling data for more types, but these will not be shown if there are no chunks of these types in the file. (This may change in upcoming versions.)
NOTE: Order of chunk types is arbitrary and can change. This has no other effect than different order in the GUI.
ChunkFiles are created without chunk type handling data in InitialPrep mode. If the ChunkFile does not contain handling data for a chunk type that is found in the chunks, (not saved) is shown next to the chunk type tag (d), as seen on the image above.
Check the 'Add newline before'/'Add newline before' checkboxes (e) to add line breaks around each chunk of the corresponding type on export.
PlayerInput chunks should have both of these to separate input lines from source text/intended output lines to follow the format standard.
The text in the prefix text field (f) will be added to the beginning of each chunk of the corresponding type. If a newline is also added to the beginning, this text will be added after the newline.
For player inputs, the standard prefix is > You
. Note the space character at the end.
NOTE: This option is less viable for third person, but different chunk types for different acting characters can be made to still use the batch insert utility. Adding a new chunk type is as simple as using a new chunk type tag in ChunkStack mode.
The text in the suffix text field (g) will be added to the end of each chunk of the corresponding type. If a newline is also added to the end, this text will be added before the newline.
For player inputs, the standard suffix is .
.
Once chunk type handling data has been set, click the [Save type handling data] button (a) at the top to store the data in the ChunkFile. Once the ChunkFile contains handling data for the chunk types shown, the (not saved) notification will disappear.
NOTE: Changes to chunk type handling data for chunk types that have defined handling data in the ChunkFile will not show the (not saved) notification again. Make sure to use the File menu 'Save' at the top or press [Ctrl]+[S] to save the changes to the ChunkFile.
Once chunk type handling is properly set and the handling data saved, ChunkCombiner should look like this:
The combined file suffix (b) will be appended to the ChunkFile name to create the name of the exported file.
For example: The ChunkFile name MD_chapter1_raw_40tknChunks.json
will lead to the exported file name MD_chapter1_raw_40tknChunks_combined.txt
Finally, click the [Export combined chunks] button (c) to save the properly formatted data. The file will be saved to the location of the opened ChunkFile.
The final product is properly formatted textadventure data, ready to be used to finetune a GPT model.
It looks like this:
You like to be called Ishmael. > You want to travel. This year—never mind when precisely—having little or no money in your purse, and nothing particular to interest you on shore, you think you will sail about a little and see the watery part of the world. > You think about what you like about sailing. It is a way you have of driving off the spleen and regulating the circulation. > You think about why you want to go to sea now. Whenever you find yourself growing grim about the mouth; whenever it is a damp, drizzly November in your soul; whenever you find yourself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral you meet; and especially whenever your hypos get such an upper hand of you, that it requires a strong moral principle to prevent you from deliberately stepping into the street, and methodically knocking people’s hats off—then, you account it high time to get to sea as soon as you can. > You think about alternatives. This is your substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; you quietly take to the ship. There is nothing surprising in this.
(This HowTo will be improved and expanded over time as FinetuneReFormatter evolves. Please open an Issue if anything is missing, unclear or does not work as described here!)
Current Wiki Version: Beta2