Skip to content

InitialPrep

Gnurro edited this page May 19, 2021 · 5 revisions

InitialPrep mode

View text statistics, calculate token distribution, export sentences, apply miscellaneous small tweaks and fixes to text and create ChunkFiles.
InitialPrep Mode

Stats

Basic text statistics are calculated immediatly, but more complicated calculations are done on demand by clicking the buttons at the top. This is to avoid unresponsiveness when large texts are analysed.

Text Stats

Click the [Tokenize data] button to calculate number of tokens and number of unique tokens. This can take a while for large texts.

Token Distribution

Once the data has been tokenized, click the [Calculate token distribution] button to see the most common tokens in the text.

Sentence File Export

Use this to export a JSON array containing the text separated into single sentences. Saves to the directory containing the currently opened file.
Warning: This will currently overwrite existing 'sentence file' without any warning!
(File selection dialog planned for upcoming versions.)

Chunking

Set chunk parameters, placeholder chunk inserts, create chunks and export as ChunkFile.
Chunks are composed of full sentences from the source text.

Maximum tokens per chunk

This is the target amount of tokens a chunk will contain. Sentences will be added to each chunk until the addition of the next sentence in the source data would exceed this value. If a sentence contains more tokens than this threshold, it will result in a chunk that contains only this sentence.
(This will be relabeled to better represent its functionality in upcoming versions.)

Placeholder Chunk Insertion

If the checkbox is checked, a placeholder chunk will be inserted after each chunk of source text.
Placeholder type tag is added to each placeholder chunks data in the ChunkFile. These later help with handling and export of the formatted data. (See ChunkCombiner mode.) Currently this is 'generic', but any string of length up to 12 can be used.
Placeholder text is inserted into each placeholder chunk as text content. This can be a string of any length, but it is recommended to keep it to something short like the preset 'PLACEHOLDER'.

Chunk Creation and Saving

Chunk file suffix automatically contains the maximum token amount per chunk, followed by a short string, preset to 'tknChunks'. This chunk file suffix will be appended to the file name of the ChunkFile to be exported.
Clicking the [Create chunks and save] button will create chunks, as described above, and then save them into a ChunkFile .json.
Warning: This will currently overwrite existing ChunkFile without any warning!

ChunkFile Data Format

The ChunkFile data format is JSON, containing a 'projectData' object and an array of chunk objects ('chunks').
The projectData object contains the target number of tokens per chunk specified at creation of the ChunkFile (targetTknsPerChunk) and a chunk type data object (tagTypeData). Chunk type data determines how chunks of different types will be formatted when they are combined into the finished formatted text. More on chunk type handling in the ChunkCombiner mode documentation.
The 'chunks' array contains the chunk objects, each containing the content 'text' of the chunk and its 'type'.

Miscellaneous Tweaks And Fixes

[Remove spaces at line ends] button: Removes all space characters from the end of lines by replacing \n with \n.