-
Notifications
You must be signed in to change notification settings - Fork 0
InitialPrep
View text statistics, calculate token distribution, export sentences, apply miscellaneous small tweaks and fixes to text and create ChunkFiles.
Basic text statistics are calculated immediatly, but more complicated calculations are done on demand by clicking the buttons at the top. This is to avoid unresponsiveness when large texts are analysed.
Click the [Tokenize data] button to calculate number of tokens and number of unique tokens. This can take a while for large texts.
Once the data has been tokenized, click the [Calculate token distribution] button to see the most common tokens in the text.
Use this to export a JSON array containing the text separated into single sentences. Saves to the directory containing the currently opened file.
Warning: This will currently overwrite existing 'sentence file' without any warning!
(File selection dialog planned for upcoming versions.)
Set chunk parameters, placeholder chunk inserts, create chunks and export as ChunkFile.
Chunks are composed of full sentences from the source text.
This is the target amount of tokens a chunk will contain. Sentences will be added to each chunk until the addition of the next sentence in the source data would exceed this value. If a sentence contains more tokens than this threshold, it will result in a chunk that contains only this sentence.
(This will be relabeled to better represent its functionality in upcoming versions.)
If the checkbox is checked, a placeholder chunk will be inserted after each chunk of source text.
Placeholder type tag is added to each placeholder chunks data in the ChunkFile. These later help with handling and export of the formatted data. (See ChunkCombiner mode.) Currently this is 'generic', but any string of length up to 12 can be used.
Placeholder text is inserted into each placeholder chunk as text content. This can be a string of any length, but it is recommended to keep it to something short like the preset 'PLACEHOLDER'.
Chunk file suffix automatically contains the maximum token amount per chunk, followed by a short string, preset to 'tknChunks'. This chunk file suffix will be appended to the file name of the ChunkFile to be exported.
Clicking the [Create chunks and save] button will create chunks, as described above, and then save them into a ChunkFile .json.
Warning: This will currently overwrite existing ChunkFile without any warning!
The ChunkFile data format is JSON, containing a 'projectData' object and an array of chunk objects ('chunks').
The projectData object contains the target number of tokens per chunk specified at creation of the ChunkFile (targetTknsPerChunk) and a chunk type data object (tagTypeData). Chunk type data determines how chunks of different types will be formatted when they are combined into the finished formatted text. More on chunk type handling in the ChunkCombiner mode documentation.
The 'chunks' array contains the chunk objects, each containing the content 'text' of the chunk and its 'type'.
[Remove spaces at line ends] button: Removes all space characters from the end of lines by replacing \n
with \n
.
Current Wiki Version: Beta2