tokenize.pl options and examples

Examples

Tokenizing text

?- tokenize(`\tExample  Text.`, Tokens).
Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')] 

?- tokenize(`\tExample  Text.`, Tokens, [cntrl(false), pack(true), cased(true)]).
Tokens = [word('Example', 1), spc(' ', 2), word('Text', 1), punct('.', 1)] 

?- tokenize(`\tExample  Text.`, Tokens), untokenize(Tokens, Text), format('~s~n', [Text]).
	example  text.
Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')],
Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...]

Tokenizing files

Given some file, e.g.,

$ cat test.txt 
1. Show examples of tokenizer.
2. Zip files.
3. Upload to the SWI-Prolog pack list!

We can use tokenize_file/2 to tokenize its contents:

?- tokenize_file('test.txt', Tokens).
Tokens = [word('1'), punct('.'), spc(' '), word(show), spc(' '), word(examples),
spc(' '), word(of), spc(' '), word(tokenizer), punct('.'), cntrl('\n'),
word('2'), punct('.'), spc(' '), word(zip), spc(' '), word(files), punct('.'),
cntrl('\n'), word('3'), punct('.'), spc(' '), word(upload), spc(' '), word(to),
spc(' '), word(the), spc(' '), word(swi), punct(-), word(prolog), spc(' '),
word(pack), spc(' '), word(list), punct(!), cntrl('\n')]

tokenize_file/3 is the same but takes a list of options.

Converting tokens to text

Given some list of tokens, we can read them into a list of character codes using untokenize/2:

?- Tokens = [word("one"),spc(" "),word("two"),spc(" "),word("three"),spc(" "),punct("!")], untokenize(Tokens, Codes), format(`~s~n`, [Codes]).
one two three !
Tokens = [word("one"), spc(" "), word("two"), spc(" "), word("three"), spc(" "), punct("!")],
Codes = [111, 110, 101, 32, 116, 119, 111, 32, 116|...]

Options

tokenize_file/3 and tokenize/3 both take an option list for their third argument. The two place versions of these predicates are equivalent to calling the three place predicates with an empty list of options, and amount to using the defaults. I.e., tokenize(Text, Tokens) is equivalent to tokenize(Text, Tokens, [cased(false), spaces(false), cntrl(true), punct(true), to(atoms), pack(false)]).

Available options:

option name	possible values	description	default
cased	true, false	whether tokens preserve cases	false
spaces	true, false	whether spaces are included as tokens or omitted	true
cntrl	true, false	whether control characters are	true
punct	true, false	whether punctuation marks are included as tokens or omitted	true
to	strings, atoms, chars, codes	set the type of representation used by the tokens	atoms
pack	true, false	whether to pack consecutive occurrences of identical tokens or simply repeat the tokens	false

Examples of options:

?- tokenize(`Example String!!\n`, Tokens).
Tokens = [word(example), spc(' '), word(string), punct(!), punct(!), cntrl('\n')] 

?- tokenize(`Example String!!\n`, Tokens, [cased(true)]).
Tokens = [word('Example'), spc(' '), word('String'), punct(!), punct(!), cntrl('\n')] 

?- tokenize(`Example String!!\n`, Tokens, [spaces(false)]).
Tokens = [word(example), word(string), punct(!), punct(!), cntrl('\n')] 

?- tokenize(`Example String!!\n`, Tokens, [cntrl(false)]).
Tokens = [word(example), spc(' '), word(string), punct(!), punct(!)] 

?- tokenize(`Example String!!\n`, Tokens, [punct(false)]).
Tokens = [word(example), spc(' '), word(string), cntrl('\n')] 

?- tokenize(`Example String!!\n`, Tokens, [to(strings)]).
Tokens = [word("example"), spc(" "), word("string"), punct("!"), punct("!"), cntrl("\n")] 

?- tokenize(`Example String!!\n`, Tokens, [pack(true)]).
Tokens = [word(example, 1), spc(' ', 1), word(string, 1), punct(!, 2), cntrl('\n', 1)] 

?- tokenize(`Example String!!\n`, Tokens, [pack(true), cased(true), spaces(false)]).
Tokens = [word('Example', 1), word('String', 1), punct(!, 2), cntrl('\n', 1)]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly