Skip to content

tokenize.pl options and examples

Shon Feder edited this page Feb 17, 2019 · 7 revisions

Examples

Tokenizing text

?- tokenize(`\tExample  Text.`, Tokens).
Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')] 

?- tokenize(`\tExample  Text.`, Tokens, [cntrl(false), pack(true), cased(true)]).
Tokens = [word('Example', 1), spc(' ', 2), word('Text', 1), punct('.', 1)] 

?- tokenize(`\tExample  Text.`, Tokens), untokenize(Tokens, Text), format('~s~n', [Text]).
	example  text.
Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')],
Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...] 

Tokenizing files

Given some file, e.g.,

$ cat test.txt 
1. Show examples of tokenizer.
2. Zip files.
3. Upload to the SWI-Prolog pack list!

We can use tokenize_file/2 to tokenize its contents:

?- tokenize_file('test.txt', Tokens).
Tokens = [word('1'), punct('.'), spc(' '), word(show), spc(' '), word(examples),
spc(' '), word(of), spc(' '), word(tokenizer), punct('.'), cntrl('\n'),
word('2'), punct('.'), spc(' '), word(zip), spc(' '), word(files), punct('.'),
cntrl('\n'), word('3'), punct('.'), spc(' '), word(upload), spc(' '), word(to),
spc(' '), word(the), spc(' '), word(swi), punct(-), word(prolog), spc(' '),
word(pack), spc(' '), word(list), punct(!), cntrl('\n')]

tokenize_file/3 is the same but takes a list of options.

Converting tokens to text

Given some list of tokens, we can read them into a list of character codes using untokenize/2:

?- Tokens = [word("one"),spc(" "),word("two"),spc(" "),word("three"),spc(" "),punct("!")], untokenize(Tokens, Codes), format(`~s~n`, [Codes]).
one two three !
Tokens = [word("one"), spc(" "), word("two"), spc(" "), word("three"), spc(" "), punct("!")],
Codes = [111, 110, 101, 32, 116, 119, 111, 32, 116|...] 

Options

tokenize_file/3 and tokenize/3 both take an option list for their third argument. The two place versions of these predicates are equivalent to calling the three place predicates with an empty list of options, and amount to using the defaults. I.e., tokenize(Text, Tokens) is equivalent to tokenize(Text, Tokens, [cased(false), spaces(false), cntrl(true), punct(true), to(atoms), pack(false)]).

Available options:

option name possible values description default
cased true, false whether tokens preserve cases false
spaces true, false whether spaces are included as tokens or omitted true
cntrl true, false whether control characters are true
punct true, false whether punctuation marks are included as tokens or omitted true
to strings, atoms, chars, codes set the type of representation used by the tokens atoms
pack true, false whether to pack consecutive occurrences of identical tokens or simply repeat the tokens false

Examples of options:

?- tokenize(`Example String!!\n`, Tokens).
Tokens = [word(example), spc(' '), word(string), punct(!), punct(!), cntrl('\n')] 

?- tokenize(`Example String!!\n`, Tokens, [cased(true)]).
Tokens = [word('Example'), spc(' '), word('String'), punct(!), punct(!), cntrl('\n')] 

?- tokenize(`Example String!!\n`, Tokens, [spaces(false)]).
Tokens = [word(example), word(string), punct(!), punct(!), cntrl('\n')] 

?- tokenize(`Example String!!\n`, Tokens, [cntrl(false)]).
Tokens = [word(example), spc(' '), word(string), punct(!), punct(!)] 

?- tokenize(`Example String!!\n`, Tokens, [punct(false)]).
Tokens = [word(example), spc(' '), word(string), cntrl('\n')] 

?- tokenize(`Example String!!\n`, Tokens, [to(strings)]).
Tokens = [word("example"), spc(" "), word("string"), punct("!"), punct("!"), cntrl("\n")] 

?- tokenize(`Example String!!\n`, Tokens, [pack(true)]).
Tokens = [word(example, 1), spc(' ', 1), word(string, 1), punct(!, 2), cntrl('\n', 1)] 

?- tokenize(`Example String!!\n`, Tokens, [pack(true), cased(true), spaces(false)]).
Tokens = [word('Example', 1), word('String', 1), punct(!, 2), cntrl('\n', 1)]