-
Notifications
You must be signed in to change notification settings - Fork 5
tokenize.pl options and examples
Shon Feder edited this page Feb 17, 2019
·
7 revisions
?- tokenize(`\tExample Text.`, Tokens).
Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')]
?- tokenize(`\tExample Text.`, Tokens, [cntrl(false), pack(true), cased(true)]).
Tokens = [word('Example', 1), spc(' ', 2), word('Text', 1), punct('.', 1)]
?- tokenize(`\tExample Text.`, Tokens), untokenize(Tokens, Text), format('~s~n', [Text]).
example text.
Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')],
Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...]
Given some file, e.g.,
$ cat test.txt
1. Show examples of tokenizer.
2. Zip files.
3. Upload to the SWI-Prolog pack list!
We can use tokenize_file/2
to tokenize its contents:
?- tokenize_file('test.txt', Tokens).
Tokens = [word('1'), punct('.'), spc(' '), word(show), spc(' '), word(examples),
spc(' '), word(of), spc(' '), word(tokenizer), punct('.'), cntrl('\n'),
word('2'), punct('.'), spc(' '), word(zip), spc(' '), word(files), punct('.'),
cntrl('\n'), word('3'), punct('.'), spc(' '), word(upload), spc(' '), word(to),
spc(' '), word(the), spc(' '), word(swi), punct(-), word(prolog), spc(' '),
word(pack), spc(' '), word(list), punct(!), cntrl('\n')]
tokenize_file/3
is the same but takes a list of options.
Given some list of tokens, we can read them into a list of character codes using
untokenize/2
:
?- Tokens = [word("one"),spc(" "),word("two"),spc(" "),word("three"),spc(" "),punct("!")], untokenize(Tokens, Codes), format(`~s~n`, [Codes]).
one two three !
Tokens = [word("one"), spc(" "), word("two"), spc(" "), word("three"), spc(" "), punct("!")],
Codes = [111, 110, 101, 32, 116, 119, 111, 32, 116|...]
tokenize_file/3
and tokenize/3
both take an option list for their third argument. The two place versions of these predicates are equivalent to calling the three place predicates with an empty list of options, and amount to using the defaults. I.e., tokenize(Text, Tokens)
is equivalent to tokenize(Text, Tokens, [cased(false), spaces(false), cntrl(true), punct(true), to(atoms), pack(false)])
.
option name | possible values | description | default |
---|---|---|---|
cased | true, false | whether tokens preserve cases | false |
spaces | true, false | whether spaces are included as tokens or omitted | true |
cntrl | true, false | whether control characters are | true |
punct | true, false | whether punctuation marks are included as tokens or omitted | true |
to | strings, atoms, chars, codes | set the type of representation used by the tokens | atoms |
pack | true, false | whether to pack consecutive occurrences of identical tokens or simply repeat the tokens | false |
?- tokenize(`Example String!!\n`, Tokens).
Tokens = [word(example), spc(' '), word(string), punct(!), punct(!), cntrl('\n')]
?- tokenize(`Example String!!\n`, Tokens, [cased(true)]).
Tokens = [word('Example'), spc(' '), word('String'), punct(!), punct(!), cntrl('\n')]
?- tokenize(`Example String!!\n`, Tokens, [spaces(false)]).
Tokens = [word(example), word(string), punct(!), punct(!), cntrl('\n')]
?- tokenize(`Example String!!\n`, Tokens, [cntrl(false)]).
Tokens = [word(example), spc(' '), word(string), punct(!), punct(!)]
?- tokenize(`Example String!!\n`, Tokens, [punct(false)]).
Tokens = [word(example), spc(' '), word(string), cntrl('\n')]
?- tokenize(`Example String!!\n`, Tokens, [to(strings)]).
Tokens = [word("example"), spc(" "), word("string"), punct("!"), punct("!"), cntrl("\n")]
?- tokenize(`Example String!!\n`, Tokens, [pack(true)]).
Tokens = [word(example, 1), spc(' ', 1), word(string, 1), punct(!, 2), cntrl('\n', 1)]
?- tokenize(`Example String!!\n`, Tokens, [pack(true), cased(true), spaces(false)]).
Tokens = [word('Example', 1), word('String', 1), punct(!, 2), cntrl('\n', 1)]