From 1221082a127ffbb463680b9cd9abbd62f5f3c54c Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sat, 11 May 2019 22:03:56 -0400 Subject: [PATCH 01/25] Add design notes (#25) --- design_notes.md | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) create mode 100644 design_notes.md diff --git a/design_notes.md b/design_notes.md new file mode 100644 index 0000000..e2be8ed --- /dev/null +++ b/design_notes.md @@ -0,0 +1,45 @@ +# Design Notes + +Initially extracted from conversation with +[@Annieppo](https://github.com/Anniepoo) and [@nicoabie](https://github.com/nicoabie) in +##prolog on [freenode](https://freenode.net/). + +The library started as a very simple and lightweight set of predicates for a +common, but very limited, form of lexing. As we extend it, we aim to maintain a +modest scope in order to achieve a sweet spot between ease of use and powerful +flexibility. + +## Scope and Aims + +`tokenize` does not aspire to become an industrial strength lexer generator. We +aim to serve most users needs between raw input and a structured form ready for +parsing by a DCG. + +If a user is parsing a language with keywords such as `class`, `module`, etc., +and wants to distinguish these from variable names, `tokenize` isn't going to +give you this out of the box. But, it should provide an easy means of achieving +this result through a subsequent lexing pass. + +## Some Model Users + +* somebody making a computer language + * needs to be able to distinguish keywords, variables and literals + * needs to be able to identify comments +* somebody making a parser for an interactive fiction game + * needs to handle stuff like "William O. N'mutu-O'Connell went to the market" +* somebody wanting to analyze human texts + * wanting to do some analysis on New York Times articles, they want to first + process the articles into meaningful tokens + +## Design Rules + +* We don't parse. +* Every token generated is callable (i.e., an atom or compound). + * Example of an possible compound token: `spc(' ')`. + * Example of a possible atom token: `escape`. + tokenization need to return tokens represented with the same arity) +* Users should be able to determine the kind of token by unification. +* Users should be able to clearly see and specify the precedence for tokenizaton + * E.g., given `"-12.3"`, `numbers, punctuation` should yield `[pnct('-'), + number(12), pnct('.'), number(3)]` while `punctuation, numbers` should yield + `[number(-12.3)]`. From 45189c547bdd6cde751b4707576bd43dba376705 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sat, 11 May 2019 22:14:30 -0400 Subject: [PATCH 02/25] Init circleci config (#27) Signed-off-by: Shon Feder --- .circleci/config.yml | 1 + 1 file changed, 1 insertion(+) create mode 100644 .circleci/config.yml diff --git a/.circleci/config.yml b/.circleci/config.yml new file mode 100644 index 0000000..22817d2 --- /dev/null +++ b/.circleci/config.yml @@ -0,0 +1 @@ +version: 2 From 4158a3f075ead34683c944896719fd6e0025d30c Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sat, 11 May 2019 22:35:25 -0400 Subject: [PATCH 03/25] Run the test harness in the CI (#28) Signed-off-by: Shon Feder --- .circleci/config.yml | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/.circleci/config.yml b/.circleci/config.yml index 22817d2..a7f5ee6 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -1 +1,21 @@ version: 2 + +jobs: + build: + docker: + - image: swipl:stable + + steps: + - run: + # TODO Build custom image to improve build time + name: Install git + command: | + apt update -y + apt install git -y + + - checkout + + - run: + name: Run tests + command: | + ./test/test.pl From a89db7d0445c378b870911d4ab2ead2a719d23f7 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sun, 12 May 2019 09:10:27 -0400 Subject: [PATCH 04/25] Add instructions for getting a basic development environment set up (#29) * Add link to design_notes.md Signed-off-by: Shon Feder --- CONTRIBUTING.md | 48 ++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 46 insertions(+), 2 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 87eda1c..16b731a 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -5,12 +5,56 @@ reports, etc. ## Code of Conduct -Please review and accept to our [code of conduct](CODE_OF_CONDUCT.md) prior to +Please review and accept our [code of conduct](CODE_OF_CONDUCT.md) prior to engaging in the project. +## Overall direction and aims + +Consult the `[design_notes.md](design_notes.md)` to see the latest codified +design philosophy and principles. + ## Setting up Development -TODO +1. Install swi-prolog's [swipl](http://www.swi-prolog.org/download/stable). + - Optionally, you may wish to use [swivm](https://github.com/fnogatz/swivm) to + manage multiple installed versions of swi-prolog. +2. Hack on the source code in `[./prolog](./prolog)`. +3. Run and explore your changes by loading the file in `swipl` (or using your + editors IDE capabilities): + - Example in swipl + + ```prolog + # in ~/oss/tokenize on git:develop x [22:45:02] + $ cd ./prolog + + # in ~/oss/tokenize/prolog on git:develop x [22:45:04] + $ swipl + Welcome to SWI-Prolog (threaded, 64 bits, version 8.0.2) + SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software. + Please run ?- license. for legal details. + + For online help and background, visit http://www.swi-prolog.org + For built-in help, use ?- help(Topic). or ?- apropos(Word). + + % lod the tokenize module + ?- [tokenize]. + true. + + % experiment + ?- tokenize("Foo bar baz", Tokens). + Tokens = [word(foo), spc(' '), word(bar), spc(' '), word(baz)]. + + % reload the module when you make changes to the source code + ?- make. + % Updating index for library /usr/local/Cellar/swi-prolog/8.0.2/libexec/lib/swipl/library/ + true. + + % finished + ?- halt. + ``` + +Please ask here or in `##prolog` on [freenode](https://freenode.net/) if you +need any help! :) ## Running tests From 5e74e4e5d1c67addc5eda542ea16dd9b6c8d274b Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sun, 12 May 2019 09:33:11 -0400 Subject: [PATCH 05/25] Fix design notes link (#31) Signed-off-by: Shon Feder --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 16b731a..8084dc9 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -10,7 +10,7 @@ engaging in the project. ## Overall direction and aims -Consult the `[design_notes.md](design_notes.md)` to see the latest codified +Consult the [`design_notes.md`](design_notes.md) to see the latest codified design philosophy and principles. ## Setting up Development From d7b0fe970141a2652e8072663e41c67679514172 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sun, 12 May 2019 12:53:18 -0400 Subject: [PATCH 06/25] Explicitly set back_quotes for code lists in the tokenize module (#30) Closes #7 * Also removed trailing white space from the readme --- README.md | 8 ++++---- prolog/tokenize.pl | 3 +++ 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 82ec7d1..b8c0b73 100644 --- a/README.md +++ b/README.md @@ -2,22 +2,22 @@ ```prolog ?- tokenize(`\tExample Text.`, Tokens). -Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')] +Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')] ?- tokenize(`\tExample Text.`, Tokens, [cntrl(false), pack(true), cased(true)]). -Tokens = [word('Example', 1), spc(' ', 2), word('Text', 1), punct('.', 1)] +Tokens = [word('Example', 1), spc(' ', 2), word('Text', 1), punct('.', 1)] ?- tokenize(`\tExample Text.`, Tokens), untokenize(Tokens, Text), format('~s~n', [Text]). example text. Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')], -Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...] +Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...] ``` # Description Module `tokenize` aims to provide a straightforward tool for tokenizing text into a simple format. It is the result of a learning exercise, and it is far from perfect. If there is sufficient interest from myself or anyone else, I'll try to improve it. -It is packaged as an SWI-Prolog pack, available [here](http://www.swi-prolog.org/pack/list?p=tokenize). Install it into your SWI-Prolog system with the query +It is packaged as an SWI-Prolog pack, available [here](http://www.swi-prolog.org/pack/list?p=tokenize). Install it into your SWI-Prolog system with the query ```prolog ?- pack_install(tokenize). diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl index a177bf9..7dc877b 100644 --- a/prolog/tokenize.pl +++ b/prolog/tokenize.pl @@ -25,6 +25,9 @@ */ +% Ensure we interpret backs as enclosing code lists in this module. +:- set_prolog_flag(back_quotes, codes). + %% tokenize(+Text:list(code), -Tokens:list(term)) is semidet. % % @see tokenize/3 is called with an empty list of options: thus, with defaults. From 15b1959ff02ad06c60b8bedf610a960a6a2095e9 Mon Sep 17 00:00:00 2001 From: Stefan Israelsson Tampe Date: Sun, 12 May 2019 20:57:58 +0200 Subject: [PATCH 07/25] add comment.pl, dcg that parses a stream of codes into comment recursive or not, tokens or just skip the comment --- prolog/comment.pl | 156 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 156 insertions(+) create mode 100644 prolog/comment.pl diff --git a/prolog/comment.pl b/prolog/comment.pl new file mode 100644 index 0000000..8e1a525 --- /dev/null +++ b/prolog/comment.pl @@ -0,0 +1,156 @@ +/* +module(tokenize(comment) + [comment/2, + comment_rec/2, + comment_token/2, + comment_token_rec/2]). +*/ + +dcgtrue(U,U). + +id([X|L]) --> [X],id(L). +id([]) --> dcgtrue. +id([X|L],[X|LL]) --> [X],id(L,LL). +id([],[]) --> dcgtrue. + +tr(S,SS) :- + atom(S) -> + ( + atom_codes(S,C), + SS=id(C) + ); + SS=S. + +eol --> {atom_codes('\n',E)},id(E). +eol(HS) --> {atom_codes('\n',E)},id(E,HS). + +comment_body(E) --> call(E),!. +comment_body(E) --> [_],comment_body(E). +comment_body(_) --> []. + +comment(S,E) --> + { + tr(S,SS), + tr(E,EE) + }, + call(SS), + comment_body(EE). + +line_comment(S) --> + {tr(S,SS)}, + comment_body(SS,eol). + +comment_body_token(E,Text) --> + call(E,HE),!, + {append(HE,[],Text)}. + +comment_body_token(E,[X|L]) --> + [X], + comment_body_token(E,L). + +comment_body_token(_,[]) --> []. + +comment_token(S,E,Text) --> + { + tr(S,SS), + tr(E,EE) + }, + call(SS,HS), + {append(HS,T,Text)}, + comment_body_token(EE,T). + +line_comment_token(S,Text) --> + {tr(S,SS)}, + comment_body_token(SS,eol,Text). + +comment_body_rec_cont(S,E,Cont,HE,Text) --> + {append(HE,T,Text)}, + comment_body_token_rec(S,E,Cont,T). + +comment_body_rec_start(HE,Text) --> + {append(HE,[],Text)}. + +comment_body_token_rec(_,E,Cont,Text) --> + call(E,HE), + call(Cont,HE,Text). + +comment_body_token_rec(S,E,Cont,Text) --> + call(S,HS), + {append(HS,T,Text)}, + comment_body_token_rec(S,E,comment_body_rec_cont(S,E,Cont),T). + +comment_body_token_rec(S,E,Cont,[X|L]) --> + [X], + comment_body_token_rec(S,E,Cont,L). + +comment_body_token_rec(_,_,_,_,_,[]) --> dcgtrue. + +comment_token_rec(S,E,Text) --> + { + tr(S,SS), + tr(E,EE) + }, + call(SS,HS), + {append(HS,T,Text)}, + comment_body_token_rec(SS,EE,comment_body_rec_start,T). + +comment_body_rec(_,E) --> + call(E). + +comment_body_rec(S,E) --> + call(S), + comment_body_rec(S,E), + comment_body_rec(S,E). + +comment_body_rec(S,E) --> + [_], + comment_body_rec(S,E). + +comment_body_rec(_,_). + +comment_rec(S,E) --> + { + tr(S,SS), + tr(E,EE) + }, + call(SS), + comment_body_rec(SS,EE). + +test(Tok,S,U) :- + atom_codes(S,SS), + call_dcg(Tok,SS,U). + +test_comment(S) :- + test(comment('<','>'),S,[]). + +test_comment_rec(S) :- + test(comment_rec('<','>'),S,[]). + +test_comment_token(S,T) :- + test(comment_token('<','>',TT),S,[]), + atom_codes(T,TT). + +test_comment_token_rec(S,T) :- + test(comment_token_rec('<','>',TT),S,[]), + atom_codes(T,TT). + +tester([]). +tester([X|L]) :- + write_term(test(X),[]), + ( + call(X) -> write(' ... OK') ; write(' ... FAIL') + ), + nl, + tester(L). + + +/* +tester( + [test_comment(''), + test_comment_rec('>'), + test_comment_token('',''), + test_comment_token_rec('>','>')]). +*/ + + + From 189066c9223c69a921edffedb20d3ac573010893 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nicol=C3=A1s=20Andr=C3=A9s=20Gallinal?= Date: Sun, 12 May 2019 17:56:29 -0300 Subject: [PATCH 08/25] Created a Makefile (#32) * Add a Makefile with test target. Updated CircleCI conf. Ability to run tests from within swipl repl. * Add make as dep for the docker image --- .circleci/config.yml | 6 +++--- CONTRIBUTING.md | 14 ++++++++++++-- Makefile | 20 ++++++++++++++++++++ test/test.pl | 18 +----------------- 4 files changed, 36 insertions(+), 22 deletions(-) create mode 100644 Makefile diff --git a/.circleci/config.yml b/.circleci/config.yml index a7f5ee6..dd98f9e 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -8,14 +8,14 @@ jobs: steps: - run: # TODO Build custom image to improve build time - name: Install git + name: Install Deps command: | apt update -y - apt install git -y + apt install git make -y - checkout - run: name: Run tests command: | - ./test/test.pl + make test diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 8084dc9..1895faa 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -59,10 +59,20 @@ need any help! :) ## Running tests Tests are located in the [`./test`](./test) directory. To run the test suite, -simply execute the test file: +simply execute make test: ```sh -$ ./test/test.pl +$ make test % PL-Unit: tokenize .. done % All 2 tests passed ``` + +If inside the swipl repl, make sure to load the test file and query run_tests. + +```prolog +?- [test/test]. +?- run_tests. +% PL-Unit: tokenize .. done +% All 2 tests passed +true. +``` \ No newline at end of file diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..d5ae1c5 --- /dev/null +++ b/Makefile @@ -0,0 +1,20 @@ +.PHONY: all test clean + +version := $(shell swipl -q -s pack -g 'version(V),writeln(V)' -t halt) +packfile = quickcheck-$(version).tgz + +SWIPL := swipl + +all: test + +version: + echo $(version) + +check: test + +install: + echo "(none)" + +test: + @$(SWIPL) -s test/test.pl -g 'run_tests,halt(0)' -t 'halt(1)' + \ No newline at end of file diff --git a/test/test.pl b/test/test.pl index 49b1857..ed6de19 100755 --- a/test/test.pl +++ b/test/test.pl @@ -1,18 +1,3 @@ -#!/usr/bin/env swipl -/** Unit tests for the tokenize library - * - * To run these tests, execute this file - * - * ./test/test.pl - */ - -:- initialization(main, main). - -main(_Argv) :- - run_tests. - -:- begin_tests(tokenize). - :- dynamic user:file_search_path/2. :- multifile user:file_search_path/2. @@ -22,8 +7,7 @@ asserta(user:file_search_path(package, PackageDir)). :- use_module(package(tokenize)). - -% TESTS START HERE +:- begin_tests(tokenize). test('Hello, Tokenize!', [true(Actual == Expected)] From 4b9b0f82efdc5b8f7eb5cf24c0d652366a189e2b Mon Sep 17 00:00:00 2001 From: Anne Ogborn Date: Sun, 12 May 2019 16:28:06 -0700 Subject: [PATCH 09/25] Add tokenization of numbers (#34) --- .gitignore | 1 + prolog/tokenize.pl | 46 +++++++++++++++++++++++++++++----------------- test/test.pl | 27 +++++++++++++++++++++++++++ 3 files changed, 57 insertions(+), 17 deletions(-) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..b25c15b --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +*~ diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl index 7dc877b..e94f0aa 100644 --- a/prolog/tokenize.pl +++ b/prolog/tokenize.pl @@ -24,6 +24,7 @@ text. */ +:- use_module(library(dcg/basics), [eos//0, number//1]). % Ensure we interpret backs as enclosing code lists in this module. :- set_prolog_flag(back_quotes, codes). @@ -55,13 +56,19 @@ % Valid options are: % % * cased(+bool) : Determines whether tokens perserve cases of the source text. -% * spaces(+bool) : Determines whether spaces are represted as tokens or discarded. -% * cntrl(+bool) : Determines whether control characters are represented as tokens or discarded. -% * punct(+bool) : Determines whether punctuation characters are represented as tokens or discarded. -% * to(+on_of([strings,atoms,chars,codes])) : Determines the representation format used for the tokens. -% * pack(+bool) : Determines whether tokens are packed or repeated. +% * spaces(+bool) : Determines whether spaces are represted as tokens +% or discarded. +% * cntrl(+bool) : Determines whether control characters are represented +% as tokens or discarded. +% * punct(+bool) : Determines whether punctuation characters are represented +% as tokens or discarded. +% * to(+one_of([strings,atoms,chars,codes])) : Determines the +% representation format used for the tokens. +% * pack(+bool) : Determines whether tokens are packed or repeated. % TODO is it possible to achieve the proper semidet without the cut? +% Annie sez some parses are ambiguous, not even sure the cut should be +% there tokenize(Text, Tokens, Options) :- must_be(nonvar, Text), @@ -138,6 +145,8 @@ % % If dcg functor is identical to the option name with 'opt_' prefixed, % then the dcg functor can be omitted. +% +% opt(Opt, Default) --> { atom_concat('opt_', Opt, Opt_DCG) }, @@ -160,7 +169,7 @@ var(Default), \+ option(Opt, Opts), writeln("Unknown options passed to opt//3: "), write(Opt) - }. + }. % TODO use print_message for this %% non_opt(+DCG). % @@ -208,11 +217,12 @@ opt_pack(true) --> state(T0, T1), { phrase(pack_tokens(T1), T0) }. - - -%% POST PROCESSING + /******************************* + * POST_PROCESSING * + *******************************/ %% Convert tokens to alternative representations. +token_to(_, number(X), number(X)) :- !. token_to(Type, Token, Converted) :- ( Type == strings -> Conversion = inverse(string_codes) ; Type == atoms -> Conversion = inverse(atom_codes) @@ -231,22 +241,26 @@ pack(X, Count) --> [X], pack(X, 1, Count). -pack(_, Total, Total) --> call(eos). +pack(_, Total, Total) --> eos. pack(X, Total, Total), [Y] --> [Y], { Y \= X }. pack(X, Count, Total) --> [X], { succ(Count, NewCount) }, pack(X, NewCount, Total). -% PARSING + /******************************* + * PARSING * + *******************************/ + -tokens([T]) --> token(T), call(eos), !. +tokens([T]) --> token(T), eos, !. tokens([T|Ts]) --> token(T), tokens(Ts). % NOTE for debugging % tokens(_) --> {length(L, 200)}, L, {format(L)}, halt, !. -token(word(W)) --> word(W), call(eos), !. +token(number(N)) --> number(N), !. +token(word(W)) --> word(W), eos, !. token(word(W)),` ` --> word(W), ` `. token(word(W)), C --> word(W), (punct(C) ; cntrl(C) ; nasciis(C)). token(spc(S)) --> spc(S). @@ -258,7 +272,7 @@ spc(` `) --> ` `. sep --> ' '. -sep --> call(eos), !. +sep --> eos, !. word(W) --> csyms(W). @@ -269,7 +283,7 @@ % non ascii's -nasciis([C]) --> nascii(C), (call(eos), !). +nasciis([C]) --> nascii(C), eos, !. nasciis([C]),[D] --> nascii(C), [D], {D < 127}. nasciis([C|Cs]) --> nascii(C), nasciis(Cs). @@ -286,8 +300,6 @@ punct([P]) --> [P], {code_type(P, punct)}. cntrl([C]) --> [C], {code_type(C, cntrl)}. -eos([], []). - %% move to general module codes_to_lower([], []). diff --git a/test/test.pl b/test/test.pl index ed6de19..f7f281f 100755 --- a/test/test.pl +++ b/test/test.pl @@ -23,4 +23,31 @@ string_codes(Actual, Codes), Expected = "Goodbye, Tokenize!". + +test('tokenize 7.0', + [true(Actual == Expected)] + ) :- + tokenize("7.0", Actual), + Expected = [number(7.0)]. + +test('untokenize 6.3', + [true(Actual == Expected)] + ) :- + untokenize([number(6.3)], Actual), + Expected = `6.3`. + + +test('tokenize number in other stuff', + [true(Actual == Expected)] + ) :- + tokenize("hi 7.0 x", Actual), + Expected = [word(hi), spc(' '), number(7.0), spc(' '), word(x)]. + +test('untokenize 6.3 in other stuff', + [true(Actual == Expected)] + ) :- + untokenize([word(hi), number(6.3)], Actual), + Expected = `hi6.3`. + + :- end_tests(tokenize). From 8e4e98fbce697bce46522b768be3e4bedeabe6b7 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Wed, 15 May 2019 08:14:56 -0400 Subject: [PATCH 10/25] Improve comments and code ordering Signed-off-by: Shon Feder --- prolog/tokenize.pl | 40 ++++++++++++++++++++++++++++------------ 1 file changed, 28 insertions(+), 12 deletions(-) diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl index e94f0aa..b282bce 100644 --- a/prolog/tokenize.pl +++ b/prolog/tokenize.pl @@ -133,8 +133,11 @@ %% Dispatches dcgs by option-list functors, with default values. process_options --> + % Preprocessing opt(cased, false), + % Tokenization non_opt(tokenize_text), + % Postprocessing opt(spaces, true), opt(cntrl, true), opt(punct, true), @@ -184,7 +187,12 @@ state(S0), [S0] --> [S0]. state(S0, S1), [S1] --> [S0]. -%% Dispatching options: + +% Dispatching the option pipeline options: + + /*************************** + * PREPROCESSING * + ***************************/ opt_cased(true) --> []. opt_cased(false) --> state(Text, LowerCodes), @@ -194,8 +202,10 @@ string_codes(LowerStr, LowerCodes) }. -tokenize_text --> state(Text, Tokenized), - { phrase(tokens(Tokenized), Text) }. + + /*************************** + * POSTPROCESSING * + ***************************/ opt_spaces(true) --> []. opt_spaces(false) --> state(T0, T1), @@ -217,11 +227,8 @@ opt_pack(true) --> state(T0, T1), { phrase(pack_tokens(T1), T0) }. - /******************************* - * POST_PROCESSING * - *******************************/ -%% Convert tokens to alternative representations. +% Convert tokens to alternative representations. token_to(_, number(X), number(X)) :- !. token_to(Type, Token, Converted) :- ( Type == strings -> Conversion = inverse(string_codes) @@ -232,8 +239,11 @@ call_into_term(Conversion, Token, Converted). -%% Packing repeating tokens -% + /*********************************** + * POSTPROCESSING HELPERS * + ***********************************/ + +% Packing repeating tokens pack_tokens([T]) --> pack_token(T). pack_tokens([T|Ts]) --> pack_token(T), pack_tokens(Ts). @@ -247,11 +257,15 @@ pack(X, NewCount, Total). + /************************** + * TOKENIZATION * + **************************/ + +tokenize_text --> state(Text, Tokenized), + { phrase(tokens(Tokenized), Text) }. - /******************************* - * PARSING * - *******************************/ +% PARSING tokens([T]) --> token(T), eos, !. tokens([T|Ts]) --> token(T), tokens(Ts). @@ -292,6 +306,8 @@ ' ' --> space. ' ' --> space, ' '. + +% Any ... --> []. ... --> [_], ... . From b52a48a8c772b0abcc848da468d9278e0725c946 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Wed, 15 May 2019 22:05:51 -0400 Subject: [PATCH 11/25] Add tokenization of strings Closes #9 Signed-off-by: Shon Feder --- prolog/tokenize.pl | 26 +++++++++++++++++++++++++- test/test.pl | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 60 insertions(+), 1 deletion(-) diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl index b282bce..5d1d053 100644 --- a/prolog/tokenize.pl +++ b/prolog/tokenize.pl @@ -273,16 +273,18 @@ % NOTE for debugging % tokens(_) --> {length(L, 200)}, L, {format(L)}, halt, !. +token(string(S)) --> string(S). token(number(N)) --> number(N), !. + token(word(W)) --> word(W), eos, !. token(word(W)),` ` --> word(W), ` `. token(word(W)), C --> word(W), (punct(C) ; cntrl(C) ; nasciis(C)). + token(spc(S)) --> spc(S). token(punct(P)) --> punct(P). token(cntrl(C)) --> cntrl(C). token(other(O)) --> nasciis(O). - spc(` `) --> ` `. sep --> ' '. @@ -290,6 +292,27 @@ word(W) --> csyms(W). +% TODO Make strings optional +% TODO Make open and close brackets configurable +string(S) --> string(`"`, `"`, S). +string(OpenBracket, CloseBracket, S) --> string_start(OpenBracket, CloseBracket, S). + +% A string starts when we encounter an OpenBracket +string_start(OpenBracket, CloseBracket, Cs) --> + OpenBracket, string_content(CloseBracket, Cs). + +% String content is everything up until we hit a CloseBracket +string_content(CloseBracket, []) --> CloseBracket, !. +% String content includes any character that isn't a CloseBracket or an escape. +string_content(CloseBracket, [C|Cs]) --> + [C], + {[C] \= CloseBracket, [C] \= `\\`}, + string_content(CloseBracket, Cs). +% String content includes any character following an escape, but not the escape +string_content(CloseBracket, [C|Cs]) --> + escape, [C], + string_content(CloseBracket, Cs). + csyms([L]) --> csym(L). csyms([L|Ls]) --> csym(L), csyms(Ls). @@ -306,6 +329,7 @@ ' ' --> space. ' ' --> space, ' '. +escape --> `\\`. % Any ... --> []. diff --git a/test/test.pl b/test/test.pl index f7f281f..a4ac2f2 100755 --- a/test/test.pl +++ b/test/test.pl @@ -24,6 +24,8 @@ Expected = "Goodbye, Tokenize!". +% NUMBERS + test('tokenize 7.0', [true(Actual == Expected)] ) :- @@ -50,4 +52,37 @@ Expected = `hi6.3`. +% STRINGS + +test('Extracts a string', + [true(Actual == Expected)] + ) :- + tokenize("\"a string\"", Actual), + Expected = [string('a string')]. + +test('Extracts a string among other stuff', + [true(Actual == Expected)] + ) :- + tokenize("Some other \"a string\" stuff", Actual), + Expected = [word(some),spc(' '),word(other),spc(' '),string('a string'),spc(' '),word(stuff)]. + +test("Extracts a string that includes escaped brackets", + [true(Actual == Expected)] + ) :- + tokenize(`"a \\"string\\""`, Actual), + Expected = [string('a "string"')]. + +test("Extracts a string that includes a doubly nested string", + [true(Actual == Expected)] + ) :- + tokenize(`"a \\"sub \\\\\\"string\\\\\\"\\""`, Actual), + Expected = [string('a "sub \\"string\\""')]. + +test("Untokenizes string things", + [true(Actual == Expected)] + ) :- + untokenize([string('some string')], ActualCodes), + string_codes(Actual, ActualCodes), + Expected = "\"some string\"". + :- end_tests(tokenize). From 5935798eaee0345e00ee11200f8d98153fa545d4 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Wed, 15 May 2019 22:11:51 -0400 Subject: [PATCH 12/25] Remove tabs Yuck. How did tabs get in here! Signed-off-by: Shon Feder --- prolog/tokenize.pl | 44 ++++++++++++++++++++++---------------------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl index 5d1d053..b0ee5d6 100644 --- a/prolog/tokenize.pl +++ b/prolog/tokenize.pl @@ -48,23 +48,23 @@ % % A token is one of: % -% * a word (contiguous alpha-numeric chars): `word(W)` -% * a punctuation mark (determined by `char_type(C, punct)`): `punct(P)` -% * a control character (determined by `char_typ(C, cntrl)`): `cntrl(C)` -% * a space ( == ` `): `spc(S)`. +%* a word (contiguous alpha-numeric chars): `word(W)` +%* a punctuation mark (determined by `char_type(C, punct)`): `punct(P)` +%* a control character (determined by `char_typ(C, cntrl)`): `cntrl(C)` +%* a space ( == ` `): `spc(S)`. % % Valid options are: % -% * cased(+bool) : Determines whether tokens perserve cases of the source text. -% * spaces(+bool) : Determines whether spaces are represted as tokens +%* cased(+bool) : Determines whether tokens perserve cases of the source text. +%* spaces(+bool) : Determines whether spaces are represted as tokens % or discarded. -% * cntrl(+bool) : Determines whether control characters are represented +%* cntrl(+bool) : Determines whether control characters are represented % as tokens or discarded. -% * punct(+bool) : Determines whether punctuation characters are represented +%* punct(+bool) : Determines whether punctuation characters are represented % as tokens or discarded. -% * to(+one_of([strings,atoms,chars,codes])) : Determines the +%* to(+one_of([strings,atoms,chars,codes])) : Determines the % representation format used for the tokens. -% * pack(+bool) : Determines whether tokens are packed or repeated. +%* pack(+bool) : Determines whether tokens are packed or repeated. % TODO is it possible to achieve the proper semidet without the cut? % Annie sez some parses are ambiguous, not even sure the cut should be @@ -190,9 +190,9 @@ % Dispatching the option pipeline options: - /*************************** - * PREPROCESSING * - ***************************/ +/*************************** +* PREPROCESSING * +***************************/ opt_cased(true) --> []. opt_cased(false) --> state(Text, LowerCodes), @@ -203,9 +203,9 @@ }. - /*************************** - * POSTPROCESSING * - ***************************/ +/*************************** +* POSTPROCESSING * +***************************/ opt_spaces(true) --> []. opt_spaces(false) --> state(T0, T1), @@ -239,9 +239,9 @@ call_into_term(Conversion, Token, Converted). - /*********************************** - * POSTPROCESSING HELPERS * - ***********************************/ +/*********************************** +* POSTPROCESSING HELPERS * +***********************************/ % Packing repeating tokens pack_tokens([T]) --> pack_token(T). @@ -257,9 +257,9 @@ pack(X, NewCount, Total). - /************************** - * TOKENIZATION * - **************************/ +/************************** +* TOKENIZATION * +**************************/ tokenize_text --> state(Text, Tokenized), { phrase(tokens(Tokenized), Text) }. From f8d6db7e2a1448ba33cb0af3a1c4b1379b4c88c1 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Wed, 15 May 2019 22:18:24 -0400 Subject: [PATCH 13/25] Fix indentation of comment bullet points Signed-off-by: Shon Feder --- prolog/tokenize.pl | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl index b0ee5d6..0299666 100644 --- a/prolog/tokenize.pl +++ b/prolog/tokenize.pl @@ -48,23 +48,24 @@ % % A token is one of: % -%* a word (contiguous alpha-numeric chars): `word(W)` -%* a punctuation mark (determined by `char_type(C, punct)`): `punct(P)` -%* a control character (determined by `char_typ(C, cntrl)`): `cntrl(C)` -%* a space ( == ` `): `spc(S)`. +% * a word (contiguous alpha-numeric chars): `word(W)` +% * a punctuation mark (determined by `char_type(C, punct)`): `punct(P)` +% * a control character (determined by `char_typ(C, cntrl)`): `cntrl(C)` +% * a space ( == ` `): `spc(S)`. % -% Valid options are: +% Valid options are: % -%* cased(+bool) : Determines whether tokens perserve cases of the source text. -%* spaces(+bool) : Determines whether spaces are represted as tokens -% or discarded. -%* cntrl(+bool) : Determines whether control characters are represented -% as tokens or discarded. -%* punct(+bool) : Determines whether punctuation characters are represented -% as tokens or discarded. -%* to(+one_of([strings,atoms,chars,codes])) : Determines the -% representation format used for the tokens. -%* pack(+bool) : Determines whether tokens are packed or repeated. +% * cased(+bool) : Determines whether tokens perserve cases of the source +% text. +% * spaces(+bool) : Determines whether spaces are represted as tokens or +% discarded. +% * cntrl(+bool) : Determines whether control characters are represented as +% tokens or discarded. +% * punct(+bool) : Determines whether punctuation characters are represented +% as tokens or discarded. +% * pack(+bool) : Determines whether tokens are packed or repeated. +% * to(+one_of([strings,atoms,chars,codes])) : Determines the representation +% format used for the tokens. % TODO is it possible to achieve the proper semidet without the cut? % Annie sez some parses are ambiguous, not even sure the cut should be From 9091f096a2f70e26b84d59f17098528e2b6dd56b Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sun, 19 May 2019 17:21:47 -0400 Subject: [PATCH 14/25] Catch edge cases and preserve escaped characters in strings Thanks to @itampe for catching these in review. Signed-off-by: Shon Feder --- prolog/tokenize.pl | 19 ++++++++++--------- test/test.pl | 32 +++++++++++++++++++++++++++++++- 2 files changed, 41 insertions(+), 10 deletions(-) diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl index 0299666..260d8e7 100644 --- a/prolog/tokenize.pl +++ b/prolog/tokenize.pl @@ -300,19 +300,20 @@ % A string starts when we encounter an OpenBracket string_start(OpenBracket, CloseBracket, Cs) --> - OpenBracket, string_content(CloseBracket, Cs). + OpenBracket, string_content(OpenBracket, CloseBracket, Cs). % String content is everything up until we hit a CloseBracket -string_content(CloseBracket, []) --> CloseBracket, !. +string_content(_OpenBracket, CloseBracket, []) --> CloseBracket, !. +% String content includes a bracket following an escape, but not the escape +string_content(OpenBracket, CloseBracket, [C|Cs]) --> + escape, (CloseBracket | OpenBracket), + {[C] = CloseBracket}, + string_content(OpenBracket, CloseBracket, Cs). % String content includes any character that isn't a CloseBracket or an escape. -string_content(CloseBracket, [C|Cs]) --> +string_content(OpenBracket, CloseBracket, [C|Cs]) --> [C], - {[C] \= CloseBracket, [C] \= `\\`}, - string_content(CloseBracket, Cs). -% String content includes any character following an escape, but not the escape -string_content(CloseBracket, [C|Cs]) --> - escape, [C], - string_content(CloseBracket, Cs). + {[C] \= CloseBracket}, + string_content(OpenBracket, CloseBracket, Cs). csyms([L]) --> csym(L). csyms([L|Ls]) --> csym(L), csyms(Ls). diff --git a/test/test.pl b/test/test.pl index a4ac2f2..b857cfb 100755 --- a/test/test.pl +++ b/test/test.pl @@ -54,6 +54,30 @@ % STRINGS +test('Tokenizing the empty strings', + [true(Actual == Expected)] + ) :- + tokenize(`""`, Actual), + Expected = [string('')]. + +test('Untokenizing an empty string', + [true(Actual == Expected)] + ) :- + untokenize([string('')], Actual), + Expected = `""`. + +test('Tokenizing a string with just two escapes', + [true(Actual == Expected)] + ) :- + tokenize(`"\\\\"`, Actual), + Expected = [string('\\\\')]. + +test('Untokenizing a string with just two characters', + [true(Actual == Expected)] + ) :- + untokenize([string('aa')], Actual), + Expected = `"aa"`. + test('Extracts a string', [true(Actual == Expected)] ) :- @@ -72,10 +96,16 @@ tokenize(`"a \\"string\\""`, Actual), Expected = [string('a "string"')]. +test("Tokenization preserves escaped characters", + [true(Actual == Expected)] + ) :- + tokenize(`"\\tLine text\\n"`, Actual), + Expected = [string('\\tline text\\n')]. + test("Extracts a string that includes a doubly nested string", [true(Actual == Expected)] ) :- - tokenize(`"a \\"sub \\\\\\"string\\\\\\"\\""`, Actual), + tokenize(`"a \\"sub \\\\"string\\\\"\\""`, Actual), Expected = [string('a "sub \\"string\\""')]. test("Untokenizes string things", From 53c83c03a0e71ec5a3bdf5c2c5eae762b0b4c8e4 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sun, 19 May 2019 18:48:15 -0400 Subject: [PATCH 15/25] Use code lists consistently for readability in tests Signed-off-by: Shon Feder --- test/test.pl | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/test/test.pl b/test/test.pl index b857cfb..1519620 100755 --- a/test/test.pl +++ b/test/test.pl @@ -81,13 +81,13 @@ test('Extracts a string', [true(Actual == Expected)] ) :- - tokenize("\"a string\"", Actual), + tokenize(`"a string"`, Actual), Expected = [string('a string')]. test('Extracts a string among other stuff', [true(Actual == Expected)] ) :- - tokenize("Some other \"a string\" stuff", Actual), + tokenize(`Some other "a string" stuff`, Actual), Expected = [word(some),spc(' '),word(other),spc(' '),string('a string'),spc(' '),word(stuff)]. test("Extracts a string that includes escaped brackets", @@ -111,8 +111,7 @@ test("Untokenizes string things", [true(Actual == Expected)] ) :- - untokenize([string('some string')], ActualCodes), - string_codes(Actual, ActualCodes), - Expected = "\"some string\"". + untokenize([string('some string')], Actual), + Expected = `"some string"`. :- end_tests(tokenize). From 1e4a002d6e6fadc3346b0f8388604c19397992f8 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Wed, 19 Jun 2019 07:51:03 -0400 Subject: [PATCH 16/25] Add CircleCI badge to README --- README.md | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index b8c0b73..b033847 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,11 @@ -# Synopsis +# `pack(tokenize)` + +A modest tokenization library for SWI-Prolog, seeking a balance between +simplicity and flexibility. + +[![CircleCI](https://circleci.com/gh/shonfeder/tokenize.svg?style=svg)](https://circleci.com/gh/shonfeder/tokenize) + +## Synopsis ```prolog ?- tokenize(`\tExample Text.`, Tokens). @@ -13,7 +20,7 @@ Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.') Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...] ``` -# Description +## Description Module `tokenize` aims to provide a straightforward tool for tokenizing text into a simple format. It is the result of a learning exercise, and it is far from perfect. If there is sufficient interest from myself or anyone else, I'll try to improve it. @@ -25,6 +32,6 @@ It is packaged as an SWI-Prolog pack, available [here](http://www.swi-prolog.org Please [visit the wiki](https://github.com/aBathologist/tokenize/wiki/tokenize.pl-options-and-examples) for more detailed instructions and examples, including a full list of options supported. -# Contributing +## Contributing See [CONTRIBUTING.md](./CONTRIBUTING.md). From 4a72e6089044d722bea0e67ce951db7398f1988a Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Wed, 19 Jun 2019 08:01:31 -0400 Subject: [PATCH 17/25] Tweak README title --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b033847..79f0e61 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# `pack(tokenize)` +# `pack(tokenize) :-` A modest tokenization library for SWI-Prolog, seeking a balance between simplicity and flexibility. From 645b9d7542f86db88598a98a4d84aedeee47fff3 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Fri, 21 Jun 2019 08:17:50 -0400 Subject: [PATCH 18/25] Use conventional option processing The record-based approach used here is endorsed in https://eu.swi-prolog.org/pldoc/man?section=option --- prolog/tokenize.pl | 162 ++++++++++++++-------------------------- prolog/tokenize_opts.pl | 32 ++++++++ test/test.pl | 19 +++++ 3 files changed, 107 insertions(+), 106 deletions(-) create mode 100644 prolog/tokenize_opts.pl diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl index 260d8e7..a03d06f 100644 --- a/prolog/tokenize.pl +++ b/prolog/tokenize.pl @@ -24,9 +24,11 @@ text. */ + :- use_module(library(dcg/basics), [eos//0, number//1]). +:- use_module(tokenize_opts). -% Ensure we interpret backs as enclosing code lists in this module. +% Ensure we interpret back ticks as enclosing code lists in this module. :- set_prolog_flag(back_quotes, codes). %% tokenize(+Text:list(code), -Tokens:list(term)) is semidet. @@ -67,14 +69,17 @@ % * to(+one_of([strings,atoms,chars,codes])) : Determines the representation % format used for the tokens. -% TODO is it possible to achieve the proper semidet without the cut? +% TODO is it possible to achieve the proper semidet without the cut? % Annie sez some parses are ambiguous, not even sure the cut should be % there -tokenize(Text, Tokens, Options) :- +tokenize(Text, ProcessedTokens, Options) :- must_be(nonvar, Text), string_codes(Text, Codes), - phrase(process_options, [Options-Codes], [Options-Tokens]), + process_options(Options, PreOpts, PostOpts), + preprocess(PreOpts, Codes, ProcessedCodes), + phrase(tokens(Tokens), ProcessedCodes), + postprocess(PostOpts, Tokens, ProcessedTokens), !. %% untokenize(+Tokens:list(term), -Untokens:list(codes)) is semidet. @@ -123,111 +128,59 @@ read_file_to_codes(File, Codes, [encoding(utf8)]), tokenize(Codes, Tokens, Options). -% PROCESSING OPTIONS -% -% NOTE: This way of processing options is probably stupid. -% I will correct/improve/rewrite it if there is ever a good -% reason to. But for now, it works. -% -% TODO: Throw exception if invalid options are passed in. -% At the moment it just fails. - -%% Dispatches dcgs by option-list functors, with default values. -process_options --> - % Preprocessing - opt(cased, false), - % Tokenization - non_opt(tokenize_text), - % Postprocessing - opt(spaces, true), - opt(cntrl, true), - opt(punct, true), - opt(to, atoms), - opt(pack, false). - -%% opt(+OptionFunctor:atom, DefaultValue:nonvar) -% -% If dcg functor is identical to the option name with 'opt_' prefixed, -% then the dcg functor can be omitted. -% -% - -opt(Opt, Default) --> - { atom_concat('opt_', Opt, Opt_DCG) }, - opt(Opt, Default, Opt_DCG). - -%% opt(+OptionFunctor:atom, +DefaultValue:nonvar, +DCGFunctor:atom). -opt(Opt, Default, DCG) --> - state(Opts-Text0, Text0), - { - pad(Opt, Selection, Opt_Selection), - option(Opt_Selection, Opts, Default), - DCG_Selection =.. [DCG, Selection] - }, - DCG_Selection, - state(Text1, Opts-Text1). -%% This ugly bit should be dispensed with... -opt(Opt, Default, _) --> - state(Opts-_), - { - var(Default), \+ option(Opt, Opts), - writeln("Unknown options passed to opt//3: "), - write(Opt) - }. % TODO use print_message for this - -%% non_opt(+DCG). -% -% Non optional dcg to dispatch. Passes the object of concern -% without the options list, then recovers option list. - -non_opt(DCG) --> - state(Opts-Text0, Text0), - DCG, - state(Text1, Opts-Text1). - -state(S0), [S0] --> [S0]. -state(S0, S1), [S1] --> [S0]. - - -% Dispatching the option pipeline options: - -/*************************** -* PREPROCESSING * -***************************/ - -opt_cased(true) --> []. -opt_cased(false) --> state(Text, LowerCodes), - { - text_to_string(Text, Str), - string_lower(Str, LowerStr), - string_codes(LowerStr, LowerCodes) - }. +/*********************************** +* {PRE,POST}-PROCESSING HELPERS * +***********************************/ -/*************************** -* POSTPROCESSING * -***************************/ +preprocess(PreOpts, Codes, ProcessedCodes) :- + preopts_data(cased, PreOpts, Cased), + DCG_Rules = ( + preprocess_case(Cased) + ), + phrase(process_dcg_rules(DCG_Rules, ProcessedCodes), Codes). + +postprocess(PostOpts, Tokens, ProcessedTokens) :- + postopts_data(spaces, PostOpts, Spaces), + postopts_data(cntrl, PostOpts, Cntrl), + postopts_data(punct, PostOpts, Punct), + postopts_data(to, PostOpts, To), + postopts_data(pack, PostOpts, Pack), + DCG_Rules = ( + keep_token(space(_), Spaces), + keep_token(cntrl(_), Cntrl), + keep_token(punct(_), Punct), + convert_token(To) + ), + phrase(process_dcg_rules(DCG_Rules, PrePackedTokens), Tokens), + (Pack + -> phrase(pack_tokens(ProcessedTokens), PrePackedTokens) + ; ProcessedTokens = PrePackedTokens + ). -opt_spaces(true) --> []. -opt_spaces(false) --> state(T0, T1), - { exclude( =(spc(_)), T0, T1) }. -opt_cntrl(true) --> []. -opt_cntrl(false) --> state(T0, T1), - { exclude( =(cntrl(_)), T0, T1) }. +/*********************************** +* POSTPROCESSING HELPERS * +***********************************/ -opt_punct(true) --> []. -opt_punct(false) --> state(T0, T1), - { exclude( =(punct(_)), T0, T1) }. +% Process a stream through a pipeline of DCG rules +process_dcg_rules(_, []) --> eos, !. +process_dcg_rules(DCG_Rules, []) --> DCG_Rules, eos, !. +process_dcg_rules(DCG_Rules, [C|Cs]) --> + DCG_Rules, + [C], + process_dcg_rules(DCG_Rules, Cs). -opt_to(codes) --> []. -opt_to(Type) --> state(CodeTokens, Tokens), - { maplist(token_to(Type), CodeTokens, Tokens) }. +preprocess_case(true), [C] --> [C]. +preprocess_case(false), [CodeOut] --> [CodeIn], + { to_lower(CodeIn, CodeOut) }. -opt_pack(false) --> []. -opt_pack(true) --> state(T0, T1), - { phrase(pack_tokens(T1), T0) }. +keep_token(_, true), [T] --> [T]. +keep_token(Token, false) --> [Token]. +keep_token(Token, false), [T] --> [T], {T \= Token}. +convert_token(Type), [Converted] --> [Token], + {token_to(Type, Token, Converted)}. % Convert tokens to alternative representations. token_to(_, number(X), number(X)) :- !. @@ -239,11 +192,6 @@ ), call_into_term(Conversion, Token, Converted). - -/*********************************** -* POSTPROCESSING HELPERS * -***********************************/ - % Packing repeating tokens pack_tokens([T]) --> pack_token(T). pack_tokens([T|Ts]) --> pack_token(T), pack_tokens(Ts). @@ -275,6 +223,8 @@ % tokens(_) --> {length(L, 200)}, L, {format(L)}, halt, !. token(string(S)) --> string(S). + +% TODO Make numbers optional token(number(N)) --> number(N), !. token(word(W)) --> word(W), eos, !. diff --git a/prolog/tokenize_opts.pl b/prolog/tokenize_opts.pl new file mode 100644 index 0000000..b1e8c06 --- /dev/null +++ b/prolog/tokenize_opts.pl @@ -0,0 +1,32 @@ +:- module(tokenize_opts, + [process_options/3, + preopts_data/3, + postopts_data/3]). + +:- use_module(library(record)). + +% pre-processing options +:- record preopts( + cased:boolean=false + ). + +% post-processing options +:- record postopts( + spaces:boolean=true, + cntrl:boolean=true, + punct:boolean=true, + to:oneof([strings,atoms,chars,codes])=atoms, + pack:boolean=false + ). + +%% process_options(+Options:list(term), -PreOpts:term, -PostOpts:term) is semidet. +% +process_options(Options, PreOpts, PostOpts) :- + make_preopts(Options, PreOpts, Rest), + make_postopts(Rest, PostOpts, InvalidOptions), + throw_on_invalid_options(InvalidOptions). + +throw_on_invalid_options(InvalidOptions) :- + InvalidOptions \= [] + -> throw(invalid_options_given(InvalidOptions)) + ; true. diff --git a/test/test.pl b/test/test.pl index 1519620..405cab7 100755 --- a/test/test.pl +++ b/test/test.pl @@ -7,6 +7,8 @@ asserta(user:file_search_path(package, PackageDir)). :- use_module(package(tokenize)). +:- use_module(package(tokenize_opts)). + :- begin_tests(tokenize). test('Hello, Tokenize!', @@ -24,6 +26,23 @@ Expected = "Goodbye, Tokenize!". +% OPTION PROCESSING + +test('process_options/3 throws on invalid options') :- + catch( + process_options([invalid(true)], _, _), + invalid_options_given([invalid(true)]), + true + ). + +test('process_options/3 sets valid options in opt records') :- + Options = [cased(false), spaces(false)], + process_options(Options, PreOpts, PostOpts), + preopts_data(cased, PreOpts, Cased), + postopts_data(spaces, PostOpts, Spaces), + assertion(cased:Cased == cased:false), + assertion(spaces:Spaces == spaces:false). + % NUMBERS test('tokenize 7.0', From e6877a32f46ba2deccef841b382529e2adcb57ac Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Fri, 21 Jun 2019 21:30:47 -0400 Subject: [PATCH 19/25] Make string and number tokens optional --- prolog/tokenize.pl | 32 +++++++++++++++++--------------- prolog/tokenize_opts.pl | 24 ++++++++++++++++-------- test/test.pl | 39 +++++++++++++++++++++++++++++---------- 3 files changed, 62 insertions(+), 33 deletions(-) diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl index a03d06f..a92a7ac 100644 --- a/prolog/tokenize.pl +++ b/prolog/tokenize.pl @@ -76,9 +76,9 @@ tokenize(Text, ProcessedTokens, Options) :- must_be(nonvar, Text), string_codes(Text, Codes), - process_options(Options, PreOpts, PostOpts), + process_options(Options, PreOpts, TokenOpts, PostOpts), preprocess(PreOpts, Codes, ProcessedCodes), - phrase(tokens(Tokens), ProcessedCodes), + phrase(tokens(TokenOpts, Tokens), ProcessedCodes), postprocess(PostOpts, Tokens, ProcessedTokens), !. @@ -216,25 +216,28 @@ % PARSING -tokens([T]) --> token(T), eos, !. -tokens([T|Ts]) --> token(T), tokens(Ts). +tokens(Opts, [T]) --> token(Opts, T), eos, !. +tokens(Opts, [T|Ts]) --> token(Opts, T), tokens(Opts, Ts). % NOTE for debugging % tokens(_) --> {length(L, 200)}, L, {format(L)}, halt, !. -token(string(S)) --> string(S). +token(Opts, string(S)) --> + { tokenopts_data(strings, Opts, true) }, + string(S). -% TODO Make numbers optional -token(number(N)) --> number(N), !. +token(Opts, number(N)) --> + { tokenopts_data(numbers, Opts, true) }, + number(N), !. -token(word(W)) --> word(W), eos, !. -token(word(W)),` ` --> word(W), ` `. -token(word(W)), C --> word(W), (punct(C) ; cntrl(C) ; nasciis(C)). +token(_Opts, word(W)) --> word(W), eos, !. +token(_Opts, word(W)),` ` --> word(W), ` `. +token(_Opts, word(W)), C --> word(W), (punct(C) ; cntrl(C) ; nasciis(C)). -token(spc(S)) --> spc(S). -token(punct(P)) --> punct(P). -token(cntrl(C)) --> cntrl(C). -token(other(O)) --> nasciis(O). +token(_Opts, spc(S)) --> spc(S). +token(_Opts, punct(P)) --> punct(P). +token(_Opts, cntrl(C)) --> cntrl(C). +token(_Opts, other(O)) --> nasciis(O). spc(` `) --> ` `. @@ -243,7 +246,6 @@ word(W) --> csyms(W). -% TODO Make strings optional % TODO Make open and close brackets configurable string(S) --> string(`"`, `"`, S). string(OpenBracket, CloseBracket, S) --> string_start(OpenBracket, CloseBracket, S). diff --git a/prolog/tokenize_opts.pl b/prolog/tokenize_opts.pl index b1e8c06..688077e 100644 --- a/prolog/tokenize_opts.pl +++ b/prolog/tokenize_opts.pl @@ -1,6 +1,7 @@ :- module(tokenize_opts, - [process_options/3, + [process_options/4, preopts_data/3, + tokenopts_data/3, postopts_data/3]). :- use_module(library(record)). @@ -10,6 +11,12 @@ cased:boolean=false ). +% tokenization options +:- record tokenopts( + numbers:boolean=true, + strings:boolean=true + ). + % post-processing options :- record postopts( spaces:boolean=true, @@ -21,12 +28,13 @@ %% process_options(+Options:list(term), -PreOpts:term, -PostOpts:term) is semidet. % -process_options(Options, PreOpts, PostOpts) :- - make_preopts(Options, PreOpts, Rest), - make_postopts(Rest, PostOpts, InvalidOptions), - throw_on_invalid_options(InvalidOptions). +process_options(Options, PreOpts, TokenOpts, PostOpts) :- + make_preopts(Options, PreOpts, Rest0), + make_postopts(Rest0, PostOpts, Rest1), + make_tokenopts(Rest1, TokenOpts, InvalidOpts), + throw_on_invalid_options(InvalidOpts). -throw_on_invalid_options(InvalidOptions) :- - InvalidOptions \= [] - -> throw(invalid_options_given(InvalidOptions)) +throw_on_invalid_options(InvalidOpts) :- + InvalidOpts \= [] + -> throw(invalid_options_given(InvalidOpts)) ; true. diff --git a/test/test.pl b/test/test.pl index 405cab7..7d58dc1 100755 --- a/test/test.pl +++ b/test/test.pl @@ -28,19 +28,27 @@ % OPTION PROCESSING -test('process_options/3 throws on invalid options') :- +test('process_options/4 throws on invalid options') :- catch( - process_options([invalid(true)], _, _), + process_options([invalid(true)], _, _, _), invalid_options_given([invalid(true)]), true ). -test('process_options/3 sets valid options in opt records') :- - Options = [cased(false), spaces(false)], - process_options(Options, PreOpts, PostOpts), +test('process_options/4 sets valid options in opt records') :- + Options = [ + cased(false), % non-default preopt + strings(false), % non-default tokenopt + spaces(false) % non-default postopt + ], + process_options(Options, PreOpts, TokenOpts, PostOpts), + % Fetch the options that were set preopts_data(cased, PreOpts, Cased), + tokenopts_data(strings, TokenOpts, Strings), postopts_data(spaces, PostOpts, Spaces), + % These compounds are just ensure informative output on failure assertion(cased:Cased == cased:false), + assertion(strings:Strings == strings:false), assertion(spaces:Spaces == spaces:false). % NUMBERS @@ -57,7 +65,6 @@ untokenize([number(6.3)], Actual), Expected = `6.3`. - test('tokenize number in other stuff', [true(Actual == Expected)] ) :- @@ -70,6 +77,12 @@ untokenize([word(hi), number(6.3)], Actual), Expected = `hi6.3`. +test('can disable number tokens', + [true(Actual == Expected)] + ) :- + tokenize("hi 7.0 x", Actual, [numbers(false)]), + Expected = [word(hi), spc(' '), word('7'), punct('.'), word('0'), spc(' '), word(x)]. + % STRINGS @@ -109,25 +122,31 @@ tokenize(`Some other "a string" stuff`, Actual), Expected = [word(some),spc(' '),word(other),spc(' '),string('a string'),spc(' '),word(stuff)]. -test("Extracts a string that includes escaped brackets", +test('Extracts a string that includes escaped brackets', [true(Actual == Expected)] ) :- tokenize(`"a \\"string\\""`, Actual), Expected = [string('a "string"')]. -test("Tokenization preserves escaped characters", +test('Tokenization preserves escaped characters', [true(Actual == Expected)] ) :- tokenize(`"\\tLine text\\n"`, Actual), Expected = [string('\\tline text\\n')]. -test("Extracts a string that includes a doubly nested string", +test('Extracts a string that includes a doubly nested string', [true(Actual == Expected)] ) :- tokenize(`"a \\"sub \\\\"string\\\\"\\""`, Actual), Expected = [string('a "sub \\"string\\""')]. -test("Untokenizes string things", +test('can disable string tokens', + [true(Actual == Expected)] + ) :- + tokenize(`some "string".`, Actual, [numbers(false)]), + Expected = [word(some), spc(' '), string(string), punct('.')]. + +test('Untokenizes string things', [true(Actual == Expected)] ) :- untokenize([string('some string')], Actual), From 3349b9b666c47fe5ea41e28d9c91ba80854a9123 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Fri, 21 Jun 2019 22:39:39 -0400 Subject: [PATCH 20/25] Rename 'spc' token to 'space' --- CONTRIBUTING.md | 4 ++-- README.md | 6 +++--- design_notes.md | 2 +- prolog/tokenize.pl | 6 +++--- test/test.pl | 12 ++++++------ 5 files changed, 15 insertions(+), 15 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 1895faa..d1ae63f 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -42,7 +42,7 @@ design philosophy and principles. % experiment ?- tokenize("Foo bar baz", Tokens). - Tokens = [word(foo), spc(' '), word(bar), spc(' '), word(baz)]. + Tokens = [word(foo), space(' '), word(bar), space(' '), word(baz)]. % reload the module when you make changes to the source code ?- make. @@ -75,4 +75,4 @@ If inside the swipl repl, make sure to load the test file and query run_tests. % PL-Unit: tokenize .. done % All 2 tests passed true. -``` \ No newline at end of file +``` diff --git a/README.md b/README.md index 79f0e61..47ac380 100644 --- a/README.md +++ b/README.md @@ -9,14 +9,14 @@ simplicity and flexibility. ```prolog ?- tokenize(`\tExample Text.`, Tokens). -Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')] +Tokens = [cntrl('\t'), word(example), space(' '), space(' '), word(text), punct('.')] ?- tokenize(`\tExample Text.`, Tokens, [cntrl(false), pack(true), cased(true)]). -Tokens = [word('Example', 1), spc(' ', 2), word('Text', 1), punct('.', 1)] +Tokens = [word('Example', 1), space(' ', 2), word('Text', 1), punct('.', 1)] ?- tokenize(`\tExample Text.`, Tokens), untokenize(Tokens, Text), format('~s~n', [Text]). example text. -Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')], +Tokens = [cntrl('\t'), word(example), space(' '), space(' '), word(text), punct('.')], Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...] ``` diff --git a/design_notes.md b/design_notes.md index e2be8ed..e84fade 100644 --- a/design_notes.md +++ b/design_notes.md @@ -35,7 +35,7 @@ this result through a subsequent lexing pass. * We don't parse. * Every token generated is callable (i.e., an atom or compound). - * Example of an possible compound token: `spc(' ')`. + * Example of an possible compound token: `space(' ')`. * Example of a possible atom token: `escape`. tokenization need to return tokens represented with the same arity) * Users should be able to determine the kind of token by unification. diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl index a92a7ac..6923d64 100644 --- a/prolog/tokenize.pl +++ b/prolog/tokenize.pl @@ -53,7 +53,7 @@ % * a word (contiguous alpha-numeric chars): `word(W)` % * a punctuation mark (determined by `char_type(C, punct)`): `punct(P)` % * a control character (determined by `char_typ(C, cntrl)`): `cntrl(C)` -% * a space ( == ` `): `spc(S)`. +% * a space ( == ` `): `space(S)`. % % Valid options are: % @@ -234,12 +234,12 @@ token(_Opts, word(W)),` ` --> word(W), ` `. token(_Opts, word(W)), C --> word(W), (punct(C) ; cntrl(C) ; nasciis(C)). -token(_Opts, spc(S)) --> spc(S). +token(_Opts, space(S)) --> space(S). token(_Opts, punct(P)) --> punct(P). token(_Opts, cntrl(C)) --> cntrl(C). token(_Opts, other(O)) --> nasciis(O). -spc(` `) --> ` `. +space(` `) --> ` `. sep --> ' '. sep --> eos, !. diff --git a/test/test.pl b/test/test.pl index 7d58dc1..9e17e36 100755 --- a/test/test.pl +++ b/test/test.pl @@ -15,12 +15,12 @@ [true(Actual == Expected)] ) :- tokenize("Hello, Tokenize!", Actual), - Expected = [word(hello),punct(','),spc(' '),word(tokenize),punct(!)]. + Expected = [word(hello),punct(','),space(' '),word(tokenize),punct(!)]. test('Goodbye, Tokenize!', [true(Actual == Expected)] ) :- - Tokens = [word('Goodbye'),punct(','),spc(' '),word('Tokenize'),punct('!')], + Tokens = [word('Goodbye'),punct(','),space(' '),word('Tokenize'),punct('!')], untokenize(Tokens, Codes), string_codes(Actual, Codes), Expected = "Goodbye, Tokenize!". @@ -69,7 +69,7 @@ [true(Actual == Expected)] ) :- tokenize("hi 7.0 x", Actual), - Expected = [word(hi), spc(' '), number(7.0), spc(' '), word(x)]. + Expected = [word(hi), space(' '), number(7.0), space(' '), word(x)]. test('untokenize 6.3 in other stuff', [true(Actual == Expected)] @@ -81,7 +81,7 @@ [true(Actual == Expected)] ) :- tokenize("hi 7.0 x", Actual, [numbers(false)]), - Expected = [word(hi), spc(' '), word('7'), punct('.'), word('0'), spc(' '), word(x)]. + Expected = [word(hi), space(' '), word('7'), punct('.'), word('0'), space(' '), word(x)]. % STRINGS @@ -120,7 +120,7 @@ [true(Actual == Expected)] ) :- tokenize(`Some other "a string" stuff`, Actual), - Expected = [word(some),spc(' '),word(other),spc(' '),string('a string'),spc(' '),word(stuff)]. + Expected = [word(some),space(' '),word(other),space(' '),string('a string'),space(' '),word(stuff)]. test('Extracts a string that includes escaped brackets', [true(Actual == Expected)] @@ -144,7 +144,7 @@ [true(Actual == Expected)] ) :- tokenize(`some "string".`, Actual, [numbers(false)]), - Expected = [word(some), spc(' '), string(string), punct('.')]. + Expected = [word(some), space(' '), string(string), punct('.')]. test('Untokenizes string things', [true(Actual == Expected)] From 5135c88ca5ebc93f78ddd0979ae4bf29fe4aa873 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Fri, 21 Jun 2019 22:49:55 -0400 Subject: [PATCH 21/25] Add a changelog --- CHANGELOG.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 CHANGELOG.md diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..0c58002 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,23 @@ +# Changelog + +All notable changes to this project will be documented in this file. + +The format is based on [Keep a Changelog][keep-a-change-log], and this project +adheres to [Semantic Versioning][semantic-versioning]. + +[keep-a-change-log]: https://keepachangelog.com/en/1.0.0/ +[semantic-versioning]: https://semver.org/spec/v2.0.0.html + +## [Unreleased] + +### Added + +- Support for numbers by [@Annipoo](https://github.com/Anniepoo) #34 +- Support for strings #37 +- Code of Conduct #23 + +### Changed + +- Spaces are now tagged with `space` instead of `spc`. #41 +- Tokenization of numbers and strings is enabled by default. #40 +- Options are now processed by a more conventional means #39 From b32ea01b712b37ab58b9161c7155a5f8aa645a6d Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sat, 22 Jun 2019 18:51:25 -0400 Subject: [PATCH 22/25] Update the pack's home page info --- pack.pl | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/pack.pl b/pack.pl index 174019f..c7ecabf 100644 --- a/pack.pl +++ b/pack.pl @@ -1,10 +1,10 @@ name(tokenize). -title('A nascent tokenization library'). +title('A simple tokenization library'). version('0.1.2'). -download('https://github.com/aBathologist/tokenize/release/*.zip'). +download('https://github.com/shonfeder/tokenize/release/*.zip'). author('Shon Feder', 'shon.feder@gmail.com'). packager('Shon Feder', 'shon.feder@gmail.com'). maintainer('Shon Feder', 'shon.feder@gmail.com'). -home('https://github.com/aBathologist/tokenize'). +home('https://github.com/shonfeder/tokenize'). From 26ce2bbb7be15168ed071f5da0f428805b95bb5f Mon Sep 17 00:00:00 2001 From: itampe <50549914+itampe@users.noreply.github.com> Date: Sun, 23 Jun 2019 01:40:57 +0200 Subject: [PATCH 23/25] Cleanup, bug fixes, and tests for comment.pl (#36) * Refactored and simplified the code * Introduce cut's to not leave choice points and lead to execution runaway * Add example with kind of comment * use copy_term of start and end tag * pldoc compliance. * removed complexity with specidfic atom treatment of matchers. now they're just matchers --- Makefile | 1 - prolog/comment.pl | 179 ++++++++++++++++-------------------------- test/test_comments.pl | 104 ++++++++++++++++++++++++ 3 files changed, 173 insertions(+), 111 deletions(-) create mode 100644 test/test_comments.pl diff --git a/Makefile b/Makefile index d5ae1c5..044b64f 100644 --- a/Makefile +++ b/Makefile @@ -17,4 +17,3 @@ install: test: @$(SWIPL) -s test/test.pl -g 'run_tests,halt(0)' -t 'halt(1)' - \ No newline at end of file diff --git a/prolog/comment.pl b/prolog/comment.pl index 8e1a525..cea7fd6 100644 --- a/prolog/comment.pl +++ b/prolog/comment.pl @@ -1,44 +1,60 @@ -/* -module(tokenize(comment) - [comment/2, - comment_rec/2, - comment_token/2, - comment_token_rec/2]). +:- module(comment, + [comment//2, + comment_rec//2, + comment_token//3, + comment_token_rec//3]). + +/** Tokenizing comments +This module defines matchers for comments used by the tokenize module. (Note +that we will use matcher as a name for dcg rules that match parts of the codes +list). + +@author Stefan Israelsson Tampe +@license LGPL v2 or later + +Interface Note: +Start and End matchers is a matcher (dcg rule) that is either evaluated with no +extra argument (--> call(StartMatcher)) and it will just match it's token or it +can have an extra argument producing the codes matched by the matcher e.g. used +as --> call(StartMatcher,MatchedCodes). The matchers match start and end codes +of the comment, the 2matcher type will represent these kinds of dcg rules or +matchers 2 is because they support two kinds of arguments to the dcg rules. +For examples +see: + + @see tests/test_comments.pl + +The matchers predicates exported and defined are: + + comment(+Start:2matcher,+End:2matcher) + - anonymously match a non recursive comment + + comment_rec(+Start:2matcher,+End:2matcher,2matcher) + - anonymously match a recursive comment + + coment_token(+Start:2matcher,+End:2matcher,-Matched:list(codes)) + - match an unrecursive comment outputs the matched sequence used + for building a resulting comment token + + coment_token_rec(+Start:2matcher,+End:2matcher,-Matched:list(codes)) + - match an recursive comment outputs the matched sequence used + for building a resulting comment token */ -dcgtrue(U,U). -id([X|L]) --> [X],id(L). -id([]) --> dcgtrue. -id([X|L],[X|LL]) --> [X],id(L,LL). -id([],[]) --> dcgtrue. -tr(S,SS) :- - atom(S) -> - ( - atom_codes(S,C), - SS=id(C) - ); - SS=S. +%% comment(+Start:2matcher,+End:2matcher) +% non recursive non tokenizing matcher -eol --> {atom_codes('\n',E)},id(E). -eol(HS) --> {atom_codes('\n',E)},id(E,HS). - comment_body(E) --> call(E),!. comment_body(E) --> [_],comment_body(E). -comment_body(_) --> []. - + comment(S,E) --> - { - tr(S,SS), - tr(E,EE) - }, - call(SS), - comment_body(EE). + call(S), + comment_body(E). -line_comment(S) --> - {tr(S,SS)}, - comment_body(SS,eol). +%% comment_token(+Start:2matcher,+End:2matcher,-Matched:list(codes)) +% non recursive tokenizing matcher comment_body_token(E,Text) --> call(E,HE),!, @@ -48,57 +64,45 @@ [X], comment_body_token(E,L). -comment_body_token(_,[]) --> []. - comment_token(S,E,Text) --> - { - tr(S,SS), - tr(E,EE) - }, - call(SS,HS), + call(S,HS), {append(HS,T,Text)}, - comment_body_token(EE,T). + comment_body_token(E,T). -line_comment_token(S,Text) --> - {tr(S,SS)}, - comment_body_token(SS,eol,Text). +%% comment_token_rec(+Start:2matcher,+End:2matcher,-Matched:list(codes)) +% recursive tokenizing matcher -comment_body_rec_cont(S,E,Cont,HE,Text) --> - {append(HE,T,Text)}, - comment_body_token_rec(S,E,Cont,T). - -comment_body_rec_start(HE,Text) --> - {append(HE,[],Text)}. +% Use this as the initial continuation, will just tidy up the matched result +% by ending the list with []. +comment_body_rec_start(_,_,[]). comment_body_token_rec(_,E,Cont,Text) --> - call(E,HE), - call(Cont,HE,Text). + call(E,HE),!, + {append(HE,T,Text)}, + call(Cont,T). comment_body_token_rec(S,E,Cont,Text) --> - call(S,HS), + call(S,HS),!, {append(HS,T,Text)}, - comment_body_token_rec(S,E,comment_body_rec_cont(S,E,Cont),T). + comment_body_token_rec(S,E,comment_body_token_rec(S,E,Cont),T). comment_body_token_rec(S,E,Cont,[X|L]) --> [X], comment_body_token_rec(S,E,Cont,L). -comment_body_token_rec(_,_,_,_,_,[]) --> dcgtrue. - comment_token_rec(S,E,Text) --> - { - tr(S,SS), - tr(E,EE) - }, - call(SS,HS), + call(S,HS), {append(HS,T,Text)}, - comment_body_token_rec(SS,EE,comment_body_rec_start,T). + comment_body_token_rec(S,E,comment_body_rec_start,T). + +%% comment_rec(+Start:2matcher,+End:2matcher) +% recursive non tokenizing matcher comment_body_rec(_,E) --> - call(E). + call(E),!. comment_body_rec(S,E) --> - call(S), + call(S),!, comment_body_rec(S,E), comment_body_rec(S,E). @@ -106,51 +110,6 @@ [_], comment_body_rec(S,E). -comment_body_rec(_,_). - comment_rec(S,E) --> - { - tr(S,SS), - tr(E,EE) - }, - call(SS), - comment_body_rec(SS,EE). - -test(Tok,S,U) :- - atom_codes(S,SS), - call_dcg(Tok,SS,U). - -test_comment(S) :- - test(comment('<','>'),S,[]). - -test_comment_rec(S) :- - test(comment_rec('<','>'),S,[]). - -test_comment_token(S,T) :- - test(comment_token('<','>',TT),S,[]), - atom_codes(T,TT). - -test_comment_token_rec(S,T) :- - test(comment_token_rec('<','>',TT),S,[]), - atom_codes(T,TT). - -tester([]). -tester([X|L]) :- - write_term(test(X),[]), - ( - call(X) -> write(' ... OK') ; write(' ... FAIL') - ), - nl, - tester(L). - - -/* -tester( - [test_comment(''), - test_comment_rec('>'), - test_comment_token('',''), - test_comment_token_rec('>','>')]). -*/ - - - + call(S), + comment_body_rec(S,E). diff --git a/test/test_comments.pl b/test/test_comments.pl new file mode 100644 index 0000000..aa7f907 --- /dev/null +++ b/test/test_comments.pl @@ -0,0 +1,104 @@ +:- dynamic user:file_search_path/2. +:- multifile user:file_search_path/2. + +% Add the package source files relative to the current file location +:- prolog_load_context(directory, Dir), + atom_concat(Dir, '/../prolog', PackageDir), + asserta(user:file_search_path(package, PackageDir)). + +:- use_module(package(comment)). +:- begin_tests(tokenize_comment). + +id(X) --> {atom_codes(X,XX)},XX. +id(X,XX) --> {atom_codes(X,XX)},XX. + +mytest(Tok,S,U) :- + atom_codes(S,SS), + call_dcg(Tok,SS,U). + +test_comment(S) :- + mytest(comment(id('<'),id('>')),S,[]). + +test_comment_rec(S) :- + mytest(comment_rec(id('<'),id('>')),S,[]). + +test_comment_token(S,T) :- + mytest(comment_token(id('<'),id('>'),TT),S,[]), + atom_codes(T,TT). + +test_comment_token_rec(S,T) :- + mytest(comment_token_rec(id('<'),id('>'),TT),S,[]), + atom_codes(T,TT). + +start(AA) :- + ( + catch(b_getval(a,[N,A]),_,N=0) -> + true; + N=0 + ), + NN is N + 1, + ( + N == 0 -> + AA = _; + AA = A + ), + b_setval(a,[NN,AA]). + +end(A) :- + b_getval(a,[N,A]), + NN is N - 1, + b_setval(a,[NN,A]). + +left(A) --> + {atom_codes(A,AA)}, + AA, + {start(B)}, + [B]. + +left(A,C) --> + {atom_codes(A,AA)}, + AA, + {start(B)}, + [B], + {append(AA,[B],C)}. + +right(A) --> + {end(B)}, + [B], + {atom_codes(A,AA)}, + AA. + +right(A,C) --> + {end(B)}, + [B], + {atom_codes(A,AA)}, + AA, + {append([B],AA,C)}. + +test_adapt(S,T) :- + mytest(comment_token_rec(left('<'),right('>'),TT),S,[]), + atom_codes(T,TT). + + +:- multifile test/2. + +test('Test comment',[true(test_comment(''))]) :- true. +test('Test comment_rec',[true(test_comment_rec('>'))]) :- true. +test('Test comment_token',[true(A == B)]) :- + A='', + test_comment_token(A,B). + +test('Test comment_token_rec',[true(A == B)]) :- + A='>', + test_comment_token(A,B). + +test('Test comment_token_rec advanced 1',[true(A == B)]) :- + A='<1 alla2> <1 balla2> 1>1>', + test_adapt(A,B). + +test('Test comment_token_rec advanced 2',[true(A == B)]) :- + A='<2 alla1> <2 balla1> 2>2>', + test_adapt(A,B). + + +:- end_tests(tokenize_comment). From 97b9378de5eb740c3638af6e3ce9f5851e9754dc Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sat, 22 Jun 2019 22:16:50 -0400 Subject: [PATCH 24/25] Extract WIP comment code into separate directory --- comment-wip/README.md | 4 ++++ {prolog => comment-wip}/comment.pl | 0 {test => comment-wip}/test_comments.pl | 0 3 files changed, 4 insertions(+) create mode 100644 comment-wip/README.md rename {prolog => comment-wip}/comment.pl (100%) rename {test => comment-wip}/test_comments.pl (100%) diff --git a/comment-wip/README.md b/comment-wip/README.md new file mode 100644 index 0000000..c1c3fd9 --- /dev/null +++ b/comment-wip/README.md @@ -0,0 +1,4 @@ +WIP code towards tokenization of comments. + +It was extracted here because it's not ready for release, but we want to keep it +available for the author to resume work on it. diff --git a/prolog/comment.pl b/comment-wip/comment.pl similarity index 100% rename from prolog/comment.pl rename to comment-wip/comment.pl diff --git a/test/test_comments.pl b/comment-wip/test_comments.pl similarity index 100% rename from test/test_comments.pl rename to comment-wip/test_comments.pl From c45fb74a9fb76c19d7f555b89d542e6d8ea74ad6 Mon Sep 17 00:00:00 2001 From: Shon Feder Date: Sat, 22 Jun 2019 22:47:56 -0400 Subject: [PATCH 25/25] Bump version --- CHANGELOG.md | 27 ++++++++++++++++++++++++--- pack.pl | 2 +- 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0c58002..08dd184 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,9 @@ adheres to [Semantic Versioning][semantic-versioning]. [keep-a-change-log]: https://keepachangelog.com/en/1.0.0/ [semantic-versioning]: https://semver.org/spec/v2.0.0.html -## [Unreleased] +## [unreleased] + +## [1.0.0] ### Added @@ -18,6 +20,25 @@ adheres to [Semantic Versioning][semantic-versioning]. ### Changed -- Spaces are now tagged with `space` instead of `spc`. #41 -- Tokenization of numbers and strings is enabled by default. #40 +- Spaces are now tagged with `space` instead of `spc` #41 +- Tokenization of numbers and strings is enabled by default #40 - Options are now processed by a more conventional means #39 +- The location for the pack's home is updated + +## [0.1.2] + +Prior to changelog. + +## [0.1.1] + +Prior to changelog. + +## [0.1.0] + +Prior to changelog. + +[unreleased]: https://github.com/shonfeder/tokenize/compare/v1.0.0...HEAD +[1.0.0]: https://github.com/shonfeder/tokenize/compare/v0.1.2...v1.0.0 +[0.1.2]: https://github.com/shonfeder/tokenize/compare/v0.1.1...v0.1.2 +[0.1.1]: https://github.com/shonfeder/tokenize/compare/v0.1.0...v0.1.1 +[0.1.0]: https://github.com/shonfeder/tokenize/releases/tag/v0.1.0 diff --git a/pack.pl b/pack.pl index c7ecabf..68438aa 100644 --- a/pack.pl +++ b/pack.pl @@ -1,7 +1,7 @@ name(tokenize). title('A simple tokenization library'). -version('0.1.2'). +version('1.0.0'). download('https://github.com/shonfeder/tokenize/release/*.zip'). author('Shon Feder', 'shon.feder@gmail.com').