From 1221082a127ffbb463680b9cd9abbd62f5f3c54c Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sat, 11 May 2019 22:03:56 -0400
Subject: [PATCH 01/25] Add design notes (#25)

---
 design_notes.md | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)
 create mode 100644 design_notes.md

diff --git a/design_notes.md b/design_notes.md
new file mode 100644
index 0000000..e2be8ed
--- /dev/null
+++ b/design_notes.md
@@ -0,0 +1,45 @@
+# Design Notes
+
+Initially extracted from conversation with
+[@Annieppo](https://github.com/Anniepoo) and [@nicoabie](https://github.com/nicoabie) in
+##prolog on [freenode](https://freenode.net/).
+
+The library started as a very simple and lightweight set of predicates for a
+common, but very limited, form of lexing. As we extend it, we aim to maintain a
+modest scope in order to achieve a sweet spot between ease of use and powerful
+flexibility.
+
+## Scope and Aims
+
+`tokenize` does not aspire to become an industrial strength lexer generator. We
+aim to serve most users needs between raw input and a structured form ready for
+parsing by a DCG.
+
+If a user is parsing a language with keywords such as `class`, `module`, etc.,
+and wants to distinguish these from variable names, `tokenize` isn't going to
+give you this out of the box. But, it should provide an easy means of achieving
+this result through a subsequent lexing pass.
+
+## Some Model Users
+
+* somebody making a computer language
+  * needs to be able to distinguish keywords, variables and literals
+  * needs to be able to identify comments
+* somebody making a parser for an interactive fiction game
+  * needs to handle stuff like "William O. N'mutu-O'Connell went to the market"
+* somebody wanting to analyze human texts
+  * wanting to do some analysis on New York Times articles, they want to first
+    process the articles into meaningful tokens
+
+## Design Rules
+
+* We don't parse.
+* Every token generated is callable (i.e., an atom or compound).
+  * Example of an possible compound token: `spc(' ')`.
+  * Example of a possible atom token: `escape`.
+  tokenization need to return tokens represented with the same arity)
+* Users should be able to determine the kind of token by unification.
+* Users should be able to clearly see and specify the precedence for tokenizaton
+  * E.g., given `"-12.3"`, `numbers, punctuation` should yield `[pnct('-'),
+    number(12), pnct('.'), number(3)]` while `punctuation, numbers` should yield
+    `[number(-12.3)]`.

From 45189c547bdd6cde751b4707576bd43dba376705 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sat, 11 May 2019 22:14:30 -0400
Subject: [PATCH 02/25] Init circleci config (#27)

Signed-off-by: Shon Feder <shon.feder@gmail.com>
---
 .circleci/config.yml | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 .circleci/config.yml

diff --git a/.circleci/config.yml b/.circleci/config.yml
new file mode 100644
index 0000000..22817d2
--- /dev/null
+++ b/.circleci/config.yml
@@ -0,0 +1 @@
+version: 2

From 4158a3f075ead34683c944896719fd6e0025d30c Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sat, 11 May 2019 22:35:25 -0400
Subject: [PATCH 03/25] Run the test harness in the CI (#28)

Signed-off-by: Shon Feder <shon.feder@gmail.com>
---
 .circleci/config.yml | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/.circleci/config.yml b/.circleci/config.yml
index 22817d2..a7f5ee6 100644
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -1 +1,21 @@
 version: 2
+
+jobs:
+  build:
+    docker:
+      - image: swipl:stable
+
+    steps:
+      - run:
+          # TODO Build custom image to improve build time
+          name: Install git
+          command: |
+            apt update -y
+            apt install git -y
+
+      - checkout
+
+      - run:
+          name: Run tests
+          command: |
+            ./test/test.pl

From a89db7d0445c378b870911d4ab2ead2a719d23f7 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sun, 12 May 2019 09:10:27 -0400
Subject: [PATCH 04/25] Add instructions for getting a basic development
 environment set up (#29)

* Add link to design_notes.md

Signed-off-by: Shon Feder <shon.feder@gmail.com>
---
 CONTRIBUTING.md | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 87eda1c..16b731a 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -5,12 +5,56 @@ reports, etc.
 
 ## Code of Conduct
 
-Please review and accept to our [code of conduct](CODE_OF_CONDUCT.md) prior to
+Please review and accept our [code of conduct](CODE_OF_CONDUCT.md) prior to
 engaging in the project.
 
+## Overall direction and aims
+
+Consult the `[design_notes.md](design_notes.md)` to see the latest codified
+design philosophy and principles.
+
 ## Setting up Development
 
-TODO
+1. Install swi-prolog's [swipl](http://www.swi-prolog.org/download/stable).
+    - Optionally, you may wish to use [swivm](https://github.com/fnogatz/swivm) to
+      manage multiple installed versions of swi-prolog.
+2. Hack on the source code in `[./prolog](./prolog)`.
+3. Run and explore your changes by loading the file in `swipl` (or using your
+   editors IDE capabilities):
+    - Example in swipl
+
+    ```prolog
+    # in ~/oss/tokenize on git:develop x [22:45:02]
+    $ cd ./prolog
+
+    # in ~/oss/tokenize/prolog on git:develop x [22:45:04]
+    $ swipl
+    Welcome to SWI-Prolog (threaded, 64 bits, version 8.0.2)
+    SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
+    Please run ?- license. for legal details.
+
+    For online help and background, visit http://www.swi-prolog.org
+    For built-in help, use ?- help(Topic). or ?- apropos(Word).
+
+    % lod the tokenize module
+    ?- [tokenize].
+    true.
+
+    % experiment
+    ?- tokenize("Foo bar baz", Tokens).
+    Tokens = [word(foo), spc(' '), word(bar), spc(' '), word(baz)].
+
+    % reload the module when you make changes to the source code
+    ?- make.
+    % Updating index for library /usr/local/Cellar/swi-prolog/8.0.2/libexec/lib/swipl/library/
+    true.
+
+    % finished
+    ?- halt.
+    ```
+
+Please ask here or in `##prolog` on [freenode](https://freenode.net/) if you
+need any help! :)
 
 ## Running tests
 

From 5e74e4e5d1c67addc5eda542ea16dd9b6c8d274b Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sun, 12 May 2019 09:33:11 -0400
Subject: [PATCH 05/25] Fix design notes link (#31)

Signed-off-by: Shon Feder <shon.feder@gmail.com>
---
 CONTRIBUTING.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 16b731a..8084dc9 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -10,7 +10,7 @@ engaging in the project.
 
 ## Overall direction and aims
 
-Consult the `[design_notes.md](design_notes.md)` to see the latest codified
+Consult the [`design_notes.md`](design_notes.md) to see the latest codified
 design philosophy and principles.
 
 ## Setting up Development

From d7b0fe970141a2652e8072663e41c67679514172 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sun, 12 May 2019 12:53:18 -0400
Subject: [PATCH 06/25]  Explicitly set back_quotes for code lists in the
 tokenize module (#30)

Closes #7

* Also removed trailing white space from the readme
---
 README.md          | 8 ++++----
 prolog/tokenize.pl | 3 +++
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index 82ec7d1..b8c0b73 100644
--- a/README.md
+++ b/README.md
@@ -2,22 +2,22 @@
 
 ```prolog
 ?- tokenize(`\tExample  Text.`, Tokens).
-Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')] 
+Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')]
 
 ?- tokenize(`\tExample  Text.`, Tokens, [cntrl(false), pack(true), cased(true)]).
-Tokens = [word('Example', 1), spc(' ', 2), word('Text', 1), punct('.', 1)] 
+Tokens = [word('Example', 1), spc(' ', 2), word('Text', 1), punct('.', 1)]
 
 ?- tokenize(`\tExample  Text.`, Tokens), untokenize(Tokens, Text), format('~s~n', [Text]).
 	example  text.
 Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')],
-Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...] 
+Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...]
 ```
 
 # Description
 
 Module `tokenize` aims to provide a straightforward tool for tokenizing text into a simple format. It is the result of a learning exercise, and it is far from perfect. If there is sufficient interest from myself or anyone else, I'll try to improve it.
 
-It is packaged as an SWI-Prolog pack, available [here](http://www.swi-prolog.org/pack/list?p=tokenize). Install it into your SWI-Prolog system with the query 
+It is packaged as an SWI-Prolog pack, available [here](http://www.swi-prolog.org/pack/list?p=tokenize). Install it into your SWI-Prolog system with the query
 
 ```prolog
 ?- pack_install(tokenize).
diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl
index a177bf9..7dc877b 100644
--- a/prolog/tokenize.pl
+++ b/prolog/tokenize.pl
@@ -25,6 +25,9 @@
 
 */
 
+% Ensure we interpret backs as enclosing code lists in this module.
+:- set_prolog_flag(back_quotes, codes).
+
 %% tokenize(+Text:list(code), -Tokens:list(term)) is semidet.
 %
 %   @see tokenize/3 is called with an empty list of options: thus, with defaults.

From 15b1959ff02ad06c60b8bedf610a960a6a2095e9 Mon Sep 17 00:00:00 2001
From: Stefan Israelsson Tampe <stefan.itampe@gmail.com>
Date: Sun, 12 May 2019 20:57:58 +0200
Subject: [PATCH 07/25] add comment.pl, dcg that parses a stream of codes into
 comment recursive or not, tokens or just skip the comment

---
 prolog/comment.pl | 156 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 156 insertions(+)
 create mode 100644 prolog/comment.pl

diff --git a/prolog/comment.pl b/prolog/comment.pl
new file mode 100644
index 0000000..8e1a525
--- /dev/null
+++ b/prolog/comment.pl
@@ -0,0 +1,156 @@
+/*
+module(tokenize(comment)
+       [comment/2,
+        comment_rec/2,
+        comment_token/2,
+        comment_token_rec/2]).
+*/
+
+dcgtrue(U,U).
+
+id([X|L]) --> [X],id(L).
+id([]) --> dcgtrue.
+id([X|L],[X|LL]) --> [X],id(L,LL).
+id([],[]) --> dcgtrue.
+
+tr(S,SS) :-
+    atom(S) ->
+        (
+            atom_codes(S,C),
+            SS=id(C)
+        );
+    SS=S.
+
+eol     --> {atom_codes('\n',E)},id(E).
+eol(HS) --> {atom_codes('\n',E)},id(E,HS).
+ 
+comment_body(E) --> call(E),!.
+comment_body(E) --> [_],comment_body(E).
+comment_body(_) --> [].
+                           
+comment(S,E) -->
+    {
+        tr(S,SS),
+        tr(E,EE)
+    },
+    call(SS),
+    comment_body(EE).
+
+line_comment(S) -->
+    {tr(S,SS)},
+    comment_body(SS,eol).
+
+comment_body_token(E,Text) -->
+    call(E,HE),!,
+    {append(HE,[],Text)}.
+
+comment_body_token(E,[X|L]) -->
+    [X],
+    comment_body_token(E,L).
+
+comment_body_token(_,[]) --> [].
+                           
+comment_token(S,E,Text) -->
+    {
+        tr(S,SS),
+        tr(E,EE)
+    },
+    call(SS,HS),
+    {append(HS,T,Text)},
+    comment_body_token(EE,T).
+
+line_comment_token(S,Text) -->
+    {tr(S,SS)},
+    comment_body_token(SS,eol,Text).
+
+comment_body_rec_cont(S,E,Cont,HE,Text) -->
+    {append(HE,T,Text)},
+    comment_body_token_rec(S,E,Cont,T).
+
+comment_body_rec_start(HE,Text) -->
+    {append(HE,[],Text)}.
+
+comment_body_token_rec(_,E,Cont,Text) -->
+    call(E,HE),
+    call(Cont,HE,Text).
+
+comment_body_token_rec(S,E,Cont,Text) -->
+    call(S,HS),
+    {append(HS,T,Text)},
+    comment_body_token_rec(S,E,comment_body_rec_cont(S,E,Cont),T).
+
+comment_body_token_rec(S,E,Cont,[X|L]) -->
+    [X],
+    comment_body_token_rec(S,E,Cont,L).
+
+comment_body_token_rec(_,_,_,_,_,[]) --> dcgtrue.
+
+comment_token_rec(S,E,Text) -->
+    {
+        tr(S,SS),
+        tr(E,EE)
+    },
+    call(SS,HS),
+    {append(HS,T,Text)},
+    comment_body_token_rec(SS,EE,comment_body_rec_start,T).
+
+comment_body_rec(_,E) -->
+    call(E).
+
+comment_body_rec(S,E) -->
+    call(S),
+    comment_body_rec(S,E),
+    comment_body_rec(S,E).
+
+comment_body_rec(S,E) -->
+    [_],
+    comment_body_rec(S,E).
+
+comment_body_rec(_,_).
+
+comment_rec(S,E) -->
+    {
+        tr(S,SS),
+        tr(E,EE)
+    },
+    call(SS),
+    comment_body_rec(SS,EE).
+
+test(Tok,S,U) :-
+    atom_codes(S,SS),
+    call_dcg(Tok,SS,U).
+
+test_comment(S) :-
+    test(comment('<','>'),S,[]).
+
+test_comment_rec(S) :-
+    test(comment_rec('<','>'),S,[]).
+
+test_comment_token(S,T) :-
+    test(comment_token('<','>',TT),S,[]),
+    atom_codes(T,TT).
+
+test_comment_token_rec(S,T) :-
+    test(comment_token_rec('<','>',TT),S,[]),
+    atom_codes(T,TT).
+
+tester([]).
+tester([X|L]) :-
+    write_term(test(X),[]),
+    (
+        call(X) -> write(' ... OK') ; write(' ... FAIL')
+    ),
+    nl,
+    tester(L).
+
+
+/*
+tester(
+    [test_comment('<alla>'),
+     test_comment_rec('<alla<balla>>'),
+     test_comment_token('<alla>','<alla>'),
+     test_comment_token_rec('<alla<balla>>','<alla<balla>>')]).
+*/
+
+    
+        

From 189066c9223c69a921edffedb20d3ac573010893 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Nicol=C3=A1s=20Andr=C3=A9s=20Gallinal?= <nicoabie@gmail.com>
Date: Sun, 12 May 2019 17:56:29 -0300
Subject: [PATCH 08/25] Created a Makefile (#32)

* Add a Makefile with test target. Updated CircleCI conf. Ability to run tests from within swipl repl.
* Add make as dep for the docker image
---
 .circleci/config.yml |  6 +++---
 CONTRIBUTING.md      | 14 ++++++++++++--
 Makefile             | 20 ++++++++++++++++++++
 test/test.pl         | 18 +-----------------
 4 files changed, 36 insertions(+), 22 deletions(-)
 create mode 100644 Makefile

diff --git a/.circleci/config.yml b/.circleci/config.yml
index a7f5ee6..dd98f9e 100644
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -8,14 +8,14 @@ jobs:
     steps:
       - run:
           # TODO Build custom image to improve build time
-          name: Install git
+          name: Install Deps
           command: |
             apt update -y
-            apt install git -y
+            apt install git make -y
 
       - checkout
 
       - run:
           name: Run tests
           command: |
-            ./test/test.pl
+            make test
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 8084dc9..1895faa 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -59,10 +59,20 @@ need any help! :)
 ## Running tests
 
 Tests are located in the [`./test`](./test) directory. To run the test suite,
-simply execute the test file:
+simply execute make test:
 
 ```sh
-$ ./test/test.pl
+$ make test
 % PL-Unit: tokenize .. done
 % All 2 tests passed
 ```
+
+If inside the swipl repl, make sure to load the test file and query run_tests.
+
+```prolog
+?- [test/test].
+?- run_tests.
+% PL-Unit: tokenize .. done
+% All 2 tests passed
+true.
+```
\ No newline at end of file
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..d5ae1c5
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,20 @@
+.PHONY: all test clean
+
+version := $(shell swipl -q -s pack -g 'version(V),writeln(V)' -t halt)
+packfile = quickcheck-$(version).tgz
+
+SWIPL := swipl
+
+all: test
+
+version:
+	echo $(version)
+
+check: test
+
+install:
+	echo "(none)"
+
+test:
+	@$(SWIPL) -s test/test.pl -g 'run_tests,halt(0)' -t 'halt(1)'
+	
\ No newline at end of file
diff --git a/test/test.pl b/test/test.pl
index 49b1857..ed6de19 100755
--- a/test/test.pl
+++ b/test/test.pl
@@ -1,18 +1,3 @@
-#!/usr/bin/env swipl
-/** <module>  Unit tests for the tokenize library
- *
- * To run these tests, execute this file
- *
- *    ./test/test.pl
- */
-
-:- initialization(main, main).
-
-main(_Argv) :-
-    run_tests.
-
-:- begin_tests(tokenize).
-
 :- dynamic user:file_search_path/2.
 :- multifile user:file_search_path/2.
 
@@ -22,8 +7,7 @@
    asserta(user:file_search_path(package, PackageDir)).
 
 :- use_module(package(tokenize)).
-
-% TESTS START HERE
+:- begin_tests(tokenize).
 
 test('Hello, Tokenize!',
      [true(Actual == Expected)]

From 4b9b0f82efdc5b8f7eb5cf24c0d652366a189e2b Mon Sep 17 00:00:00 2001
From: Anne Ogborn <annie66us@yahoo.com>
Date: Sun, 12 May 2019 16:28:06 -0700
Subject: [PATCH 09/25] Add tokenization of numbers (#34)

---
 .gitignore         |  1 +
 prolog/tokenize.pl | 46 +++++++++++++++++++++++++++++-----------------
 test/test.pl       | 27 +++++++++++++++++++++++++++
 3 files changed, 57 insertions(+), 17 deletions(-)
 create mode 100644 .gitignore

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..b25c15b
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1 @@
+*~
diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl
index 7dc877b..e94f0aa 100644
--- a/prolog/tokenize.pl
+++ b/prolog/tokenize.pl
@@ -24,6 +24,7 @@
 text.
 
 */
+:- use_module(library(dcg/basics), [eos//0, number//1]).
 
 % Ensure we interpret backs as enclosing code lists in this module.
 :- set_prolog_flag(back_quotes, codes).
@@ -55,13 +56,19 @@
 %  Valid options are:
 %
 %   * cased(+bool)  : Determines whether tokens perserve cases of the source text.
-%   * spaces(+bool) : Determines whether spaces are represted as tokens or discarded.
-%   * cntrl(+bool)  : Determines whether control characters are represented as tokens or discarded.
-%   * punct(+bool)  : Determines whether punctuation characters are represented as tokens or discarded.
-%   * to(+on_of([strings,atoms,chars,codes])) : Determines the representation format used for the tokens.
-%   * pack(+bool)   : Determines whether tokens are packed or repeated.
+%   * spaces(+bool) : Determines whether spaces are represted as tokens
+%     or discarded.
+%   * cntrl(+bool)  : Determines whether control characters are represented
+%     as tokens or discarded.
+%   * punct(+bool)  : Determines whether punctuation characters are represented
+%     as tokens or discarded.
+%   * to(+one_of([strings,atoms,chars,codes])) : Determines the
+%      representation format used for the tokens.
+%   * pack(+bool) :   Determines whether tokens are packed or repeated.
 
 % TODO is it possible to achieve the proper semidet  without the cut?
+% Annie sez some parses are ambiguous, not even sure the cut should be
+% there
 
 tokenize(Text, Tokens, Options) :-
     must_be(nonvar, Text),
@@ -138,6 +145,8 @@
 %
 %   If dcg functor is identical to the option name with 'opt_' prefixed,
 %   then the dcg functor can be omitted.
+%
+%
 
 opt(Opt, Default) -->
     { atom_concat('opt_', Opt, Opt_DCG) },
@@ -160,7 +169,7 @@
         var(Default), \+ option(Opt, Opts),
         writeln("Unknown options passed to opt//3: "),
         write(Opt)
-    }.
+    }.   % TODO use print_message for this
 
 %% non_opt(+DCG).
 %
@@ -208,11 +217,12 @@
 opt_pack(true)  --> state(T0, T1),
     { phrase(pack_tokens(T1), T0) }.
 
-
-
-%% POST PROCESSING
+		 /*******************************
+		 *      POST_PROCESSING		*
+		 *******************************/
 
 %% Convert tokens to alternative representations.
+token_to(_, number(X), number(X)) :- !.
 token_to(Type, Token, Converted) :-
     ( Type == strings -> Conversion = inverse(string_codes)
     ; Type == atoms   -> Conversion = inverse(atom_codes)
@@ -231,22 +241,26 @@
 
 pack(X, Count) --> [X], pack(X, 1, Count).
 
-pack(_, Total, Total)      --> call(eos).
+pack(_, Total, Total)      --> eos.
 pack(X, Total, Total), [Y] --> [Y], { Y \= X }.
 pack(X, Count, Total)      --> [X], { succ(Count, NewCount) },
                                pack(X, NewCount, Total).
 
 
 
-% PARSING
+		 /*******************************
+		 *        PARSING               *
+		 *******************************/
+
 
-tokens([T])    --> token(T), call(eos), !.
+tokens([T])    --> token(T), eos, !.
 tokens([T|Ts]) --> token(T), tokens(Ts).
 
 % NOTE for debugging
 % tokens(_)   --> {length(L, 200)}, L, {format(L)}, halt, !.
 
-token(word(W))     --> word(W), call(eos), !.
+token(number(N))   --> number(N), !.
+token(word(W))     --> word(W), eos, !.
 token(word(W)),` ` --> word(W), ` `.
 token(word(W)), C  --> word(W), (punct(C) ; cntrl(C) ; nasciis(C)).
 token(spc(S))      --> spc(S).
@@ -258,7 +272,7 @@
 spc(` `) --> ` `.
 
 sep --> ' '.
-sep --> call(eos), !.
+sep --> eos, !.
 
 word(W) --> csyms(W).
 
@@ -269,7 +283,7 @@
 
 
 % non ascii's
-nasciis([C])     --> nascii(C), (call(eos), !).
+nasciis([C])     --> nascii(C), eos, !.
 nasciis([C]),[D] --> nascii(C), [D], {D < 127}.
 nasciis([C|Cs])  --> nascii(C), nasciis(Cs).
 
@@ -286,8 +300,6 @@
 punct([P]) --> [P], {code_type(P, punct)}.
 cntrl([C]) --> [C], {code_type(C, cntrl)}.
 
-eos([], []).
-
 %% move to general module
 
 codes_to_lower([], []).
diff --git a/test/test.pl b/test/test.pl
index ed6de19..f7f281f 100755
--- a/test/test.pl
+++ b/test/test.pl
@@ -23,4 +23,31 @@
     string_codes(Actual, Codes),
     Expected = "Goodbye, Tokenize!".
 
+
+test('tokenize 7.0',
+     [true(Actual == Expected)]
+    ) :-
+    tokenize("7.0", Actual),
+    Expected = [number(7.0)].
+
+test('untokenize 6.3',
+     [true(Actual == Expected)]
+    ) :-
+    untokenize([number(6.3)], Actual),
+    Expected = `6.3`.
+
+
+test('tokenize number in other stuff',
+     [true(Actual == Expected)]
+    ) :-
+    tokenize("hi 7.0 x", Actual),
+    Expected = [word(hi), spc(' '), number(7.0), spc(' '), word(x)].
+
+test('untokenize 6.3 in other stuff',
+     [true(Actual == Expected)]
+    ) :-
+    untokenize([word(hi), number(6.3)], Actual),
+    Expected = `hi6.3`.
+
+
 :- end_tests(tokenize).

From 8e4e98fbce697bce46522b768be3e4bedeabe6b7 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Wed, 15 May 2019 08:14:56 -0400
Subject: [PATCH 10/25] Improve comments and code ordering

Signed-off-by: Shon Feder <shon.feder@gmail.com>
---
 prolog/tokenize.pl | 40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl
index e94f0aa..b282bce 100644
--- a/prolog/tokenize.pl
+++ b/prolog/tokenize.pl
@@ -133,8 +133,11 @@
 
 %% Dispatches dcgs by option-list functors, with default values.
 process_options -->
+    % Preprocessing
     opt(cased,  false),
+    % Tokenization
     non_opt(tokenize_text),
+    % Postprocessing
     opt(spaces, true),
     opt(cntrl,  true),
     opt(punct,  true),
@@ -184,7 +187,12 @@
 state(S0),     [S0] --> [S0].
 state(S0, S1), [S1] --> [S0].
 
-%% Dispatching options:
+
+% Dispatching the option pipeline options:
+
+     /***************************
+		 *      PREPROCESSING       *
+		 ***************************/
 
 opt_cased(true)  --> [].
 opt_cased(false) --> state(Text, LowerCodes),
@@ -194,8 +202,10 @@
         string_codes(LowerStr, LowerCodes)
     }.
 
-tokenize_text --> state(Text, Tokenized),
-    { phrase(tokens(Tokenized), Text) }.
+
+     /***************************
+		 *      POSTPROCESSING      *
+		 ***************************/
 
 opt_spaces(true)  --> [].
 opt_spaces(false) --> state(T0, T1),
@@ -217,11 +227,8 @@
 opt_pack(true)  --> state(T0, T1),
     { phrase(pack_tokens(T1), T0) }.
 
-		 /*******************************
-		 *      POST_PROCESSING		*
-		 *******************************/
 
-%% Convert tokens to alternative representations.
+% Convert tokens to alternative representations.
 token_to(_, number(X), number(X)) :- !.
 token_to(Type, Token, Converted) :-
     ( Type == strings -> Conversion = inverse(string_codes)
@@ -232,8 +239,11 @@
     call_into_term(Conversion, Token, Converted).
 
 
-%% Packing repeating tokens
-%
+     /***********************************
+		 *      POSTPROCESSING HELPERS      *
+		 ***********************************/
+
+% Packing repeating tokens
 pack_tokens([T])    --> pack_token(T).
 pack_tokens([T|Ts]) --> pack_token(T), pack_tokens(Ts).
 
@@ -247,11 +257,15 @@
                                pack(X, NewCount, Total).
 
 
+     /**************************
+		 *      TOKENIZATION       *
+		 **************************/
+
+tokenize_text --> state(Text, Tokenized),
+                  { phrase(tokens(Tokenized), Text) }.
 
-		 /*******************************
-		 *        PARSING               *
-		 *******************************/
 
+% PARSING
 
 tokens([T])    --> token(T), eos, !.
 tokens([T|Ts]) --> token(T), tokens(Ts).
@@ -292,6 +306,8 @@
 ' ' --> space.
 ' ' --> space, ' '.
 
+
+% Any
 ... --> [].
 ... --> [_], ... .
 

From b52a48a8c772b0abcc848da468d9278e0725c946 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Wed, 15 May 2019 22:05:51 -0400
Subject: [PATCH 11/25] Add tokenization of strings

Closes #9

Signed-off-by: Shon Feder <shon.feder@gmail.com>
---
 prolog/tokenize.pl | 26 +++++++++++++++++++++++++-
 test/test.pl       | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl
index b282bce..5d1d053 100644
--- a/prolog/tokenize.pl
+++ b/prolog/tokenize.pl
@@ -273,16 +273,18 @@
 % NOTE for debugging
 % tokens(_)   --> {length(L, 200)}, L, {format(L)}, halt, !.
 
+token(string(S))   --> string(S).
 token(number(N))   --> number(N), !.
+
 token(word(W))     --> word(W), eos, !.
 token(word(W)),` ` --> word(W), ` `.
 token(word(W)), C  --> word(W), (punct(C) ; cntrl(C) ; nasciis(C)).
+
 token(spc(S))      --> spc(S).
 token(punct(P))    --> punct(P).
 token(cntrl(C))    --> cntrl(C).
 token(other(O))    --> nasciis(O).
 
-
 spc(` `) --> ` `.
 
 sep --> ' '.
@@ -290,6 +292,27 @@
 
 word(W) --> csyms(W).
 
+% TODO Make strings optional
+% TODO Make open and close brackets configurable
+string(S) --> string(`"`, `"`, S).
+string(OpenBracket, CloseBracket, S) --> string_start(OpenBracket, CloseBracket, S).
+
+% A string starts when we encounter an OpenBracket
+string_start(OpenBracket, CloseBracket, Cs) -->
+    OpenBracket, string_content(CloseBracket, Cs).
+
+% String content is everything up until we hit a CloseBracket
+string_content(CloseBracket, []) --> CloseBracket, !.
+% String content includes any character that isn't a CloseBracket or an escape.
+string_content(CloseBracket, [C|Cs]) -->
+    [C],
+    {[C] \= CloseBracket, [C] \= `\\`},
+    string_content(CloseBracket, Cs).
+% String content includes any character following an escape, but not the escape
+string_content(CloseBracket, [C|Cs]) -->
+    escape, [C],
+    string_content(CloseBracket, Cs).
+
 csyms([L])    --> csym(L).
 csyms([L|Ls]) --> csym(L), csyms(Ls).
 
@@ -306,6 +329,7 @@
 ' ' --> space.
 ' ' --> space, ' '.
 
+escape --> `\\`.
 
 % Any
 ... --> [].
diff --git a/test/test.pl b/test/test.pl
index f7f281f..a4ac2f2 100755
--- a/test/test.pl
+++ b/test/test.pl
@@ -24,6 +24,8 @@
     Expected = "Goodbye, Tokenize!".
 
 
+% NUMBERS
+
 test('tokenize 7.0',
      [true(Actual == Expected)]
     ) :-
@@ -50,4 +52,37 @@
     Expected = `hi6.3`.
 
 
+% STRINGS
+
+test('Extracts a string',
+     [true(Actual == Expected)]
+    ) :-
+    tokenize("\"a string\"", Actual),
+    Expected = [string('a string')].
+
+test('Extracts a string among other stuff',
+     [true(Actual == Expected)]
+    ) :-
+    tokenize("Some other \"a string\" stuff", Actual),
+    Expected = [word(some),spc(' '),word(other),spc(' '),string('a string'),spc(' '),word(stuff)].
+
+test("Extracts a string that includes escaped brackets",
+     [true(Actual == Expected)]
+    ) :-
+    tokenize(`"a \\"string\\""`, Actual),
+    Expected = [string('a "string"')].
+
+test("Extracts a string that includes a doubly nested string",
+     [true(Actual == Expected)]
+    ) :-
+    tokenize(`"a \\"sub \\\\\\"string\\\\\\"\\""`, Actual),
+    Expected = [string('a "sub \\"string\\""')].
+
+test("Untokenizes string things",
+     [true(Actual == Expected)]
+    ) :-
+    untokenize([string('some string')], ActualCodes),
+    string_codes(Actual, ActualCodes),
+    Expected = "\"some string\"".
+
 :- end_tests(tokenize).

From 5935798eaee0345e00ee11200f8d98153fa545d4 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Wed, 15 May 2019 22:11:51 -0400
Subject: [PATCH 12/25] Remove tabs

Yuck. How did tabs get in here!

Signed-off-by: Shon Feder <shon.feder@gmail.com>
---
 prolog/tokenize.pl | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl
index 5d1d053..b0ee5d6 100644
--- a/prolog/tokenize.pl
+++ b/prolog/tokenize.pl
@@ -48,23 +48,23 @@
 %
 %   A token is one of:
 %
-%   * a word (contiguous alpha-numeric chars): `word(W)`
-%   * a punctuation mark (determined by `char_type(C, punct)`): `punct(P)`
-%   * a control character (determined by `char_typ(C, cntrl)`): `cntrl(C)`
-%   * a space ( == ` `): `spc(S)`.
+%* a word (contiguous alpha-numeric chars): `word(W)`
+%* a punctuation mark (determined by `char_type(C, punct)`): `punct(P)`
+%* a control character (determined by `char_typ(C, cntrl)`): `cntrl(C)`
+%* a space ( == ` `): `spc(S)`.
 %
 %  Valid options are:
 %
-%   * cased(+bool)  : Determines whether tokens perserve cases of the source text.
-%   * spaces(+bool) : Determines whether spaces are represted as tokens
+%* cased(+bool)  : Determines whether tokens perserve cases of the source text.
+%* spaces(+bool) : Determines whether spaces are represted as tokens
 %     or discarded.
-%   * cntrl(+bool)  : Determines whether control characters are represented
+%* cntrl(+bool)  : Determines whether control characters are represented
 %     as tokens or discarded.
-%   * punct(+bool)  : Determines whether punctuation characters are represented
+%* punct(+bool)  : Determines whether punctuation characters are represented
 %     as tokens or discarded.
-%   * to(+one_of([strings,atoms,chars,codes])) : Determines the
+%* to(+one_of([strings,atoms,chars,codes])) : Determines the
 %      representation format used for the tokens.
-%   * pack(+bool) :   Determines whether tokens are packed or repeated.
+%* pack(+bool) :   Determines whether tokens are packed or repeated.
 
 % TODO is it possible to achieve the proper semidet  without the cut?
 % Annie sez some parses are ambiguous, not even sure the cut should be
@@ -190,9 +190,9 @@
 
 % Dispatching the option pipeline options:
 
-     /***************************
-		 *      PREPROCESSING       *
-		 ***************************/
+/***************************
+*      PREPROCESSING       *
+***************************/
 
 opt_cased(true)  --> [].
 opt_cased(false) --> state(Text, LowerCodes),
@@ -203,9 +203,9 @@
     }.
 
 
-     /***************************
-		 *      POSTPROCESSING      *
-		 ***************************/
+/***************************
+*      POSTPROCESSING      *
+***************************/
 
 opt_spaces(true)  --> [].
 opt_spaces(false) --> state(T0, T1),
@@ -239,9 +239,9 @@
     call_into_term(Conversion, Token, Converted).
 
 
-     /***********************************
-		 *      POSTPROCESSING HELPERS      *
-		 ***********************************/
+/***********************************
+*      POSTPROCESSING HELPERS      *
+***********************************/
 
 % Packing repeating tokens
 pack_tokens([T])    --> pack_token(T).
@@ -257,9 +257,9 @@
                                pack(X, NewCount, Total).
 
 
-     /**************************
-		 *      TOKENIZATION       *
-		 **************************/
+/**************************
+*      TOKENIZATION       *
+**************************/
 
 tokenize_text --> state(Text, Tokenized),
                   { phrase(tokens(Tokenized), Text) }.

From f8d6db7e2a1448ba33cb0af3a1c4b1379b4c88c1 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Wed, 15 May 2019 22:18:24 -0400
Subject: [PATCH 13/25] Fix indentation of comment bullet points

Signed-off-by: Shon Feder <shon.feder@gmail.com>
---
 prolog/tokenize.pl | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl
index b0ee5d6..0299666 100644
--- a/prolog/tokenize.pl
+++ b/prolog/tokenize.pl
@@ -48,23 +48,24 @@
 %
 %   A token is one of:
 %
-%* a word (contiguous alpha-numeric chars): `word(W)`
-%* a punctuation mark (determined by `char_type(C, punct)`): `punct(P)`
-%* a control character (determined by `char_typ(C, cntrl)`): `cntrl(C)`
-%* a space ( == ` `): `spc(S)`.
+%   * a word (contiguous alpha-numeric chars): `word(W)`
+%   * a punctuation mark (determined by `char_type(C, punct)`): `punct(P)`
+%   * a control character (determined by `char_typ(C, cntrl)`): `cntrl(C)`
+%   * a space ( == ` `): `spc(S)`.
 %
-%  Valid options are:
+%   Valid options are:
 %
-%* cased(+bool)  : Determines whether tokens perserve cases of the source text.
-%* spaces(+bool) : Determines whether spaces are represted as tokens
-%     or discarded.
-%* cntrl(+bool)  : Determines whether control characters are represented
-%     as tokens or discarded.
-%* punct(+bool)  : Determines whether punctuation characters are represented
-%     as tokens or discarded.
-%* to(+one_of([strings,atoms,chars,codes])) : Determines the
-%      representation format used for the tokens.
-%* pack(+bool) :   Determines whether tokens are packed or repeated.
+%   * cased(+bool) : Determines whether tokens perserve cases of the source
+%         text.
+%   * spaces(+bool) : Determines whether spaces are represted as tokens or
+%         discarded.
+%   * cntrl(+bool) : Determines whether control characters are represented as
+%         tokens or discarded.
+%   * punct(+bool) : Determines whether punctuation characters are represented
+%         as tokens or discarded.
+%   * pack(+bool)   : Determines whether tokens are packed or repeated.
+%   * to(+one_of([strings,atoms,chars,codes])) : Determines the representation
+%         format used for the tokens.
 
 % TODO is it possible to achieve the proper semidet  without the cut?
 % Annie sez some parses are ambiguous, not even sure the cut should be

From 9091f096a2f70e26b84d59f17098528e2b6dd56b Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sun, 19 May 2019 17:21:47 -0400
Subject: [PATCH 14/25] Catch edge cases and preserve escaped characters in
 strings

Thanks to @itampe for catching these in review.

Signed-off-by: Shon Feder <shon.feder@gmail.com>
---
 prolog/tokenize.pl | 19 ++++++++++---------
 test/test.pl       | 32 +++++++++++++++++++++++++++++++-
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl
index 0299666..260d8e7 100644
--- a/prolog/tokenize.pl
+++ b/prolog/tokenize.pl
@@ -300,19 +300,20 @@
 
 % A string starts when we encounter an OpenBracket
 string_start(OpenBracket, CloseBracket, Cs) -->
-    OpenBracket, string_content(CloseBracket, Cs).
+    OpenBracket, string_content(OpenBracket, CloseBracket, Cs).
 
 % String content is everything up until we hit a CloseBracket
-string_content(CloseBracket, []) --> CloseBracket, !.
+string_content(_OpenBracket, CloseBracket, []) --> CloseBracket, !.
+% String content includes a bracket following an escape, but not the escape
+string_content(OpenBracket, CloseBracket, [C|Cs]) -->
+    escape, (CloseBracket | OpenBracket),
+    {[C] = CloseBracket},
+    string_content(OpenBracket, CloseBracket, Cs).
 % String content includes any character that isn't a CloseBracket or an escape.
-string_content(CloseBracket, [C|Cs]) -->
+string_content(OpenBracket, CloseBracket, [C|Cs]) -->
     [C],
-    {[C] \= CloseBracket, [C] \= `\\`},
-    string_content(CloseBracket, Cs).
-% String content includes any character following an escape, but not the escape
-string_content(CloseBracket, [C|Cs]) -->
-    escape, [C],
-    string_content(CloseBracket, Cs).
+    {[C] \= CloseBracket},
+    string_content(OpenBracket, CloseBracket, Cs).
 
 csyms([L])    --> csym(L).
 csyms([L|Ls]) --> csym(L), csyms(Ls).
diff --git a/test/test.pl b/test/test.pl
index a4ac2f2..b857cfb 100755
--- a/test/test.pl
+++ b/test/test.pl
@@ -54,6 +54,30 @@
 
 % STRINGS
 
+test('Tokenizing the empty strings',
+     [true(Actual == Expected)]
+    ) :-
+    tokenize(`""`, Actual),
+    Expected = [string('')].
+
+test('Untokenizing an empty string',
+     [true(Actual == Expected)]
+    ) :-
+    untokenize([string('')], Actual),
+    Expected = `""`.
+
+test('Tokenizing a string with just two escapes',
+     [true(Actual == Expected)]
+    ) :-
+    tokenize(`"\\\\"`, Actual),
+    Expected = [string('\\\\')].
+
+test('Untokenizing a string with just two characters',
+     [true(Actual == Expected)]
+    ) :-
+    untokenize([string('aa')], Actual),
+    Expected = `"aa"`.
+
 test('Extracts a string',
      [true(Actual == Expected)]
     ) :-
@@ -72,10 +96,16 @@
     tokenize(`"a \\"string\\""`, Actual),
     Expected = [string('a "string"')].
 
+test("Tokenization preserves escaped characters",
+     [true(Actual == Expected)]
+    ) :-
+    tokenize(`"\\tLine text\\n"`, Actual),
+    Expected = [string('\\tline text\\n')].
+
 test("Extracts a string that includes a doubly nested string",
      [true(Actual == Expected)]
     ) :-
-    tokenize(`"a \\"sub \\\\\\"string\\\\\\"\\""`, Actual),
+    tokenize(`"a \\"sub \\\\"string\\\\"\\""`, Actual),
     Expected = [string('a "sub \\"string\\""')].
 
 test("Untokenizes string things",

From 53c83c03a0e71ec5a3bdf5c2c5eae762b0b4c8e4 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sun, 19 May 2019 18:48:15 -0400
Subject: [PATCH 15/25] Use code lists consistently for readability in tests

Signed-off-by: Shon Feder <shon.feder@gmail.com>
---
 test/test.pl | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/test/test.pl b/test/test.pl
index b857cfb..1519620 100755
--- a/test/test.pl
+++ b/test/test.pl
@@ -81,13 +81,13 @@
 test('Extracts a string',
      [true(Actual == Expected)]
     ) :-
-    tokenize("\"a string\"", Actual),
+    tokenize(`"a string"`, Actual),
     Expected = [string('a string')].
 
 test('Extracts a string among other stuff',
      [true(Actual == Expected)]
     ) :-
-    tokenize("Some other \"a string\" stuff", Actual),
+    tokenize(`Some other "a string" stuff`, Actual),
     Expected = [word(some),spc(' '),word(other),spc(' '),string('a string'),spc(' '),word(stuff)].
 
 test("Extracts a string that includes escaped brackets",
@@ -111,8 +111,7 @@
 test("Untokenizes string things",
      [true(Actual == Expected)]
     ) :-
-    untokenize([string('some string')], ActualCodes),
-    string_codes(Actual, ActualCodes),
-    Expected = "\"some string\"".
+    untokenize([string('some string')], Actual),
+    Expected = `"some string"`.
 
 :- end_tests(tokenize).

From 1e4a002d6e6fadc3346b0f8388604c19397992f8 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Wed, 19 Jun 2019 07:51:03 -0400
Subject: [PATCH 16/25] Add CircleCI badge to README

---
 README.md | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index b8c0b73..b033847 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,11 @@
-# Synopsis
+# `pack(tokenize)`
+
+A modest tokenization library for SWI-Prolog, seeking a balance between
+simplicity and flexibility.
+
+[![CircleCI](https://circleci.com/gh/shonfeder/tokenize.svg?style=svg)](https://circleci.com/gh/shonfeder/tokenize)
+
+## Synopsis
 
 ```prolog
 ?- tokenize(`\tExample  Text.`, Tokens).
@@ -13,7 +20,7 @@ Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')
 Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...]
 ```
 
-# Description
+## Description
 
 Module `tokenize` aims to provide a straightforward tool for tokenizing text into a simple format. It is the result of a learning exercise, and it is far from perfect. If there is sufficient interest from myself or anyone else, I'll try to improve it.
 
@@ -25,6 +32,6 @@ It is packaged as an SWI-Prolog pack, available [here](http://www.swi-prolog.org
 
 Please [visit the wiki](https://github.com/aBathologist/tokenize/wiki/tokenize.pl-options-and-examples) for more detailed instructions and examples, including a full list of options supported.
 
-# Contributing
+## Contributing
 
 See [CONTRIBUTING.md](./CONTRIBUTING.md).

From 4a72e6089044d722bea0e67ce951db7398f1988a Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Wed, 19 Jun 2019 08:01:31 -0400
Subject: [PATCH 17/25] Tweak README title

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index b033847..79f0e61 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# `pack(tokenize)`
+# `pack(tokenize) :-`
 
 A modest tokenization library for SWI-Prolog, seeking a balance between
 simplicity and flexibility.

From 645b9d7542f86db88598a98a4d84aedeee47fff3 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Fri, 21 Jun 2019 08:17:50 -0400
Subject: [PATCH 18/25] Use conventional option processing

The record-based approach used here is endorsed in
https://eu.swi-prolog.org/pldoc/man?section=option
---
 prolog/tokenize.pl      | 162 ++++++++++++++--------------------------
 prolog/tokenize_opts.pl |  32 ++++++++
 test/test.pl            |  19 +++++
 3 files changed, 107 insertions(+), 106 deletions(-)
 create mode 100644 prolog/tokenize_opts.pl

diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl
index 260d8e7..a03d06f 100644
--- a/prolog/tokenize.pl
+++ b/prolog/tokenize.pl
@@ -24,9 +24,11 @@
 text.
 
 */
+
 :- use_module(library(dcg/basics), [eos//0, number//1]).
+:- use_module(tokenize_opts).
 
-% Ensure we interpret backs as enclosing code lists in this module.
+% Ensure we interpret back ticks as enclosing code lists in this module.
 :- set_prolog_flag(back_quotes, codes).
 
 %% tokenize(+Text:list(code), -Tokens:list(term)) is semidet.
@@ -67,14 +69,17 @@
 %   * to(+one_of([strings,atoms,chars,codes])) : Determines the representation
 %         format used for the tokens.
 
-% TODO is it possible to achieve the proper semidet  without the cut?
+% TODO is it possible to achieve the proper semidet without the cut?
 % Annie sez some parses are ambiguous, not even sure the cut should be
 % there
 
-tokenize(Text, Tokens, Options) :-
+tokenize(Text, ProcessedTokens, Options) :-
     must_be(nonvar, Text),
     string_codes(Text, Codes),
-    phrase(process_options, [Options-Codes], [Options-Tokens]),
+    process_options(Options, PreOpts, PostOpts),
+    preprocess(PreOpts, Codes, ProcessedCodes),
+    phrase(tokens(Tokens), ProcessedCodes),
+    postprocess(PostOpts, Tokens, ProcessedTokens),
     !.
 
 %% untokenize(+Tokens:list(term), -Untokens:list(codes)) is semidet.
@@ -123,111 +128,59 @@
     read_file_to_codes(File, Codes, [encoding(utf8)]),
     tokenize(Codes, Tokens, Options).
 
-% PROCESSING OPTIONS
-%
-%   NOTE: This way of processing options is probably stupid.
-%   I will correct/improve/rewrite it if there is ever a good
-%   reason to. But for now, it works.
-%
-%   TODO: Throw exception if invalid options are passed in.
-%   At the moment it just fails.
-
-%% Dispatches dcgs by option-list functors, with default values.
-process_options -->
-    % Preprocessing
-    opt(cased,  false),
-    % Tokenization
-    non_opt(tokenize_text),
-    % Postprocessing
-    opt(spaces, true),
-    opt(cntrl,  true),
-    opt(punct,  true),
-    opt(to,     atoms),
-    opt(pack,   false).
-
-%% opt(+OptionFunctor:atom, DefaultValue:nonvar)
-%
-%   If dcg functor is identical to the option name with 'opt_' prefixed,
-%   then the dcg functor can be omitted.
-%
-%
-
-opt(Opt, Default) -->
-    { atom_concat('opt_', Opt, Opt_DCG) },
-    opt(Opt, Default, Opt_DCG).
-
-%% opt(+OptionFunctor:atom, +DefaultValue:nonvar, +DCGFunctor:atom).
-opt(Opt, Default, DCG) -->
-    state(Opts-Text0, Text0),
-    {
-        pad(Opt, Selection, Opt_Selection),
-        option(Opt_Selection, Opts, Default),
-        DCG_Selection =.. [DCG, Selection]
-    },
-    DCG_Selection,
-    state(Text1, Opts-Text1).
-%% This ugly bit should be dispensed with...
-opt(Opt, Default, _) -->
-    state(Opts-_),
-    {
-        var(Default), \+ option(Opt, Opts),
-        writeln("Unknown options passed to opt//3: "),
-        write(Opt)
-    }.   % TODO use print_message for this
-
-%% non_opt(+DCG).
-%
-%   Non optional dcg to dispatch. Passes the object of concern
-%   without the options list, then recovers option list.
-
-non_opt(DCG) -->
-    state(Opts-Text0, Text0),
-    DCG,
-    state(Text1, Opts-Text1).
-
-state(S0),     [S0] --> [S0].
-state(S0, S1), [S1] --> [S0].
-
-
-% Dispatching the option pipeline options:
-
-/***************************
-*      PREPROCESSING       *
-***************************/
-
-opt_cased(true)  --> [].
-opt_cased(false) --> state(Text, LowerCodes),
-    {
-        text_to_string(Text, Str),
-        string_lower(Str, LowerStr),
-        string_codes(LowerStr, LowerCodes)
-    }.
 
+/***********************************
+*      {PRE,POST}-PROCESSING HELPERS      *
+***********************************/
 
-/***************************
-*      POSTPROCESSING      *
-***************************/
+preprocess(PreOpts, Codes, ProcessedCodes) :-
+    preopts_data(cased, PreOpts, Cased),
+    DCG_Rules = (
+        preprocess_case(Cased)
+    ),
+    phrase(process_dcg_rules(DCG_Rules, ProcessedCodes), Codes).
+
+postprocess(PostOpts, Tokens, ProcessedTokens) :-
+    postopts_data(spaces, PostOpts, Spaces),
+    postopts_data(cntrl, PostOpts, Cntrl),
+    postopts_data(punct, PostOpts, Punct),
+    postopts_data(to, PostOpts, To),
+    postopts_data(pack, PostOpts, Pack),
+    DCG_Rules = (
+        keep_token(space(_), Spaces),
+        keep_token(cntrl(_), Cntrl),
+        keep_token(punct(_), Punct),
+        convert_token(To)
+    ),
+    phrase(process_dcg_rules(DCG_Rules, PrePackedTokens), Tokens),
+    (Pack
+    -> phrase(pack_tokens(ProcessedTokens), PrePackedTokens)
+    ;  ProcessedTokens = PrePackedTokens
+    ).
 
-opt_spaces(true)  --> [].
-opt_spaces(false) --> state(T0, T1),
-    { exclude( =(spc(_)), T0, T1) }.
 
-opt_cntrl(true)  --> [].
-opt_cntrl(false) --> state(T0, T1),
-    { exclude( =(cntrl(_)), T0, T1) }.
+/***********************************
+*      POSTPROCESSING HELPERS      *
+***********************************/
 
-opt_punct(true)  --> [].
-opt_punct(false) --> state(T0, T1),
-    { exclude( =(punct(_)), T0, T1) }.
+% Process a stream through a pipeline of DCG rules
+process_dcg_rules(_, []) --> eos, !.
+process_dcg_rules(DCG_Rules, []) --> DCG_Rules, eos, !.
+process_dcg_rules(DCG_Rules, [C|Cs]) -->
+    DCG_Rules,
+    [C],
+    process_dcg_rules(DCG_Rules, Cs).
 
-opt_to(codes) --> [].
-opt_to(Type)  --> state(CodeTokens, Tokens),
-    { maplist(token_to(Type), CodeTokens, Tokens) }.
+preprocess_case(true), [C] --> [C].
+preprocess_case(false), [CodeOut] --> [CodeIn],
+    { to_lower(CodeIn, CodeOut) }.
 
-opt_pack(false) --> [].
-opt_pack(true)  --> state(T0, T1),
-    { phrase(pack_tokens(T1), T0) }.
+keep_token(_, true), [T] --> [T].
+keep_token(Token, false) --> [Token].
+keep_token(Token, false), [T] --> [T], {T \= Token}.
 
+convert_token(Type), [Converted] --> [Token],
+    {token_to(Type, Token, Converted)}.
 
 % Convert tokens to alternative representations.
 token_to(_, number(X), number(X)) :- !.
@@ -239,11 +192,6 @@
     ),
     call_into_term(Conversion, Token, Converted).
 
-
-/***********************************
-*      POSTPROCESSING HELPERS      *
-***********************************/
-
 % Packing repeating tokens
 pack_tokens([T])    --> pack_token(T).
 pack_tokens([T|Ts]) --> pack_token(T), pack_tokens(Ts).
@@ -275,6 +223,8 @@
 % tokens(_)   --> {length(L, 200)}, L, {format(L)}, halt, !.
 
 token(string(S))   --> string(S).
+
+% TODO Make numbers optional
 token(number(N))   --> number(N), !.
 
 token(word(W))     --> word(W), eos, !.
diff --git a/prolog/tokenize_opts.pl b/prolog/tokenize_opts.pl
new file mode 100644
index 0000000..b1e8c06
--- /dev/null
+++ b/prolog/tokenize_opts.pl
@@ -0,0 +1,32 @@
+:- module(tokenize_opts,
+          [process_options/3,
+           preopts_data/3,
+           postopts_data/3]).
+
+:- use_module(library(record)).
+
+% pre-processing options
+:- record preopts(
+       cased:boolean=false
+   ).
+
+% post-processing options
+:- record postopts(
+       spaces:boolean=true,
+       cntrl:boolean=true,
+       punct:boolean=true,
+       to:oneof([strings,atoms,chars,codes])=atoms,
+       pack:boolean=false
+   ).
+
+%% process_options(+Options:list(term), -PreOpts:term, -PostOpts:term) is semidet.
+%
+process_options(Options, PreOpts, PostOpts) :-
+    make_preopts(Options, PreOpts, Rest),
+    make_postopts(Rest, PostOpts, InvalidOptions),
+    throw_on_invalid_options(InvalidOptions).
+
+throw_on_invalid_options(InvalidOptions) :-
+    InvalidOptions \= []
+    -> throw(invalid_options_given(InvalidOptions))
+    ;  true.
diff --git a/test/test.pl b/test/test.pl
index 1519620..405cab7 100755
--- a/test/test.pl
+++ b/test/test.pl
@@ -7,6 +7,8 @@
    asserta(user:file_search_path(package, PackageDir)).
 
 :- use_module(package(tokenize)).
+:- use_module(package(tokenize_opts)).
+
 :- begin_tests(tokenize).
 
 test('Hello, Tokenize!',
@@ -24,6 +26,23 @@
     Expected = "Goodbye, Tokenize!".
 
 
+% OPTION PROCESSING
+
+test('process_options/3 throws on invalid options') :-
+    catch(
+        process_options([invalid(true)], _, _),
+        invalid_options_given([invalid(true)]),
+        true
+    ).
+
+test('process_options/3 sets valid options in opt records') :-
+    Options = [cased(false), spaces(false)],
+    process_options(Options, PreOpts, PostOpts),
+    preopts_data(cased, PreOpts, Cased),
+    postopts_data(spaces, PostOpts, Spaces),
+    assertion(cased:Cased == cased:false),
+    assertion(spaces:Spaces == spaces:false).
+
 % NUMBERS
 
 test('tokenize 7.0',

From e6877a32f46ba2deccef841b382529e2adcb57ac Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Fri, 21 Jun 2019 21:30:47 -0400
Subject: [PATCH 19/25] Make string and number tokens optional

---
 prolog/tokenize.pl      | 32 +++++++++++++++++---------------
 prolog/tokenize_opts.pl | 24 ++++++++++++++++--------
 test/test.pl            | 39 +++++++++++++++++++++++++++++----------
 3 files changed, 62 insertions(+), 33 deletions(-)

diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl
index a03d06f..a92a7ac 100644
--- a/prolog/tokenize.pl
+++ b/prolog/tokenize.pl
@@ -76,9 +76,9 @@
 tokenize(Text, ProcessedTokens, Options) :-
     must_be(nonvar, Text),
     string_codes(Text, Codes),
-    process_options(Options, PreOpts, PostOpts),
+    process_options(Options, PreOpts, TokenOpts, PostOpts),
     preprocess(PreOpts, Codes, ProcessedCodes),
-    phrase(tokens(Tokens), ProcessedCodes),
+    phrase(tokens(TokenOpts, Tokens), ProcessedCodes),
     postprocess(PostOpts, Tokens, ProcessedTokens),
     !.
 
@@ -216,25 +216,28 @@
 
 % PARSING
 
-tokens([T])    --> token(T), eos, !.
-tokens([T|Ts]) --> token(T), tokens(Ts).
+tokens(Opts, [T])    --> token(Opts, T), eos, !.
+tokens(Opts, [T|Ts]) --> token(Opts, T), tokens(Opts, Ts).
 
 % NOTE for debugging
 % tokens(_)   --> {length(L, 200)}, L, {format(L)}, halt, !.
 
-token(string(S))   --> string(S).
+token(Opts, string(S)) -->
+    { tokenopts_data(strings, Opts, true) },
+    string(S).
 
-% TODO Make numbers optional
-token(number(N))   --> number(N), !.
+token(Opts, number(N)) -->
+    { tokenopts_data(numbers, Opts, true) },
+    number(N), !.
 
-token(word(W))     --> word(W), eos, !.
-token(word(W)),` ` --> word(W), ` `.
-token(word(W)), C  --> word(W), (punct(C) ; cntrl(C) ; nasciis(C)).
+token(_Opts, word(W))     --> word(W), eos, !.
+token(_Opts, word(W)),` ` --> word(W), ` `.
+token(_Opts, word(W)), C  --> word(W), (punct(C) ; cntrl(C) ; nasciis(C)).
 
-token(spc(S))      --> spc(S).
-token(punct(P))    --> punct(P).
-token(cntrl(C))    --> cntrl(C).
-token(other(O))    --> nasciis(O).
+token(_Opts, spc(S))   --> spc(S).
+token(_Opts, punct(P)) --> punct(P).
+token(_Opts, cntrl(C)) --> cntrl(C).
+token(_Opts, other(O)) --> nasciis(O).
 
 spc(` `) --> ` `.
 
@@ -243,7 +246,6 @@
 
 word(W) --> csyms(W).
 
-% TODO Make strings optional
 % TODO Make open and close brackets configurable
 string(S) --> string(`"`, `"`, S).
 string(OpenBracket, CloseBracket, S) --> string_start(OpenBracket, CloseBracket, S).
diff --git a/prolog/tokenize_opts.pl b/prolog/tokenize_opts.pl
index b1e8c06..688077e 100644
--- a/prolog/tokenize_opts.pl
+++ b/prolog/tokenize_opts.pl
@@ -1,6 +1,7 @@
 :- module(tokenize_opts,
-          [process_options/3,
+          [process_options/4,
            preopts_data/3,
+           tokenopts_data/3,
            postopts_data/3]).
 
 :- use_module(library(record)).
@@ -10,6 +11,12 @@
        cased:boolean=false
    ).
 
+% tokenization options
+:- record tokenopts(
+       numbers:boolean=true,
+       strings:boolean=true
+   ).
+
 % post-processing options
 :- record postopts(
        spaces:boolean=true,
@@ -21,12 +28,13 @@
 
 %% process_options(+Options:list(term), -PreOpts:term, -PostOpts:term) is semidet.
 %
-process_options(Options, PreOpts, PostOpts) :-
-    make_preopts(Options, PreOpts, Rest),
-    make_postopts(Rest, PostOpts, InvalidOptions),
-    throw_on_invalid_options(InvalidOptions).
+process_options(Options, PreOpts, TokenOpts, PostOpts) :-
+    make_preopts(Options, PreOpts, Rest0),
+    make_postopts(Rest0, PostOpts, Rest1),
+    make_tokenopts(Rest1, TokenOpts, InvalidOpts),
+    throw_on_invalid_options(InvalidOpts).
 
-throw_on_invalid_options(InvalidOptions) :-
-    InvalidOptions \= []
-    -> throw(invalid_options_given(InvalidOptions))
+throw_on_invalid_options(InvalidOpts) :-
+    InvalidOpts \= []
+    -> throw(invalid_options_given(InvalidOpts))
     ;  true.
diff --git a/test/test.pl b/test/test.pl
index 405cab7..7d58dc1 100755
--- a/test/test.pl
+++ b/test/test.pl
@@ -28,19 +28,27 @@
 
 % OPTION PROCESSING
 
-test('process_options/3 throws on invalid options') :-
+test('process_options/4 throws on invalid options') :-
     catch(
-        process_options([invalid(true)], _, _),
+        process_options([invalid(true)], _, _, _),
         invalid_options_given([invalid(true)]),
         true
     ).
 
-test('process_options/3 sets valid options in opt records') :-
-    Options = [cased(false), spaces(false)],
-    process_options(Options, PreOpts, PostOpts),
+test('process_options/4 sets valid options in opt records') :-
+    Options = [
+        cased(false),   % non-default preopt
+        strings(false), % non-default tokenopt
+        spaces(false)   % non-default postopt
+    ],
+    process_options(Options, PreOpts, TokenOpts, PostOpts),
+    % Fetch the options that were set
     preopts_data(cased, PreOpts, Cased),
+    tokenopts_data(strings, TokenOpts, Strings),
     postopts_data(spaces, PostOpts, Spaces),
+    % These compounds are just ensure informative output on failure
     assertion(cased:Cased == cased:false),
+    assertion(strings:Strings == strings:false),
     assertion(spaces:Spaces == spaces:false).
 
 % NUMBERS
@@ -57,7 +65,6 @@
     untokenize([number(6.3)], Actual),
     Expected = `6.3`.
 
-
 test('tokenize number in other stuff',
      [true(Actual == Expected)]
     ) :-
@@ -70,6 +77,12 @@
     untokenize([word(hi), number(6.3)], Actual),
     Expected = `hi6.3`.
 
+test('can disable number tokens',
+     [true(Actual == Expected)]
+    ) :-
+    tokenize("hi 7.0 x", Actual, [numbers(false)]),
+    Expected = [word(hi), spc(' '), word('7'), punct('.'), word('0'), spc(' '), word(x)].
+
 
 % STRINGS
 
@@ -109,25 +122,31 @@
     tokenize(`Some other "a string" stuff`, Actual),
     Expected = [word(some),spc(' '),word(other),spc(' '),string('a string'),spc(' '),word(stuff)].
 
-test("Extracts a string that includes escaped brackets",
+test('Extracts a string that includes escaped brackets',
      [true(Actual == Expected)]
     ) :-
     tokenize(`"a \\"string\\""`, Actual),
     Expected = [string('a "string"')].
 
-test("Tokenization preserves escaped characters",
+test('Tokenization preserves escaped characters',
      [true(Actual == Expected)]
     ) :-
     tokenize(`"\\tLine text\\n"`, Actual),
     Expected = [string('\\tline text\\n')].
 
-test("Extracts a string that includes a doubly nested string",
+test('Extracts a string that includes a doubly nested string',
      [true(Actual == Expected)]
     ) :-
     tokenize(`"a \\"sub \\\\"string\\\\"\\""`, Actual),
     Expected = [string('a "sub \\"string\\""')].
 
-test("Untokenizes string things",
+test('can disable string tokens',
+     [true(Actual == Expected)]
+    ) :-
+    tokenize(`some "string".`, Actual, [numbers(false)]),
+    Expected = [word(some), spc(' '), string(string), punct('.')].
+
+test('Untokenizes string things',
      [true(Actual == Expected)]
     ) :-
     untokenize([string('some string')], Actual),

From 3349b9b666c47fe5ea41e28d9c91ba80854a9123 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Fri, 21 Jun 2019 22:39:39 -0400
Subject: [PATCH 20/25] Rename 'spc' token to 'space'

---
 CONTRIBUTING.md    |  4 ++--
 README.md          |  6 +++---
 design_notes.md    |  2 +-
 prolog/tokenize.pl |  6 +++---
 test/test.pl       | 12 ++++++------
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 1895faa..d1ae63f 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -42,7 +42,7 @@ design philosophy and principles.
 
     % experiment
     ?- tokenize("Foo bar baz", Tokens).
-    Tokens = [word(foo), spc(' '), word(bar), spc(' '), word(baz)].
+    Tokens = [word(foo), space(' '), word(bar), space(' '), word(baz)].
 
     % reload the module when you make changes to the source code
     ?- make.
@@ -75,4 +75,4 @@ If inside the swipl repl, make sure to load the test file and query run_tests.
 % PL-Unit: tokenize .. done
 % All 2 tests passed
 true.
-```
\ No newline at end of file
+```
diff --git a/README.md b/README.md
index 79f0e61..47ac380 100644
--- a/README.md
+++ b/README.md
@@ -9,14 +9,14 @@ simplicity and flexibility.
 
 ```prolog
 ?- tokenize(`\tExample  Text.`, Tokens).
-Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')]
+Tokens = [cntrl('\t'), word(example), space(' '), space(' '), word(text), punct('.')]
 
 ?- tokenize(`\tExample  Text.`, Tokens, [cntrl(false), pack(true), cased(true)]).
-Tokens = [word('Example', 1), spc(' ', 2), word('Text', 1), punct('.', 1)]
+Tokens = [word('Example', 1), space(' ', 2), word('Text', 1), punct('.', 1)]
 
 ?- tokenize(`\tExample  Text.`, Tokens), untokenize(Tokens, Text), format('~s~n', [Text]).
 	example  text.
-Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')],
+Tokens = [cntrl('\t'), word(example), space(' '), space(' '), word(text), punct('.')],
 Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...]
 ```
 
diff --git a/design_notes.md b/design_notes.md
index e2be8ed..e84fade 100644
--- a/design_notes.md
+++ b/design_notes.md
@@ -35,7 +35,7 @@ this result through a subsequent lexing pass.
 
 * We don't parse.
 * Every token generated is callable (i.e., an atom or compound).
-  * Example of an possible compound token: `spc(' ')`.
+  * Example of an possible compound token: `space(' ')`.
   * Example of a possible atom token: `escape`.
   tokenization need to return tokens represented with the same arity)
 * Users should be able to determine the kind of token by unification.
diff --git a/prolog/tokenize.pl b/prolog/tokenize.pl
index a92a7ac..6923d64 100644
--- a/prolog/tokenize.pl
+++ b/prolog/tokenize.pl
@@ -53,7 +53,7 @@
 %   * a word (contiguous alpha-numeric chars): `word(W)`
 %   * a punctuation mark (determined by `char_type(C, punct)`): `punct(P)`
 %   * a control character (determined by `char_typ(C, cntrl)`): `cntrl(C)`
-%   * a space ( == ` `): `spc(S)`.
+%   * a space ( == ` `): `space(S)`.
 %
 %   Valid options are:
 %
@@ -234,12 +234,12 @@
 token(_Opts, word(W)),` ` --> word(W), ` `.
 token(_Opts, word(W)), C  --> word(W), (punct(C) ; cntrl(C) ; nasciis(C)).
 
-token(_Opts, spc(S))   --> spc(S).
+token(_Opts, space(S))   --> space(S).
 token(_Opts, punct(P)) --> punct(P).
 token(_Opts, cntrl(C)) --> cntrl(C).
 token(_Opts, other(O)) --> nasciis(O).
 
-spc(` `) --> ` `.
+space(` `) --> ` `.
 
 sep --> ' '.
 sep --> eos, !.
diff --git a/test/test.pl b/test/test.pl
index 7d58dc1..9e17e36 100755
--- a/test/test.pl
+++ b/test/test.pl
@@ -15,12 +15,12 @@
      [true(Actual == Expected)]
     ) :-
     tokenize("Hello, Tokenize!", Actual),
-    Expected = [word(hello),punct(','),spc(' '),word(tokenize),punct(!)].
+    Expected = [word(hello),punct(','),space(' '),word(tokenize),punct(!)].
 
 test('Goodbye, Tokenize!',
      [true(Actual == Expected)]
     ) :-
-    Tokens = [word('Goodbye'),punct(','),spc(' '),word('Tokenize'),punct('!')],
+    Tokens = [word('Goodbye'),punct(','),space(' '),word('Tokenize'),punct('!')],
     untokenize(Tokens, Codes),
     string_codes(Actual, Codes),
     Expected = "Goodbye, Tokenize!".
@@ -69,7 +69,7 @@
      [true(Actual == Expected)]
     ) :-
     tokenize("hi 7.0 x", Actual),
-    Expected = [word(hi), spc(' '), number(7.0), spc(' '), word(x)].
+    Expected = [word(hi), space(' '), number(7.0), space(' '), word(x)].
 
 test('untokenize 6.3 in other stuff',
      [true(Actual == Expected)]
@@ -81,7 +81,7 @@
      [true(Actual == Expected)]
     ) :-
     tokenize("hi 7.0 x", Actual, [numbers(false)]),
-    Expected = [word(hi), spc(' '), word('7'), punct('.'), word('0'), spc(' '), word(x)].
+    Expected = [word(hi), space(' '), word('7'), punct('.'), word('0'), space(' '), word(x)].
 
 
 % STRINGS
@@ -120,7 +120,7 @@
      [true(Actual == Expected)]
     ) :-
     tokenize(`Some other "a string" stuff`, Actual),
-    Expected = [word(some),spc(' '),word(other),spc(' '),string('a string'),spc(' '),word(stuff)].
+    Expected = [word(some),space(' '),word(other),space(' '),string('a string'),space(' '),word(stuff)].
 
 test('Extracts a string that includes escaped brackets',
      [true(Actual == Expected)]
@@ -144,7 +144,7 @@
      [true(Actual == Expected)]
     ) :-
     tokenize(`some "string".`, Actual, [numbers(false)]),
-    Expected = [word(some), spc(' '), string(string), punct('.')].
+    Expected = [word(some), space(' '), string(string), punct('.')].
 
 test('Untokenizes string things',
      [true(Actual == Expected)]

From 5135c88ca5ebc93f78ddd0979ae4bf29fe4aa873 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Fri, 21 Jun 2019 22:49:55 -0400
Subject: [PATCH 21/25] Add a changelog

---
 CHANGELOG.md | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)
 create mode 100644 CHANGELOG.md

diff --git a/CHANGELOG.md b/CHANGELOG.md
new file mode 100644
index 0000000..0c58002
--- /dev/null
+++ b/CHANGELOG.md
@@ -0,0 +1,23 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog][keep-a-change-log], and this project
+adheres to [Semantic Versioning][semantic-versioning].
+
+[keep-a-change-log]: https://keepachangelog.com/en/1.0.0/
+[semantic-versioning]: https://semver.org/spec/v2.0.0.html
+
+## [Unreleased]
+
+### Added
+
+- Support for numbers by [@Annipoo](https://github.com/Anniepoo) #34
+- Support for strings #37
+- Code of Conduct #23
+
+### Changed
+
+- Spaces are now tagged with `space` instead of `spc`. #41
+- Tokenization of numbers and strings is enabled by default. #40
+- Options are now processed by a more conventional means #39

From b32ea01b712b37ab58b9161c7155a5f8aa645a6d Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sat, 22 Jun 2019 18:51:25 -0400
Subject: [PATCH 22/25] Update the pack's home page info

---
 pack.pl | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/pack.pl b/pack.pl
index 174019f..c7ecabf 100644
--- a/pack.pl
+++ b/pack.pl
@@ -1,10 +1,10 @@
 name(tokenize).
-title('A nascent tokenization library').
+title('A simple tokenization library').
 
 version('0.1.2').
-download('https://github.com/aBathologist/tokenize/release/*.zip').
+download('https://github.com/shonfeder/tokenize/release/*.zip').
 
 author('Shon Feder', 'shon.feder@gmail.com').
 packager('Shon Feder', 'shon.feder@gmail.com').
 maintainer('Shon Feder', 'shon.feder@gmail.com').
-home('https://github.com/aBathologist/tokenize').
+home('https://github.com/shonfeder/tokenize').

From 26ce2bbb7be15168ed071f5da0f428805b95bb5f Mon Sep 17 00:00:00 2001
From: itampe <50549914+itampe@users.noreply.github.com>
Date: Sun, 23 Jun 2019 01:40:57 +0200
Subject: [PATCH 23/25] Cleanup, bug fixes, and tests for comment.pl (#36)

* Refactored and simplified the code
* Introduce cut's to not leave choice points and lead to execution runaway
* Add example with <X...X> kind of comment
* use copy_term of start and end tag
* pldoc compliance.
* removed complexity with specidfic atom treatment of matchers. now they're just matchers
---
 Makefile              |   1 -
 prolog/comment.pl     | 179 ++++++++++++++++--------------------------
 test/test_comments.pl | 104 ++++++++++++++++++++++++
 3 files changed, 173 insertions(+), 111 deletions(-)
 create mode 100644 test/test_comments.pl

diff --git a/Makefile b/Makefile
index d5ae1c5..044b64f 100644
--- a/Makefile
+++ b/Makefile
@@ -17,4 +17,3 @@ install:
 
 test:
 	@$(SWIPL) -s test/test.pl -g 'run_tests,halt(0)' -t 'halt(1)'
-	
\ No newline at end of file
diff --git a/prolog/comment.pl b/prolog/comment.pl
index 8e1a525..cea7fd6 100644
--- a/prolog/comment.pl
+++ b/prolog/comment.pl
@@ -1,44 +1,60 @@
-/*
-module(tokenize(comment)
-       [comment/2,
-        comment_rec/2,
-        comment_token/2,
-        comment_token_rec/2]).
+:- module(comment,
+          [comment//2,
+           comment_rec//2,
+           comment_token//3,
+           comment_token_rec//3]).
+
+/** <module> Tokenizing comments
+This module defines matchers for comments used by the tokenize module. (Note
+that we will use matcher as a name for dcg rules that match parts of the codes
+list).
+
+@author Stefan Israelsson Tampe
+@license LGPL v2 or later
+
+Interface Note:
+Start and End matchers is a matcher (dcg rule) that is either evaluated with no
+extra argument (--> call(StartMatcher)) and it will just match it's token or it
+can have an extra argument producing the codes matched by the matcher e.g. used
+as --> call(StartMatcher,MatchedCodes). The matchers match start and end codes
+of the comment, the 2matcher type will represent these kinds of dcg rules or
+matchers 2 is because they support two kinds of arguments to the dcg rules.
+For examples
+see:
+
+  @see tests/test_comments.pl
+
+The matchers predicates exported and defined are:
+
+ comment(+Start:2matcher,+End:2matcher)
+   - anonymously match a non recursive comment
+
+ comment_rec(+Start:2matcher,+End:2matcher,2matcher)
+   - anonymously match a recursive comment
+
+ coment_token(+Start:2matcher,+End:2matcher,-Matched:list(codes))
+   - match an unrecursive comment outputs the matched sequence used
+     for building a resulting comment token
+
+ coment_token_rec(+Start:2matcher,+End:2matcher,-Matched:list(codes))
+   - match an recursive comment outputs the matched sequence used
+     for building a resulting comment token
 */
 
-dcgtrue(U,U).
 
-id([X|L]) --> [X],id(L).
-id([]) --> dcgtrue.
-id([X|L],[X|LL]) --> [X],id(L,LL).
-id([],[]) --> dcgtrue.
 
-tr(S,SS) :-
-    atom(S) ->
-        (
-            atom_codes(S,C),
-            SS=id(C)
-        );
-    SS=S.
+%% comment(+Start:2matcher,+End:2matcher)
+%    non recursive non tokenizing matcher
 
-eol     --> {atom_codes('\n',E)},id(E).
-eol(HS) --> {atom_codes('\n',E)},id(E,HS).
- 
 comment_body(E) --> call(E),!.
 comment_body(E) --> [_],comment_body(E).
-comment_body(_) --> [].
-                           
+
 comment(S,E) -->
-    {
-        tr(S,SS),
-        tr(E,EE)
-    },
-    call(SS),
-    comment_body(EE).
+    call(S),
+    comment_body(E).
 
-line_comment(S) -->
-    {tr(S,SS)},
-    comment_body(SS,eol).
+%% comment_token(+Start:2matcher,+End:2matcher,-Matched:list(codes))
+%    non recursive tokenizing matcher
 
 comment_body_token(E,Text) -->
     call(E,HE),!,
@@ -48,57 +64,45 @@
     [X],
     comment_body_token(E,L).
 
-comment_body_token(_,[]) --> [].
-                           
 comment_token(S,E,Text) -->
-    {
-        tr(S,SS),
-        tr(E,EE)
-    },
-    call(SS,HS),
+    call(S,HS),
     {append(HS,T,Text)},
-    comment_body_token(EE,T).
+    comment_body_token(E,T).
 
-line_comment_token(S,Text) -->
-    {tr(S,SS)},
-    comment_body_token(SS,eol,Text).
+%% comment_token_rec(+Start:2matcher,+End:2matcher,-Matched:list(codes))
+%   recursive tokenizing matcher
 
-comment_body_rec_cont(S,E,Cont,HE,Text) -->
-    {append(HE,T,Text)},
-    comment_body_token_rec(S,E,Cont,T).
-
-comment_body_rec_start(HE,Text) -->
-    {append(HE,[],Text)}.
+% Use this as the initial continuation, will just tidy up the matched result
+% by ending the list with [].
+comment_body_rec_start(_,_,[]).
 
 comment_body_token_rec(_,E,Cont,Text) -->
-    call(E,HE),
-    call(Cont,HE,Text).
+    call(E,HE),!,
+    {append(HE,T,Text)},
+    call(Cont,T).
 
 comment_body_token_rec(S,E,Cont,Text) -->
-    call(S,HS),
+    call(S,HS),!,
     {append(HS,T,Text)},
-    comment_body_token_rec(S,E,comment_body_rec_cont(S,E,Cont),T).
+    comment_body_token_rec(S,E,comment_body_token_rec(S,E,Cont),T).
 
 comment_body_token_rec(S,E,Cont,[X|L]) -->
     [X],
     comment_body_token_rec(S,E,Cont,L).
 
-comment_body_token_rec(_,_,_,_,_,[]) --> dcgtrue.
-
 comment_token_rec(S,E,Text) -->
-    {
-        tr(S,SS),
-        tr(E,EE)
-    },
-    call(SS,HS),
+    call(S,HS),
     {append(HS,T,Text)},
-    comment_body_token_rec(SS,EE,comment_body_rec_start,T).
+    comment_body_token_rec(S,E,comment_body_rec_start,T).
+
+%% comment_rec(+Start:2matcher,+End:2matcher)
+%    recursive non tokenizing matcher
 
 comment_body_rec(_,E) -->
-    call(E).
+    call(E),!.
 
 comment_body_rec(S,E) -->
-    call(S),
+    call(S),!,
     comment_body_rec(S,E),
     comment_body_rec(S,E).
 
@@ -106,51 +110,6 @@
     [_],
     comment_body_rec(S,E).
 
-comment_body_rec(_,_).
-
 comment_rec(S,E) -->
-    {
-        tr(S,SS),
-        tr(E,EE)
-    },
-    call(SS),
-    comment_body_rec(SS,EE).
-
-test(Tok,S,U) :-
-    atom_codes(S,SS),
-    call_dcg(Tok,SS,U).
-
-test_comment(S) :-
-    test(comment('<','>'),S,[]).
-
-test_comment_rec(S) :-
-    test(comment_rec('<','>'),S,[]).
-
-test_comment_token(S,T) :-
-    test(comment_token('<','>',TT),S,[]),
-    atom_codes(T,TT).
-
-test_comment_token_rec(S,T) :-
-    test(comment_token_rec('<','>',TT),S,[]),
-    atom_codes(T,TT).
-
-tester([]).
-tester([X|L]) :-
-    write_term(test(X),[]),
-    (
-        call(X) -> write(' ... OK') ; write(' ... FAIL')
-    ),
-    nl,
-    tester(L).
-
-
-/*
-tester(
-    [test_comment('<alla>'),
-     test_comment_rec('<alla<balla>>'),
-     test_comment_token('<alla>','<alla>'),
-     test_comment_token_rec('<alla<balla>>','<alla<balla>>')]).
-*/
-
-    
-        
+    call(S),
+    comment_body_rec(S,E).
diff --git a/test/test_comments.pl b/test/test_comments.pl
new file mode 100644
index 0000000..aa7f907
--- /dev/null
+++ b/test/test_comments.pl
@@ -0,0 +1,104 @@
+:- dynamic user:file_search_path/2.
+:- multifile user:file_search_path/2.
+
+% Add the package source files relative to the current file location
+:- prolog_load_context(directory, Dir),
+   atom_concat(Dir, '/../prolog', PackageDir),
+   asserta(user:file_search_path(package, PackageDir)).
+
+:- use_module(package(comment)).
+:- begin_tests(tokenize_comment).
+
+id(X)    --> {atom_codes(X,XX)},XX.
+id(X,XX) --> {atom_codes(X,XX)},XX.
+
+mytest(Tok,S,U) :-
+    atom_codes(S,SS),
+    call_dcg(Tok,SS,U).
+
+test_comment(S) :-
+    mytest(comment(id('<'),id('>')),S,[]).
+
+test_comment_rec(S) :-
+    mytest(comment_rec(id('<'),id('>')),S,[]).
+
+test_comment_token(S,T) :-
+    mytest(comment_token(id('<'),id('>'),TT),S,[]),
+    atom_codes(T,TT).
+
+test_comment_token_rec(S,T) :-
+    mytest(comment_token_rec(id('<'),id('>'),TT),S,[]),
+    atom_codes(T,TT).
+
+start(AA) :-
+    (
+        catch(b_getval(a,[N,A]),_,N=0) ->
+          true;
+        N=0
+    ),
+    NN is N + 1,
+    (
+        N == 0 ->
+          AA = _;
+        AA = A
+    ),
+    b_setval(a,[NN,AA]).
+
+end(A) :-
+    b_getval(a,[N,A]),
+    NN is N - 1,
+    b_setval(a,[NN,A]).
+
+left(A) -->
+    {atom_codes(A,AA)},
+    AA,
+    {start(B)},
+    [B].
+
+left(A,C) -->
+    {atom_codes(A,AA)},
+    AA,
+    {start(B)},
+    [B],
+    {append(AA,[B],C)}.
+
+right(A) -->
+    {end(B)},
+    [B],
+    {atom_codes(A,AA)},
+    AA.
+
+right(A,C) -->
+    {end(B)},
+    [B],
+    {atom_codes(A,AA)},
+    AA,
+    {append([B],AA,C)}.
+
+test_adapt(S,T) :-
+    mytest(comment_token_rec(left('<'),right('>'),TT),S,[]),
+    atom_codes(T,TT).
+
+
+:- multifile test/2.
+
+test('Test comment',[true(test_comment('<alla>'))]) :- true.
+test('Test comment_rec',[true(test_comment_rec('<alla<balla>>'))]) :- true.
+test('Test comment_token',[true(A == B)]) :-
+    A='<alla>',
+    test_comment_token(A,B).
+
+test('Test comment_token_rec',[true(A == B)]) :-
+    A='<alla<balla>>',
+    test_comment_token(A,B).
+
+test('Test comment_token_rec advanced 1',[true(A == B)]) :-
+    A='<1 alla2> <1 balla2> 1>1>',
+    test_adapt(A,B).
+
+test('Test comment_token_rec advanced 2',[true(A == B)]) :-
+    A='<2 alla1> <2 balla1> 2>2>',
+    test_adapt(A,B).
+
+
+:- end_tests(tokenize_comment).

From 97b9378de5eb740c3638af6e3ce9f5851e9754dc Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sat, 22 Jun 2019 22:16:50 -0400
Subject: [PATCH 24/25] Extract WIP comment code into separate directory

---
 comment-wip/README.md                  | 4 ++++
 {prolog => comment-wip}/comment.pl     | 0
 {test => comment-wip}/test_comments.pl | 0
 3 files changed, 4 insertions(+)
 create mode 100644 comment-wip/README.md
 rename {prolog => comment-wip}/comment.pl (100%)
 rename {test => comment-wip}/test_comments.pl (100%)

diff --git a/comment-wip/README.md b/comment-wip/README.md
new file mode 100644
index 0000000..c1c3fd9
--- /dev/null
+++ b/comment-wip/README.md
@@ -0,0 +1,4 @@
+WIP code towards tokenization of comments.
+
+It was extracted here because it's not ready for release, but we want to keep it
+available for the author to resume work on it.
diff --git a/prolog/comment.pl b/comment-wip/comment.pl
similarity index 100%
rename from prolog/comment.pl
rename to comment-wip/comment.pl
diff --git a/test/test_comments.pl b/comment-wip/test_comments.pl
similarity index 100%
rename from test/test_comments.pl
rename to comment-wip/test_comments.pl

From c45fb74a9fb76c19d7f555b89d542e6d8ea74ad6 Mon Sep 17 00:00:00 2001
From: Shon Feder <shon.feder@gmail.com>
Date: Sat, 22 Jun 2019 22:47:56 -0400
Subject: [PATCH 25/25] Bump version

---
 CHANGELOG.md | 27 ++++++++++++++++++++++++---
 pack.pl      |  2 +-
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 0c58002..08dd184 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,7 +8,9 @@ adheres to [Semantic Versioning][semantic-versioning].
 [keep-a-change-log]: https://keepachangelog.com/en/1.0.0/
 [semantic-versioning]: https://semver.org/spec/v2.0.0.html
 
-## [Unreleased]
+## [unreleased]
+
+## [1.0.0]
 
 ### Added
 
@@ -18,6 +20,25 @@ adheres to [Semantic Versioning][semantic-versioning].
 
 ### Changed
 
-- Spaces are now tagged with `space` instead of `spc`. #41
-- Tokenization of numbers and strings is enabled by default. #40
+- Spaces are now tagged with `space` instead of `spc` #41
+- Tokenization of numbers and strings is enabled by default #40
 - Options are now processed by a more conventional means #39
+- The location for the pack's home is updated
+
+## [0.1.2]
+
+Prior to changelog.
+
+## [0.1.1]
+
+Prior to changelog.
+
+## [0.1.0]
+
+Prior to changelog.
+
+[unreleased]: https://github.com/shonfeder/tokenize/compare/v1.0.0...HEAD
+[1.0.0]: https://github.com/shonfeder/tokenize/compare/v0.1.2...v1.0.0
+[0.1.2]: https://github.com/shonfeder/tokenize/compare/v0.1.1...v0.1.2
+[0.1.1]: https://github.com/shonfeder/tokenize/compare/v0.1.0...v0.1.1
+[0.1.0]: https://github.com/shonfeder/tokenize/releases/tag/v0.1.0
diff --git a/pack.pl b/pack.pl
index c7ecabf..68438aa 100644
--- a/pack.pl
+++ b/pack.pl
@@ -1,7 +1,7 @@
 name(tokenize).
 title('A simple tokenization library').
 
-version('0.1.2').
+version('1.0.0').
 download('https://github.com/shonfeder/tokenize/release/*.zip').
 
 author('Shon Feder', 'shon.feder@gmail.com').