Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ER: Regex-defined custom tokens #20

Open
numist opened this issue Dec 7, 2020 · 2 comments
Open

ER: Regex-defined custom tokens #20

numist opened this issue Dec 7, 2020 · 2 comments

Comments

@numist
Copy link
Contributor

numist commented Dec 7, 2020

It occurs to me that a C-based custom token matcher is overkill in most cases. For example:

.token hex '0x0BADCAFE' '0x248c'
.token oct '0644' '0777771'
.token bin '11001011b' '10b'

Could be completely defined in the language specification using regexes:

.token hex /0x[0-9A-Fa-f]+/
.token oct /0[1-7][0-7]*/
.token bin /[01]+b/
@numist
Copy link
Contributor Author

numist commented Dec 19, 2020

This turned out to not be too complicated to implement in my toy engineering calculator, though integrating it into the parser generator is another thing entirely:

#include "parser.h"
#include <assert.h>
#include <regex.h>

#ifndef ARRAY_SIZE
#define ARRAY_SIZE(X) (sizeof(X)/sizeof(X[0]))
#endif

struct owl_token match_token(const char *string, void *unused) {
  static struct {
    enum owl_token_type token_type;
    const char *regex_str;
  } regex_tokenizers[] = {
    { .token_type = OWL_TOKEN_HEX, .regex_str = "^0[xX][0-9a-fA-F]+" },
    { .token_type = OWL_TOKEN_HEX, .regex_str = "^[0-9a-fA-F]+[hH]" },
    { .token_type = OWL_TOKEN_OCT, .regex_str = "^0[1-7][0-7]*" },
    { .token_type = OWL_TOKEN_OCT, .regex_str = "^[0-7]+[oO]" },
    { .token_type = OWL_TOKEN_BIN, .regex_str = "^[01]+[bB]" },
    { .token_type = OWL_TOKEN_SCI, .regex_str = "^[0-9]+\\.[0-9]+[eE][-]?[0-9]+" }
  };

  struct owl_token result = owl_token_no_match;
  for (int i = 0; i < ARRAY_SIZE(regex_tokenizers); i++) {
    regex_t re;
    int rc = regcomp(&re, regex_tokenizers[i].regex_str, REG_EXTENDED);
    assert(0 == rc);
    regmatch_t re_matches;
    rc = regexec(&re, string, 1, &re_matches, 0);
    assert(0 == rc || REG_NOMATCH == rc);
    if (0 == rc) {
      assert(re_matches.rm_so == 0);
      if (re_matches.rm_eo > result.length) {
        result = (struct owl_token){
          .length = (unsigned long)re_matches.rm_eo,
          .type = regex_tokenizers[i].token_type
        };
      }
    }
  }
  return result;
}

The regexes can be compiled once and reused (the compiled regex_ts are even documented to be thread-safe!) but doing so in a portable way is a bit of a crapshoot… I guess pthread_once could work, but even compiling them once per owl_tree_create_* would probably net a decent win.

@ianh
Copy link
Owner

ianh commented Dec 19, 2020

There are two big things I'd want here that POSIX regexes can't provide:

token-level ambiguity detection
If there are two rules like [0-9]+ (for integers) and [0-9]+(\.[0-9]*)? (for real numbers) which match the same text (123), the conflict should be reported instead of silently choosing one or the other at parse time.

reverse lexing for ambiguity reporting
Ambiguities in the grammar are reported by finding a sequence of tokens that can produce two different parse trees. After finding this ambiguous sequence of tokens, the ambiguity checker generates a string which tokenizes back into that sequence. Each token has to be "reverse lexed" into a string which will turn back into that token. E.g., here the number token becomes the string 1:

$ owl -g 'x = number : a  number : b'
error: this grammar is ambiguous

. 1 

  can be parsed in two different ways: as

. 1   
  x:a 

  or as

. 1   
  x:b 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants