ER: Regex-defined custom tokens #20

numist · 2020-12-07T05:51:01Z

It occurs to me that a C-based custom token matcher is overkill in most cases. For example:

.token hex '0x0BADCAFE' '0x248c'
.token oct '0644' '0777771'
.token bin '11001011b' '10b'

Could be completely defined in the language specification using regexes:

.token hex /0x[0-9A-Fa-f]+/
.token oct /0[1-7][0-7]*/
.token bin /[01]+b/

The text was updated successfully, but these errors were encountered:

numist · 2020-12-19T06:01:18Z

This turned out to not be too complicated to implement in my toy engineering calculator, though integrating it into the parser generator is another thing entirely:

#include "parser.h"
#include <assert.h>
#include <regex.h>

#ifndef ARRAY_SIZE
#define ARRAY_SIZE(X) (sizeof(X)/sizeof(X[0]))
#endif

struct owl_token match_token(const char *string, void *unused) {
  static struct {
    enum owl_token_type token_type;
    const char *regex_str;
  } regex_tokenizers[] = {
    { .token_type = OWL_TOKEN_HEX, .regex_str = "^0[xX][0-9a-fA-F]+" },
    { .token_type = OWL_TOKEN_HEX, .regex_str = "^[0-9a-fA-F]+[hH]" },
    { .token_type = OWL_TOKEN_OCT, .regex_str = "^0[1-7][0-7]*" },
    { .token_type = OWL_TOKEN_OCT, .regex_str = "^[0-7]+[oO]" },
    { .token_type = OWL_TOKEN_BIN, .regex_str = "^[01]+[bB]" },
    { .token_type = OWL_TOKEN_SCI, .regex_str = "^[0-9]+\\.[0-9]+[eE][-]?[0-9]+" }
  };

  struct owl_token result = owl_token_no_match;
  for (int i = 0; i < ARRAY_SIZE(regex_tokenizers); i++) {
    regex_t re;
    int rc = regcomp(&re, regex_tokenizers[i].regex_str, REG_EXTENDED);
    assert(0 == rc);
    regmatch_t re_matches;
    rc = regexec(&re, string, 1, &re_matches, 0);
    assert(0 == rc || REG_NOMATCH == rc);
    if (0 == rc) {
      assert(re_matches.rm_so == 0);
      if (re_matches.rm_eo > result.length) {
        result = (struct owl_token){
          .length = (unsigned long)re_matches.rm_eo,
          .type = regex_tokenizers[i].token_type
        };
      }
    }
  }
  return result;
}

The regexes can be compiled once and reused (the compiled regex_ts are even documented to be thread-safe!) but doing so in a portable way is a bit of a crapshoot… I guess pthread_once could work, but even compiling them once per owl_tree_create_* would probably net a decent win.

ianh · 2020-12-19T08:35:06Z

There are two big things I'd want here that POSIX regexes can't provide:

token-level ambiguity detection
If there are two rules like [0-9]+ (for integers) and [0-9]+(\.[0-9]*)? (for real numbers) which match the same text (123), the conflict should be reported instead of silently choosing one or the other at parse time.

reverse lexing for ambiguity reporting
Ambiguities in the grammar are reported by finding a sequence of tokens that can produce two different parse trees. After finding this ambiguous sequence of tokens, the ambiguity checker generates a string which tokenizes back into that sequence. Each token has to be "reverse lexed" into a string which will turn back into that token. E.g., here the number token becomes the string 1:

$ owl -g 'x = number : a  number : b'
error: this grammar is ambiguous

. 1 

  can be parsed in two different ways: as

. 1   
  x:a 

  or as

. 1   
  x:b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ER: Regex-defined custom tokens #20

ER: Regex-defined custom tokens #20

numist commented Dec 7, 2020

numist commented Dec 19, 2020 •

edited

Loading

ianh commented Dec 19, 2020 •

edited

Loading

ER: Regex-defined custom tokens #20

ER: Regex-defined custom tokens #20

Comments

numist commented Dec 7, 2020

numist commented Dec 19, 2020 • edited Loading

ianh commented Dec 19, 2020 • edited Loading

numist commented Dec 19, 2020 •

edited

Loading

ianh commented Dec 19, 2020 •

edited

Loading