-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ER: Regex-defined custom tokens #20
Comments
This turned out to not be too complicated to implement in my toy engineering calculator, though integrating it into the parser generator is another thing entirely: #include "parser.h"
#include <assert.h>
#include <regex.h>
#ifndef ARRAY_SIZE
#define ARRAY_SIZE(X) (sizeof(X)/sizeof(X[0]))
#endif
struct owl_token match_token(const char *string, void *unused) {
static struct {
enum owl_token_type token_type;
const char *regex_str;
} regex_tokenizers[] = {
{ .token_type = OWL_TOKEN_HEX, .regex_str = "^0[xX][0-9a-fA-F]+" },
{ .token_type = OWL_TOKEN_HEX, .regex_str = "^[0-9a-fA-F]+[hH]" },
{ .token_type = OWL_TOKEN_OCT, .regex_str = "^0[1-7][0-7]*" },
{ .token_type = OWL_TOKEN_OCT, .regex_str = "^[0-7]+[oO]" },
{ .token_type = OWL_TOKEN_BIN, .regex_str = "^[01]+[bB]" },
{ .token_type = OWL_TOKEN_SCI, .regex_str = "^[0-9]+\\.[0-9]+[eE][-]?[0-9]+" }
};
struct owl_token result = owl_token_no_match;
for (int i = 0; i < ARRAY_SIZE(regex_tokenizers); i++) {
regex_t re;
int rc = regcomp(&re, regex_tokenizers[i].regex_str, REG_EXTENDED);
assert(0 == rc);
regmatch_t re_matches;
rc = regexec(&re, string, 1, &re_matches, 0);
assert(0 == rc || REG_NOMATCH == rc);
if (0 == rc) {
assert(re_matches.rm_so == 0);
if (re_matches.rm_eo > result.length) {
result = (struct owl_token){
.length = (unsigned long)re_matches.rm_eo,
.type = regex_tokenizers[i].token_type
};
}
}
}
return result;
} The regexes can be compiled once and reused (the compiled |
There are two big things I'd want here that POSIX regexes can't provide: token-level ambiguity detection reverse lexing for ambiguity reporting
|
It occurs to me that a C-based custom token matcher is overkill in most cases. For example:
Could be completely defined in the language specification using regexes:
The text was updated successfully, but these errors were encountered: