A special LL(1) parser that is specially designed for parsing token ids generated by an LLM transformer. The purpose of this parser is to be used in a constrained decoder. In a constrained decoder it is normal to test the top k tokens with the highest probabilities to see if they are considered valid and then the token with the highest probability that is valid is chosen to be added to the generated output. This parser allows us to test tokens without actually adding them to the parse tree. After testing some tokens the most valued token can be added to the parse tree once it has been chosen as an official token for the transformer's generated output.
Another feature that is useful for a constrained decoder is the ability to override the parser's validity logic by just never applying the tested token, even if it is considered valid. The test results will provide the id of the rule that the token was tested against. This rule id can provide context that can be used to determine validity of the token outside of what the parser is capable of understanding. One example is table column names. If the parser thinks that a token is a valid column name when it really isn't, then the validity decision can be overidden.
This is unnecessary unless the vlad grammar rules have changed. The generated parser files should already exist within the project and be integrated with.
Java is strictly just a dependency of Antlr.
sudo apt install default-jre
sudo apt install default-jdk
Antlr 4 is what we use for parsing our grammar file and converting that information into the necessary information for our main token-based parser.
pip install antlr4-tools
pip install antlr4-python3-runtime
This project is a token-based parser and it utilizes an Antlr parser for loading the grammar of the target language that is to be represented by tokens.
cd scripts
sh build_sub_parser.sh
(Grammars)[https://github.com/antlr/grammars-v4/tree/master]