Add support for SPL (Splunk query language) #1970
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello, here is a proposal to add support for SPL (Splunk query language).
SPL is technically fileless, meaning that it does not come up in a dedicated file with a specific file extension. However, I have here assumed that users could store a query in a
FILENAME.spl
file, that's the first extension which might come to mind. Technically, Splunk stores its knowledge obects in INI configuration files, parts of it can be SPL queries. However, syntax highlighting for SPL can become very helpful when sharing pieces of queries in threads, such as on Github or Gitlab.It is a stretch to consider SPL as real language, it is extremely permissive and can be ambiguous. Fortunately we do not need here to validate the syntax, just highlight notable elements.
I'll now share some information to explain why the Lexer is done this way here. Theorically, SPL's grammar is vary basic, however to achieve a useful kind of syntax highlighting we need a lot of compensation.
Basic syntax and implicit Search command
SPL is basically a succession of commands. A command starts with a pipe, followed by its name and then arguments/input data:
There are several types of commands, but the most important here for us is "Generating commands", it is a type of command which does not need data as input but will provide a data output. Each query should start by a Generating Command.
However, for ease of use like for a search engine, Splunk considers that by default the command "search" is implictly used if the query does not start by an explicit command call like
| commandName
. That's the first exception that will mess with us. Lexer needs to be able to handle this implicit default state.This:
is the exact same thing as:
Subqueries
Splunk allows to have subqueries in the queries, which will be executed beforehand. There is no actual limit of how deep they can be nested and they can occur nearly anywhere in the query, their output will technically be, on runtime, more SPL. They are defined between brackets. A subquery follows almost the same rules as a query, except that the first pipe (to start a command) is optional. That's another exception to handle.
End of query
There is technically no way of indicating we have reached the end of a query apart from EOF.
Consequently, it is unfortunately not really possible to put several queries in a same file/block and expect all of them to always be syntaxically highlighted correctly. This is due to the implicit "search" command at the beginning of the query, if a query ends and another starts with such implicit state, we cannot know where the previous query ends and where the new one starts.
Arguments position and nature
Command arguments can be positioned anywhere after the command name. Some arguments are parameters which the command expects. Each command has a different set of known parameters. Arguments are used like this
argName=value
. Almost anything else will be treated as input for the command. Note that inputs can also be of the shapefield=value
, such as in a search command when providing the data filters. This is why we have a large dictionary listing the expected arguments for each command, so that we can highlight only what is valid in the context of the current command.Operators
Apart for the usual arithmetic operators which do not really matter here, Splunk commands use a wide variety of operators. There are the usual boolean operators for the definition of conditions (AND, OR, NOT etc.). But there are also operators like "BY" or "GROUPBY" for agregations, "AS" for aliases/renaming but also operators which define a specific section of the command like for a SQL query, such as "WHERE", "FROM" etc. We want to highlight them but only when they are in the appropriate location. This is why we have a large dictionary defining the possible advanced operators.
Functions
Some commands allow the use of various functions, such as agregation functions, for more advanced features. There are several types of functions, in this lexer we have considered the following ones:
Some commands support functions, but only of some specific types and we want to highligh only valid calls. This is why we have several data structures listing the functions of a given type and the list of commands supporting them. Splunk allows agregation commands to also support eval functions inside agregation functions. That's another exception to handle.
There are some other specificities/tricks, but I'll spare you the detail, they are less important.
Example of highlighted SPL
Official syntax highlighting
Splunk provides syntax highlighting in its interface, but it is surprisingly light. I have here pushed it a little farther, with some personally choices based on past experience. It was also often inspired by the syntax highlighting proposed there : https://github.com/ChrisYounger/highlighter
Deprecated commands
Similarly to Chris Younger in his implementation, deprecated SPL commands are not highlighted even though they can be found in the official documentation (with a warning). This is made in order to alert the user that a command being used should not be.