Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SPL (Splunk query language) #1970

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

romain-durban
Copy link

Hello, here is a proposal to add support for SPL (Splunk query language).

SPL is technically fileless, meaning that it does not come up in a dedicated file with a specific file extension. However, I have here assumed that users could store a query in a FILENAME.spl file, that's the first extension which might come to mind. Technically, Splunk stores its knowledge obects in INI configuration files, parts of it can be SPL queries. However, syntax highlighting for SPL can become very helpful when sharing pieces of queries in threads, such as on Github or Gitlab.

It is a stretch to consider SPL as real language, it is extremely permissive and can be ambiguous. Fortunately we do not need here to validate the syntax, just highlight notable elements.

I'll now share some information to explain why the Lexer is done this way here. Theorically, SPL's grammar is vary basic, however to achieve a useful kind of syntax highlighting we need a lot of compensation.

Basic syntax and implicit Search command

SPL is basically a succession of commands. A command starts with a pipe, followed by its name and then arguments/input data:

| commandA arg1=true fieldA, fieldB
| commandB

There are several types of commands, but the most important here for us is "Generating commands", it is a type of command which does not need data as input but will provide a data output. Each query should start by a Generating Command.
However, for ease of use like for a search engine, Splunk considers that by default the command "search" is implictly used if the query does not start by an explicit command call like | commandName. That's the first exception that will mess with us. Lexer needs to be able to handle this implicit default state.

This:

index=_internal sourcetype=splunkd

is the exact same thing as:

| search index=_internal sourcetype=splunkd

Subqueries

Splunk allows to have subqueries in the queries, which will be executed beforehand. There is no actual limit of how deep they can be nested and they can occur nearly anywhere in the query, their output will technically be, on runtime, more SPL. They are defined between brackets. A subquery follows almost the same rules as a query, except that the first pipe (to start a command) is optional. That's another exception to handle.

End of query

There is technically no way of indicating we have reached the end of a query apart from EOF.
Consequently, it is unfortunately not really possible to put several queries in a same file/block and expect all of them to always be syntaxically highlighted correctly. This is due to the implicit "search" command at the beginning of the query, if a query ends and another starts with such implicit state, we cannot know where the previous query ends and where the new one starts.

Arguments position and nature

Command arguments can be positioned anywhere after the command name. Some arguments are parameters which the command expects. Each command has a different set of known parameters. Arguments are used like this argName=value. Almost anything else will be treated as input for the command. Note that inputs can also be of the shape field=value, such as in a search command when providing the data filters. This is why we have a large dictionary listing the expected arguments for each command, so that we can highlight only what is valid in the context of the current command.

Operators

Apart for the usual arithmetic operators which do not really matter here, Splunk commands use a wide variety of operators. There are the usual boolean operators for the definition of conditions (AND, OR, NOT etc.). But there are also operators like "BY" or "GROUPBY" for agregations, "AS" for aliases/renaming but also operators which define a specific section of the command like for a SQL query, such as "WHERE", "FROM" etc. We want to highlight them but only when they are in the appropriate location. This is why we have a large dictionary defining the possible advanced operators.

Functions

Some commands allow the use of various functions, such as agregation functions, for more advanced features. There are several types of functions, in this lexer we have considered the following ones:

  • Eval functions: functions to evaluate data/fields, they basically transform data
  • Agregation functions: functions used in agregating commands, like in SQL, such as count, avg etc.
  • Convert functions; specificaly used for data type conversions
  • Filter functions: specifically used for data filtering

Some commands support functions, but only of some specific types and we want to highligh only valid calls. This is why we have several data structures listing the functions of a given type and the list of commands supporting them. Splunk allows agregation commands to also support eval functions inside agregation functions. That's another exception to handle.

There are some other specificities/tricks, but I'll spare you the detail, they are less important.

Example of highlighted SPL

spl_example

Official syntax highlighting

Splunk provides syntax highlighting in its interface, but it is surprisingly light. I have here pushed it a little farther, with some personally choices based on past experience. It was also often inspired by the syntax highlighting proposed there : https://github.com/ChrisYounger/highlighter

Deprecated commands

Similarly to Chris Younger in his implementation, deprecated SPL commands are not highlighted even though they can be found in the official documentation (with a warning). This is made in order to alert the user that a command being used should not be.

Pushing this first working version, further tests will be done later
Getting rid of tabs
Removed the :search_command state to reduce redundancy and playing instead with the states stack
Fixed some issues, now covering some commands which were missing and now better highlighting some special operators, args and functions
Fixed a minor syntax error in the example, for good measure
One of the regex for multiline comments did not have the multiline option and was raising an error on \n
@romain-durban
Copy link
Author

Hello @tancnle,

Sorry, I did not notice the lexer was raising one Error token because it was on a "\n".
I have fixed it (a multiline mode was missing on one of the regexes).

Added missing EOF newline to comply to the linelint rule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant