By default, Pinot transforms null values coming from the data source to a default value determined by the type of the corresponding column (or as specified in the schema). Eg: for INT column, the default will be 0 and for STRING column, the default is "null"
. This transformation is necessary to ensure all the indices can be built correctly during segment creation. However, we're now unable to keep track of the null values in the Pinot table and hence cannot support queries such as:
select count(*) from my_table where column IS NOT NULL
There is a workaround by matching with default values in the filter predicate. However, this is error prone since oftentimes it's difficult to distinguish valid values from the default null values. Therefore, we added first class NULL value support in Pinot for overcoming this limitation. As of today, the latest version supports NULL filter predicates only. Generic support for NULL handling in query execution is in progress (eg: within aggregation functions such as count
or sum
).
To turn on NULL
handling, simply enable the boolean flag in the table index config called as nullHandlingEnabled
(please see tableIndexConfig section). Please note - this will cause Pinot to use additional memory and disk space per segment. The details are as follows:
During data ingestion (either realtime/offline) eachGenericRow
object derived from the original data source record keeps track of all the column names containing null values. This is done as part of the NullValueTransformer
. For each such column, the segment creation logic updates a NULL value vector (implemented by a roaring bitmap) with the corresponding document ID. Effectively, at the end of the segment creation process we get a per column NULL value vector which can give us the set of document IDs containing null values for that column. This per column vector is then exposed through the DataSource
interface for use in query execution.
During Query execution, if the query includes a IS NULL
or IS NOT NULL
predicate as shown above, we fetch the NULL value vector for the corresponding column within FilterPlanNode
and retrieve the corresponding bitmap which represents all document IDs containing NULL values for that column. This bitmap is then used to create a BitmapBasedFilterOperator
which does the actual filtering operation.