Releases: bullet-db/bullet-core
Supports extended field extraction for Filters and Aggregations
This release lets us use the new extended extraction of fields supported by bullet-record-0.2.2.
Filters and Aggregations can now use the extended dot notation to access Maps, Maps of Maps or Lists and Lists of Maps.
Address #47 partially
QueryManager#size and hasQuery(String)
A minor release adding two new helpers: int size()
and boolean hasQuery(String)
to the QueryManager.
Another category in QueryCategorizer
This release adds a new category for Queriers in QueryCategorizer. After categorizing queries (with or without a BulletRecord), you can use QueryCategorizer#getHasData
to get the queries that have new data. This is in addition to the getDone
, getRateLimited
and getClosed
queriers.
Query Partitioner and other changes
This release adds:
- A Partitioner concept to bullet-core and an implementation: SimpleEqualityPartitioner
- A few new validator helper methods to
com.yahoo.bullet.common.Validator
and a couple more helpers to BulletConfig - Moves the FilterOperations, ProjectionOperations, AggregationOperations, WindowingOperations and PostAggregationOperations to a nested package
com.yahoo.bullet.querying.operations
. - Rewrites StringFilterClause instances to ObjectFilterClause on Query#configure(BulletConfig)
- Removes the AutoCloseable interface from
com.yahoo.bullet.pubsub.PubSub
- Adds a hasNull abstract method to
com.yahoo.bullet.parsing.Clause
and makes the RELATIONALS and LOGICALS member variables a Set instead of a List.
Partitioner
In the filtering stage, current Bullet implementations gets all the queries (Querier instances) and feeds each one the BulletRecord for each BulletRecord. With partitioning, the goal is to minimize the number of queries that need to see a given BulletRecord (ideally only the queries that would need to see it, see it). This release adds an interface com.yahoo.bullet.querying.partitioning.Partitioner
and provides a com.yahoo.bullet.querying.partitioning.SimpleEqualityPartitioner
implementation that ONLY works for queries with equality filters (==) connected by and operations (AND) on singular values.
For example, if queries to your instance of Bullet commonly have equality ANDed filters on fields A, B.c and D, take the following query:
A == foo AND B.c == bar AND D == null, using the fields [A, B.c, D] will make sure that records will values of foo, bar and null for those fields, will be seen only by those queries that are have those filters (or subsets of it), including queries with no filters on these fields (and queries with no filters at all).
The SimpleEqualityPartitioner#getKeys(Query) returns a set of size 1. This means the queries need not be duplicated after partitioning. However, SimpleEqualityPartitioner#getKeys(BulletRecord)will return a set of keys representing the queries that this record needs to presented to. The size of this can be up to 2^(# of fields), where each of the keys is the list of all combinations of null and the actual value in the record for each field. If fields are [A, B.c, D] and a record has these values: [A: foo, B.c: bar, D: baz, ...], the keys will be the the concatenation of the following items in each tuple using the user-configured delimiter (not necessarily in this order):
[foo, bar, baz]
[foo, null, baz]
[foo, bar, null]
[foo, null, null]
[null, bar, baz]
[null, bar, null]
[null, null, baz]
[null, null, null]
Using these keys and presenting the record to all the queries with the same key will ensure that the record is seen by exactly only the queries that need to see it. This can be huge savings depending on the cardinality of these fields and number of queries that get removed from the overall pool of queries.
Note: If you use the partitioner when your workload of queries don't have these kinds of filters, nothing bad will happen but you will unnecessary do work to compute that your queries don't fit the partitioner and default partition most, if not all, of them.
To configure a Partitioner, these new settings with their defaults have been added:
bullet.query.partitioner.enable: false
bullet.query.partitioner.class.name: "com.yahoo.bullet.querying.partitioning.SimpleEqualityPartitioner"
bullet.query.partitioner.equality.fields: null
bullet.query.partitioner.equality.delimiter: "|"
AutoCloseable Pubsub Components, HttpClient 4.3.6
This release is a small improvement to the PubSub interfaces and it makes the com.yahoo.bullet.pubsub.{PubSub,Publisher,Subscriber}
classes implement the AutoCloseable interface.
It also bumps the org,apache.httpcomponents.HttpClient dependency to 4.3.6 for a security bugfix.
Better Order By, Smaller Serializations, Transient Fields
This release changes the Order By post aggregation to not allow multiple order bys in post aggregations and allows you to specify just one. Instead of a list of string field names and a singular ascending or descending direction, you now specify a list of field objects, each with a name and a direction.
Order By changes
Before:
{
"filters": {},
"aggregation": {},
"postAggregations": [
{ "type": "ORDERBY", "fields": ["A", "B"], "direction": "DESC" }
]
}
This let you only sort a set of fields by one direction. It also let you specify other order bys in the post aggregations that would destroy the ordering of the first.
Now:
{
"filters": {},
"aggregation": {},
"postAggregations": [
{ "type": "ORDERBY", "fields": [ { "field": "A", "direction": "ASC" }, { "field": "B", "direction": "DESC" } ] }
]
}
This lets you sort in different directions by multiple fields using the later fields to sort when there are ties. It is now an error to have multiple order bys in the list of post aggregations. This closes #54
Projection
Previously, if the projection
field in a Query was null or the fields
in a projection was null or empty, all the fields would be projected regardless of whether a computation post aggregation was added to create new fields. It is now possible to just get your post aggregations fields by adding a projection in this manner:
{
"projection": { "fields": { } },
"postAggregations": [{
"type": "COMPUTATION",
"expression": {
"left": { "value": { "kind": "FIELD", "value": "a" } },
"right": { "value": { "kind": "VALUE", "value": "5" } },
"operation": "+"
}
}]
}
This will produce just the result a + 5
in your result.
In the future, we will use #57 to support pre-aggregation computations that might express this in a more natural manner.
Other changes
This release also fixes #53 and drops the size of the serialized Querier object. This should improve some overhead in state management in https://github.com/bullet-db/bullet-spark
There is also some internal change to support transient fields that are needed to support post aggregations
Post Aggregations!
This release adds post aggregations to Bullet queries! It also adds casting support to both filtering (and in post aggregations).
Casting
Filters allowed you specify a list of values with the following format:
"value": {
"kind": "FIELD | VALUE",
"value": "foo.bar",
}
Now a new key, type
has been added (in addition to kind
and value
), which you can use to type cast the constant or the extracted field. Note that if the casting fails, this filter will be ignored.
Post Aggregations
Post aggregations are specified by adding a new top-level field to the query: postAggregations
, which is a list of the various post aggregations. The order of the various post aggregations in this list determines how they are evaluated. Post aggregations can refer to previous results of post aggregations in the list to chain them.
{
"filters": [],
"aggregation": {},
"postAggregations": []
}
We start with two kinds of post aggregations:
1. ORDER BY
This orders result records based on given fields (in ascending order by default). To sort the records in descending order, use the DESC
direction
. You can specify any fields in each record or from previous post aggregations. Note that the ordering is fully typed, so the types of the fields will be used. If multiple fields are specified, ties will be broken from the list of fields from left to right.
{
"type": "ORDERBY",
"fields": ["A", "B"],
"direction": "DESC"
}
2. COMPUTATION
This lets you perform arithmetic on the results in a fully nested way. We currently support +, -, * and / as operations. The format for this is:
{
"type": "COMPUTATION",
"expression": {}
}
2.1 Expressions
For future extensibility, the expression
in this post aggregation is free form. Currently, we support the binary arithmetic operations that can be nested (implying parentheses). This forms a tree of expressions. The leaves of this tree resolve atomic values such as fields or constants. So, there are two kinds of expressions.
A) Binary Expressions
{
"operation": "+",
"left": {},
"right": {},
"type": "INTEGER | FLOAT | BOOLEAN | DOUBLE | LONG | STRING"
}
, where left
and right
are themselves expressions and type
is used for force cast the result to the given type.
B) Unary Expressions
{
"value": {
"kind": "FIELD | VALUE",
"value": "foo.bar",
"type": "INTEGER | FLOAT | BOOLEAN | DOUBLE | LONG | STRING"
}
}
These is the same definition value used for filtering mentioned above and can be used to extract fields from the record as your chosen type or use constants as your chosen type.
If casting fails in any of the expressions, the expression is ignored.
Example
Putting all these together, here is a complete example of post aggregation. Stay tuned for bullet-bql-0.2.0 for the much more concise BQL version of these. This first force computes a new field C: (CAST(foo.bar, LONG) + CAST((CAST(1.2, DOUBLE)/CAST(1, INTEGER)), FLOAT)
or (C: foo.bar + (1.2/1) for each record in the result window and then orders the result by foo.baz first then by the new the field C.
{
"postAggregations":[
{
"type":"COMPUTATION",
"expression":{
"operation":"+",
"left":{
"value":{
"kind":"FIELD",
"value":"foo.bar",
"type":"LONG"
}
},
"right":{
"operation":"/",
"left":{
"value":{
"kind":"VALUE",
"value":"1.2",
"type":"DOUBLE",
}
},
"right":{
"value":{
"kind":"VALUE",
"value":"1",
"type":"INTEGER"
}
},
"type":"FLOAT"
},
"newName":"C"
}
},
{
"type":"ORDERBY",
" fields":[
"foo.baz", "C"
],
"direction":"ASC"
}
]
}
Sliding window, filtering against other fields, SIZEIS, CONTAINSKEY, and CONTAINSVALUE
Sliding Window
This release lets you specify a Emit Type Record window with Include Type Record and Last set to more than 1. This is the general case of the Reactive window - a Sliding window. As before, we only support this currently for Raw aggregations.
New Relational operations
Addresses #36. We have added support for three new relational filter operations:
-
SIZEIS
- lets you check if a complex field (MAP, LIST) has a given size. You can nest this in a NOT logical operator to get SIZE NOT IS. If you apply this to a STRING field, the length of the string is used for the comparison. -
CONTAINSKEY
lets you check if a MAP field contains the given key. This MAP field can be a top level MAP field or a MAP field inside a LIST of MAPS -
CONTAINSVALUE
lets you check if a MAP field or a LIST field contains the given value. In the case of a LIST: if the LIST contains primitive values, they are used to compare against the given value and if it is a LIST of MAPs, the values of the MAPs are checked.
Filter against other fields
You can now filter against other fields, not just constants!
Relational filters (not Logical) have been extended to support a new format:
Old Relational Filter
{
"operation": "SIZEIS",
"field": "foo",
"values": ["1"]
}
New Relational Filter
{
"operation": "SIZEIS",
"field": "foo",
"values": [{ "kind": "VALUE", "value": "1"}]
}
The old formats are still accepted. However, you may not mix and match both styles in the same values
.
To compare to another field, simply use
{
"kind": "FIELD",
"value": "foo.bar"
}
This feature will let us support casting #37.
[BUG] - RESTPubSub Closes All HTTP Connections
This release fixes a bug in the RESTPubSub to ensure that the RESTPublisher and RESTSubscriber close all HTTP connections when they are finished posting or getting data from the PubSub REST endpoints.
It also changes a default setting to:
bullet.query.rate.limit.max.emit.count: 50
So now, by default, the backend will return a rate-limit error at 50 messages (or more) every 100 milliseconds.
Added RESTPublisher HTTP Timeout Setting
This release adds a RESTPubSub setting to configure an HTTP timeout for the RESTPublisher. The previous timeout setting only applied to the RESTSubscriber.
The two settings to configure the HTTP timeout for RESTSubscriber and RESTPublisher are now:
bullet.pubsub.rest.subscriber.connect.timeout.ms
bullet.pubsub.rest.publisher.connect.timeout.ms
Both will default to 5 seconds if not included in your settings file.
The old bullet.pubsub.rest.connect.timeout.ms
setting should be removed from your settings file.