BigQL provides a SQL Query Layer for Google Cloud Bigtable.
Cloud Bigtable is Google's fully managed NoSQL Big Data database service. Each table contains rows and columns. Each row/column intersection can contain multiple cells. Each cell contains a unique timestamped version of the data for that row and column. Thus Bigtable is often used to store time series data.
BigQL provides a SQL query layer to run aggregation query on Bigtable.
pip install bigql
Using the weather balloon example data shown in Single-timestamp unserialized schema design
Row key pressure temperature humidity altitude
us-west2#3698#2021-03-05-1200 94558 9.6 61 612
us-west2#3698#2021-03-05-1201 94122 9.7 62 611
us-west2#3698#2021-03-05-1202 95992 9.5 58 602
us-west2#3698#2021-03-05-1203 96025 9.5 66 598
us-west2#3698#2021-03-05-1204 96021 9.6 63 624
After initialize the client
from bigql.client import Client
# config follows offical python bigtable client
client = Client(config)
client.register_table(
"weather_balloons",
instance_id="INSTANCE_ID",
column_families={
"measurements": {
"only_read_latest": True,
"columns": {
"pressure": int,
"temperature": str,
"humidity": int,
"altitude": int
}
}
}
)
we are able to calculate average pressure of the period by
client.query("measurements", """
SELECT avg(pressure) FROM weather_balloons
WHERE
"_row_key" BETWEEN 'us-west2#3698#2021-03-05-1200' AND 'us-west2#3698#2021-03-05-1204'
""")
Or with row key decomposition
client.register_table(
xxx,
row_key_identifiers=["location", "balloon_id", "event_minute"],
row_key_separator="#"
)
client.query("measurements", """
SELECT balloon_id, avg(pressure) FROM weather_balloons
WHERE
location = 'us-west2'
AND balloon_id IN ('3698', '3700')
AND event_minute BETWEEN '2021-03-05-1200' AND '2021-03-05-1204'
GROUP BY 1
""")
The output of query
is list of pyarrow.RecordBatch. It can be easily convert to python dictionary (to_pydict
) and pandas dataframe (to_pandas
).
Each cell in Bigtable have a timestamp. SELECT "_timestamp"
will return a float number, represent seconds since Unix epoch. Following is an example to select 5 minutes interval
SELECT to_timestamp_seconds(cast(floor("_timestamp" / 600) * 600 as bigint)) as interval
However, as of 2022-01, it
- only supports "us-central1" and "europe-west1" region
- only supports query with "rowkey"
- by default can run up to 4 concurrent queries against Bigtable external data source
- ✅ Insert Into
- ✅ Select *
- ✅ Select column(s)
- ✅ Filter (WHERE): "=", "IN", "BETWEEN", ">", ">=", "<", "<="
- ✅ GROUP BY
- ✅ ORDER BY
- ✅ HAVING
- ✅ LIMIT
- ✅ Aggregate (e.g. avg, sum, count)
- ✅ AND
- ✅ Alias
- ✅ Cast
- ✅ Common Math Functions
- Common Date/Time Functions
- OR ???
- Join ???
- ✅ Partition Pruning
- ✅ Projection pushdown
- ❌ Predicate push down (Value range and Value regex)
- not work well, because its filter works on all cells, not only predicate column
- Limit Pushdown ???
- for row key encoding, only string is supported
- for single/composite row key, identifiers supports "=" and "IN". Additionally, last identifier also supports "BETWEEN".
- for qualifiers, only string and integer (64bit BigEndian encoding) value are supported
- subqueries and common table expressions are not supported
BigQL depends on
- sqloxide and sqlparser-rs: SQL parser
- python-bigtable: offical python bigtable client
- datafusion-python: in memory query engine
- pyarrow: in memory columnar store / dataframe