add basic query implementation #19

michaelsbradleyjr · 2022-07-20T16:23:52Z

This PR adds a basic query implementation for base Datastore, NullDatastore, and most importantly SQLiteDatastore.

Closes #8.

An implementation for FileSystemDatastore will be skipped at this time, though it can be revisited in the future.

The immediate use case for this facility is codex-storage/nim-codex#138, it's also relevant to codex-storage/nim-codex#131.

For codex-storage/nim-codex#137, if we want to use TieredDatastore, then some decisions need to made re: the desired behavior, as explained below.

General Notes

The first choice was how to implement the facility: should it be method query or iterator query? The second choice was whether it should be sync or async.

A method query approach would naturally involve returning seq[(key, data)] (entire results for the query), while iterator query would involve yielding (key, data) one at a time.

I chose to proceed with iterator query and made it async:

type
  QueryResponse* = tuple[key: Key, data: seq[byte]]

iterator query*(self: SQLiteDatastore, query: Query): Future[QueryResponse] =
  ...

Note that this is presently "faux async" just like method get, etc.

Types Query and QueryKey

The original vision of Python Datastore (embraced by various implementations in JavaScript and Go) is to have a Query class/object that provides a high-level almost declarative API for specifying what to query, how to filter, sort, and so on.

The basic implementation in this PR does not attempt to fulfill that vision. Rather, a Query object is constructed only with a QueryKey, and at the moment QueryKey is simply an alias for Key.

In the case of SQLiteDatastore, the idea is that a QueryKey will use the GLOB syntax supported by SQLite to match multiple rows.

Imagine that a Codex manifest with CID abcd1234 is stored with key manifest:abcd1234. To retrieve all the blocks stored for that manifest, one could do something like:

for block in ds.query(Query.init("manifest:abcd1234/block:*")):
  let
    # key will look like "manifest:abcd1234/block:[cid]"
    (key, data) = await block
    
  ...

Limitations of SQLite GLOB

SQLite's GLOB syntax is not a form of regular expressions. The wildcard * is greedy such that the following won't work re: querying for all the manifests but not their blocks (per hypothetical example above):

ds.query(Query.init("manifest:*[^/]"))

That would match all manifests and all their blocks.

If desirable, there are a number of ways to work around that at the application level (i.e. in a program using SQLiteDatastore), or we could consider enabling PCRE support in SQLite and using REGEXP instead of GLOB for the QueryKey parameter.

Async + Iterator

Consider the support built into the JavaScript language for asynchronous iteration:

Those language constructs work together so that async iteration is well-defined and distinct from sync iteration. For example, for-await...of needs to be used in the general case of async iteration because it's not generally possible to know in advance how many times an async iterable will yield a value. In other words, you can't use Promise.all(), you need to step through each promised value, awaiting each in turn. Or you can use some sugar lib like IxJS to avoid manually writing for-await...of loops.

We don't currently have libs or lang features for Nim that are dedicated to async iteration. So, it's important to understand that it will be up to the caller to await on each step in a loop over async iterator query, at least in the general case, i.e. instead of trying to collect all the responses into chronos' allFinished().

Because iterator query for SQLiteDatastore is "faux async", it technically could work to exhaust the iterator and run the collected Futures through allFinished(). But when we move to a case like TieredDatastore (depending on how that works), or when we have a multi-threaded solution where the actual querying of the database takes place on a worker thread, then it's not feasible to do that, at least not without additional machinery.

TieredDatastore

In the original Python Datastore, there is this doc comment for class TieredDatastore:

Datastores should be arranged in order of completeness,
with the most complete datastore last, as it will handle query calls.

And below that:

def query(self, query):
    '''Returns a sequence of objects matching criteria expressed in `query`.
    The last datastore will handle all query calls, as it has a (if not
    the only) complete record of all objects.
    '''
    # queries hit the last (most complete) datastore
    return self._stores[-1].query(query)

Indeed, it's not clear how a query implementation (in Nim, Python, or other lang) should operate internally if a query op were to span multiple stores.

So iterator query for nim-datastore's TieredDatastore should probably have the last datastore directly handle queries.

However, that does not mean that application-specific derivatives of TieredDatastore could not have specialized logic for e.g. method get vs. iterator query.

Codex, for example, might have this arrangement re: nim-datastore layers in a derivative of TieredDatastore:

LRU cache -> SQLite -> Network

While method get may attempt to resolve a key from Network if neither LRU cache nor SQLite has it, iterator query could bypass LRU cache and go directly to SQlite while also not attempting to fall through to Network.

TODO in future PRs

We could prevent GLOB characters (*, ?, [, ]) from being used in strings passed to Key.init() while allowing them to be used in QueryKey.init().

In #11 we look toward a refactor that will be generic with respect to parameters and return types. That will provide additional flexibility, allowing concrete implementations to diverge in terms of their choice of base Key and other base types, how e.g. Key relates to Query and QueryKey (and even data layout in the backend), and making it simpler for applications to specialize on types and behaviors for get, query, etc. for a specific backend on which they choose to build.

Bulat-Ziganshin

I accept this PR, except for the need to test queries with empty blobs returned. Otherwise, it's OK for merge

Bulat-Ziganshin · 2022-07-20T22:00:33Z

datastore/sqlite_datastore.nim

+          v = sqlite3_errcode(sqlite3_db_handle(s))
+
+        if not (v in [SQLITE_OK, SQLITE_ROW, SQLITE_DONE]):
+          raise (ref Defect)(msg: $sqlite3_errstr(v))


This became repetitive pattern, may be replaced sometime in future

Bulat-Ziganshin · 2022-07-20T22:01:58Z

datastore/sqlite_datastore.nim

+      SQLITE_TRANSIENT_GCSAFE)
+
+  if not (v == SQLITE_OK):
+    raise (ref Defect)(msg: $sqlite3_errstr(v))


As I said before, we should discuss when raise Defect and when return failure, but at this moment let's accept the already implemented approach

In this case I wasn't sure of the best approach because we can't return from an inline iterator. We could yield a failure and break the internal loop, or leave it to the user to break in their loop over the iterator when an iteration-step yields a failure. Either way the cleanup in the defer block will be triggered.

Since I wasn't sure, for now I chose to raise Defect on any error, either directly or via nim-result's expect.

Bulat-Ziganshin · 2022-07-20T22:05:02Z

datastore/sqlite_datastore.nim

+        dataLen = sqlite3_column_bytes(s, queryStmtDataCol)
+        dataBytes = cast[ptr UncheckedArray[byte]](blob)
+        data = @(toOpenArray(dataBytes, 0, dataLen - 1))
+        fut = newFuture[QueryResponse]()


will it work with blob==nil ? your testing code for queries doesn't include empty blobs

I'll add some tests for that. Should already work since the approach is roughly the same for method get, which supports empty blobs.

I've accounted for empty blobs in 32596eb.

michaelsbradleyjr · 2022-07-21T04:13:05Z

test queries with empty blobs returned. Otherwise, it's OK for merge

I've accounted for empty blobs in 32596eb.

I also replaced the original PR description with notes that I thought to be relevant now and in the future, even if at present we move forward with the PR as-is.

Let me know what you think, and if we're good to go, I'll squash my [wip] commits (really this time 🤦) and then merge it.

Bulat-Ziganshin · 2022-07-21T18:03:48Z

your PR comments should be documents of its own.

As far as you tested empty blobs, this PR is OK for me.

as of chosen API, it should be debated before implementation. Now the API you choosen is already implemented, so let's Merge it and go to work on integration

michaelsbradleyjr changed the base branch from master to prepare_persistent July 20, 2022 16:24

Bulat-Ziganshin reviewed Jul 20, 2022

View reviewed changes

Base automatically changed from prepare_persistent to master July 20, 2022 22:11

michaelsbradleyjr force-pushed the 8_Add_query_implementation branch from 979ceac to 6ef19c4 Compare July 21, 2022 03:56

michaelsbradleyjr marked this pull request as ready for review July 21, 2022 04:13

add basic query implementation

f6d2c08

michaelsbradleyjr force-pushed the 8_Add_query_implementation branch from 32596eb to f6d2c08 Compare July 21, 2022 18:05

michaelsbradleyjr merged commit 249f63a into master Jul 21, 2022

michaelsbradleyjr deleted the 8_Add_query_implementation branch July 21, 2022 18:20

michaelsbradleyjr mentioned this pull request Aug 24, 2022

(Potentially) only announce Manifest Cid on the DHT codex-storage/nim-codex#227

Open

michaelsbradleyjr mentioned this pull request Sep 12, 2022

introduce methods close and query in the base API and refactor accordingly #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add basic query implementation #19

add basic query implementation #19

michaelsbradleyjr commented Jul 20, 2022 •

edited

Loading

Bulat-Ziganshin left a comment

Bulat-Ziganshin Jul 20, 2022

Bulat-Ziganshin Jul 20, 2022

michaelsbradleyjr Jul 20, 2022 •

edited

Loading

Bulat-Ziganshin Jul 20, 2022

michaelsbradleyjr Jul 20, 2022

michaelsbradleyjr Jul 21, 2022

michaelsbradleyjr commented Jul 21, 2022

Bulat-Ziganshin commented Jul 21, 2022

add basic query implementation #19

add basic query implementation #19

Conversation

michaelsbradleyjr commented Jul 20, 2022 • edited Loading

Bulat-Ziganshin left a comment

Choose a reason for hiding this comment

Bulat-Ziganshin Jul 20, 2022

Choose a reason for hiding this comment

Bulat-Ziganshin Jul 20, 2022

Choose a reason for hiding this comment

michaelsbradleyjr Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

Bulat-Ziganshin Jul 20, 2022

Choose a reason for hiding this comment

michaelsbradleyjr Jul 20, 2022

Choose a reason for hiding this comment

michaelsbradleyjr Jul 21, 2022

Choose a reason for hiding this comment

michaelsbradleyjr commented Jul 21, 2022

Bulat-Ziganshin commented Jul 21, 2022

michaelsbradleyjr commented Jul 20, 2022 •

edited

Loading

michaelsbradleyjr Jul 20, 2022 •

edited

Loading