Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add basic query implementation #19

Merged
merged 1 commit into from
Jul 21, 2022

Conversation

michaelsbradleyjr
Copy link
Contributor

@michaelsbradleyjr michaelsbradleyjr commented Jul 20, 2022

This PR adds a basic query implementation for base Datastore, NullDatastore, and most importantly SQLiteDatastore.

Closes #8.

An implementation for FileSystemDatastore will be skipped at this time, though it can be revisited in the future.

The immediate use case for this facility is codex-storage/nim-codex#138, it's also relevant to codex-storage/nim-codex#131.

For codex-storage/nim-codex#137, if we want to use TieredDatastore, then some decisions need to made re: the desired behavior, as explained below.

General Notes

The first choice was how to implement the facility: should it be method query or iterator query? The second choice was whether it should be sync or async.

A method query approach would naturally involve returning seq[(key, data)] (entire results for the query), while iterator query would involve yielding (key, data) one at a time.

I chose to proceed with iterator query and made it async:

type
  QueryResponse* = tuple[key: Key, data: seq[byte]]

iterator query*(self: SQLiteDatastore, query: Query): Future[QueryResponse] =
  ...

Note that this is presently "faux async" just like method get, etc.

Types Query and QueryKey

The original vision of Python Datastore (embraced by various implementations in JavaScript and Go) is to have a Query class/object that provides a high-level almost declarative API for specifying what to query, how to filter, sort, and so on.

The basic implementation in this PR does not attempt to fulfill that vision. Rather, a Query object is constructed only with a QueryKey, and at the moment QueryKey is simply an alias for Key.

In the case of SQLiteDatastore, the idea is that a QueryKey will use the GLOB syntax supported by SQLite to match multiple rows.

Imagine that a Codex manifest with CID abcd1234 is stored with key manifest:abcd1234. To retrieve all the blocks stored for that manifest, one could do something like:

for block in ds.query(Query.init("manifest:abcd1234/block:*")):
  let
    # key will look like "manifest:abcd1234/block:[cid]"
    (key, data) = await block
    
  ...

Limitations of SQLite GLOB

SQLite's GLOB syntax is not a form of regular expressions. The wildcard * is greedy such that the following won't work re: querying for all the manifests but not their blocks (per hypothetical example above):

ds.query(Query.init("manifest:*[^/]"))

That would match all manifests and all their blocks.

If desirable, there are a number of ways to work around that at the application level (i.e. in a program using SQLiteDatastore), or we could consider enabling PCRE support in SQLite and using REGEXP instead of GLOB for the QueryKey parameter.

Async + Iterator

Consider the support built into the JavaScript language for asynchronous iteration:

Those language constructs work together so that async iteration is well-defined and distinct from sync iteration. For example, for-await...of needs to be used in the general case of async iteration because it's not generally possible to know in advance how many times an async iterable will yield a value. In other words, you can't use Promise.all(), you need to step through each promised value, awaiting each in turn. Or you can use some sugar lib like IxJS to avoid manually writing for-await...of loops.

We don't currently have libs or lang features for Nim that are dedicated to async iteration. So, it's important to understand that it will be up to the caller to await on each step in a loop over async iterator query, at least in the general case, i.e. instead of trying to collect all the responses into chronos' allFinished().

Because iterator query for SQLiteDatastore is "faux async", it technically could work to exhaust the iterator and run the collected Futures through allFinished(). But when we move to a case like TieredDatastore (depending on how that works), or when we have a multi-threaded solution where the actual querying of the database takes place on a worker thread, then it's not feasible to do that, at least not without additional machinery.

TieredDatastore

In the original Python Datastore, there is this doc comment for class TieredDatastore:

Datastores should be arranged in order of completeness,
with the most complete datastore last, as it will handle query calls.

And below that:

def query(self, query):
    '''Returns a sequence of objects matching criteria expressed in `query`.
    The last datastore will handle all query calls, as it has a (if not
    the only) complete record of all objects.
    '''
    # queries hit the last (most complete) datastore
    return self._stores[-1].query(query)

Indeed, it's not clear how a query implementation (in Nim, Python, or other lang) should operate internally if a query op were to span multiple stores.

So iterator query for nim-datastore's TieredDatastore should probably have the last datastore directly handle queries.

However, that does not mean that application-specific derivatives of TieredDatastore could not have specialized logic for e.g. method get vs. iterator query.

Codex, for example, might have this arrangement re: nim-datastore layers in a derivative of TieredDatastore:

LRU cache -> SQLite -> Network

While method get may attempt to resolve a key from Network if neither LRU cache nor SQLite has it, iterator query could bypass LRU cache and go directly to SQlite while also not attempting to fall through to Network.

TODO in future PRs

We could prevent GLOB characters (*, ?, [, ]) from being used in strings passed to Key.init() while allowing them to be used in QueryKey.init().

In #11 we look toward a refactor that will be generic with respect to parameters and return types. That will provide additional flexibility, allowing concrete implementations to diverge in terms of their choice of base Key and other base types, how e.g. Key relates to Query and QueryKey (and even data layout in the backend), and making it simpler for applications to specialize on types and behaviors for get, query, etc. for a specific backend on which they choose to build.

@michaelsbradleyjr michaelsbradleyjr changed the base branch from master to prepare_persistent July 20, 2022 16:24
Copy link
Contributor

@Bulat-Ziganshin Bulat-Ziganshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accept this PR, except for the need to test queries with empty blobs returned. Otherwise, it's OK for merge

v = sqlite3_errcode(sqlite3_db_handle(s))

if not (v in [SQLITE_OK, SQLITE_ROW, SQLITE_DONE]):
raise (ref Defect)(msg: $sqlite3_errstr(v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This became repetitive pattern, may be replaced sometime in future

SQLITE_TRANSIENT_GCSAFE)

if not (v == SQLITE_OK):
raise (ref Defect)(msg: $sqlite3_errstr(v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before, we should discuss when raise Defect and when return failure, but at this moment let's accept the already implemented approach

Copy link
Contributor Author

@michaelsbradleyjr michaelsbradleyjr Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I wasn't sure of the best approach because we can't return from an inline iterator. We could yield a failure and break the internal loop, or leave it to the user to break in their loop over the iterator when an iteration-step yields a failure. Either way the cleanup in the defer block will be triggered.

Since I wasn't sure, for now I chose to raise Defect on any error, either directly or via nim-result's expect.

dataLen = sqlite3_column_bytes(s, queryStmtDataCol)
dataBytes = cast[ptr UncheckedArray[byte]](blob)
data = @(toOpenArray(dataBytes, 0, dataLen - 1))
fut = newFuture[QueryResponse]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it work with blob==nil ? your testing code for queries doesn't include empty blobs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some tests for that. Should already work since the approach is roughly the same for method get, which supports empty blobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've accounted for empty blobs in 32596eb.

Base automatically changed from prepare_persistent to master July 20, 2022 22:11
@michaelsbradleyjr michaelsbradleyjr force-pushed the 8_Add_query_implementation branch from 979ceac to 6ef19c4 Compare July 21, 2022 03:56
@michaelsbradleyjr
Copy link
Contributor Author

test queries with empty blobs returned. Otherwise, it's OK for merge

I've accounted for empty blobs in 32596eb.

I also replaced the original PR description with notes that I thought to be relevant now and in the future, even if at present we move forward with the PR as-is.

Let me know what you think, and if we're good to go, I'll squash my [wip] commits (really this time 🤦) and then merge it.

@michaelsbradleyjr michaelsbradleyjr marked this pull request as ready for review July 21, 2022 04:13
@Bulat-Ziganshin
Copy link
Contributor

your PR comments should be documents of its own.

As far as you tested empty blobs, this PR is OK for me.

as of chosen API, it should be debated before implementation. Now the API you choosen is already implemented, so let's Merge it and go to work on integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add query implementation for listing blocks in a store
2 participants