ANS-104 bundle indexing #24

Adds the DB schema required for indexing data items for GraphQL querying. Also includes a table for tracking bundle status (processed_at + data_item_count). Bundles use a separate SQLite DB (similar to data) to reduce lock contention and support greater bootstrapping flexibility.

Adds the wiring needed to use the new bundle DB in both the StandaloneSqlite class and the tests.

Extracting some small helper functions so they can used when constructing data item rows too.

Records ANS-104 metadata in new data item tables. Flushing to stable data items tables is not yet implemented. Also implements propogation of a root parent transaction ID to the ANS-104 unbundler. A root parent transaction ID is needed to efficiently find and sort data items when executing GQL queries.

Adds SQL to flush stable data items to the stable data item and data item tags tables as well as remove flushed data from the new data item tables. This is still relatively unoptimized and is not yet exhaustive in its cleanup of stale data.

…ushing PE-3769 Clears the heights on data items > fork height when forks occur and updates data items related to L1 TXs when L1 TX heights are set. Also adds a height condition to the query for data items to flush to stable to avoid unnecessary work when joining to L1 stable tables to retrieve cannonical heights. Note: further optimization may still be possible. It may be possible to eliminate one of the joins by replacing it with a join to stable_block_transactions if we add height to stable_block_transactions. Though, it's unclear how much performance improvement that would yield.

…E-3769 Sometimes we can't fetch transactions when indexing a block. In those cases we still know the height, so we should ensure the height is set on any associated data items.

Further simplifies joins when copying new data items to the stable tables and cleaning up stale data items. Rather than getting height from stable L1 tables, we rely on height on the new data items and only join to stable L1 tables to get the block transaction index.

Combines stable transactions and stable data items using a UNION in the SQL query. Each subquery in the UNION has its own ORDER BY and LIMIT. This allows the sub-selects to do most of the work before the union is computed. This change also implements returning parent/bundleIn for data items. However, filtering based on bundledIn and sorting data items by ID are not functional yet and will be implemented in future commits.

Adds data items to GQL sorting and cursors. Data items are sorted by ID after block height and block TX index. ID was chosen as opposed to bundle offsets or indexes because we want duplicates of the same item sorted consistently where possible. Also, bundle data item indexes are potentially confusing when data item filtering is used. In order to accomplish this, the cursor condition in the query was changed from a simple numeric comparison to a set of comparisons against the cursor components. An OR is required in the comparisons to avoid comparing against irrelevant conditions (e.g. block TX index comparison when height > cursor height). This clutters the WHERE conditions, but is still fairly readable. Also it may perform better since it makes the height comparison legible to SQLite.

Implements the GraphQL bundledIn/parent filter (parent is depricated). Filtering on 'null' matches only L1 transactions. Data items queries are skipped in that case. This ensures users do not pay a performance penalty if they only want to query L1. Similarly, L1 transactions are skipped if a bundledIn filter is specified.

Adds support for querying data items that have not yet been flushed to the stable (> 50 blocks old) tables. Note, there are still some edge cases to work out with this and new data querying in general. In particular, we don't currently support querying data that has not yet been associated with a block or data is technically stable but was indexed late (e.g. due to missing chunks) and has not yet been flushed.

Adds a simple queue + worker index data items (similar to the one for indexing nested data). Currently there is no back pressure or other congestion control so if the queue gets too backed up it may crash the service. This issue will be address in a future commit.

…PE-3769 We pause the stream to give the downstream consumer a chance to setup its own handler before data starts flowing. Of course, this still has to happen relatively quickly since node.js + the OS won't buffer indefinitely once data starts flowing over the network, but it should still prevent some obvious app level races.

Add queries to retrieve data items tags and return them in GraphQL. In the SQLite DB implementation these are separate queries for convenience. If we were making requests to something like PostgreSQL we'd probably bundle this into the main query.

Since tags are retrieved in a second query by data_item_id, this significantly improves the performance of retrieving tags for data items that have not yet been flushed to the stable data items table (stable data items already have a similar index).

…ing data items PE-3769 We don't want a data item with the same ID to appear multiple times in GraphQL, so we only insert unique IDs into new_data_items. However, we'd still like to have a record of all the bundles containing a particular ID. This is important if a bundle is removed (due to content moderation) or the parent association needs to be changed for any other unforeseen reason.

The wrong id column was being used for new data items and data item was missing from the stable data item join (not needed for transactions since height and block index are sufficient).

Adds a try/catch in the worker thread to log errors. Also alters the error handling in workers so that workers no longer immediately exit when an error occurs. Instead they wait till an error threshold is reached (currently 100 errors) and then exit. This preserves some level of "fail fast" error handling while reducing overhead of creating a new worker after every error.

Adds WIP bundle schema docs generated by SchemaSpy. Run ./scripts/schemaspy to generate the docs in ./docs/sqlite/bundles. SchemaSpy properties and schema metadata are stored in ./docs/sqlite/bundles.properties and ./docs/sqlite/bundles.meta.xml respectively.

…tems PE-3769 Adds a parent_index and filter_id to bundle_data_items. parent_index (numeric index of the parent bundle in its parent bundle) distinguishs between data_items contained in duplicate parents in the same bundle. filter_id records the filter that caused the data item to be indexed (useful when determining what needs to potentially be reprocessed later).

This moves filtering down into the parser so that we can (in a future commit) emit an event that indicates how many data items within each bundle matched the filter. We want that in order to detect bundles that failed to import successfully. There are a couple of side benefits of this too - 1. it moves more work out of the main thread; 2. it reduces the amount of messages that go back to the main thread.

Adds unbundle complete events containing - filter string used to match data items, total data item count, matched data item count. These events will be used to index bundles in the DB. The filter string is included so that we know which bundles need reprocessing when it's changed.

Use a canonical JSON representation for filters to avoid storing the same filter multiple times in the DB.

Records the filter string used to determine which data items to match on the bundle_data_items table in the DB. This can be used when filters change to help determine what to re-index when filters changes.

Records bundle records that include first and last timestamps for queuing, skipping, unbundling, and indexing (note: indexing timestamp column is present, but not yet set). Data items counts, both total and matched by the index filter, are also recorded as well as the IDs of the filters used to match both the bundle and the data items in it. These can be used later to decide when to reprocess bundles. Note: 'last_fully_indexed_at' is handled slightly differently from other 'last_*' timestamps. Most are not overwritten if they're already set but 'last_fully_indexed_at' is. It's assumed that if the bundle record is being updated in some way it means the bundle is being reprocessed and it's indexing status should be cleared unless it's explicitly set as part of the update.

… PE-4054 The recursive case when getting parent data was incorrectly passing the original ID instead of the parent ID. That lead to infinite recursion since it was continually finding the same parent and then trying to download it. This change corrects that and fixed what appeared to be an issue with setting passing the size for nested bundles. The size should always be the original size. It's only the offset that should be added to during recursion.

Small change - removes one unnecessary fallback and adds a couple comments explaining the size logic.

Adds a bundle repair worker that queries `bundles` and `bundle_data_item` tables to determine which bundles have been fully imported. It does this by setting bundle `last_fully_indexed_at` based on a comparison of `bundle_data_items` for each bundle to `matched_data_item_count` on the bundles (taking filters into account) and then using those `last_fully_indexed_at` timestamps to determine if the bundle should be reprocessed.

Adds ANS104_NESTED_BUNDLE_INDEXED and ANS104_BUNDLED_INDEXED events. ANS104_NESTED_BUNDLED_INDEXED is emitted when a nested ANS-104 bundle is indexed and ready for processing and ANS104_BUNDLE_INDEXED is a more general event that is emitted when either a nested ANS-104 or a L1 ANS-104 bundle is ready for processing. Also modifies existing bundle event handling logic to use the new combined event and handle both L1 TXs and data items.

… PE-4115 Adds a process that resets bundle timestamps for bundles that were processed with different filters than are currenly in use. Since the process creates some DB load even if the filters are unchnaged, it is only enabled when the FILTER_CHANGE_REPROCESS environment variable is set to true. In the future we may optimize this further by keeping a log of filter changes. That would enable more efficient queries based on comparing timestamps (< filter change time) rather than filter IDs (using an inequality).

…f rehashing Prior to this change we were hashing the owner key to get the owner address. This change uses the owner address from the data item instead. These should always be the same value so rehashing is unnecessary. Note: I ran a test comparing the values and on the sample of data items I processed there were no differences.

In order to simplify filter construction, if owner_address is set in a filter, but only owner is present on the matchable item (L1 TXs don't include the address), hash owner on-demand to produce and owner_address and match against that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANS-104 bundle indexing #24

ANS-104 bundle indexing #24

Commits on Jul 14, 2023

Commits on Jul 17, 2023