Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANS-104 bundle indexing #24

Merged
merged 33 commits into from
Jul 18, 2023
Merged

ANS-104 bundle indexing #24

merged 33 commits into from
Jul 18, 2023

Commits on Jul 14, 2023

  1. feat(bundles): add bundle/data item GQL index schema PE-3769

    Adds the DB schema required for indexing data items for GraphQL
    querying. Also includes a table for tracking bundle status (processed_at
    + data_item_count). Bundles use a separate SQLite DB (similar to data)
    to reduce lock contention and support greater bootstrapping flexibility.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    b4418f7 View commit details
    Browse the repository at this point in the history
  2. feat(sqlite): add bundle DB support to StandaloneSqlite PE-3769

    Adds the wiring needed to use the new bundle DB in both the
    StandaloneSqlite class and the tests.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    4f0cee8 View commit details
    Browse the repository at this point in the history
  3. refactor(sqlite): extract tx row construction helper functions PE-3769

    Extracting some small helper functions so they can used when constructing
    data item rows too.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    8b4d8e2 View commit details
    Browse the repository at this point in the history
  4. feat(bundles): index ANS-104 bundles in new data item tables PE-3769

    Records ANS-104 metadata in new data item tables. Flushing to stable
    data items tables is not yet implemented. Also implements propogation of
    a root parent transaction ID to the ANS-104 unbundler. A root parent
    transaction ID is needed to efficiently find and sort data items when
    executing GQL queries.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    ab15e67 View commit details
    Browse the repository at this point in the history
  5. feat(bundles): save stable ans-104 data items PE-3769

    Adds SQL to flush stable data items to the stable data item and data
    item tags tables as well as remove flushed data from the new data item
    tables. This is still relatively unoptimized and is not yet exhaustive
    in its cleanup of stale data.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    bb70d8c View commit details
    Browse the repository at this point in the history
  6. feat(bundles): improve data item height tracking + optimize stable fl…

    …ushing PE-3769
    
    Clears the heights on data items > fork height when forks occur and
    updates data items related to L1 TXs when L1 TX heights are set. Also
    adds a height condition to the query for data items to flush to stable
    to avoid unnecessary work when joining to L1 stable tables to retrieve
    cannonical heights.
    
    Note: further optimization may still be possible. It may be possible to
    eliminate one of the joins by replacing it with a join to
    stable_block_transactions if we add height to stable_block_transactions.
    Though, it's unclear how much performance improvement that would yield.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    98e76a9 View commit details
    Browse the repository at this point in the history
  7. fix(bundles): set data item heights even when L1 TX retrieval fails P…

    …E-3769
    
    Sometimes we can't fetch transactions when indexing a block. In those
    cases we still know the height, so we should ensure the height is set on
    any associated data items.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    4534830 View commit details
    Browse the repository at this point in the history
  8. perf(sqlite bundles): remove more data item flushing joins PE-3769

    Further simplifies joins when copying new data items to the stable
    tables and cleaning up stale data items. Rather than getting height from
    stable L1 tables, we rely on height on the new data items and only join
    to stable L1 tables to get the block transaction index.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    e5122f1 View commit details
    Browse the repository at this point in the history
  9. feat(sqlite bundles): add ability to query stable data items PE-3769

    Combines stable transactions and stable data items using a UNION in the
    SQL query. Each subquery in the UNION has its own ORDER BY and LIMIT.
    This allows the sub-selects to do most of the work before the union is
    computed. This change also implements returning parent/bundleIn for data
    items. However, filtering based on bundledIn and sorting data items by
    ID are not functional yet and will be implemented in future commits.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    adad9a3 View commit details
    Browse the repository at this point in the history
  10. feat(sqlite graphql): include data items in sorting and cursors PE-3769

    Adds data items to GQL sorting and cursors. Data items are sorted by ID
    after block height and block TX index. ID was chosen as opposed to
    bundle offsets or indexes because we want duplicates of the same item
    sorted consistently where possible. Also, bundle data item indexes are
    potentially confusing when data item filtering is used.
    
    In order to accomplish this, the cursor condition in the query was
    changed from a simple numeric comparison to a set of comparisons against
    the cursor components. An OR is required in the comparisons to avoid
    comparing against irrelevant conditions (e.g. block TX index comparison
    when height > cursor height). This clutters the WHERE conditions, but is
    still fairly readable. Also it may perform better since it makes the
    height comparison legible to SQLite.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    3c904ca View commit details
    Browse the repository at this point in the history
  11. feat(sqlite graphql): add bundledIn/parent filter support PE-3769

    Implements the GraphQL bundledIn/parent filter (parent is depricated).
    Filtering on 'null' matches only L1 transactions. Data items queries are
    skipped in that case. This ensures users do not pay a performance
    penalty if they only want to query L1. Similarly, L1 transactions are
    skipped if a bundledIn filter is specified.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    3fdf15a View commit details
    Browse the repository at this point in the history
  12. feat(sqlite graphql): support querying "new" data items PE-3769

    Adds support for querying data items that have not yet been flushed to
    the stable (> 50 blocks old) tables. Note, there are still some edge
    cases to work out with this and new data querying in general. In
    particular, we don't currently support querying data that has not yet
    been associated with a block or data is technically stable but was
    indexed late (e.g. due to missing chunks) and has not yet been flushed.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    0f8894b View commit details
    Browse the repository at this point in the history
  13. feat(ans-104 bundles): add worker to index data items PE-3769

    Adds a simple queue + worker index data items (similar to the one for
    indexing nested data). Currently there is no back pressure or other
    congestion control so if the queue gets too backed up it may crash the
    service. This issue will be address in a future commit.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    f78b6da View commit details
    Browse the repository at this point in the history
  14. fix(data): pause the cache stream after setting up internal handlers …

    …PE-3769
    
    We pause the stream to give the downstream consumer a chance to setup
    its own handler before data starts flowing. Of course, this still has
    to happen relatively quickly since node.js + the OS won't buffer
    indefinitely once data starts flowing over the network, but it should
    still prevent some obvious app level races.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    96d0680 View commit details
    Browse the repository at this point in the history
  15. fix(bundles graphql): correctly return data items tags PE-3769

    Add queries to retrieve data items tags and return them in GraphQL. In
    the SQLite DB implementation these are separate queries for convenience.
    If we were making requests to something like PostgreSQL we'd probably
    bundle this into the main query.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    6993449 View commit details
    Browse the repository at this point in the history
  16. perf(sqlite graphql): add new_data_item data_item_id index PE-3769

    Since tags are retrieved in a second query by data_item_id, this
    significantly improves the performance of retrieving tags for data items
    that have not yet been flushed to the stable data items table (stable
    data items already have a similar index).
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    1832d23 View commit details
    Browse the repository at this point in the history
  17. feat(sqlite bundles): record all parent/child relationships for match…

    …ing data items PE-3769
    
    We don't want a data item with the same ID to appear multiple times in
    GraphQL, so we only insert unique IDs into new_data_items. However, we'd
    still like to have a record of all the bundles containing a particular
    ID. This is important if a bundle is removed (due to content moderation)
    or the parent association needs to be changed for any other unforeseen
    reason.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    9218dba View commit details
    Browse the repository at this point in the history
  18. fix(sqlite bundles): correct join condition for data item tags PE-3769

    The wrong id column was being used for new data items and data item was
    missing from the stable data item join (not needed for transactions
    since height and block index are sufficient).
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    29b2ee9 View commit details
    Browse the repository at this point in the history
  19. chore(sqlite): improve worker error logging PE-3769

    Adds a try/catch in the worker thread to log errors. Also alters the
    error handling in workers so that workers no longer immediately exit
    when an error occurs. Instead they wait till an error threshold is
    reached (currently 100 errors) and then exit. This preserves some level
    of "fail fast" error handling while reducing overhead of creating a new
    worker after every error.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    199bfe4 View commit details
    Browse the repository at this point in the history
  20. doc(sqlite): add WIP bundle schema docs PE-3769

    Adds WIP bundle schema docs generated by SchemaSpy. Run
    ./scripts/schemaspy to generate the docs in ./docs/sqlite/bundles.
    SchemaSpy properties and schema metadata are stored in
    ./docs/sqlite/bundles.properties and ./docs/sqlite/bundles.meta.xml
    respectively.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    11f7e9d View commit details
    Browse the repository at this point in the history
  21. feat(sqlite bundles): add filter_id and parent_index to bundle_data_i…

    …tems PE-3769
    
    Adds a parent_index and filter_id to bundle_data_items. parent_index
    (numeric index of the parent bundle in its parent bundle) distinguishs
    between data_items contained in duplicate parents in the same bundle.
    filter_id records the filter that caused the data item to be indexed
    (useful when determining what needs to potentially be reprocessed
    later).
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    56a6dfb View commit details
    Browse the repository at this point in the history
  22. refactor(bundles ans-104): push filtering down into worker PE-3769

    This moves filtering down into the parser so that we can (in a future
    commit) emit an event that indicates how many data items within each
    bundle matched the filter. We want that in order to detect bundles that
    failed to import successfully. There are a couple of side benefits of
    this too - 1. it moves more work out of the main thread; 2. it reduces
    the amount of messages that go back to the main thread.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    2a6bf3e View commit details
    Browse the repository at this point in the history
  23. feat(bundles ans-104): emit unbundle complete events PE-3769

    Adds unbundle complete events containing - filter string used to match
    data items, total data item count, matched data item count. These events
    will be used to index bundles in the DB. The filter string is included
    so that we know which bundles need reprocessing when it's changed.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    db55aba View commit details
    Browse the repository at this point in the history
  24. feat(bundles filters): canonicalize bundle filter string PE-3769

    Use a canonical JSON representation for filters to avoid storing the
    same filter multiple times in the DB.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    24b219e View commit details
    Browse the repository at this point in the history
  25. feat(bundles filters): record data item filters in the DB PE-3769

    Records the filter string used to determine which data items to match on
    the bundle_data_items table in the DB. This can be used when filters
    change to help determine what to re-index when filters changes.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    3e76b93 View commit details
    Browse the repository at this point in the history
  26. feat(bundles): add bundle process tracking PE-3769

    Records bundle records that include first and last timestamps for
    queuing, skipping, unbundling, and indexing (note: indexing timestamp
    column is present, but not yet set). Data items counts, both total and
    matched by the index filter, are also recorded as well as the IDs of the
    filters used to match both the bundle and the data items in it. These
    can be used later to decide when to reprocess bundles.
    
    Note: 'last_fully_indexed_at' is handled slightly differently from other
    'last_*' timestamps. Most are not overwritten if they're already set but
    'last_fully_indexed_at' is. It's assumed that if the bundle record is
    being updated in some way it means the bundle is being reprocessed and
    it's indexing status should be cleared unless it's explicitly set as
    part of the update.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    38cb2c1 View commit details
    Browse the repository at this point in the history
  27. fix(bundles data): fix infinite recursion when parent data is missing…

    … PE-4054
    
    The recursive case when getting parent data was incorrectly passing the
    original ID instead of the parent ID. That lead to infinite recursion
    since it was continually finding the same parent and then trying to
    download it. This change corrects that and fixed what appeared to be an
    issue with setting passing the size for nested bundles. The size should
    always be the original size. It's only the offset that should be added
    to during recursion.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    ff07818 View commit details
    Browse the repository at this point in the history
  28. refactor(data cache): simplify and comment cache size logic PE-4054

    Small change - removes one unnecessary fallback and adds a couple
    comments explaining the size logic.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    58665d9 View commit details
    Browse the repository at this point in the history
  29. feat(bundles repair): add bundle repair worker PE-4041

    Adds a bundle repair worker that queries `bundles` and
    `bundle_data_item` tables to determine which bundles have been fully
    imported. It does this by setting bundle `last_fully_indexed_at` based
    on a comparison of `bundle_data_items` for each bundle to
    `matched_data_item_count` on the bundles (taking filters into account)
    and then using those `last_fully_indexed_at` timestamps to determine if
    the bundle should be reprocessed.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    477511e View commit details
    Browse the repository at this point in the history
  30. feat(sqlite bundles): index nested ANS-104 bundles PE-3639

    Adds ANS104_NESTED_BUNDLE_INDEXED and ANS104_BUNDLED_INDEXED events.
    ANS104_NESTED_BUNDLED_INDEXED is emitted when a nested ANS-104 bundle is
    indexed and ready for processing and ANS104_BUNDLE_INDEXED is a more
    general event that is emitted when either a nested ANS-104 or a L1
    ANS-104 bundle is ready for processing. Also modifies existing bundle
    event handling logic to use the new combined event and handle both L1
    TXs and data items.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    9405022 View commit details
    Browse the repository at this point in the history
  31. feat(bundles): add a process to reindex bundles after a filter change…

    … PE-4115
    
    Adds a process that resets bundle timestamps for bundles that were
    processed with different filters than are currenly in use. Since the
    process creates some DB load even if the filters are unchnaged, it is
    only enabled when the FILTER_CHANGE_REPROCESS environment variable is
    set to true. In the future we may optimize this further by keeping a log
    of filter changes. That would enable more efficient queries based on
    comparing timestamps (< filter change time) rather than filter IDs
    (using an inequality).
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    a75455d View commit details
    Browse the repository at this point in the history
  32. refactor(bundles ans-104): use owner address from data item instead o…

    …f rehashing
    
    Prior to this change we were hashing the owner key to get the owner
    address. This change uses the owner address from the data item instead.
    These should always be the same value so rehashing is unnecessary.
    
    Note: I ran a test comparing the values and on the sample of data items
    I processed there were no differences.
    djwhitt committed Jul 14, 2023
    Configuration menu
    Copy the full SHA
    66181c7 View commit details
    Browse the repository at this point in the history

Commits on Jul 17, 2023

  1. feat(filters): support on-demand owner hashing PE-4214

    In order to simplify filter construction, if owner_address is set in a
    filter, but only owner is present on the matchable item (L1 TXs don't
    include the address), hash owner on-demand to produce and owner_address
    and match against that.
    djwhitt committed Jul 17, 2023
    Configuration menu
    Copy the full SHA
    d3e9457 View commit details
    Browse the repository at this point in the history