Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert entire backend to use BIGINT primary keys #32

Merged
merged 16 commits into from
Dec 19, 2024

Conversation

wpfl-dbt
Copy link
Collaborator

@wpfl-dbt wpfl-dbt commented Dec 19, 2024

Context

The new ingest process moves hashing to the server for inserting results. This means we can work with integers locally, and can use them in the database too, which should speed everything up, as there's a lot less data sloshing about.

Changes proposed in this pull request

erDiagram
    Sources {
        bigint resolution_id PK,FK
        string alias
        string schema
        string table
        string id
        jsonb indices
    }
    Clusters {
        bigint cluster_id PK,FK
        bytes cluster_hash
        bigint dataset FK
        array[string] source_pk
    }
    Contains {
        bigint parent PK,FK
        bigint child PK,FK
    }
    Probabilities {
        bigint model PK,FK
        bigint cluster PK,FK
        float probability
    }
    Resolutions {
        bigint resolution_id PK,FK
        bytes resolution_hash
        enum type
        string name
        string description
        float truth
    }
    ResolutionFrom {
        bigint parent PK,FK
        bigint child PK,FK
        int level
        float truth_cache
    }

    Sources |o--|| Resolutions : ""
    Sources ||--o{ Clusters : ""
    Clusters ||--o{ Contains : "parent, child"
    Clusters ||--o{ Probabilities : ""
    Resolutions ||--o{ Probabilities : ""
    Resolutions ||--o{ ResolutionFrom : "child, parent"
Loading
  • Move entire ORM to use integer primary keys
  • Temporarily adapt client to pass tests with this change

Guidance to review

Start with the ORM and new README diagram, then the server side changes, then the client.

I've chosen to rename hash to id on coming out of Matchbox. I'm worried this is too generic. What do you think?

Checklist:

  • My code follows the style guidelines of this project
  • New and existing unit tests pass locally with my changes

@wpfl-dbt wpfl-dbt requested a review from leo-mazzone December 19, 2024 13:00
@wpfl-dbt wpfl-dbt marked this pull request as ready for review December 19, 2024 13:00
Copy link
Collaborator

@leo-mazzone leo-mazzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have reviewed on a call, all looks good enough

@wpfl-dbt wpfl-dbt merged commit 0ad8b4a into feature/new-ingest-process Dec 19, 2024
3 checks passed
@wpfl-dbt wpfl-dbt deleted the feature/int-cluster-pk branch December 19, 2024 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants