Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to choose the columns we index on #22

Merged
merged 16 commits into from
Dec 12, 2024

Conversation

wpfl-dbt
Copy link
Collaborator

@wpfl-dbt wpfl-dbt commented Dec 5, 2024

Context

Matchbox currently indexes every column. This makes it difficult to control how data is hashed, and is vulnerable to column renames or reorderings. Ticket. Feature shopping list.

Changes proposed in this pull request

  • Adds column specification to datasets.toml config file. You can configure:
    • Order of columns
    • Alias of columns
    • Datatype of columns
    • A special * character as a shorthand for "select all"
  • Adds table aliasing, column specification and hashing to Source object, ensuring private data isn't sent to the server
  • Adds Sources.alias and Sources.indices to PostgreSQL backend, enabling reconstruction of Source by client if they have permission to access the table
  • Adds a check and warning to query if a non-indexed column is selected
  • Adds unit tests to cover these additions

Guidance to review

  • Check unit tests pass
  • Confirm you're happen with the two extra columns in Sources, especially the datatype of the column indexes (JSONB with base64-encoded hashes, as pure bytes isn't permitted)
  • Confirm you're happy with the security model of what's stored vs what's retrieved
    • If you don't have warehouse creds, you can't know what columns have been used to index in Matchbox
    • BUT, internal users who have access and know the table can recover them
    • AND hypothetical external uses can check whether their hashes are matching ours as they tinker with federation settings
  • Note SourceColumn.type isn't actually used anywhere yet -- the hashing algorithm casts everything to string anyway

Checklist:

  • My code follows the style guidelines of this project
  • New and existing unit tests pass locally with my changes

@wpfl-dbt wpfl-dbt marked this pull request as ready for review December 6, 2024 18:23
@lmazz1-dbt lmazz1-dbt self-assigned this Dec 9, 2024
@wpfl-dbt wpfl-dbt merged commit c426882 into main Dec 12, 2024
5 checks passed
@wpfl-dbt wpfl-dbt deleted the feature/index-columns branch December 12, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants