Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added match function to the backend #17

Merged
merged 24 commits into from
Dec 13, 2024
Merged

Added match function to the backend #17

merged 24 commits into from
Dec 13, 2024

Conversation

wpfl-dbt
Copy link
Collaborator

@wpfl-dbt wpfl-dbt commented Dec 3, 2024

Context

Top of the features shopping list and vital for replicating the existing Company Matching endpoint using Matchbox is match().

Given an ID in a source dataset, it should return the paired keys from the target dataset from the perspective of a point of truth.

While there's plenty of edge cases and ways we might extend this, this is the vanilla implementation so we have something to work with.

Changes proposed in this pull request

  • Adds match() as a function throughout the library: client, server base class, PostgreSQL adapter
  • Adds simple unit tests based on identifying "Revolution Inc" across the warehouse data
  • Adds indices to Clusters and Contains to facilitate speed on this query
  • Modifies batch_ingest() to permit use of indices via a new isolate_table function
  • Adds some helper function to visualise the data subgraph -- I used them to help me debug and felt they were useful enough to leave in
  • Minor change to Models.get_lineage_to_dataset(): the dataset model is now also returned, with accompanying change in query() logic

Guidance to review

Focus on matchbox.server.postgresql.utils.query -- this is where the biggest, most confusing changes are.

  • My unit tests lack some coverage. I don't have any examples of simple 1:1 because my source data has no examples of this. I think this is okay for now -- do you agree?
    • Note 1:0 is now covered
  • I chose to implement the IDs as sets rather than str | list[str] | None because I think it's more elegant
  • "Source" and "target" aren't language used elsewhere in the system, but I think it's justified as this is a reasonably unique operation for the system. Do you agree with this language?
  • I considered implementing the API out of scope. Do you agree?

Relevant links

Checklist:

  • My code follows the style guidelines of this project
  • New and existing unit tests pass locally with my changes

@wpfl-dbt wpfl-dbt marked this pull request as ready for review December 8, 2024 14:11
Copy link
Collaborator

@lmazz1-dbt lmazz1-dbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the API endpoint is out of scope, but I'd like to see more tests

@wpfl-dbt
Copy link
Collaborator Author

I agree that the API endpoint is out of scope, but I'd like to see more tests

Added 1:0 coverage as discussed.

@wpfl-dbt wpfl-dbt merged commit 0ed6046 into main Dec 13, 2024
5 checks passed
@wpfl-dbt wpfl-dbt deleted the feature/match branch December 13, 2024 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants