Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison level builders initial work #1689

Merged
merged 22 commits into from
Nov 2, 2023

Conversation

ADBond
Copy link
Contributor

@ADBond ADBond commented Nov 1, 2023

Implements ComparisonLevelBuilder base class, as well as subclasses for:

  • NullLevel
  • ElseLevel
  • ExactMatchLevel
  • LevenshteinLevel

Starting work on #1679, and particularly based on the mini-implementation in this comment.

WIP - a few things not done but thought it would be good to get some early eyes before doing too much polishing.
In particular to do:

  • docstrings fleshed out
  • possibly revise implementation of create_level_dict / configure
  • possible further hiding of dialect from subclasses (see comment 👇 )

Comment on lines 45 to 49
def create_sql(self, sql_dialect):
col = self.input_column(sql_dialect)
lev_fn = dialect_lookup[sql_dialect].levenshtein_function_name
return f"{lev_fn}({col.name_l()}, {col.name_r()}) <= {self.distance_threshold}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could hide dialect_lookup from people implementing concrete levels using a trick like this.
This is probably a bit simpler for contributers, having something like

lev_fn = self.dialect.levenshtein_function_name

without having to worry about the fact that the object doesn't really have a dialect, but it comes at some added complexity to the base class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they would also then have (potentially) direct access to things like name_l etc without needing to directly go via InputColumn

@ADBond
Copy link
Contributor Author

ADBond commented Nov 1, 2023

Haven't currently paid close attention to things like sensible naming and the like, but just to have something fully working to look at initially

"""
self.col_name = col_name
# TODO: do we need to set these to None, or let ComparisonLevel handle missings?
self.m_probability = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i think you're right - these are probably not needed. They are a relic from when i was wondering about whether hte user should also be able to do like LevenshteinLevel.m_probability = 0.9, but I don't think we're envisaging users should do that

They're probably unnecessary.

from abc import ABC, abstractproperty


class SplinkDialect(ABC):
Copy link
Member

@RobinL RobinL Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the care you've taken in naming things unambiguously! Good to be super clear about sql_dialect:str the sqlglot dialect, and the SplinkDialect!


@final
def get_comparison_level(self, sql_dialect: str):
return ComparisonLevel(self.create_level_dict(sql_dialect))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an argument for turning sql_dialect into a SplinkDialect here? so that e.g. the

def create_sql(self, dialect:SplinkDialect):
        col = self.input_column(dialect) #have to modify self.input_column fn slightly
        lev_fn = dialect.levenshtein_function_name
        return f"{lev_fn}({col.name_l()}, {col.name_r()}) <= {self.distance_threshold}"
        ```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this is simpler than the contextmanager stuff. Also feels quite natural in that whoever is adding these levels will usually have to be making changes to SplinkDialect and it's subclasses anyhow, so it's natural to then be working with these directly when making the sql rather than needing to go via a lookup

@RobinL
Copy link
Member

RobinL commented Nov 1, 2023

This looks great to me!

@ADBond ADBond marked this pull request as ready for review November 1, 2023 19:18
@ADBond
Copy link
Contributor Author

ADBond commented Nov 1, 2023

Okay I think this is probably done as far as this particular strand goes (unless there's anything I've missed!)

One important thing I haven't been too careful about is sorting out which properties/methods are private, which would probably be useful to do here, as these are specifically user-facing classes. But I thought that this would probably become clearer after a few iterations of implementing additional levels

@RobinL
Copy link
Member

RobinL commented Nov 1, 2023

Yeah, happy with this, really good!

Agree - can decide on what's private after adding a few more levels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants