-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison level builders initial work #1689
Comparison level builders initial work #1689
Conversation
heavily based on Robin's final sketch
for things where it doesn't make sense - ElseLevel, potentially CustomLevel
these will probably all eventually go (along with much of the helper functionality), but waiting to see what kind of pattern emerges with cl and ctl
splink/comparison_level_library.py
Outdated
def create_sql(self, sql_dialect): | ||
col = self.input_column(sql_dialect) | ||
lev_fn = dialect_lookup[sql_dialect].levenshtein_function_name | ||
return f"{lev_fn}({col.name_l()}, {col.name_r()}) <= {self.distance_threshold}" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could hide dialect_lookup
from people implementing concrete levels using a trick like this.
This is probably a bit simpler for contributers, having something like
lev_fn = self.dialect.levenshtein_function_name
without having to worry about the fact that the object doesn't really have a dialect, but it comes at some added complexity to the base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they would also then have (potentially) direct access to things like name_l
etc without needing to directly go via InputColumn
Haven't currently paid close attention to things like sensible naming and the like, but just to have something fully working to look at initially |
splink/comparison_level_creator.py
Outdated
""" | ||
self.col_name = col_name | ||
# TODO: do we need to set these to None, or let ComparisonLevel handle missings? | ||
self.m_probability = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i think you're right - these are probably not needed. They are a relic from when i was wondering about whether hte user should also be able to do like LevenshteinLevel.m_probability = 0.9
, but I don't think we're envisaging users should do that
They're probably unnecessary.
from abc import ABC, abstractproperty | ||
|
||
|
||
class SplinkDialect(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the care you've taken in naming things unambiguously! Good to be super clear about sql_dialect:str
the sqlglot dialect, and the SplinkDialect!
|
||
@final | ||
def get_comparison_level(self, sql_dialect: str): | ||
return ComparisonLevel(self.create_level_dict(sql_dialect)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an argument for turning sql_dialect into a SplinkDialect here? so that e.g. the
def create_sql(self, dialect:SplinkDialect):
col = self.input_column(dialect) #have to modify self.input_column fn slightly
lev_fn = dialect.levenshtein_function_name
return f"{lev_fn}({col.name_l()}, {col.name_r()}) <= {self.distance_threshold}"
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this is simpler than the contextmanager stuff. Also feels quite natural in that whoever is adding these levels will usually have to be making changes to SplinkDialect
and it's subclasses anyhow, so it's natural to then be working with these directly when making the sql rather than needing to go via a lookup
This looks great to me! |
…llowed values in one place (at least on the creator side)
Okay I think this is probably done as far as this particular strand goes (unless there's anything I've missed!) One important thing I haven't been too careful about is sorting out which properties/methods are private, which would probably be useful to do here, as these are specifically user-facing classes. But I thought that this would probably become clearer after a few iterations of implementing additional levels |
Yeah, happy with this, really good! Agree - can decide on what's private after adding a few more levels |
Implements
ComparisonLevelBuilder
base class, as well as subclasses for:NullLevel
ElseLevel
ExactMatchLevel
LevenshteinLevel
Starting work on #1679, and particularly based on the mini-implementation in this comment.
WIP - a few things not done but thought it would be good to get some early eyes before doing too much polishing.
In particular to do:
create_level_dict
/configure