-
Notifications
You must be signed in to change notification settings - Fork 274
Conversation
Signed-off-by: Sarad Mohanan <[email protected]>
Signed-off-by: Sarad Mohanan <[email protected]>
fixing #859 |
Signed-off-by: Sarad Mohanan <[email protected]>
Can you add a fuller description on your approach in this PR? I'll plan to review this next week when it's ready! |
Signed-off-by: Sarad Mohanan <[email protected]>
@sungchun12 I have updated the description please have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I really like this approach and you did a great job of cleaning up the implementation emails and reducing kwargs
bloat for function calls.
Once you update the data class for a pydantic class and update the relevant attribute calls, I'll give this another review and aim to merge this week!
@@ -14,7 +14,7 @@ | |||
def connect_to_table( | |||
db_info: Union[str, dict], | |||
table_name: Union[DbPath, str], | |||
key_columns: str = ("id",), | |||
key_columns: Tuple[str] = ("id",), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, good catch. You're reminding me that we need to add static type checking to this codebase
data_diff/cli_options.py
Outdated
|
||
|
||
@dataclass | ||
class CliOptions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing all this really puts into perspective just how many CLI options are available that need validation. Can you refactor this using pydantic?
Until we have static type checking in place, runtime type validation will serve this better
quick pseudo-code
from typing import Optional, Literal, Dict, Union, Tuple
from pydantic import BaseModel
class CliOptions(BaseModel):
bisection_factor: int
bisection_threshold: int
table_write_limit: int
database1: Union[str, Dict, None] = None
table1: Optional[str] = None
database2: Union[str, Dict, None] = None
table2: Optional[str] = None
key_columns: Tuple[str, ...] = ()
update_column: Optional[str] = None
columns: Tuple[str, ...] = ()
limit: Optional[int] = None
materialize_to_table: Optional[str] = None
min_age: Optional[str] = None
max_age: Optional[str] = None
stats: bool = False
debug: bool = False
json_output: bool = False
verbose: bool = False
version: bool = False
interactive: bool = False
no_tracking: bool = False
case_sensitive: bool = False
assume_unique_key: bool = False
sample_exclusive_rows: bool = False
materialize_all_rows: bool = False
threads: Union[None, int, Literal["serial"]] = None
threads1: Optional[int] = None
threads2: Optional[int] = None
threaded: bool = False
where: Optional[str] = None
algorithm: Literal["auto", "joindiff", "hashdiff"] = "auto"
conf: Optional[str] = None
run: Optional[str] = None
dbt: bool = False
cloud: bool = False
dbt_profiles_dir: Optional[str] = None
dbt_project_dir: Optional[str] = None
select: Optional[str] = None
state: Optional[str] = None
prod_database: Optional[str] = None
prod_schema: Optional[str] = None
__conf__: Optional[Dict] = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will have to change the attribute calls for the rest of the files as a result
tests/test_main.py
Outdated
assert threaded | ||
assert threads == 6 | ||
_set_threads(cli_options) | ||
assert str(value_error.exception) == "Error: threads must be of type int, or value must be 'serial'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pydantic will resolve this at runtime validation, but this custom error message is nice
Co-authored-by: Sung Won Chung <[email protected]>
Signed-off-by: Sarad Mohanan <[email protected]>
Signed-off-by: Sarad Mohanan <[email protected]>
Signed-off-by: Sarad Mohanan <[email protected]>
Signed-off-by: Sarad Mohanan <[email protected]>
Signed-off-by: Sarad Mohanan <[email protected]>
Signed-off-by: Sarad Mohanan <[email protected]>
Signed-off-by: Sarad Mohanan <[email protected]>
Signed-off-by: Sarad Mohanan <[email protected]>
Signed-off-by: Sarad Mohanan <[email protected]>
@sungchun12 I have addressed the review comments please take a look |
@sar009 thank you! I'll review later this week. |
Mirror this PR for the full test suite to uncover any edge cases: #871 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, really good!
Just need new tests for these methods/functions given this codebase hasn't tested them before and we need to make sure the new args we're passing are doing exactly as expected.
_apply_config
apply_config_from_file
_print_result
_data_diff
create_schema
diff_schemas
f"Maximum number of rows to write when creating materialized or sample tables, per thread. " | ||
f"Default={TABLE_WRITE_LIMIT}" | ||
), | ||
type=int, | ||
metavar="COUNT", | ||
) | ||
@click.option( | ||
"-j", | ||
"--threads", | ||
default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default=None, | |
default=None, | |
type=int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cli_options.dbt_project_dir = os.path.expanduser(cli_options.dbt_project_dir) | ||
|
||
if cli_options.dbt: | ||
dbt_diff(cli_options, log_status_handler=log_handlers.get("log_status_handler")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to verify how cli_options kwargs are being passed into this method
|
||
return new_kw | ||
cli_options.run_args = run_args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense instead of returning new_kw
since cli_options
stores config state now
from pydantic import BaseModel, PositiveInt | ||
|
||
|
||
class CliOptions(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is bloated and will become bloated over time, but I'm willing to accept that tradeoff for better runtime validation of flag types and simplification of kwargs**
across method calls. Overall, this PR has opened my eyes that a major refactor needs to happen across lots of the codebase for simpler data flows, but in the short-term, this approach is worth it.
This pull request has been marked as stale because it has been open for 60 days with no activity. If you would like the pull request to remain open, please comment on the pull request and it will be added to the triage queue. Otherwise, it will be closed in 7 days. |
Although we are closing this pull request as stale, it's not gone forever. PRs can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage. |
The
__main__.py
is blotted with a lot of CLI variables impacting readability and loose typing. So namespacing the CLI options under the classCliOptions
.