-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🔧 refactor ex0 #70
base: master
Are you sure you want to change the base?
🔧 refactor ex0 #70
Conversation
|
||
|
||
class Job(ETL_Base): | ||
def transform(self): | ||
url = self.jargs.api_inputs['path'] | ||
def transform(self) -> sql.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing on return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 Nice
Wow, impressive. Thanks a lot @AlejandroUPC . I really like it. |
local_path = tmp_dir+'/tmp_file.csv.gz' | ||
open(local_path, 'wb').write(resp.content) # creating local copy, necessary for sc_sql.read.csv, TODO: check to remove local copy step. | ||
self.logger.info('Copied file locally at {}.'.format(local_path)) | ||
os.makedirs(tmp_dir, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would break since tmp_dir = 'tmp'
line is deleted above.
def transform(self, some_events, other_events): | ||
df = self.query(""" | ||
def transform( | ||
self, some_events="some_events", other_events="other_events" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inputs have to be spark dataframes or pandas dataframes. When they are spark dataframes, the framework register all of them into the spark SQL environment so they can be used directly inside queries, like those in self.query()
.
FROM some_events se | ||
JOIN other_events oe on se.session_id=oe.session_id | ||
FROM {some_events} se | ||
JOIN {other_events} oe on se.session_id=oe.session_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea. I have been thinking of doing something like that but haven't gotten to it yet. I think it would need to be done differently since the input variables are not expected to be strings.
|
||
@staticmethod | ||
def date_diff_sec(x,y): | ||
return int((y-x).total_seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 functions above are needed for the transform to work.
end_date = datetime(year=2001, month=1, day=1) | ||
expected = 366 * 24 * 3600 | ||
result = Job.date_diff_sec(start_date, end_date) | ||
assert expected == result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 Thanks for the extra unit-tests
|
||
@staticmethod | ||
def date_diff_sec(start_dt: datetime, end_dt: datetime) -> int: | ||
return int((end_dt - start_dt).total_seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see, you moved these functions into the framework. Good idea but in that case, I think I would keep them out of it until I see more cases of jobs requiring them.
Converts a string date to a datetime to then parse to a target format. | ||
""" | ||
wiki_dt = datetime.strptime(wiki_dt_str, "%Y%m%d%H%M%S") | ||
return wiki_dt.strftime("%Y-%m-%d %H:%M:%S") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, much better than the code I had ! I guess I didn't know about this option at the time.
could you tell me what linter standard you were using ? is it "black" ? |
I use black because I am used to it, regarding the quotes you are right, the PEP 8 doesn't talk about it but blacks justifies it here https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#:~:text=Why%20settle%20on%20double%20quotes,fonts%20and%20syntax%20highlighting%20used. I think I should have learnt a little bit the project before further doing changes although I won't lie to you (and I am just being honest) there are some major parts that need to be handled. See you on Wednesday the 8th! |
Thanks @AlejandroUPC for the "black" take on double quotes. Interesting. |
There is a bunch of things I would like to integrate from this branch but I think they would need to be integrated in separate branches of smaller scope. This one is huge and cannot be merged as is (conflicts).
If you can put some or all of these things in a separate branch (or several), I will happily merge them. If you prefer continuing to work from that branch, no problem, let me know, I will do it and tag you in the PRs. Regarding the "black" formatting style, I looked into it more. I found it makes the code less concise and conciseness is something I like about python in general. So, I would like to avoid moving the existing code to that standard for now. I am good with new code being formatted that way though. I added PEP8 checks in CI, so as long as it passes the CI checks, it's good to me. Let me know if I missed some other type of updates you made in that PR. |
test_input: str, expected: Tuple[str, str, List[str]] | ||
) -> None: | ||
with pytest.raises(RuntimeError) as e: | ||
result = FSOpsDispatcher().split_s3_path(test_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests, thanks a lot. I'd like to merge them
This is done, did some changes added tests, refactor, typing, linting but the project seems a bit complex to keep working.
I would need the time to understand more and use it, feel free to merge if you want @arthurprevot