Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔧 refactor ex0 #70

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

AlejandroUPC
Copy link
Contributor

@AlejandroUPC AlejandroUPC commented May 28, 2022

This is done, did some changes added tests, refactor, typing, linting but the project seems a bit complex to keep working.
I would need the time to understand more and use it, feel free to merge if you want @arthurprevot



class Job(ETL_Base):
def transform(self):
url = self.jargs.api_inputs['path']
def transform(self) -> sql.DataFrame:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing on return

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 Nice

@AlejandroUPC AlejandroUPC marked this pull request as ready for review May 28, 2022 11:23
@arthurprevot
Copy link
Owner

Wow, impressive. Thanks a lot @AlejandroUPC . I really like it.
I think the code would need more unit-tests to be able to merge the PR in one go though ! Or it will have to be merged by chunks. I see some bits that will probably break if I merge it as is, and that are not captured by unit-tests yet.

local_path = tmp_dir+'/tmp_file.csv.gz'
open(local_path, 'wb').write(resp.content) # creating local copy, necessary for sc_sql.read.csv, TODO: check to remove local copy step.
self.logger.info('Copied file locally at {}.'.format(local_path))
os.makedirs(tmp_dir, exist_ok=True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would break since tmp_dir = 'tmp' line is deleted above.

def transform(self, some_events, other_events):
df = self.query("""
def transform(
self, some_events="some_events", other_events="other_events"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inputs have to be spark dataframes or pandas dataframes. When they are spark dataframes, the framework register all of them into the spark SQL environment so they can be used directly inside queries, like those in self.query().

FROM some_events se
JOIN other_events oe on se.session_id=oe.session_id
FROM {some_events} se
JOIN {other_events} oe on se.session_id=oe.session_id
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. I have been thinking of doing something like that but haven't gotten to it yet. I think it would need to be done differently since the input variables are not expected to be strings.


@staticmethod
def date_diff_sec(x,y):
return int((y-x).total_seconds())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 functions above are needed for the transform to work.

end_date = datetime(year=2001, month=1, day=1)
expected = 366 * 24 * 3600
result = Job.date_diff_sec(start_date, end_date)
assert expected == result
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 Thanks for the extra unit-tests


@staticmethod
def date_diff_sec(start_dt: datetime, end_dt: datetime) -> int:
return int((end_dt - start_dt).total_seconds())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, you moved these functions into the framework. Good idea but in that case, I think I would keep them out of it until I see more cases of jobs requiring them.

Converts a string date to a datetime to then parse to a target format.
"""
wiki_dt = datetime.strptime(wiki_dt_str, "%Y%m%d%H%M%S")
return wiki_dt.strftime("%Y-%m-%d %H:%M:%S")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much better than the code I had ! I guess I didn't know about this option at the time.

@arthurprevot
Copy link
Owner

could you tell me what linter standard you were using ? is it "black" ?
For strings, I usually try to stick to ' instead of " when possible, just a bit easier visually. I think pep8 doesn't enforce one or the other.

@AlejandroUPC
Copy link
Contributor Author

could you tell me what linter standard you were using ? is it "black" ? For strings, I usually try to stick to ' instead of " when possible, just a bit easier visually. I think pep8 doesn't enforce one or the other.

I use black because I am used to it, regarding the quotes you are right, the PEP 8 doesn't talk about it but blacks justifies it here https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#:~:text=Why%20settle%20on%20double%20quotes,fonts%20and%20syntax%20highlighting%20used.

I think I should have learnt a little bit the project before further doing changes although I won't lie to you (and I am just being honest) there are some major parts that need to be handled.

See you on Wednesday the 8th!

@arthurprevot
Copy link
Owner

Thanks @AlejandroUPC for the "black" take on double quotes. Interesting.
Yes, for sure there is a lot of room for improvement. I'd be happy to get more details on what you see. We can talk about in on jan 8, or before if you want.

@arthurprevot
Copy link
Owner

There is a bunch of things I would like to integrate from this branch but I think they would need to be integrated in separate branches of smaller scope. This one is huge and cannot be merged as is (conflicts).
The elements from this PR I would like to integrate:

  • The unit-tests you added
  • The static types
  • The use of with statement where possible
  • The addition of docstrings.

If you can put some or all of these things in a separate branch (or several), I will happily merge them. If you prefer continuing to work from that branch, no problem, let me know, I will do it and tag you in the PRs.

Regarding the "black" formatting style, I looked into it more. I found it makes the code less concise and conciseness is something I like about python in general. So, I would like to avoid moving the existing code to that standard for now. I am good with new code being formatted that way though. I added PEP8 checks in CI, so as long as it passes the CI checks, it's good to me.

Let me know if I missed some other type of updates you made in that PR.

test_input: str, expected: Tuple[str, str, List[str]]
) -> None:
with pytest.raises(RuntimeError) as e:
result = FSOpsDispatcher().split_s3_path(test_input)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests, thanks a lot. I'd like to merge them

@AlejandroUPC AlejandroUPC marked this pull request as draft June 21, 2024 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants