🔧 refactor ex0 #70

AlejandroUPC · 2022-05-28T10:26:13Z

This is done, did some changes added tests, refactor, typing, linting but the project seems a bit complex to keep working.
I would need the time to understand more and use it, feel free to merge if you want @arthurprevot

AlejandroUPC · 2022-05-28T10:26:30Z

jobs/examples/ex0_extraction_job.py



 class Job(ETL_Base):
-    def transform(self):
-        url = self.jargs.api_inputs['path']
+    def transform(self) -> sql.DataFrame:


typing on return

arthurprevot · 2022-05-28T15:22:55Z

Wow, impressive. Thanks a lot @AlejandroUPC . I really like it.
I think the code would need more unit-tests to be able to merge the PR in one go though ! Or it will have to be merged by chunks. I see some bits that will probably break if I merge it as is, and that are not captured by unit-tests yet.

arthurprevot · 2022-05-28T15:30:00Z

jobs/examples/ex0_extraction_job.py

-        local_path = tmp_dir+'/tmp_file.csv.gz'
-        open(local_path, 'wb').write(resp.content)  # creating local copy, necessary for sc_sql.read.csv, TODO: check to remove local copy step.
-        self.logger.info('Copied file locally at {}.'.format(local_path))
+        os.makedirs(tmp_dir, exist_ok=True)


this would break since tmp_dir = 'tmp' line is deleted above.

arthurprevot · 2022-05-28T15:38:49Z

jobs/examples/ex1_frameworked_job.py

-    def transform(self, some_events, other_events):
-        df = self.query("""
+    def transform(
+        self, some_events="some_events", other_events="other_events"


The inputs have to be spark dataframes or pandas dataframes. When they are spark dataframes, the framework register all of them into the spark SQL environment so they can be used directly inside queries, like those in self.query().

arthurprevot · 2022-05-28T15:42:03Z

jobs/examples/ex1_frameworked_job.py

-            FROM some_events se
-            JOIN other_events oe on se.session_id=oe.session_id
+            FROM {some_events} se
+            JOIN {other_events} oe on se.session_id=oe.session_id


I like the idea. I have been thinking of doing something like that but haven't gotten to it yet. I think it would need to be done differently since the input variables are not expected to be strings.

arthurprevot · 2022-05-28T15:47:22Z

jobs/examples/ex2_frameworked_job.py

-
-    @staticmethod
-    def date_diff_sec(x,y):
-        return int((y-x).total_seconds())


These 2 functions above are needed for the transform to work.

arthurprevot · 2022-05-28T15:48:38Z

tests/jobs/examples/test_ex2_frameworked_job.py

+    end_date = datetime(year=2001, month=1, day=1)
+    expected = 366 * 24 * 3600
+    result = Job.date_diff_sec(start_date, end_date)
+    assert expected == result


💯 Thanks for the extra unit-tests

arthurprevot · 2022-05-28T15:54:10Z

yaetos/etl_utils.py

+
+    @staticmethod
+    def date_diff_sec(start_dt: datetime, end_dt: datetime) -> int:
+        return int((end_dt - start_dt).total_seconds())


oh I see, you moved these functions into the framework. Good idea but in that case, I think I would keep them out of it until I see more cases of jobs requiring them.

arthurprevot · 2022-05-28T16:00:12Z

yaetos/etl_utils.py

+        Converts a string date to a datetime to then parse to a target format.
+        """
+        wiki_dt = datetime.strptime(wiki_dt_str, "%Y%m%d%H%M%S")
+        return wiki_dt.strftime("%Y-%m-%d %H:%M:%S")


Yes, much better than the code I had ! I guess I didn't know about this option at the time.

arthurprevot · 2022-05-28T16:06:04Z

could you tell me what linter standard you were using ? is it "black" ?
For strings, I usually try to stick to ' instead of " when possible, just a bit easier visually. I think pep8 doesn't enforce one or the other.

AlejandroUPC · 2022-05-28T19:08:00Z

could you tell me what linter standard you were using ? is it "black" ? For strings, I usually try to stick to ' instead of " when possible, just a bit easier visually. I think pep8 doesn't enforce one or the other.

I use black because I am used to it, regarding the quotes you are right, the PEP 8 doesn't talk about it but blacks justifies it here https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#:~:text=Why%20settle%20on%20double%20quotes,fonts%20and%20syntax%20highlighting%20used.

I think I should have learnt a little bit the project before further doing changes although I won't lie to you (and I am just being honest) there are some major parts that need to be handled.

See you on Wednesday the 8th!

arthurprevot · 2022-05-28T21:52:24Z

Thanks @AlejandroUPC for the "black" take on double quotes. Interesting.
Yes, for sure there is a lot of room for improvement. I'd be happy to get more details on what you see. We can talk about in on jan 8, or before if you want.

arthurprevot · 2022-05-31T12:14:19Z

There is a bunch of things I would like to integrate from this branch but I think they would need to be integrated in separate branches of smaller scope. This one is huge and cannot be merged as is (conflicts).
The elements from this PR I would like to integrate:

The unit-tests you added
The static types
The use of with statement where possible
The addition of docstrings.

If you can put some or all of these things in a separate branch (or several), I will happily merge them. If you prefer continuing to work from that branch, no problem, let me know, I will do it and tag you in the PRs.

Regarding the "black" formatting style, I looked into it more. I found it makes the code less concise and conciseness is something I like about python in general. So, I would like to avoid moving the existing code to that standard for now. I am good with new code being formatted that way though. I added PEP8 checks in CI, so as long as it passes the CI checks, it's good to me.

Let me know if I missed some other type of updates you made in that PR.

arthurprevot · 2022-05-31T12:32:48Z

tests/yaetos/test_env_dispatchers.py

+    test_input: str, expected: Tuple[str, str, List[str]]
+) -> None:
+    with pytest.raises(RuntimeError) as e:
+        result = FSOpsDispatcher().split_s3_path(test_input)


Nice tests, thanks a lot. I'd like to merge them

🔧 refactor ex0

4f62bdd

AlejandroUPC commented May 28, 2022

View reviewed changes

AlejandroUPC added 6 commits May 28, 2022 12:29

🔧 refactor ex1

e890156

🔧 refactor ex1_raw_job

b8e47cf

🔧 adding tests

8c24fc2

🔧 refactor ex2_fworked_job

f4eee73

🔧 random refactor

0b61cc0

🔧 big commit

de7ce9b

AlejandroUPC marked this pull request as ready for review May 28, 2022 11:23

arthurprevot reviewed May 28, 2022

View reviewed changes

This was referenced May 30, 2022

Linting, fixing broken code #72

Merged

Linting, part 2, basic formatting compliance #73

Merged

AlejandroUPC added 3 commits May 31, 2022 03:20

✨ refactor and some tests on env_dispatchers

603423a

🔧 big refactor, class following pep8, context managers,etc..

b996fe5

🔧 forgot to commit file

432a97f

arthurprevot reviewed May 31, 2022

View reviewed changes

arthurprevot mentioned this pull request May 31, 2022

Proposals for improvement #69

Closed

AlejandroUPC marked this pull request as draft June 21, 2024 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔧 refactor ex0 #70

🔧 refactor ex0 #70

AlejandroUPC commented May 28, 2022 •

edited

Loading

AlejandroUPC May 28, 2022

arthurprevot May 28, 2022

arthurprevot commented May 28, 2022

arthurprevot May 28, 2022

arthurprevot May 28, 2022

arthurprevot May 28, 2022

arthurprevot May 28, 2022

arthurprevot May 28, 2022

arthurprevot May 28, 2022

arthurprevot May 28, 2022

arthurprevot commented May 28, 2022

AlejandroUPC commented May 28, 2022

arthurprevot commented May 28, 2022

arthurprevot commented May 31, 2022

arthurprevot May 31, 2022

🔧 refactor ex0 #70

Are you sure you want to change the base?

🔧 refactor ex0 #70

Conversation

AlejandroUPC commented May 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arthurprevot commented May 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arthurprevot commented May 28, 2022

AlejandroUPC commented May 28, 2022

arthurprevot commented May 28, 2022

arthurprevot commented May 31, 2022

Choose a reason for hiding this comment

AlejandroUPC commented May 28, 2022 •

edited

Loading