Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support single tenancy & more + move to v0.11.8 #116

Merged
merged 45 commits into from
Jun 12, 2024

Conversation

arthurprevot
Copy link
Owner

@arthurprevot arthurprevot commented Jun 12, 2024

Added ability to define and use multiple sections in mode_specific_params. Useful for support single tenancy architecture, where data lake is split across buckets.
Involved:

  • New params "name_base_in_param" and "name_base_in_param" to deal with cases when input and output base_path are different. (not ideal yet, to be improved later)
  • New jobs to ex17_multimode_params_job and ex18_base_path_in_out_job
  • new section in mode_specific_params "your_extra_tenant" example
  • new unit-tests
  • update all cases of hardcoded dev_local, dev_EMR or prod_EMR

Other

  • More code to register to athena, from spark df, and using glue
  • clean and fix setup of spark-submit in EMR steps, to use python launcher.py --job_name instead of using py_job that may not have been registered in jobs_metadata.yml.
  • Made emr_ec2_role and emr_role overrideable.
  • fix to loading pandas df from parquet in AWS (there was pb with individual files and tmp folder)
  • allow expanding params from job, using new expand_params()
  • allow use of glob for loading spark df from various path (regex to be added later)
  • More flexible way to deal with */'{latest}'/ and */'{now}'/

Moved to 0.11.8. Published to pypi

…r_file by error. Led to pb when deploying to cluster.

"python launcher_file.py job_name=some_job_depending_on_generic_py_job"
  - would be submitted as "spark-submit copy_job.py ..." and break because copy_job.py is not in job_metadata.yml.
  - instead of                     "spark-submit launcher_file.py job_name=some_job
…nd_params(). To be used for job specific params like "base_path" = "some/path/{some_param}/{other_param}/"
…lity to load individual file (fix tmp folder in that case)
…erent base_path for input and output. Works (see new job) but there are better solutions. To be revisited.
@arthurprevot arthurprevot changed the title Support single tenancy + parquet fix Support single tenancy & more + move to v0.11.8 Jun 12, 2024
@arthurprevot arthurprevot merged commit 6256440 into master Jun 12, 2024
2 checks passed
@arthurprevot arthurprevot deleted the support_signle_tenancy branch June 12, 2024 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant