Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMP-97070 Move Spark metadata upon transfer completion #8

Open
wants to merge 1 commit into
base: AMP-96980
Choose a base branch
from

Conversation

LeontiBrechko
Copy link
Collaborator

Description

Move spark transfer metadata to a subdirectory of the target export S3 location

Testing

print(f'Identified bucket: {bucket}, prefix: {prefix}')

# List all files in the s3_path directory
response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter='/')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this return folders as well? If only files are returned, then we are good.

Copy link
Collaborator Author

@LeontiBrechko LeontiBrechko Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only files. Same for listing objects in our Java repo using Amplitude's S3 wrapper

The example job mentioned in the description has the logs of what was discovered (note that /meta folder exists at the point of execution of this method and is not listed)

if '://' not in s3_uri_with_spark_metadata:
raise ValueError(f'Invalid s3 URI: {s3_uri_with_spark_metadata}. Expected to contain "://".')
bucket, prefix = s3_uri_with_spark_metadata.split('://')[1].split('/', 1)
bucket = replace_double_slashes_with_single_slash(bucket)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious where double slashes are coming from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for bucket, it shouldn't

It's just a sanitization in case some unnormalized input for s3uri is provided (e.g. s3:////bucket////prefix////)

expected_output = '/path/to/file/with/double/slashes/end/'
self.assertEqual(expected_output, replace_double_slashes_with_single_slash(input_string))

def test_move_spark_metadata_to_separate_s3_folder(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding tests!

Copy link
Collaborator

@fzqgriffin fzqgriffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's hold on merge until AMP-96980 is approved so the other PR can still focus on event mutation stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants