-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: massive logs streaming #424
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments -- i'll follow up offline
content = log_provider.get_log_content(task, logtype) | ||
results[log_key] = json.dumps({"log_hash": current_hash, "content": content}) | ||
total_lines = count_total_lines(local_paths) | ||
results[log_key] = json.dumps({"log_hash": current_hash, "content_paths": local_paths, "line_count": total_lines}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly for my education but why json.dumps? We seem to then be doing a lot of loads. Is this serialized soemwhere outside the process?
*task.path_components, | ||
attempt | ||
) | ||
paths = dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get the non _ method get_log_location
not be used here (am always a little weary of using internal methods)
log_paths[name] = None | ||
else: | ||
shutil.move(path, log_paths[name]) | ||
return [val for val in log_paths.values() if val is not None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this whole method could just be simplified by using the task datastore's load_logs (well through the filecache if you want to although not 100% sure) with the addition of something like "write_to_disk_only". In other words, you seem to be doing a lot of the thing from there with the only difference being that you don't open the logs but just keep them on disk. I feel we could make that small change there (without modifying any of the storage layers but just keeping it at the task datastore level) and this should simplify this code a lot. Basically, your line 280 (or somethin akin to it) would go around line 984 of task_datastore.py (protected by a flag or something).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, as a follow up will look into hoisting all the logic required by fetch_logs into the Metaflow client side, for something akin to dump_logs(stream, destination_path)
Alternative approach to streaming massive logs. Does not require any changes to Metaflow client side.
Intended to supersede #423
TODO: