-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LogStore to retain build logs, and a S3LogStore implementation #967
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I think this is a great approach. I had been thinking we should do it at the deployment/mybinder.org level, but this does seem lots simpler.
repo2docker/app.py
Outdated
and self.s3_logs_secret_key | ||
and self.s3_logs_bucket | ||
): | ||
logfile = tempfile.NamedTemporaryFile("w", delete=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These builds logs have been useful in the past, maybe they should be kept anyway, even if not uploading to s3?
Thanks for reviewing! I started with repo2docker for this prototype because it was easier to get the logs, but I've since figured out how to do it on the BinderHub side: jupyterhub/binderhub#1161. Obviously all your review comments from here also apply there. I think BinderHub is a better location since it integrates multiple components though the repo2docker is a bit simpler. Let me know which you prefer and I'll put more time into it. |
Came here to find out why "store logs in s3" needed changes in repo2docker. My guess at an implementation would have either been changes to BinderHub chart (inject a sidecar) or in BinderHub itself. Is it worth discussing the pros and cons? Having thought about it for a bit while writing this I am starting to think that r2d is the right place (sidecar or "native"), not bhub. By putting the code into r2d we can restart the hub pod without losing logs and maintain the kind of "fire and forget" setup we have right now for builds (bhub asks for a build to start and then doesn't need to maintain any state about it). |
About naming the bucket: I think there are two options:
(the "derived from" is meant to represent us hashing/mangling/escaping/processing the inputs to generate a bucket name) I like naming it after the resolved ref as that generates unique names for free and we can have a history of previous builds. However I can't (yet) think of a nice way for a user to find those logs again. The scenario i am thinking about is this:
Basically in a situation where you build and rebuild and rerebuild the same repo by adding commits to your repository it becomes quite tricky to (re)find the log of a previous build. I think this is a user interface problem which we could solve in BinderHub with some GUI or by clever structuring of the bucket names or some other way. I think the right level of complexity for a user would be presenting them with a chronologically sorted list of build logs that has a link back to the source repository (in the state it was when we built it) and maybe even the human recognisable ref that they typed into the BinderHub form. |
I see two pieces of information useful for solving this problem:
Key information: prefix can be used to filter files to a short manageable list per-repo, metadata cannot, hence the prefix. bucket-storage like S3/GCS does not support things like list-files-matching-metadata. Then we can have a BinderHub API based on this info to retrieve the build log history for a repo, e.g.
|
For the benefit of anyone new to S3 I thought I'd clear up a bit of terminology. You have an S3 server, a bucket, and a key which references an object. With many public clouds you also have a region but ignore that for now. This means the URL to your object (in this case a log-file) is something like The So in my original example: Server: |
This adds unnecessary complexity since it depends on the S3 server, so rely on the user to know how to get the URL
Add boto3 to test dependencies
repo2docker knows nothing about the logstore other than it should copy the Docker build logs to it
I think this is quite close to being ready. I took @minrk's suggestion (thanks!) of using a Configurable which I think makes things a lot nicer, the changes to The main issue left is how to configure c.Repo2Docker.logstore = 'repo2docker.logstore.S3LogStore'
c.S3LogStore.endpoint = 'http://192.168.1.1:9000'
c.S3LogStore.access_key = 'minio'
c.S3LogStore.secret_key = 'minio123'
c.S3LogStore.bucket = 'mybinder'
c.S3LogStore.logname = 'buildlogs/repo2docker.log'
c.S3LogStore.metadata = {"R2D-Key": "value",} This is probably OK for the static S3 connection parameters, but the logname will change with every build so the config would need to be rewritten. Options:
Any thoughts? I quite like option 3 as it matches other Jupyter/Ipython applications, and hides the new feature from a typical user. It's not currently working in R2D though, was this a deliberately design choice? It seemed easy enough to add: diff --git a/repo2docker/__main__.py b/repo2docker/__main__.py
index 907460d..5ea22d8 100644
--- a/repo2docker/__main__.py
+++ b/repo2docker/__main__.py
@@ -232,7 +232,8 @@ def make_r2d(argv=None):
print(__version__)
sys.exit(0)
- args = get_argparser().parse_args(argv)
+ print(argv)
+ args, unknown_args = get_argparser().parse_known_args(argv)
r2d = Repo2Docker()
@@ -240,6 +241,9 @@ def make_r2d(argv=None):
r2d.log_level = logging.DEBUG
r2d.load_config_file(args.config)
+ if unknown_args:
+ print("Unknown args: ", unknown_args)
+ r2d.parse_command_line(unknown_args)
if args.appendix:
r2d.appendix = args.appendix |
On K8S sidecar vs native in BinderHub: AFAIK since the repo2docker container logs to stdout it's not possible to capture them in a sidecar without changing repo2docker to log to a location the sidecar logging agent can read, such as a file on a shared volume or a network socket. I might be wrong on this though. BinderHub obtains the logs using the K8s API |
I'm onboard with the idea and I love that you have created an abstract LogStore class where s3 is one implementation of it. Happy to help with review efforts if you want to bring this to life again @manics! |
Prototype of uploading repo2docker build logs to S3 (jupyterhub/binderhub#1156)
If you don't have access to an S3 server run a local one:
Download an S3 client e.g. the minio client from https://min.io/download
Add temporary client config called
mybindertest
for the local test serverCreate a bucket called
mybinder
Allow public downloads from the bucket
Repo2docker S3 logs config
Build with image name
asd123
Logs should be available at
http://localhost:9000/mybinder/buildlogs/asd123/repo2docker.log
If you can't be bothered to setup your own Minio server you can use the public demo one at play.min.io:
Those demo credentials are public, I've already run the above so assuming no-one else has modified it you just need:
https://play.minio.io/mybinder/buildlogs/r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319/repo2docker.log