Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist and Serve TaskRun Logs #198

Closed
adambkaplan opened this issue Jun 17, 2022 · 16 comments
Closed

Persist and Serve TaskRun Logs #198

adambkaplan opened this issue Jun 17, 2022 · 16 comments
Labels
area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) kind/feature Categorizes issue or PR as related to a new feature.

Comments

@adambkaplan
Copy link
Contributor

Feature request

Enhance Results to do the following:

  1. Store TaskRun step logs.
  2. Provide an API where logs for each TaskRun step can be served.
  3. Use the existing Results RBAC controls to ensure logs are only served to those who have been granted the right permissions.

Use case

CI/CD users expect to view the full logs of any given step in a build/pipeline process.
This is primarily driven by two use cases:

  1. Debugging a failed build/pipeline.
  2. Auditing a build/pipeline process.

This is done in Tekton today by serving the underlying container logs from a TaskRun pod. These are stored on the host node and can be lost due to TaskRun pruning, cluster maintenance, or other mechanisms that delete the underlying pod.
For auditing purposes, build logs may need to be retained for the life a particular version of software is supported.

The most common means of persisting Kubernetes logs today is with log forwarding tools like fluentd and analysis engines like ElasticSearch, Amazon CloudWatch, and Grafana Loki.
These stacks are optimized to stream logs across systems for analysis in real time (this is a good thing!).
They are not built to retain and serve individual log files.

This feature request proposes that the Results watcher and apiserver be extended to store logs for TaskRun steps.
These logs can then be fetched by the apiserver from an API endpoint.

@adambkaplan adambkaplan added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 17, 2022
@adambkaplan
Copy link
Contributor Author

Credit to @CathalOConnorRH who did a lot of research on our end with fluentd and Loki that led to this feature request.

cc @wlynch - this follows up the "Logs with Tekton Results" item discussed during the Tekton Community Summit.

@khrm
Copy link
Contributor

khrm commented Jun 22, 2022

The most common means of persisting Kubernetes logs today is with log forwarding tools like fluentd and analysis engines like ElasticSearch, Amazon CloudWatch, and Grafana Loki. These stacks are optimized to stream logs across systems for analysis in real time (this is a good thing!). They are not built to retain and serve individual log files.

ELK stack is optimized for storing logs longtime. We are using it in Openshift Logging. At present, we can view pipelineruns and taskruns logs in Openshift Logging. There are some issues with this UX.

This feature request proposes that the Results watcher and apiserver be extended to store logs for TaskRun steps. These logs can then be fetched by the apiserver from an API endpoint.

The problem with storing logs in Postgres/MySQL is that they aren't build for this.

I will start working on this problem next week in two phases. We had a discussion on this in slack but unfortunately, it got lost.

First phase: A kube rest api service/proxy which gives us data from Tekton results.
Second phase: Design a plugin architecture in this service for fetching logs from various sources like ELK, Splunk, etc.

@vdemeester
Copy link
Member

@khrm we discussed this a bit offline as well. There is two things that could be done as part of tektoncd/results, independent of ELK or anything:

  • add something to the results api to fetch the logs, so that we don't need another API for the logs
  • where we store the logs — this could be pluggable, and we could think of different "storage" (standard file, elk, …)

@khrm
Copy link
Contributor

khrm commented Jun 22, 2022

Yes, that's what I am planning to do after adding a proxy service.

@sayan-biswas
Copy link
Contributor

@khrm I have added a REST proxy for the existing GRPC server, as part of some changes required to work with KCP.

https://github.com/sayan-biswas/tekton-results/blob/33d111248f3c6f7400a001030d7c3170d8aef174/cmd/api/main.go#L153

This branch has the proxy changes without the KCP changes. If you are thinking of implementing something like this, then I can create a PR next week to merge this.

@jb-2020
Copy link

jb-2020 commented Jun 23, 2022

One note, the dashboard team has a minio walkthrough for log persistence. I only bring this up as with this change in context of #82 [integrations] Tekton Dashboard - would be very nice if this change to results is also usable in the dashboard.

@adambkaplan
Copy link
Contributor Author

I have submitted #203 as an initial proof of concept. There is a lot here - @khrm @vdemeester do you think this warrants a TEP?

@daniel-maganto
Copy link

From Allianz Direct we have created a solution to get logs from S3 and show them in Tekton Dashboard for long-term logs when you need to delete a task in your cluster.
https://github.com/allianz-direct/tekton-s3-log-reader

@adambkaplan
Copy link
Contributor Author

@afrittoli also pointed out that Tekton's dogfooding CI manually forwards logs to GCS with Tekton Tasks. It looks like we have minimally the following use cases:

  1. No log forwarding or retrieval (configure watcher and/or apiserver to forward logs)
  2. Retrieve logs through results API, log forwarding done externally (example - Fluentd to Elasticsearch, Loki, etc.). "Driver" understands how to retrieve logs from log aggregator.
  3. Retrieve logs through results API, log forwarding done by watchers. APIServer supports drivers that retrieve and store logs from:
    1. Local disk
    2. S3 (AWS or compatible object storage service)
    3. GCS
    4. Azure, other cloud provider object storage

@adambkaplan
Copy link
Contributor Author

Update: this feature was captured in TEP-0117, which was approved as a provisional proposal.

@tekton-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@adambkaplan
Copy link
Contributor Author

/remove-lifecycle stale

@vdemeester
Copy link
Member

/area roadmap

@tekton-robot tekton-robot added the area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) label Feb 15, 2023
@adambkaplan
Copy link
Contributor Author

@tektoncd/results-maintainers I think we can call this "done" and mark TEP-0117 as implemented. Thoughts?

@adambkaplan
Copy link
Contributor Author

/close

This was implemented in v0.5.0

@tekton-robot
Copy link

@adambkaplan: Closing this issue.

In response to this:

/close

This was implemented in v0.5.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) kind/feature Categorizes issue or PR as related to a new feature.
Projects
Status: Done
Development

No branches or pull requests

7 participants