Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement logging infrastructure #311

Closed
37 tasks done
snizhana-dynnyk opened this issue Apr 27, 2021 · 14 comments
Closed
37 tasks done

Implement logging infrastructure #311

snizhana-dynnyk opened this issue Apr 27, 2021 · 14 comments

Comments

@snizhana-dynnyk
Copy link
Contributor

snizhana-dynnyk commented Apr 27, 2021

User Story

  • As a Giant Swarm engineer, I want to be able to access a history of logs for components managing our platform in order to be to both investigate ongoing operational issues and provide details for incident reports

Tasks

Related

@teemow
Copy link
Member

teemow commented Apr 28, 2021

The visibility is also interesting for customers. Similar to #182

@teemow
Copy link
Member

teemow commented May 4, 2021

via @anvddriesch: With CAPI a better interface towards distributed logging becomes more important. There are many controllers in the CAPI implementations compared to the single operator we use today. So in case you want to debug what is going on, eg during cluster creation, it would be good to have the logs in one place.

@snizhana-dynnyk
Copy link
Contributor Author

snizhana-dynnyk commented Aug 17, 2021

as a requirement (coming from https://github.com/giantswarm/giantswarm/issues/11489) we want the k8s audit log kept

and
https://github.com/giantswarm/giantswarm/issues/9576

@JosephSalisbury
Copy link
Contributor

Screenshot 2022-03-15 at 13 59 24

@teemow teemow added this to Roadmap May 10, 2022
@teemow teemow moved this to Future (> 6 months) in Roadmap May 10, 2022
@weatherhog weatherhog moved this from Future (> 6 months) to Mid Term (3-6 months) in Roadmap Jun 17, 2022
@TheoBrigitte TheoBrigitte self-assigned this Dec 7, 2022
@TheoBrigitte TheoBrigitte moved this from Mid Term (3-6 months) to Future (> 6 months) in Roadmap Dec 7, 2022
@TheoBrigitte
Copy link
Member

Status update

  • We do have a central Loki instance setup on a GiantSwarm cluster
  • We are ingesting all logs (journald + pods)
  • We are ingesting logs from some MCs (anteater, gauss, gorilla, otter)
  • Cost is currently around 1$/month/MC

@TheoBrigitte
Copy link
Member

Note from sig-product sync and further async discussions
https://docs.google.com/document/d/1Dbl_76JqeUlG16FhHU4P4kM41B3OP0HCkZLONXFH0f0/edit#
https://github.com/giantswarm/giantswarm/pull/24939

  • We want to log as much as possible, but have alert in place in case some component log too much. So we can adjust log level and retention.
  • Document current setup, retention policy.
  • Loki as a managed app, for customer to monitor their app
  • Centralized vs Installation setup
    • Some customer have security concern about data going out
    • Storage might be hard on installations
    • Automating S3 bucket creation
  • Is cost an issue ?

@teemow
Copy link
Member

teemow commented Dec 8, 2022

@TheoBrigitte imo Loki shouldn't be a managed app to monitor applications. Instead we need to build centralized logging for the platform that supports the infrastructure and applications. And this is the same for Prometheus.

And it needs to be scoped by management cluster. I am sorry that you started working on centralized logging. But this is not an option. I could have told you that before. Imo we need to get better to communicate decisions like that. This is quite an impactful decision that you took in the team and this is not aligned with our general architecture.

Examples why this doesn't work:

  • Private environments for financial service providers and insurances won't allow us to ship logging data outside.
  • Edge clusters need to be rather resilient against reduced or broken internet connectivity. So this design would be in the way to go to the edge.
  • We always considered the decentralized management clusters as a separate failure domain which limits the blast radius. A centralized logging would introduced a SPOF for all installations.
  • This would introduce linear fixed costs for us that scale up with each customer. So far almost all of our costs are not linear to the customers.
  • We make a promise that the customers can keep their environment with all features if they decide not to work with us anymore. This breaks with that promise.

So from my side centralized logging is no option. I am happy to talk about this on Tuesday, but please don't invest any time into the centralized solution anymore.

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Jan 12, 2023

The recent discussions around our logging architecture, about the centralized vs distributed approach pro and cons have been summarized into this RFC.

We are now working on setting up Loki following the distributed approach.

@QuentinBisson
Copy link

The currently plan to install Loki is explained in the RFC but we forgot to talk about Promtail the log ingester.

After discussions with the team we think the following idea is the safest bet.

Motivation

Getting a new app or config change deployed to a management cluster is rather straightforward if the application is deployed through an app-collection (creating a release and voila) but deploying the app or the config change to a workload cluster is rather tedious (and that's without taking into account the need for a customer's approval of the change) causing us pain (opened postmortems, silenced alerts that are already fixed and so on)

Idea

Our idea is to deploy the promtail app as part of the observability bundle with promtail disabled by default today (creating a new release before we create new Vintage releases (namely aws 18.2.0 and azure 19.0.0)) with promtail being disabled.
The App CR will be referencing a configmap and secret (through the AppCR extra config) that will be managed by a new operator (current name idea is logging-operator)

This operator will be in charge of creating/updating/deleting the configmap for each cluster so we can dynamically update the promtail config on each cluster (management cluster and workload cluster alike).

This will allow us to configure the application at runtime (feature flagging per MCs and so on). This operator will also be used to implement multi-tenancy (cluster or organization level remains to be seen) without having to ask customers to upgrade again.

@TheoBrigitte
Copy link
Member

Added TODOs about Promtail deployment that were discussed in last refinement session.

@TheoBrigitte TheoBrigitte mentioned this issue Jan 24, 2023
4 tasks
@puja108 puja108 moved this from Future (> 6 months) to Near Term (1-3 months) in Roadmap Feb 9, 2023
@puja108 puja108 moved this from Near Term (1-3 months) to Mid Term (3-6 months) in Roadmap Feb 9, 2023
@TheoBrigitte TheoBrigitte moved this from Mid Term (3-6 months) to Ready Soon (<4 weeks) in Roadmap Jul 3, 2023
@TheoBrigitte
Copy link
Member

Status update

We have a good progress on the dynamic configuration part, which we need in order to configure Grafana, Promtail, and the multitenant-proxy to make logs flow through our infrastructure. There are some last bit and pieces (more details here https://github.com/giantswarm/giantswarm/issues/27146) plus some testing to be done.
After this we enter testing phase, first on GiatnSwarm installations and then open the trial for customers.
So in short:

  1. Test dynamic configuration
  2. Testing on GiantSwarm installations
  3. Trial on customer AWS vintage installations

@TheoBrigitte
Copy link
Member

Status update

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Sep 21, 2023

Status update

@QuentinBisson
Copy link

QuentinBisson commented Mar 19, 2024

We are almost there, we are only missing https://github.com/giantswarm/giantswarm/issues/28726 so technically https://github.com/giantswarm/giantswarm/issues/29776 to be done here

@github-project-automation github-project-automation bot moved this from In Progress ⛏️ to Done ✅ in Roadmap Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

5 participants