Implement logging infrastructure #311

snizhana-dynnyk · 2021-04-27T20:14:06Z

User Story

As a Giant Swarm engineer, I want to be able to access a history of logs for components managing our platform in order to be to both investigate ongoing operational issues and provide details for incident reports

Tasks

via @anvddriesch: With CAPI a better interface towards distributed logging becomes more important. There are many controllers in the CAPI implementations compared to the single operator we use today. So in case you want to debug what is going on, eg during cluster creation, it would be good to have the logs in one place.

snizhana-dynnyk · 2021-08-17T12:42:29Z

as a requirement (coming from https://github.com/giantswarm/giantswarm/issues/11489) we want the k8s audit log kept

and
https://github.com/giantswarm/giantswarm/issues/9576

JosephSalisbury · 2022-03-15T14:00:09Z

TheoBrigitte · 2022-12-07T10:09:12Z

Status update

We do have a central Loki instance setup on a GiantSwarm cluster
We are ingesting all logs (journald + pods)
We are ingesting logs from some MCs (anteater, gauss, gorilla, otter)
Cost is currently around 1$/month/MC

TheoBrigitte · 2022-12-07T10:23:00Z

Note from sig-product sync and further async discussions
https://docs.google.com/document/d/1Dbl_76JqeUlG16FhHU4P4kM41B3OP0HCkZLONXFH0f0/edit#
https://github.com/giantswarm/giantswarm/pull/24939

We want to log as much as possible, but have alert in place in case some component log too much. So we can adjust log level and retention.
Document current setup, retention policy.
Loki as a managed app, for customer to monitor their app
Centralized vs Installation setup
- Some customer have security concern about data going out
- Storage might be hard on installations
- Automating S3 bucket creation
Is cost an issue ?

teemow · 2022-12-08T16:36:46Z

@TheoBrigitte imo Loki shouldn't be a managed app to monitor applications. Instead we need to build centralized logging for the platform that supports the infrastructure and applications. And this is the same for Prometheus.

And it needs to be scoped by management cluster. I am sorry that you started working on centralized logging. But this is not an option. I could have told you that before. Imo we need to get better to communicate decisions like that. This is quite an impactful decision that you took in the team and this is not aligned with our general architecture.

Examples why this doesn't work:

Private environments for financial service providers and insurances won't allow us to ship logging data outside.
Edge clusters need to be rather resilient against reduced or broken internet connectivity. So this design would be in the way to go to the edge.
We always considered the decentralized management clusters as a separate failure domain which limits the blast radius. A centralized logging would introduced a SPOF for all installations.
This would introduce linear fixed costs for us that scale up with each customer. So far almost all of our costs are not linear to the customers.
We make a promise that the customers can keep their environment with all features if they decide not to work with us anymore. This breaks with that promise.

So from my side centralized logging is no option. I am happy to talk about this on Tuesday, but please don't invest any time into the centralized solution anymore.

TheoBrigitte · 2023-01-12T16:22:53Z

The recent discussions around our logging architecture, about the centralized vs distributed approach pro and cons have been summarized into this RFC.

We are now working on setting up Loki following the distributed approach.

QuentinBisson · 2023-01-16T13:01:58Z

The currently plan to install Loki is explained in the RFC but we forgot to talk about Promtail the log ingester.

After discussions with the team we think the following idea is the safest bet.

Motivation

Getting a new app or config change deployed to a management cluster is rather straightforward if the application is deployed through an app-collection (creating a release and voila) but deploying the app or the config change to a workload cluster is rather tedious (and that's without taking into account the need for a customer's approval of the change) causing us pain (opened postmortems, silenced alerts that are already fixed and so on)

Idea

Our idea is to deploy the promtail app as part of the observability bundle with promtail disabled by default today (creating a new release before we create new Vintage releases (namely aws 18.2.0 and azure 19.0.0)) with promtail being disabled.
The App CR will be referencing a configmap and secret (through the AppCR extra config) that will be managed by a new operator (current name idea is logging-operator)

This operator will be in charge of creating/updating/deleting the configmap for each cluster so we can dynamically update the promtail config on each cluster (management cluster and workload cluster alike).

This will allow us to configure the application at runtime (feature flagging per MCs and so on). This operator will also be used to implement multi-tenancy (cluster or organization level remains to be seen) without having to ask customers to upgrade again.

TheoBrigitte · 2023-01-24T11:28:33Z

Added TODOs about Promtail deployment that were discussed in last refinement session.

TheoBrigitte · 2023-07-11T11:52:13Z

Status update

We have a good progress on the dynamic configuration part, which we need in order to configure Grafana, Promtail, and the multitenant-proxy to make logs flow through our infrastructure. There are some last bit and pieces (more details here https://github.com/giantswarm/giantswarm/issues/27146) plus some testing to be done.
After this we enter testing phase, first on GiatnSwarm installations and then open the trial for customers.
So in short:

Test dynamic configuration
Testing on GiantSwarm installations
Trial on customer AWS vintage installations

TheoBrigitte · 2023-08-22T16:25:10Z

Status update

We successfully ran our pre-release on GiantSwarm AWS testing installations.
We wrote a guide to evaluate Loki cost https://intranet.giantswarm.io/docs/observability/loki-cost-estimate/
We are now ready to deploy Loki on AWS installations.
We continue with implementing logging for AWS vintage WCs

TheoBrigitte · 2023-09-21T09:35:22Z

Status update

Logging infrastructure is live on all AWS installations
- Since 28th of August on 19 installations
- Since 20th of September on bandicoot, flamingo, and fox
Documentation improved and made public in our handbook
- Usage: https://handbook.giantswarm.io/docs/observability/loki-usage/
- Cost estimate: https://handbook.giantswarm.io/docs/observability/loki-cost-estimate/
AWS workload cluster system logs
- Updated logging-operator to deploy Promtail on WC
- Added logging target config for WC
- Running manual test

QuentinBisson · 2024-03-19T12:28:53Z

We are almost there, we are only missing https://github.com/giantswarm/giantswarm/issues/28726 so technically https://github.com/giantswarm/giantswarm/issues/29776 to be done here

snizhana-dynnyk added kind/epic area/empowerment epic/logging team/atlas Team Atlas labels Apr 27, 2021

snizhana-dynnyk removed the area/empowerment label Jan 24, 2022

JosephSalisbury mentioned this issue Mar 31, 2022

Flux (and apps) in MC logs visibility to customers #948

Closed

teemow added this to Roadmap May 10, 2022

teemow moved this to Future (> 6 months) in Roadmap May 10, 2022

weatherhog moved this from Future (> 6 months) to Mid Term (3-6 months) in Roadmap Jun 17, 2022

TheoBrigitte added the topic/observability label Dec 7, 2022

TheoBrigitte self-assigned this Dec 7, 2022

TheoBrigitte moved this from Mid Term (3-6 months) to Future (> 6 months) in Roadmap Dec 7, 2022

TheoBrigitte mentioned this issue Dec 29, 2022

add logging-infrastructure RFC giantswarm/rfc#58

Merged

This was referenced Jan 9, 2023

Find out how to access object storage on AWS installations #1839

Closed

Deploy Loki on AWS vintage installations #1840

Closed

Create object storage for AWS installations #1849

Closed

TheoBrigitte mentioned this issue Jan 17, 2023

Deploy promtail to WCs #1872

Closed

8 tasks

TheoBrigitte mentioned this issue Jan 24, 2023

Define how we handle Promtail configuration #1903

Closed

TheoBrigitte mentioned this issue Jan 24, 2023

Deploy Promtail #1908

Closed

4 tasks

puja108 moved this from Future (> 6 months) to Near Term (1-3 months) in Roadmap Feb 9, 2023

puja108 moved this from Near Term (1-3 months) to Mid Term (3-6 months) in Roadmap Feb 9, 2023

TheoBrigitte mentioned this issue Feb 10, 2023

Organize outcome of on-site observability session #1358

Closed

4 tasks

TheoBrigitte mentioned this issue Feb 21, 2023

Promtail deployment POC #2054

Closed

3 tasks

This was referenced Apr 25, 2023

Adjust Loki retention #2366

Closed

Deploy Loki for 1 week trial on AWS vintage #2405

Closed

TheoBrigitte mentioned this issue May 30, 2023

Deploy Promtail on AWS vintage installations #2509

Closed

TheoBrigitte moved this from Mid Term (3-6 months) to Ready Soon (<4 weeks) in Roadmap Jul 3, 2023

QuentinBisson mentioned this issue Oct 9, 2023

Object Storage Support #2871

Closed

This was referenced Oct 31, 2023

Setup logging on CAPA installations #2935

Closed

Setup logging on CAPZ installations #2942

Closed

Setup logging on CAPV installations #2943

Closed

Setup logging on CAPVCD installations #2944

Closed

QuentinBisson closed this as completed Mar 25, 2024

github-project-automation bot moved this from In Progress ⛏️ to Done ✅ in Roadmap Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement logging infrastructure #311

Implement logging infrastructure #311

snizhana-dynnyk commented Apr 27, 2021 •

edited by QuentinBisson

Loading

teemow commented Apr 28, 2021

teemow commented May 4, 2021

snizhana-dynnyk commented Aug 17, 2021 •

edited

Loading

JosephSalisbury commented Mar 15, 2022

TheoBrigitte commented Dec 7, 2022

TheoBrigitte commented Dec 7, 2022

teemow commented Dec 8, 2022

TheoBrigitte commented Jan 12, 2023 •

edited by QuentinBisson

Loading

QuentinBisson commented Jan 16, 2023

TheoBrigitte commented Jan 24, 2023

TheoBrigitte commented Jul 11, 2023

TheoBrigitte commented Aug 22, 2023

TheoBrigitte commented Sep 21, 2023 •

edited

Loading

QuentinBisson commented Mar 19, 2024 •

edited

Loading

Implement logging infrastructure #311

Implement logging infrastructure #311

Comments

snizhana-dynnyk commented Apr 27, 2021 • edited by QuentinBisson Loading

User Story

Tasks

Related

teemow commented Apr 28, 2021

teemow commented May 4, 2021

snizhana-dynnyk commented Aug 17, 2021 • edited Loading

JosephSalisbury commented Mar 15, 2022

TheoBrigitte commented Dec 7, 2022

Status update

TheoBrigitte commented Dec 7, 2022

teemow commented Dec 8, 2022

TheoBrigitte commented Jan 12, 2023 • edited by QuentinBisson Loading

QuentinBisson commented Jan 16, 2023

Motivation

Idea

TheoBrigitte commented Jan 24, 2023

TheoBrigitte commented Jul 11, 2023

Status update

TheoBrigitte commented Aug 22, 2023

Status update

TheoBrigitte commented Sep 21, 2023 • edited Loading

Status update

QuentinBisson commented Mar 19, 2024 • edited Loading

snizhana-dynnyk commented Apr 27, 2021 •

edited by QuentinBisson

Loading

snizhana-dynnyk commented Aug 17, 2021 •

edited

Loading

TheoBrigitte commented Jan 12, 2023 •

edited by QuentinBisson

Loading

TheoBrigitte commented Sep 21, 2023 •

edited

Loading

QuentinBisson commented Mar 19, 2024 •

edited

Loading