Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self Service Log Ingestion #3518

Open
1 of 5 tasks
Tracked by #3515
Rotfuks opened this issue Jun 24, 2024 · 21 comments
Open
1 of 5 tasks
Tracked by #3515

Self Service Log Ingestion #3518

Rotfuks opened this issue Jun 24, 2024 · 21 comments
Assignees
Labels
team/atlas Team Atlas

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Jun 24, 2024

Motivation

We want customers to be able to ingest whatever data is relevant for them in a self service way, this also includes logs. So we need to make sure we have a way how they can add their own data sources for logs.

Todo

  • investigate how exactly we could empower customers to add their own log sources - for example PodLogs https://github.com/giantswarm/giantswarm/issues/29072
  • Implement any needed changes to make it happen
  • Create a documentation draft how customers can add their own sources of events/logs to be monitored
  • Get some feedback from AE about the documentation
  • Make sure the documentation is published in the new observability platform docs

Outcome

  • Customers can now add their own sources of events to be monitored by the observability platform
  • there is docs and educational content out there showing them how it's done
@QuentinBisson
Copy link

We see that we can use pod logs but do we want to force customers to create pod logs for log ingestion? Can we allow them to collect logs at the namespace level (with annotations and so on)?

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jul 22, 2024

How much effort is it to create podlogs for customers? I would love to have some label based stuff where we can just say "add this label and it's automatically ingested" because that makes it quite flexible and intuitive. It will also help us with mutli-tenancy I believe.

@QuentinBisson
Copy link

The issue I have is not that pod logs don't make sense but I would think they should be used on really rare occasions. Ideally, an annotation/label on the pod or namespace should be enough to get most the tenant for most log and that would make profiles and traces collection easier. I would only use pod logs if the pod needs a custom pipeline imo

What i'm not sure is if we can get alls logs for a namespace if it's annotated unless the pod has it's own label and unless it's equipped with a pod log?

I would think we could do something with drops but I'm not sure. Maybe @TheoBrigitte knows if log sources can exclude data taken from other sources?

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Aug 26, 2024

When using Alloy as logging agent installed within a workload cluster, we configure it in a way which would allow to retrieves logs from specific namespaces and/or pods.

This solution makes use of 2 differente PodLogs (with mutual exclusion):

Those PodLogs would be configured by us and customers would only deal with labels on their resources.

With this solution we might face a problem with resources usage on the Kubeletes, as all the log traffic would happen via the Kubernetes API the network and CPU usage on Kubelet might be problematic especially in cases where many/all pods would be monitored.
Alloy does not currently provide another way to select targets based on their namespace. The usual loki.source.file does not suffer from Kubelet resources usage problem as logs are directly retrieved from the node where Alloy is running, but it does not allow to select pods by namespace.

I opened an upstream issue requesting to add the namespace metadata within the discovery.kubernetes component, this would allow us to avoid using PodLogs and suffering from their overhead.

@QuentinBisson
Copy link

@TheoBrigitte TheoBrigitte self-assigned this Aug 27, 2024
@TheoBrigitte
Copy link
Member

Did you take a look at this?https://grafana.com/docs/alloy/latest/reference/components/loki/loki.source.kubernetes/

Looking it, this would be simpler than the currently used local.file_match in our solution, but I also do not see the benefit over loki.source.podlogs, you get rid of the need for PodLogs resources but also loose the capability to filter on namespaces labels and you still have the network and CPU overhead on the Kubernetes API server.

@QuentinBisson
Copy link

I quite like that we do not have to run it as a daemonset though :D

But why do you not have the namespace ? I thought those should give you __meta_kubernetes_namespace in the loki.process or relabel phase?

@QuentinBisson
Copy link

Oh you meant namespace labels,nevermind

@TheoBrigitte
Copy link
Member

Using a combination of loki.relabel and loki.source.podlogs components it is possible to set the tenant id based on a given label from the pod or its namespace.

In the following example the tenant id is taken from the pod label foo.

Here is the config and the PodLog resource I used

  • Alloy config
loki.source.podlogs "default" {
  forward_to = [loki.relabel.default.receiver]
}

loki.relabel "default" {
  forward_to = [loki.write.default.receiver]

  rule {
    action = "replace"
    source_labels = ["foo"]
    target_label  = "__tenant_id__"
    replacement = "$1"
    regex = "(.*)"
  }

  rule {
    action = "labeldrop"
    regex = "^foo$"
  }
}

loki.write "default" {
  endpoint {
    url = "https://loki.svc/loki/api/v1/push"
  }
}
  • PodLog (note: this will select all pods from all namespaces, change the selectors to fit your need)
apiVersion: monitoring.grafana.com/v1alpha2
kind: PodLogs
metadata:
  name: pod-tenant-id-from-label
spec:
  selector: {}
  namespaceSelector: {}
  relabelings:
  - action: replace
    sourceLabels: ["__meta_kubernetes_pod_label_foo"]
    targetLabel: "foo"
    replacement: "$1"
    regex: "(.*)"

It is also possible to set the tenant id using the loki.process component which has a tenant stage which allow for exactly this; setting the tenant id, but from there only log entry content are accessible.
More info at https://grafana.com/docs/alloy/latest/reference/components/loki/loki.process/#stagetenant-block

@TheoBrigitte
Copy link
Member

Current prototype idea

image

Improvements we want to explore:

  • Workaround the Kubelet traffic limitation by fetching logs from local disk, using either loki.source.kubernetes or some newer features like join or logs.alloy module
  • Avoid duplicated targets
  • How to provide access to Alloy components like loki.process

@TheoBrigitte
Copy link
Member

Using loki.source.kubernetes would only allow to select pods and not namespace using labels, as the pod's namespace labels are not exposed in this component, that's why I opened a upstream issue asking to expose those grafana/alloy#1550

@TheoBrigitte
Copy link
Member

The potential new join feature would not help in our case as this would only allow enriching metadata in theloki.relabel component but there would still be no way to pass the resulting targets into the loki.source.file.

@QuentinBisson
Copy link

QuentinBisson commented Sep 23, 2024

What if you enrich then drop logs instead of trying to discover only those we should "scrape" ?

@TheoBrigitte
Copy link
Member

What if you enrich then drop logs ?

There would still be now way match the resulting targets against a local file as the loki.relabel and loki.source.file components cannot be connected

@TheoBrigitte
Copy link
Member

logs.alloy also does not help as its mainly a wrapper around existing alloy components.

@QuentinBisson
Copy link

What if you enrich then drop logs ?

There would still be now way match the resulting targets against a local file as the loki.relabel and loki.source.file components cannot be connected

if you join based on the extracted labels from loki.source.file and the one that does kubernetes discovery that's not possible? 🤔 It might be interesting to go to the next community meeting

@TheoBrigitte
Copy link
Member

What if you enrich then drop logs ?

There would still be now way match the resulting targets against a local file as the loki.relabel and loki.source.file components cannot be connected

if you join based on the extracted labels from loki.source.file and the one that does kubernetes discovery that's not possible? 🤔 It might be interesting to go to the next community meeting

The namespace metadata is only present when using the loki.source.podlogs component, and this component cannot be chained with loki.source.file. The discovery.kubernetes component does not expose namespace metadata and the join proposition made upstream would only happen in loki.relabel stage and also cannot be linked into the loki.source.file.

loki.source.file is only compatible with component exporting targets: https://grafana.com/docs/alloy/latest/reference/compatibility/#targets-exporters, which in our case means discovery.kubernetes or local.file_match, therefore we cannot access namespace metadata unless exposed by discovery.kuberntes directly.

@TheoBrigitte
Copy link
Member

We can't load components like loki.process dynamically into Alloy.

The way to load dynamic configuration into Alloy is via modules. A module is describe by a declare block which only accept argument and export blocks, meaning there would be no way to pass any of the stage block from loki.process. There is a module import example using loki.process here https://grafana.com/docs/alloy/latest/get-started/modules/#example.

@TheoBrigitte
Copy link
Member

It is currently not possible to use kyverno policy to label the kube-system namespace, as kyverno lacks permissions to do so

$ cat kube-system-logging.cpol.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: kube-system-logging
spec:
  admission: true
  background: true
  mutateExistingOnPolicyUpdate: true
  rules:
    - name: kube-system-enable-logging
      match:
        resources:
          kinds:
            - Namespace
          name: kube-system
      mutate:
        patchesJson6902: '[{"op":"add","path":"/metadata/labels/giantswarm.io~1logging","value":"enabled"}]'
        targets:
        - kind: Namespace
          name: kube-system

$ k apply -f kube-system-logging.cpol.yaml
Error from server: error when creating "kube-system-logging.cpol.yaml": admission webhook "validate-policy.kyverno.svc" denied the request: path: spec.rules[0].mutate.targets.: auth check fails, additional privileges are required for the service account 'system:serviceaccount:kyverno:kyverno-background-controller': failed to get GVR for kind /Namespace; failed to get GVR for kind /Namespace

@TheoBrigitte
Copy link
Member

Just linking the Alloy internal tenant_id label used for tenant override https://github.com/grafana/alloy/blob/8f1be0e86b0ced53e73cb30d228aa736b1380d89/internal/component/common/loki/client/client.go#L35

@TheoBrigitte
Copy link
Member

Load testing the log pipeline with Alloy as logs using podlogs and loki-canary

loki values to run canary only on a wc

global:
  clusterDomain: cluster.local
  dnsService: coredns
  image:
    registry: gsoci.azurecr.io
  podSecurityStandards:
    enforced: true
loki:
  enabled: false
lokiCanary:
  enabled: true
  push: false
  mode: deployment
  deployment:
    replicaCount: 30
    strategy:
      type: RollingUpdate
  extraArgs:
  - -tenant-id=playground
  - -interval=5ms
  - -pruneinterval=60s
  - -size=384
multiTenantAuth:
  enabled: false

This allow to reach ~10k log lines for a single tenant

image

The goal is to reproduce what we currently have on alba/peu01

image

Still need to compare the promtail vs alloy setup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/atlas Team Atlas
Projects
Status: Inbox 📥
Development

No branches or pull requests

3 participants