Skip to content

Commit

Permalink
Improve e2e troubleshooting (#448)
Browse files Browse the repository at this point in the history
* Improve e2e troubleshooting

Improve / fix some issues with e2e tests:
- Add more logs; print some useful information such as when cluster is
  still up
- Improve readiness (e.g: had agents pods crashing)
- Use more up to date templates for loki and kafka (similar to what we
  have in docs repo)

* remove -a flag; do not tag e2e

Tagging e2e is not necessary and has some undesired side effect such as
excluding these e2e source files from building/linting, which can
invisibilise some problems

* Add doc
  • Loading branch information
jotak authored Dec 16, 2024
1 parent 48eb61b commit 889f4f1
Show file tree
Hide file tree
Showing 13 changed files with 276 additions and 72 deletions.
7 changes: 4 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ docker-generate: ## Create the container that generates the eBPF binaries
.PHONY: compile
compile: ## Compile ebpf agent project
@echo "### Compiling project"
GOARCH=${GOARCH} GOOS=$(GOOS) go build -mod vendor -a -o bin/netobserv-ebpf-agent cmd/netobserv-ebpf-agent.go
GOARCH=${GOARCH} GOOS=$(GOOS) go build -mod vendor -o bin/netobserv-ebpf-agent cmd/netobserv-ebpf-agent.go

.PHONY: build-and-push-bc-image
build-and-push-bc-image: docker-generate ## Build and push bytecode image
Expand All @@ -153,7 +153,7 @@ build-and-push-bc-image: docker-generate ## Build and push bytecode image
.PHONY: test
test: ## Test code using go test
@echo "### Testing code"
GOOS=$(GOOS) go test -mod vendor -a ./... -coverpkg=./... -coverprofile cover.all.out
GOOS=$(GOOS) go test -mod vendor ./pkg/... ./cmd/... -coverpkg=./... -coverprofile cover.all.out

.PHONY: cov-exclude-generated
cov-exclude-generated:
Expand All @@ -175,7 +175,8 @@ tests-e2e: prereqs ## Run e2e tests
go clean -testcache
# making the local agent image available to kind in two ways, so it will work in different
# environments: (1) as image tagged in the local repository (2) as image archive.
$(OCI_BIN) build . --build-arg TARGETARCH=$(GOARCH) -t localhost/ebpf-agent:test
rm -f ebpf-agent.tar || true
$(OCI_BIN) build . --build-arg LDFLAGS="" --build-arg TARGETARCH=$(GOARCH) -t localhost/ebpf-agent:test
$(OCI_BIN) save -o ebpf-agent.tar localhost/ebpf-agent:test
GOOS=$(GOOS) go test -p 1 -timeout 30m -v -mod vendor -tags e2e ./e2e/...

Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,10 @@ make generate

Regularly tested on Fedora.

### Running end-to-end tests

Refer to the specific documentation: [e2e readme](./e2e/README.md)

## Known issues

### Extrenal Traffic in Openshift (OVN-Kubernetes CNI)
Expand Down
66 changes: 66 additions & 0 deletions e2e/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
## eBPF Agent e2e tests

e2e tests can be run with:

```bash
make tests-e2e
```

If you use podman, you may need to run it as root instead:

```bash
sudo make tests-e2e
```

### What it does

It builds an image with the current code, including pre-generated BPF bytecode, starts a KIND cluster and deploys the agent on it. It also deploys a typical NetObserv stack, that includes flowlogs-pipeline, Loki and/or Kafka.

It then runs a couple of smoke tests on that cluster, such as testing sending pings between pods and verifying that the expected flows are created.

The tests leverage Kube's [e2e-framework](https://github.com/kubernetes-sigs/e2e-framework). They are based on manifest files that you can find in [this directory](./cluster/base/).

### How to troubleshoot

During the tests, you can run any `kubectl` command to the KIND cluster.

If you use podman/root and don't want to open a root session you can simply copy the root kube config:

```bash
sudo cp /root/.kube/config /tmp/agent-kind-kubeconfig
sudo -E chown $USER:$USER /tmp/agent-kind-kubeconfig
export KUBECONFIG=/tmp/agent-kind-kubeconfig
```

Then:

```bash
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
flp-29bmd 1/1 Running 0 6s
loki-7c98dfd6d4-c8q9m 1/1 Running 0 56s
```

### Cleanup

The KIND cluster should be cleaned up after tests. Sometimes it won't, like with forced exit or for some kinds of failures.
When that's the case, you should see a message telling you to manually cleanup the cluster:

```
^CSIGTERM received, cluster might still be running
To clean up, run: kind delete cluster --name basic-test-cluster20241212-125815
FAIL github.com/netobserv/netobserv-ebpf-agent/e2e/basic 172.852s
```

If that's not the case, you can manually retrieve the cluster name to delete:

```bash
$ kind get clusters
basic-test-cluster20241212-125815

$ kind delete cluster --name=basic-test-cluster20241212-125815
Deleting cluster "basic-test-cluster20241212-125815" ...
Deleted nodes: ["basic-test-cluster20241212-125815-control-plane"]
```

If not cleaned up, a subsequent run of e2e tests will fail due to addresses (ports) already in use.
11 changes: 5 additions & 6 deletions e2e/basic/common.go
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
//go:build e2e

package basic

import (
Expand Down Expand Up @@ -37,7 +35,7 @@ func (bt *FlowCaptureTester) DoTest(t *testing.T, isIPFIX bool) {
return ctx
},
).Assess("correctness of client -> server (as Service) request flows",
func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
func(ctx context.Context, t *testing.T, _ *envconf.Config) context.Context {
lq := bt.lokiQuery(t,
`{DstK8S_OwnerName="server",SrcK8S_OwnerName="client"}`+
`|="\"DstAddr\":\"`+pci.serverServiceIP+`\""`)
Expand Down Expand Up @@ -82,7 +80,7 @@ func (bt *FlowCaptureTester) DoTest(t *testing.T, isIPFIX bool) {
return ctx
},
).Assess("correctness of client -> server (as Pod) request flows",
func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
func(ctx context.Context, t *testing.T, _ *envconf.Config) context.Context {
lq := bt.lokiQuery(t,
`{DstK8S_OwnerName="server",SrcK8S_OwnerName="client"}`+
`|="\"DstAddr\":\"`+pci.serverPodIP+`\""`)
Expand Down Expand Up @@ -124,7 +122,7 @@ func (bt *FlowCaptureTester) DoTest(t *testing.T, isIPFIX bool) {
return ctx
},
).Assess("correctness of server (from Service) -> client response flows",
func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
func(ctx context.Context, t *testing.T, _ *envconf.Config) context.Context {
lq := bt.lokiQuery(t,
`{DstK8S_OwnerName="client",SrcK8S_OwnerName="server"}`+
`|="\"SrcAddr\":\"`+pci.serverServiceIP+`\""`)
Expand Down Expand Up @@ -167,7 +165,7 @@ func (bt *FlowCaptureTester) DoTest(t *testing.T, isIPFIX bool) {
return ctx
},
).Assess("correctness of server (from Pod) -> client response flows",
func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
func(ctx context.Context, t *testing.T, _ *envconf.Config) context.Context {
lq := bt.lokiQuery(t,
`{DstK8S_OwnerName="client",SrcK8S_OwnerName="server"}`+
`|="\"SrcAddr\":\"`+pci.serverPodIP+`\""`)
Expand Down Expand Up @@ -282,6 +280,7 @@ func (bt *FlowCaptureTester) lokiQuery(t *testing.T, logQL string) tester.LokiQu
query, err = bt.Cluster.Loki().Query(1, logQL)
require.NoError(t, err)
require.NotNil(t, query)
require.NotNil(t, query.Data)
require.NotEmpty(t, query.Data.Result)
}, test.Interval(time.Second))
result := query.Data.Result[0]
Expand Down
3 changes: 1 addition & 2 deletions e2e/basic/flow_test.go
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
//go:build e2e

package basic

import (
Expand Down Expand Up @@ -152,6 +150,7 @@ func getPingFlows(t *testing.T, newerThan time.Time, expectedBytes int) (sent, r
}, test.Interval(time.Second))

test.Eventually(t, time.Minute, func(t require.TestingT) {
// testCluster.Loki().DebugPrint(100, `{app="netobserv-flowcollector",DstK8S_OwnerName="pinger"}`)
query, err = testCluster.Loki().
Query(1, fmt.Sprintf(`{SrcK8S_OwnerName="server",DstK8S_OwnerName="pinger"}`+
`|~"\"Proto\":1[,}]"`+ // Proto 1 == ICMP
Expand Down
74 changes: 63 additions & 11 deletions e2e/cluster/base/02-loki.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ data:
server:
http_listen_port: 3100
grpc_listen_port: 9096
grpc_server_max_recv_msg_size: 10485760
http_server_read_timeout: 1m
http_server_write_timeout: 1m
log_level: error
target: all
common:
path_prefix: /loki-store
storage:
Expand All @@ -31,9 +36,32 @@ data:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
compactor:
compaction_interval: 5m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
frontend:
compress_responses: true
ingester:
chunk_encoding: snappy
chunk_retain_period: 1m
query_range:
align_queries_with_step: true
cache_results: true
max_retries: 5
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_bytes: 500MB
validity: 24h
parallelise_shardable_queries: true
query_scheduler:
max_outstanding_requests_per_tenant: 2048
schema_config:
configs:
- from: 2020-10-24
- from: 2022-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
Expand All @@ -47,15 +75,39 @@ data:
active_index_directory: /loki-store/index
shared_store: filesystem
cache_location: /loki-store/boltdb-cache
datasource.yaml: |
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://localhost:3100
isDefault: true
version: 1
cache_ttl: 24h
limits_config:
ingestion_rate_strategy: global
ingestion_rate_mb: 10
ingestion_burst_size_mb: 10
max_label_name_length: 1024
max_label_value_length: 2048
max_label_names_per_series: 30
reject_old_samples: true
reject_old_samples_max_age: 15m
creation_grace_period: 10m
enforce_metric_name: false
max_line_size: 256000
max_line_size_truncate: false
max_entries_limit_per_query: 10000
max_streams_per_user: 0
max_global_streams_per_user: 0
unordered_writes: true
max_chunks_per_query: 2000000
max_query_length: 721h
max_query_parallelism: 32
max_query_series: 10000
cardinality_limit: 100000
max_streams_matchers_per_query: 1000
max_concurrent_tail_requests: 10
retention_period: 24h
max_cache_freshness_per_query: 5m
max_queriers_per_tenant: 0
per_stream_rate_limit: 3MB
per_stream_rate_limit_burst: 15MB
max_query_lookback: 0
min_sharding_lookback: 0s
split_queries_by_interval: 1m
---
apiVersion: apps/v1
kind: Deployment
Expand Down Expand Up @@ -83,7 +135,7 @@ spec:
name: loki-config
containers:
- name: loki
image: grafana/loki:2.4.1
image: grafana/loki:2.9.0
volumeMounts:
- mountPath: "/loki-store"
name: loki-store
Expand Down
Loading

0 comments on commit 889f4f1

Please sign in to comment.