GATEWAYS-4306: exporting metrics for conntrack per zone #137

shrouti1995 · 2025-08-13T08:58:12Z

This PR implements a new connection tracking monitoring system that leverages netlink to directly access kernel conntrack data. The implementation provides zone-based monitoring capabilities, allowing for more granular network traffic analysis

Tested with 1 M and 2M conntracks created in s2r7node11.

armando-migliaccio · 2025-08-13T10:11:33Z

_ No description provided. _

Please provide a descriptive commit message.

shrouti1995 · 2025-08-13T10:38:33Z

_ No description provided. _

Please provide a descriptive commit message.

sure! for now updated the description

Copilot

Pull Request Overview

This PR implements a new connection tracking monitoring system that leverages netlink to directly access kernel conntrack data. The system provides zone-based monitoring capabilities for granular network traffic analysis and DDoS detection.

Adds a new ConntrackService that uses netlink to query kernel conntrack entries
Integrates the conntrack service into the main OVS client with proper lifecycle management
Updates dependencies to support the new conntrack functionality

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.

File	Description
ovsnl/conntrack.go	New service implementing conntrack entry retrieval and conversion from kernel data
ovsnl/client.go	Integration of ConntrackService into main client with initialization and cleanup
go.mod	Dependency updates including ti-mo/conntrack library and Go version upgrade

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

ovsnl/conntrack_linux.go

ovsnl/conntrack.go

ovsnl/client.go

go.mod

do-msingh · 2025-08-13T19:18:39Z

ovsnl/conntrack.go

+	// Start dump in goroutine
+	go func() {
+		defer close(flowChan)
+		flows, err := s.client.Dump(nil)


I am curious how well Dump scales, and whether you should be using DumpFilter or DumpExpect instead (or maybe even some new variant that just counts if needed)
How many entries did you scale to , in your test setup?

I did a POC, did not test yet on heavy traffic. Will check on DumpFilter or DumpExpect as well. Let me get back to you

You won't need heavy traffic. Just scale up to like a million or two Conntrack entries and see if it performs well.

updated the code. Checked with 1 million conntrack entries. It is working. The code needs a lot of cleanup. Just a heads up.

That's encouraging. While you are at it, could you try the max conntrack limit as well?

Looking at the snapshot from your scaled run, it does indicate the irq plateau for the duration of the run. I am assuming there were no CPU lockup messages from the kernel during this run, correct?

Did you get a chance to optimize the frequency of metrics collection?

On another note, this collection should be controlled with a config knob and we should slow roll this carefully.

Also cc @jcooperdo for another pair of eyes.

@do-msingh was working on the issue you caught from screenshot. It was not properly doing conntrack count refresh. It looks under control now. As of now I tested seeding conntracks to a specific droplet in a hyperviser. In this screenshot you will see only around 400K jump because for easy testing I kept the timeout 10 mins. But actually created 2.6 Million conntracks only. I will test scenarios like adding timeout for 1 hr, seeding conntracks to multiple droplets (10-20), and see how the system performs. Will keep posted here.

While testing with 1 hr timeout with 2.6M conntracks created against 1 droplet in a single zone. There are some small discrepencies due to processing delay: ( For example we have missed 12k events to count while running sync for 2.6M conntracks )
I can try to fix it later as an improvement task

tested with 3 droplets with three different zones each receiving 2.6 mililon events over 4 hrs.

did not see any oom kill error

in the same way when I tested creating conntracks without my changes in openvswitch_exporter the graph looks like above

shrouti1995 · 2025-09-08T13:52:40Z

The build is failing due to go version in my local machine. In this repository we are using older version. When the code is signed off I will try to install the older version and push it. Kept it like this for now.

do-msingh · 2025-09-08T22:56:52Z

ovsnl/conntrack.go

+
+// NewZoneMarkAggregator creates a new aggregator with its own listening connection.
+func NewZoneMarkAggregator(s *ConntrackService) (*ZoneMarkAggregator, error) {
+	log.Printf("Creating new conntrack zone mark aggregator...")


Could you remove these logs?

do-msingh · 2025-09-08T22:57:03Z

ovsnl/conntrack.go

+		return nil, fmt.Errorf("failed to create listening connection: %w", err)
+	}
+
+	log.Printf("Successfully created conntrack listening connection")


same as above

do-msingh · 2025-09-09T17:32:15Z

ovsnl/conntrack.go

+	// Start dump in goroutine
+	go func() {
+		defer close(flowChan)
+		flows, err := s.client.Dump(nil)


This script would be nice to integrate with chef and maybe export some metrics using node exporter so we can build some dashboards around it. In your tests, could you run at scale for an extended period like a couple of hours and check average CPU util? Do you only see CPU spikes around the time metrics are collected? For how long? Also @jcooper had a suggestion to reduce the frequency of collecting the metrics, or maybe optimizing it to reduce load.
Lastly, can you check dmesg output as well at scale to make sure we are not missing anything?

go.mod

ovsnl/client.go

jcooperdo · 2025-09-30T13:50:56Z

ovsnl/conntrack.go

+}
+
+// ForceSync performs a manual sync (disabled for large tables)
+func (a *ZoneMarkAggregator) ForceSync() error {


This method doesn't appear to be used, is it needed?

removed it.

jcooperdo · 2025-09-30T13:55:12Z

ovsnl/conntrack.go

+}
+
+// IsHealthy checks if the aggregator is in a healthy state
+func (a *ZoneMarkAggregator) IsHealthy() bool {


This method doesn't appear to be used, is it needed?

removed it. missed cleaning it up. Added this for debugging purpose.

jcooperdo · 2025-09-30T14:14:49Z

ovsnl/conntrack.go

+	CPUs               int
+}
+
+// ConntrackService manages the connection to the kernel's conntrack via Netlink.


How does ConntrackService do what the comment suggests? Its only references I can find are no-op constructors/closers.

ovsnl/conntrack.go

jcooperdo · 2025-09-30T14:26:47Z

ovsnl/conntrack.go

+	// primary counts (zone -> mark -> count)
+	mu     sync.RWMutex
+	counts map[uint16]map[uint32]int


Do we gain any benefit from mapping by zone->mark? Could we instead simplify this by mapping by zmKey?

jcooperdo · 2025-09-30T14:34:18Z

ovsnl/conntrack.go

+	return out
+}
+
+// GetTotalCount returns the total counted entries (best-effort)


Is this used by anything? Would it return the same as nf_conntrack_count?

removed it. missed cleaning it up. Added this for debugging purpose.

Co-authored-by: jcooperdo <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

ovsnl/conntrack.go

Copilot · 2025-10-10T06:32:28Z

ovsnl/conntrack.go

+	deltas := a.destroyDeltas
+	a.destroyDeltas = make(map[ZmKey]int)
+	a.deltaMu.Unlock()


Similar race condition as in applyDeltasImmediately: the mutex is unlocked before the deltas map is fully processed. Other goroutines could start adding to the new destroyDeltas map while the old deltas are still being applied, potentially causing inconsistent state.

ovsnl/conntrack.go

Copilot · 2025-10-10T06:32:29Z

ovsnl/client.go

+	if len(errs) > 0 {
+		return fmt.Errorf("errors closing client: %v", errs)
+	}


[nitpick] The error handling could be improved by using a more structured approach. Consider using errors.Join (Go 1.20+) or a similar pattern to properly combine multiple errors instead of formatting them into a single string.

ovsnl/client_linux_test.go

ovsnl/datapath_linux_test.go

ovsnl/client_linux_integration_test.go

ovsnl/conntrack_linux_test.go

ovsnl/datapath_linux_integration_test.go

ovsnl/conntrack_stub.go

Co-authored-by: Anit Gandhi <[email protected]> Co-authored-by: Copilot <[email protected]>

ovsnl/client.go

ovsnl/conntrack_linux.go

ovsnl/client.go

Co-authored-by: Anit Gandhi <[email protected]>

ovsnl/conntrack_linux.go

anitgandhi · 2025-10-13T15:44:50Z

ovsnl/conntrack_linux.go

+
+	go func() {
+		a.initialSnapshotComplete = true
+		a.initialSnapshotError = nil


initialSnapshotError seems like it's always nil. it defaults to nil and then is set again here, and then seemingly never used

same goes for initialSnapshotComplete

also, why are these set in a goroutine?

my bad. those variables I used at the time of debugging issues under heavy load. I missed cleaning those up. I will clean it up in next commit.

anitgandhi · 2025-10-13T15:48:35Z

ovsnl/conntrack_linux.go

+			if atomic.LoadInt64(&a.eventCount)%100 == 0 {
+				runtime.Gosched()
+			}


what is this for?

I was testing with 2.6M conntracks sent to a single vm and also was conducting a test when multiple vms are receiving heavy conntrack. My intention is to tell the scheduler to give up the current goroutine's time slice and let others run. Found out this can be used to improve concurrency and responsiveness in high-throughput or tight-loop scenarios. Doing it for every event will be inefficient so I wanted to have the if condition

ovsnl/conntrack_linux.go

anitgandhi · 2025-10-13T15:54:01Z

ovsnl/conntrack_linux_test.go

+	}
+
+	// Clean up
+	agg.Stop()


would recommend replacing this with t.Cleanup(agg.Stop) up on line 58

ovsnl/conntrack_linux.go

Co-authored-by: Anit Gandhi <[email protected]>

anitgandhi · 2025-10-14T14:33:21Z

ovsnl/conntrack_linux.go

+	for i := 0; i < eventWorkerCount; i++ {
+		a.wg.Add(1)
+		go a.eventWorker(i)
+	}
+
+	a.wg.Add(1)
+	go a.destroyFlusher()
+
+	a.wg.Add(1)
+	go a.startHealthMonitoring()


nowadays in Go 1.25+, you can use a.wg.Go , without having to do a.wg.Add + defer a.wg.Done

ovsnl/conntrack_linux.go

anitgandhi · 2025-10-14T14:45:01Z

ovsnl/conntrack_common.go

+	// metrics / health
+	eventCount      int64
+	lastEventTime   time.Time
+	eventRate       float64


it looks like eventRate gets protected by countsMu. assuming that's intentional, it should be grouped up at the top like

countsMu sync.RWMutex counts map[ZmKey]int // primary counts (zmKey -> count) - simplified flat mapping eventRate float64

anitgandhi · 2025-10-14T14:45:41Z

ovsnl/conntrack_common.go

+	destroyDeltas map[ZmKey]int
+
+	// metrics / health
+	eventCount      int64


would recommend using atomic.Int64 as the type, and using the methods on that type
since go 1.19 that's been the more recommended pattern for atomics

Co-authored-by: Anit Gandhi <[email protected]>

GATEWAYS-4306: exporting metrics for conntrack per zone

e306589

shrouti1995 requested a review from Copilot August 13, 2025 10:38

Copilot AI reviewed Aug 13, 2025

View reviewed changes

ovsnl/conntrack_linux.go Show resolved Hide resolved

ovsnl/conntrack.go Outdated Show resolved Hide resolved

ovsnl/conntrack.go Outdated Show resolved Hide resolved

ovsnl/client.go Show resolved Hide resolved

go.mod Outdated Show resolved Hide resolved

do-msingh reviewed Aug 13, 2025

View reviewed changes

GATEWAYS-4306: scaling with event driven approach

80916dc

shrouti1995 requested a review from do-msingh August 28, 2025 16:14

code cleanup

8b03af1

shrouti1995 marked this pull request as ready for review September 2, 2025 20:00

shrouti1995 added 3 commits September 8, 2025 13:36

GATEWAYS-4306: updating go.mod

e8eeddb

GATEWAYS-4306: updating test cases

dc73f17

GATEWAYS-4306: editing go.mod again

7b31316

do-msingh reviewed Sep 9, 2025

View reviewed changes

shrouti1995 added 3 commits September 22, 2025 20:43

conntrack destroy rate limiting part 1

02ce3ca

conntrack destroy rate limiting part 2

15e3b1a

configuration update

b1b102c

shrouti1995 requested review from do-msingh and jcooperdo September 26, 2025 11:41

configuration update

9796da5

jcooperdo reviewed Sep 29, 2025

View reviewed changes

go.mod Outdated Show resolved Hide resolved

jcooperdo reviewed Sep 29, 2025

View reviewed changes

ovsnl/client.go Outdated Show resolved Hide resolved

jcooperdo reviewed Sep 30, 2025

View reviewed changes

ovsnl/conntrack.go Outdated Show resolved Hide resolved

jcooperdo reviewed Sep 30, 2025

View reviewed changes

review suggestions

df047c7

Co-authored-by: jcooperdo <[email protected]>

security issue

1a1e9d1

shrouti1995 requested review from Copilot and jcooperdo October 10, 2025 06:30

Copilot AI reviewed Oct 10, 2025

View reviewed changes

anitgandhi reviewed Oct 10, 2025

View reviewed changes

ovsnl/datapath_linux_test.go Outdated Show resolved Hide resolved

anitgandhi reviewed Oct 10, 2025

View reviewed changes

shrouti1995 and others added 3 commits October 10, 2025 22:23

Apply suggestions from code review

4c0669e

Co-authored-by: Anit Gandhi <[email protected]> Co-authored-by: Copilot <[email protected]>

code review suggestions

96ff4a1

code restructure

5125d48

anitgandhi reviewed Oct 10, 2025

View reviewed changes

ovsnl/client.go Outdated Show resolved Hide resolved

ovsnl/conntrack_linux.go Outdated Show resolved Hide resolved

ovsnl/client.go Outdated Show resolved Hide resolved

ovsnl/client.go Outdated Show resolved Hide resolved

shrouti1995 and others added 3 commits October 13, 2025 10:31

Apply suggestions from code review

eeb4fce

Co-authored-by: Anit Gandhi <[email protected]>

code review suggestion

98dcb70

Co-authored-by: Anit Gandhi <[email protected]>

code review suggestion

893e7c6

shrouti1995 requested a review from anitgandhi October 13, 2025 05:07

anitgandhi reviewed Oct 13, 2025

View reviewed changes

ovsnl/conntrack_linux.go Outdated Show resolved Hide resolved

anitgandhi reviewed Oct 13, 2025

View reviewed changes

ovsnl/conntrack_linux.go Outdated Show resolved Hide resolved

anitgandhi reviewed Oct 13, 2025

View reviewed changes

ovsnl/conntrack_linux.go Outdated Show resolved Hide resolved

anitgandhi reviewed Oct 13, 2025

View reviewed changes

ovsnl/conntrack_linux.go Outdated Show resolved Hide resolved

shrouti1995 and others added 3 commits October 13, 2025 21:50

code review suggestion : with simpler defer

b77cf7d

Co-authored-by: Anit Gandhi <[email protected]>

code review suggestion : adding wait group and var name change

48399e8

code review suggestion : dead code cleanup

18d82e3

shrouti1995 requested a review from anitgandhi October 14, 2025 07:04

anitgandhi reviewed Oct 14, 2025

View reviewed changes

shrouti1995 and others added 3 commits October 16, 2025 12:19

Apply suggestions from code review : adding defer to unlocks

db13380

Co-authored-by: Anit Gandhi <[email protected]>

changing goroutine syntax

e0c37ee

using atomic.Int64

9a9e749

shrouti1995 requested a review from anitgandhi October 16, 2025 07:10

GATEWAYS-4306: exporting metrics for conntrack per zone #137

Are you sure you want to change the base?

GATEWAYS-4306: exporting metrics for conntrack per zone #137

Uh oh!

Conversation

shrouti1995 commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

armando-migliaccio commented Aug 13, 2025

Uh oh!

shrouti1995 commented Aug 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrouti1995 Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrouti1995 Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrouti1995 Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrouti1995 Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrouti1995 commented Sep 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrouti1995 Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

shrouti1995 commented Aug 13, 2025 •

edited

Loading

shrouti1995 Sep 23, 2025 •

edited

Loading

shrouti1995 Sep 23, 2025 •

edited

Loading

shrouti1995 Sep 26, 2025 •

edited

Loading

shrouti1995 Sep 29, 2025 •

edited

Loading

shrouti1995 Oct 3, 2025 •

edited

Loading

shrouti1995 Oct 13, 2025 •

edited

Loading