Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[usm] service discovery, system-probe ignore configured services by name #30702

Merged

Conversation

yuri-lipnesh
Copy link
Contributor

@yuri-lipnesh yuri-lipnesh commented Nov 4, 2024

What does this PR do?

This code adds a hashmap to store processes that should be excluded from processing based on the service name.
This code adds excludes service name 'datadog-agent' from processing.

Motivation

USMON-1226
Some processes cannot be excluded based on the command name, which may be a generic name such as "agent".
In this case, a service can be excluded based on the service name. When a service name is identified, the process ID is stored in an internal table to exclude it from processing.

Describe how to test/QA your changes

This code passed service discovery unit test on local VM
PASS
ok github.com/DataDog/datadog-agent/pkg/collector/corechecks/servicediscovery/module 38.766s

Started couple of programs on local VM and dogfeeded to dddev environment.
Checked that services were reported on APM page "Identify and close your observability gaps"
Checked that services are not reported anymore after adding services to the list of bypassed services.

Possible Drawbacks / Trade-offs

Additional Notes

@yuri-lipnesh yuri-lipnesh requested a review from a team as a code owner November 4, 2024 12:37
@github-actions github-actions bot added component/system-probe medium review PR review might take time labels Nov 4, 2024
@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Nov 4, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=48395348 --os-family=ubuntu

Note: This applies to commit a777efa

@yuri-lipnesh yuri-lipnesh added changelog/no-changelog team/usm The USM team qa/done QA done before merge and regressions are covered by tests team/apm-onboarding labels Nov 4, 2024
@@ -0,0 +1,51 @@
// Unless explicitly stated otherwise all files in this repository are licensed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it in a new file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source file impl_linux.go already reached 600 lines, which I think it considered large file, I didn't want to increase this file further, this is why new file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is misleading to have methods of the same struct defined on multiple files
let's fold it back to the original file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 14 to 17
// ignoreServices is a list of service names that should not be reported as a service.
var ignoreServices = map[string]struct{}{
"datadog-agent": {},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only datadog agent?

Copy link
Contributor Author

@yuri-lipnesh yuri-lipnesh Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meantime only service defined in scope of the ticket to be bypassed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is pretty much irrelevant
Customers with the datadog agent don't set DD_SERVICE for us, and we don't do that by default

So I'm concerned the current implementation does not provide any actual value and impact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to delete this PR and close Jira ticket?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I want to improve the ticket description, and make the implementation better for the general case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated Jira ticket and title of this PR. Now this PR allows to add excluded service name to system-probe config. The config has one default service "datadog-agent"

@github-actions github-actions bot added long review PR is complex, plan time to review it and removed medium review PR review might take time labels Nov 4, 2024
Copy link

cit-pr-commenter bot commented Nov 4, 2024

Regression Detector

Comment on lines 37 to 45
func (s *discovery) shouldIgnoreService(name string, proc *process.Process) bool {
s.mux.Lock()
_, found := ignoreServices[name]
if found {
s.ignorePids[proc.Pid] = true
}
s.mux.Unlock()

return found
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a function called should.... should only report true/false, and not changing an internal state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this function

Comment on lines 21 to 23
s.mux.Lock()
s.ignorePids[pid] = true
s.mux.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always prefer defered unlock, it guarantees future modifications won't miss the unlock

Suggested change
s.mux.Lock()
s.ignorePids[pid] = true
s.mux.Unlock()
s.mux.Lock()
defer s.mux.Unlock()
s.ignorePids[pid] = true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very short frequently used function, 'defer' instruction requires additional allocation, which is not necessary in this case. This function is not expected to grow significantly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a very negligible optimization.
Please follow the effective go guidelines https://go.dev/doc/effective_go#defer
Those are meant to reduce human errors and mistakes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this function to use 'defer'


// shouldIgnoreService saves pid of the process if the service should be excluded from handling
// and returns true for such process.
func (s *discovery) shouldIgnoreService(name string, proc *process.Process) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to get Process here, if you care only about the PID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this code and separated ignored service and ignored pid

makeAlias(t, test.comm, serverBin)
bin := filepath.Join(serverDir, test.comm)
cmd := exec.CommandContext(ctx, bin)
cmd.Env = append(os.Environ(), test.envs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why os.Environ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 63 to 64
err := cmd.Start()
require.NoError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err := cmd.Start()
require.NoError(t, err)
require.NoError(t, cmd.Start())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// wait until the service name becomes available
info, err := discovery.getServiceInfo(proc)
assert.NoError(collect, err)
assert.NotEmpty(collect, info.ddServiceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're expecting for a certain value (based on the environment variable value), use it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this unit test intends to check that Pid has been bypassed, there are other unit test covering service name extraction. I do not think this unit test should check match of service name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why do you test ddServiceName?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added service name verification

Comment on lines 503 to 505
if s.shouldIgnoreService(name, proc) {
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you add the pid to the ignore list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this code and separated ignored service and addition of ignored pid

@guyarb
Copy link
Contributor

guyarb commented Nov 5, 2024

The ticket and PR should not focus merely on DD_SERVICE, but to filter entries by service name

Please fix the code, PR description and title accordingly

@yuri-lipnesh yuri-lipnesh changed the title [usm] service discovery, system-probe ignore service DD_SERVICE=datadog-agent [usm] service discovery, system-probe ignore discovered service datadog-agent Nov 5, 2024
@yuri-lipnesh yuri-lipnesh requested a review from a team as a code owner November 5, 2024 18:17
@yuri-lipnesh yuri-lipnesh changed the title [usm] service discovery, system-probe ignore discovered service datadog-agent [usm] service discovery, system-probe ignore configured services by name Nov 5, 2024
// loadIgnoredServices saves names that should not be reported as a service
func (config *discoveryConfig) loadIgnoredServices(services []string) {
if len(services) == 0 {
log.Debug("loading ignored services found empty commands list")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste bug, commands list is inaccurate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@yuri-lipnesh yuri-lipnesh requested a review from guyarb November 6, 2024 14:36
@yuri-lipnesh
Copy link
Contributor Author

/merge

@dd-devflow
Copy link

dd-devflow bot commented Nov 7, 2024

Devflow running: /merge

View all feedbacks in Devflow UI.


2024-11-07 17:42:40 UTC ℹ️ MergeQueue: waiting for PR to be ready

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.


2024-11-07 21:22:32 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 24m.

@@ -404,6 +404,7 @@ func InitSystemProbeConfig(cfg pkgconfigmodel.Config) {
cfg.BindEnvAndSetDefault(join(discoveryNS, "enabled"), false)
cfg.BindEnvAndSetDefault(join(discoveryNS, "cpu_usage_update_delay"), "60s")
cfg.BindEnvAndSetDefault(join(discoveryNS, "ignored_command_names"), []string{"chronyd", "cilium-agent", "containerd", "dhclient", "dockerd", "kubelet", "livenessprobe", "local-volume-pr", "sshd", "systemd"})
cfg.BindEnvAndSetDefault(join(discoveryNS, "ignored_services"), []string{"datadog-agent", "trace-agent", "process-agent", "datadog-cluster-agent"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add security-agent and other processes listed in cmd/*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APM Distributed Tracing team provided us with some test environment with service discovery enabled, I took some DD services from that environment. Apparently security is not enabled there, that's why I don't see that service-agent in the discovery list. I would be careful about adding names that I don't know how they show up in discovery. This is why I prefer not to extend this list at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuri-lipnesh - Bryce is mentioning another DD product that it is not enabled in your environment. Please open a follow up PR to address that, and run PR in dogfooding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created USMON-1339

@dd-mergequeue dd-mergequeue bot merged commit 2a4706d into main Nov 7, 2024
294 checks passed
@dd-mergequeue dd-mergequeue bot deleted the yuri.lipnesh/USMON-1226_discovery_ignore_service branch November 7, 2024 21:45
@github-actions github-actions bot added this to the 7.61.0 milestone Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog component/system-probe long review PR is complex, plan time to review it qa/done QA done before merge and regressions are covered by tests team/apm-onboarding team/usm The USM team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants