Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage buildup #78

Open
MarcoipPolo opened this issue Apr 18, 2024 · 5 comments
Open

Memory usage buildup #78

MarcoipPolo opened this issue Apr 18, 2024 · 5 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@MarcoipPolo
Copy link

Describe the bug

After running the deployment for some time i noticed that it uses more and more memory as the deployment lives. This should not happen.

What do you see?

image

What do you expect to see?

Memory usage stabilizes on a certain point

List the steps that must be taken to reproduce this issue

  1. Start application
  2. Monitor the application over time

Version

0.2.0 version of helmchart

Additional information

No response

@MarcoipPolo MarcoipPolo added the bug Something isn't working label Apr 18, 2024
@TwiN TwiN added help wanted Extra attention is needed good first issue Good for newcomers labels Apr 19, 2024
@TwiN
Copy link
Owner

TwiN commented Apr 19, 2024

I can 100% confirm this. I noticed it happened as well & have not been able to figure out why. For now, I decided to just set a fairly low memory limit and let it get occasionally OOMKilled.

Would strongly appreciate if somebody could investigate

@MarcoipPolo
Copy link
Author

Not sure, but first thing that comes to mind is the logs that gets outputted every time that TTL controller is triggered. Maybe this article can help https://www.codereliant.io/memory-leaks-with-pprof/

@David2011Hernandez
Copy link

David2011Hernandez commented Nov 20, 2024

Hi, nice spotted @MarcoipPolo I faced this a while ago, unfortunately i did not raised here nor documented the way I profile the code 😢 . BTW I am really grateful with @TwiN for open sourcing this project

In any case the bottom-line of the research is:

  • big chunk of memory allocation happens marshalling and unmarshalling object getting from k8s api server, when looping through all resources
// DoReconcile goes over all API resources specified, retrieves all sub resources and deletes those who have expired
func DoReconcile(dynamicClient dynamic.Interface, eventManager *kevent.EventManager, resources []*metav1.APIResourceList) bool {
	for _, resource := range resources {
		if len(resource.APIResources) == 0 {
			continue
		}
		gv := strings.Split(resource.GroupVersion, "/")
		gvr := schema.GroupVersionResource{}
		if len(gv) == 2 {
			gvr.Group = gv[0]
			gvr.Version = gv[1]
		} else if len(gv) == 1 {
			gvr.Version = gv[0]
		} else {
			continue
		}
		for _, apiResource := range resource.APIResources {
			// Make sure that we can list and delete the resource. If we can't, then there's no point querying it.
			verbs := apiResource.Verbs.String()
			if !strings.Contains(verbs, "list") || !strings.Contains(verbs, "delete") {
				continue
			}
			// List all items under the resource
			gvr.Resource = apiResource.Name
			var list *unstructured.UnstructuredList
			var continueToken string
			var err error
			for list == nil || continueToken != "" {
				list, err = dynamicClient.Resource(gvr).List(context.TODO(), metav1.ListOptions{TimeoutSeconds: &listTimeoutSeconds, Continue: continueToken, Limit: ListLimit})
				if err != nil {
					log.Printf("Error checking %s from %s: %s", gvr.Resource, gvr.GroupVersion(), err)
					continue
				}
				if list != nil {
					continueToken = list.GetContinue()
				}
				if debug {
					log.Println("Checking", len(list.Items), gvr.Resource, "from", gvr.GroupVersion())
				}
				for _, item := range list.Items { # mainly here if I recalled properly

My solution at the time was to add some sort of filter like
besides changing the constants to variables so I can modify them using environment variables for example

// DoReconcile goes over all API resources specified, retrieves all sub resources and deletes those who have expired
func DoReconcile(dynamicClient dynamic.Interface, eventManager *kevent.EventManager, resources []*metav1.APIResourceList, listLimit int64, throttleDuration time.Duration, resourceGroupVersionToKeep map[string]struct{}) bool {
	for _, resource := range resources {
		if len(resource.APIResources) == 0 {
			continue
		}
		// only keep the resource group version we want to check
		if _, ok := resourceGroupVersionToKeep[resource.GroupVersion]; !ok {
			if debug != "" {
				log.Println("Skipping", resource.GroupVersion)
			}
			continue
		}

FYI I am not a golang expert

@David2011Hernandez
Copy link

I have also added pprof behind a gate


func main() {
	// start pprof server
	if debug != "" {
		go func() {
			log.Println(http.ListenAndServe("localhost:6060", nil))
		}()
	}

@TwiN
Copy link
Owner

TwiN commented Dec 6, 2024

I suspect your fix may just hide the memory leak since more resources are skipped 😔
That being said, thanks for bringing it up. I'll investigate when I have some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants