Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

script taking too long to gather results #22

Closed
doccaz opened this issue Dec 2, 2021 · 18 comments
Closed

script taking too long to gather results #22

doccaz opened this issue Dec 2, 2021 · 18 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request question Further information is requested

Comments

@doccaz
Copy link
Owner

doccaz commented Dec 2, 2021

It's been reported that lately vercheck.py is taking around 10-12 minutes to complete, and is stuck in the "gathering results..." message for quite a while.

I've noticed that something changed in the SCC infrastructure that's leaving at least "max_threads" connections in the CLOSE_WAIT state until the kernel's tcp_fin_timeout runs out for each one. The default for recent kernels is 60 seconds.

@doccaz doccaz self-assigned this Dec 2, 2021
@doccaz doccaz added the bug Something isn't working label Dec 2, 2021
@doccaz
Copy link
Owner Author

doccaz commented Dec 2, 2021

So far I've tried the following:

HTTPConnection.default_socket_options + [
    (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
]

The HTTPConnection object is instantiated by the URLLib3 PoolManager class. That didn't make a difference.

  • add a HTTP connection close header: the HTTP/1.1 specification mentions that if a header containing 'Connection':'Close' is added on a request on its own, the server must immediately close the socket. This closes the connection immediately, negating the advantages of using a connection pool.
  • Tried using preload_content=False, which forces the program to read and process each byte received (used for streaming content). I implemented this (and had to decompress the data manually with gzip). No discernible difference noticed.

@doccaz
Copy link
Owner Author

doccaz commented Dec 17, 2021

It appears we're suffering some kind of rate limiting from SCC side (to be confirmed). It doesn't matter what max_threads value I use, or even package numbers. Both a 565 and a 1565 package supportconfig appear to be taking around the same time to complete (~10 minutes). But at least we're not getting "zombie" CLOSE_WAIT connections anymore after adding the "Connection: close" header.

Closing this for now, will check later with SCC staff about rate limiting.

@doccaz doccaz closed this as completed Dec 17, 2021
@doccaz doccaz added the question Further information is requested label Dec 18, 2021
@doccaz doccaz reopened this Dec 18, 2021
@doccaz
Copy link
Owner Author

doccaz commented Dec 18, 2021

Waiting for SCC team answer on this...

@doccaz
Copy link
Owner Author

doccaz commented Jan 19, 2022

It turns out that there are some severe restrictions on requests per minute for the API. I'll start working on a local JSON cache to try and mitigate that.

@doccaz doccaz added the enhancement New feature or request label Jan 19, 2022
@doccaz doccaz pinned this issue Jan 19, 2022
@doccaz
Copy link
Owner Author

doccaz commented Jan 19, 2022

My initial idea is to cache the responses for each product ID on a local global JSON file (/var/cache/.../something.json?). I also need to make sure that the cache entries are "expired" after a determined period of time (5-7 days?). This is necessary to guarantee that a) we don't keep version information that is too old in the cache (which I think could grow quite a bit) and b) that we don't present information that is too old.

@doccaz
Copy link
Owner Author

doccaz commented Feb 2, 2022

I implemented a JSON file cache, and so far it's working okay. However, I'm facing a few file/record locking issues due to multiple threads trying to update the cache simultaneously. So, it works for the most part, but some records are being dropped/overwritten.

Working on getting a singleton class for my "cache manager" to coordinate this properly.

@doccaz
Copy link
Owner Author

doccaz commented Feb 11, 2022

Just updating: the singleton class is working fine. I'm just ironing out some issues with file locations/permissions.
Should I go for /var/cache/scc-tools/scc_data.json, or assume something in ~/.config/cache? Or both?

The first option requires permissions to read and write a file.

@Thr3d
Copy link

Thr3d commented Feb 12, 2022

I like the /var/cache/ approach but that does raise some issues in situations where multiple users on a system could be running the script as their user. What would be the best way to handle permissions there?

A way both could work is to default to /var/cache/scc-tools/. If that directory exists and is writable, use it, otherwise fallback to ~/.config/cache/scc-tools/.

@doccaz
Copy link
Owner Author

doccaz commented Mar 9, 2022

I haven't been idle: for the past couple of weeks I was solving the problem of making a true singleton class for the cache manager, so the cache data could be safely used/updated by the multiple threads with proper locks.

Also, I had to work out routines for de-duplication/validation/expiration of entries in the cache, as well as defining a common data format compatible with the already existing search engine.

Besides that, I hit a snag that needed some serious thinking: whenever one does a search through the API, SCC automatically returns package entries from the related modules that compose that product, like for example the Basesystem module. The problem is, these also have unique product IDs.

So, for example, let's say I was searching for the package 'kernel-default' for product ID 2134 (SLED 15 SP3 x86_64) and received results from product ID 2145 (which corresponds to the Basesystem module for 15 SP3 x86_64). I added this to the cache, but on subsequent searches, the corresponding entry would NOT be found, since we're looking for the original product ID (2134). So, in order to solve this, I need to have a table that correlates each module product ID to valid product IDs. This information can be retrieved from RMT in the product list.

So, I now have a table inside the class with information like this:
2145: { 'name': 'Basesystem Module 15 SP3 x86_64', 'products': [2136, 2140, 2134]},

... the downside is that I'll need to code this information manually like I already did for the product list (and for the same reasons -- I want the tool to be self-contained and to not require authentication to SCC). But that's life...

@doccaz
Copy link
Owner Author

doccaz commented Mar 9, 2022

Current output for a search with a missing cache file (and yes, I already implemented the /var/cache vs ~/.cache logic as you suggested, @Thr3d ):

erico@suselab-erico:~/Projetos/scc-tools> ./vercheck.py -p 2134 -n kernel-default
searching for package "kernel-default" in product id "2134" (SUSE Linux Enterprise Desktop 15 SP3 x86_64)
Error reading the package cache from /home/erico/.cache/scc-tools/scc_data.json (non-fatal)
Error reading the package cache from /home/erico/.cache/scc-tools/scc_data.json (non-fatal)
searching for kernel-default for product ID 2134 in SCC
latest version for kernel-default on product ID 2134 is 5.3.18-150300.59.54.1
writing 16 items to cache at /home/erico/.cache/scc-tools/scc_data.json

In this case, we received 16 entries from SCC (which may span multiple compatible module product IDs).

If we examine the data returned as the latest version for the package, we see that it references product ID 2145 (Basesystem):

      "id" : 21911366,
      "release" : "150300.59.54.1",
      "timestamp" : "2022-03-09T16:36:29.799891",
      "name" : "kernel-default",
      "version" : "5.3.18",
      "repository" : "Basesystem Module 15 SP3 x86_64",
      "arch" : "x86_64",
      "products" : [
         {
            "edition" : "15 SP3",
            "identifier" : "sle-module-basesystem/15.3/x86_64",
            "name" : "Basesystem Module",
            "free" : true,
            "id" : 2145,
            "architecture" : "x86_64",
            "type" : "module"
         }

A subsequent search will hit this cached entry by using the table I mentioned earlier:

erico@suselab-erico:~/Projetos/scc-tools> ./vercheck.py -p 2134 -n kernel-default
searching for package "kernel-default" in product id "2134" (SUSE Linux Enterprise Desktop 15 SP3 x86_64)
loaded 16 items from cache (/home/erico/.cache/scc-tools/scc_data.json)
found kernel-default for product ID 2134 (cached)
latest version for kernel-default on product ID 2134 is 5.3.18-150300.59.54.1
writing 16 items to cache at /home/erico/.cache/scc-tools/scc_data.json

Now I need to go through with my supportconfig testcases and see how it affects performance.

doccaz added a commit that referenced this issue Mar 13, 2022
@doccaz
Copy link
Owner Author

doccaz commented Mar 15, 2022

I think I managed to solve all cache inconsistencies and thread synchronization, my testcases all ran to completion.
Cached entries come up instantly (and the corresponding threads are being closed correctly too, and fast!), but every now and then there may be a request to SCC that is "stuck" for a few minutes due to their rate-limiting (still...). Fortunately, after a while these become less and less frequent as the cache fills. I opted to hold cache entries up to 5 days. After they expire, it'll fire up a search request to SCC.

I added a script named "vercheck-cache.py" to the repository. If someone else could test it, I'll be glad to close this issue.

@tuttipazzo
Copy link

Erico,
What happens to the expired cache entries?

@doccaz
Copy link
Owner Author

doccaz commented Mar 15, 2022

Erico, What happens to the expired cache entries?

They get individually deleted from the cache, then a search request is made to SCC to re-populate the cached results for that package/product.

@tuttipazzo
Copy link

Erico, What happens to the expired cache entries?

They get individually deleted from the cache, then a search request is made to SCC to re-populate the cached results for that package/product.

But isn't the point of the cache to alleviate having to go to scc? Plus, that pkg info is going to be useful for customers who don't always update to the latest pkgs.

@Thr3d
Copy link

Thr3d commented Mar 15, 2022

@tuttipazzo Your question makes it sound like there should never be a reason to invalidate the cache. The caching is mostly for those that commonly make queries. If someone uses it so rarely that the cache is fully expired then is there really a value in using cached data that is likely out of date vs ensuring the latest data is polled?

@doccaz
Copy link
Owner Author

doccaz commented Mar 15, 2022

Currently, I'm setting the expiry date for the entries to 5 days. Would you suggest a different value, or maybe a parameter to set a different period be useful?

@doccaz
Copy link
Owner Author

doccaz commented Mar 16, 2022

I worked yesterday with Jesse, we got it working again on Concord. I guess I was abusing the Python runtime a little too much with all the threads (and Concord is too fast), so we were getting deadlocks in Python's thread synchronization. But it appears to be working fine with the cache now, with over 13,000 cache entries and no lockups so far.

It's not perfect though: I only write to the cache at the end of the session, not atomically. With multiple sessions running concurrently, the one that finishes last "wins", so we could be losing a few results that could be written to the cache. The reason for that is simple: I don't want to kill Concord's disks with thousands of small writes 😁

I'll work on adapting it to SQLite later, which should group these writes in a smarter way.

Anyway, the speed gains are still quite good. Any delays we getting are due to the API server's rate-limiting.

@doccaz
Copy link
Owner Author

doccaz commented Mar 17, 2022

@tuttipazzo coincidentally during one my tests just moments ago, I got an expired cache entry. This is how it works:

cached data for zypper is too old (-6 days), discarding cache entry
removing record from cache: {'id': 21851832, 'name': 'zypper', 'arch': 'x86_64', 'version': '1.14.51', 'release': '3.52.1', 'products': [{'id': 2219, 'name': 'SUSE Linux Enterprise Server LTSS', 'identifier': 'SLES-LTSS/15.1/x86_64', 'type': 'extension', 'free': False, 'edition': '15 SP1', 'architecture': 'x86_64'}], 'timestamp': '2022-03-12T02:15:30.193223', 'repository': 'Basesystem Module 15 SP2 x86_6415 SP2x86_64', 'product_id': 1939}
searching for zypper for product ID 1939 in SCC

So, the record in the cache was 6 days old, a request for newer information from SCC was immediately queued.

I'm closing this issue, I guess the main purpose was achieved already.

@doccaz doccaz closed this as completed Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants