Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a persistent cache for tldextract across CI runs #3716

Closed
sarayourfriend opened this issue Jan 29, 2024 · 5 comments
Closed

Use a persistent cache for tldextract across CI runs #3716

sarayourfriend opened this issue Jan 29, 2024 · 5 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: mgmt Related to repo management and automations

Comments

@sarayourfriend
Copy link
Collaborator

Current Situation

tldextract is a library used by the ingestion server to help extract TLDs (not a surprise, based on the library's name!).

It relies on updated public suffix information, which it requests the first time it runs. This information is cached, and we should persist it accross test runs.

Suggested Improvement

Configure tldextract's cache location to a place we can easily cache in GitHub Actions. Use tldextract --update with the configured cache location, as shown in the library docs, to update the cached data as needed, and prevent unnecessary network requests in tests.

Benefit

Prevent unnecessary network requests during test runs.

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: mgmt Related to repo management and automations labels Jan 29, 2024
@sarayourfriend
Copy link
Collaborator Author

@krysal @stacimc will the ingestion worker still have this dependency? I was wondering whether this is no longer needed since we moved URL clean up out of the data refresh in #430?

@AetherUnbound
Copy link
Collaborator

It does still have that dependency, although I do not believe it will be carried forward once we have the new indexer workers running. We should be able to close this issue.

@AetherUnbound AetherUnbound closed this as not planned Won't fix, can't repro, duplicate, stale Sep 3, 2024
@stacimc
Copy link
Contributor

stacimc commented Sep 3, 2024

The indexer worker definitely will not have this dependency. I will leave it to @obulat to confirm if it needed elsewhere/where the cleanup is happening as I'm not sure 😅

@stacimc
Copy link
Contributor

stacimc commented Sep 3, 2024

@AetherUnbound does the catalog's usage of tldextract not affect this?

@AetherUnbound
Copy link
Collaborator

@stacimc Because of the pytest-socket plugin we use and the example overrides we have for testing, tldextract does not make any outgoing requests during the catalog's testing and any caching it does is likely insignificant and not worth the effort to attempt to cache (IMO)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: mgmt Related to repo management and automations
Projects
Archived in project
Development

No branches or pull requests

3 participants