-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(PE-6695/PE-6696): remove resolver, add redis support for arns cache #200
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #200 +/- ##
===========================================
+ Coverage 68.53% 70.16% +1.63%
===========================================
Files 32 32
Lines 7795 7829 +34
Branches 438 438
===========================================
+ Hits 5342 5493 +151
+ Misses 2452 2336 -116
+ Partials 1 0 -1 ☔ View full report in Codecov by Sentry. |
resolver: | ||
image: ghcr.io/ar-io/arns-resolver:${RESOLVER_IMAGE_TAG:-7fe02ecda2027e504248d3f3716579f60b561de5} | ||
restart: on-failure | ||
ports: | ||
- 6000:6000 | ||
environment: | ||
- PORT=6000 | ||
- LOG_LEVEL=${LOG_LEVEL:-info} | ||
- IO_PROCESS_ID=${IO_PROCESS_ID:-} | ||
- RUN_RESOLVER=${RUN_RESOLVER:-false} | ||
- EVALUATION_INTERVAL_MS=${EVALUATION_INTERVAL_MS:-} | ||
- ARNS_CACHE_TTL_MS=${RESOLVER_CACHE_TTL_MS:-} | ||
- ARNS_CACHE_PATH=${ARNS_CACHE_PATH:-./data/arns} | ||
- AO_CU_URL=${AO_CU_URL:-} | ||
- AO_MU_URL=${AO_MU_URL:-} | ||
- AO_GATEWAY_URL=${AO_GATEWAY_URL:-} | ||
- AO_GRAPHQL_URL=${AO_GRAPHQL_URL:-} | ||
volumes: | ||
- ${ARNS_CACHE_PATH:-./data/arns}:/app/data/arns | ||
networks: | ||
- ar-io-network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋
3c62e33
to
22b8f71
Compare
With on-demand resolution, we no longer need to support the arns-resolver. We can effectively fetch and cache arns resolutions quickly via AO and cache them locally or in redis. This replaces the default `MemoryArNSCache` with a configurable KvBufferStore that supports `redis` or a local `node-cache`. When resolving an arns name, the `CompositeArNSResolver` will check the provided cache and the TTL of the record, if it is not in the cache it will then use the available arns resolvers (on-demand and/or another gateway) to get resolution data. If an operator would like to disable caching of arns names, and always resolve to the latest they can set ARNS_CACHE_TTL_SECONDS to 0. Additionally, prometheus metrics are available for hit/miss rate for the arns cache and total resolution times.
22b8f71
to
037417e
Compare
…or arns in cache This is to avoid doing it in the composite resolver
WalkthroughWalkthroughThe changes involve a comprehensive reconfiguration of the ARNS resolution system, including the removal of the Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ARNSResolver
participant Cache
participant DataSource
Client->>ARNSResolver: Request ARNS resolution
ARNSResolver->>Cache: Check for cached result
alt Cache hit
Cache-->>ARNSResolver: Return cached result
else Cache miss
ARNSResolver->>DataSource: Resolve ARNS
DataSource-->>ARNSResolver: Return resolved ARNS
ARNSResolver->>Cache: Store result in cache
end
ARNSResolver-->>Client: Return ARNS resolution result
Possibly related PRs
Tip Announcements
Recent review detailsConfiguration used: CodeRabbit UI Files selected for processing (1)
Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
}); | ||
const resolution = await resolver.resolve(name); | ||
if (resolution.resolvedId !== undefined) { | ||
await this.cache.set(name, Buffer.from(JSON.stringify(resolution))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the need to await
the cache set here. For other caches we generally don't wait for it to be set.
In this case, I imagine if we call resolve
for the same name several times (while the first is being cached) the sequential calls, while the name is not cached yet, will not wait for the cache, right? If so I don't think we should wait for the cache to be set.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call out, removed in f1bd0cf -
when thinking about the implications of this, i also realized we aren't handling concurrent requests well. I'd like to propose adding a request cache in the arns middleware that protects against concurrent calls to nameResolver.resolve(name)
by storing the promise in a NodeCache
// check if this instance is already in the process of resolving the requested name, and return that promise if so, otherwise set it in the cache
const getArnsResolutionPromise = async (): Promise<NameResolution> => {
if (arnsRequestCache.has(arnsSubdomain)) {
const arnsResolutionPromise =
arnsRequestCache.get<Promise<NameResolution>>(arnsSubdomain);
if (arnsResolutionPromise) {
return arnsResolutionPromise;
}
}
const arnsResolutionPromise = nameResolver.resolve(arnsSubdomain);
arnsRequestCache.set(arnsSubdomain, arnsResolutionPromise);
return arnsResolutionPromise;
};
const start = Date.now();
const { resolvedId, ttl, processId } =
await getArnsResolutionPromise().finally(() => {
// remove from cache after resolution
arnsRequestCache.del(arnsSubdomain);
});
metrics.arnsResolutionTime.observe(Date.now() - start);
if (resolvedId === undefined) {
sendNotFound(res);
return;
}
res.header(headerNames.arnsResolvedId, resolvedId);
res.header(headerNames.arnsTtlSeconds, ttl.toString());
res.header(headerNames.arnsProcessId, processId);
// TODO: add a header for arns cache status
res.header('Cache-Control', `public, max-age=${ttl}`);
dataHandler(req, res, next);
After adding this change, I did some tests using hey
.
Test: 100 concurrent requests for an arns name
Before:
Each arns request created independent promises to AO, resulting in a wide range of resolution times and some experiencing rate limiting/throttling having to fallback to TrustedArNSGateway resolvers.
❯ hey -n 100 -c 100 -t 60 -host 'gateways.example.com' http://localhost:4000
Summary:
Total: 65.3634 secs
Slowest: 65.3604 secs
Fastest: 17.1049 secs
Average: 31.1251 secs
Requests/sec: 1.5299
Total data: 172414200 bytes
Size/request: 1724142 bytes
Response time histogram:
17.105 [1] |■
21.930 [71] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
26.756 [3] |■■
31.582 [0] |
36.407 [0] |
41.233 [0] |
46.058 [0] |
50.884 [0] |
55.709 [0] |
60.535 [6] |■■■
65.360 [19] |■■■■■■■■■■■
Latency distribution:
10% in 19.6950 secs
25% in 20.5840 secs
50% in 21.3610 secs
75% in 59.9175 secs
90% in 63.3682 secs
95% in 64.5496 secs
99% in 65.3604 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0074 secs, 17.1049 secs, 65.3604 secs
DNS-lookup: 0.0026 secs, 0.0016 secs, 0.0042 secs
req write: 0.0002 secs, 0.0000 secs, 0.0020 secs
resp wait: 27.6524 secs, 15.5326 secs, 65.3426 secs
resp read: 3.4650 secs, 0.0111 secs, 6.5513 secs
Status code distribution:
[200] 100 responses
Logs - these are printed 100 times (one for each request) and several experience rate limits from AO infrastructure, forcing the CompositeArNSResolver
to fallback to gateways to get resolution data
core-1 | info: Attempting to resolve name with resolver {"class":"CompositeArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.052Z","type":"TrustedGatewayArNSResolver"}
core-1 | info: Resolving name... {"class":"TrustedGatewayArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.052Z"}
core-1 | warn: Unable to resolve name: fetch failed {"class":"OnDemandArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.054Z"}
core-1 | info: Attempting to resolve name with resolver {"class":"CompositeArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.054Z","type":"TrustedGatewayArNSResolver"}
core-1 | info: Resolving name... {"class":"TrustedGatewayArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.055Z"}
core-1 | warn: Unable to resolve name: fetch failed {"class":"OnDemandArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.056Z"}
core-1 | info: Attempting to resolve name with resolver {"class":"CompositeArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.056Z","type":"TrustedGatewayArNSResolver"}
core-1 | info: Resolving name... {"class":"TrustedGatewayArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.057Z"}
core-1 | warn: Unable to resolve name: fetch failed {"class":"OnDemandArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.060Z"}
core-1 | info: Attempting to resolve name with resolver {"class":"CompositeArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.060Z","type":"TrustedGatewayArNSResolver"}
core-1 | info: Resolving name... {"class":"TrustedGatewayArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.060Z"}
core-1 | warn: Unable to resolve name: fetch failed {"class":"OnDemandArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.064Z"}
core-1 | info: Attempting to resolve name with resolver {"class":"CompositeArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.065Z","type":"TrustedGatewayArNSResolver"}
core-1 | info: Resolving name... {"class":"TrustedGatewayArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.065Z"}
core-1 | warn: Unable to resolve name: fetch failed {"class":"OnDemandArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.069Z"}
core-1 | info: Attempting to resolve name with resolver {"class":"CompositeArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.069Z","type":"TrustedGatewayArNSResolver"}
core-1 | info: Resolving name... {"class":"TrustedGatewayArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.069Z"}
core-1 | warn: Unable to resolve name: fetch failed {"class":"OnDemandArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.074Z"}
core-1 | info: Attempting to resolve name with resolver {"class":"CompositeArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.074Z","type":"TrustedGatewayArNSResolver"}
core-1 | info: Resolving name... {"class":"TrustedGatewayArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.074Z"}
core-1 | warn: Unable to resolve name: fetch failed {"class":"OnDemandArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.078Z"}
core-1 | info: Attempting to resolve name with resolver {"class":"CompositeArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.078Z","type":"TrustedGatewayArNSResolver"}
core-1 | info: Resolving name... {"class":"TrustedGatewayArNSResolver","name":"joose","timestamp":"2024-09-10T14:19:59.078Z"}
After
All 100 requests were satisfied with the single request to AO, and resolution times were approx. the same
❯ hey -n 100 -c 100 -t 60 -host 'gateways.example.com' http://localhost:4000
Summary:
Total: 44.0368 secs
Slowest: 44.0354 secs
Fastest: 43.6696 secs
Average: 43.8166 secs
Requests/sec: 2.2708
Total data: 172414200 bytes
Size/request: 1724142 bytes
Response time histogram:
43.670 [1] |■
43.706 [0] |
43.743 [8] |■■■■■■■■■■
43.779 [22] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■
43.816 [31] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
43.853 [10] |■■■■■■■■■■■■■
43.889 [17] |■■■■■■■■■■■■■■■■■■■■■■
43.926 [3] |■■■■
43.962 [4] |■■■■■
43.999 [3] |■■■■
44.035 [1] |■
Latency distribution:
10% in 43.7457 secs
25% in 43.7732 secs
50% in 43.8027 secs
75% in 43.8615 secs
90% in 43.9001 secs
95% in 43.9574 secs
99% in 44.0354 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0091 secs, 43.6696 secs, 44.0354 secs
DNS-lookup: 0.0021 secs, 0.0015 secs, 0.0029 secs
req write: 0.0002 secs, 0.0000 secs, 0.0016 secs
resp wait: 42.9208 secs, 42.9030 secs, 42.9412 secs
resp read: 0.8865 secs, 0.7390 secs, 1.0992 secs
Status code distribution:
[200] 100 responses
Logs - only printed once for all 100 requests, no falling back to gateway resolver necessary
core-1 | info: Cache miss for arns name {"class":"CompositeArNSResolver","name":"gateways","timestamp":"2024-09-10T13:34:09.999Z"}
core-1 | info: Attempting to resolve name with resolver {"class":"CompositeArNSResolver","name":"gateways","timestamp":"2024-09-10T13:34:09.999Z","type":"OnDemandArNSResolver"}
core-1 | info: Resolving name... {"class":"OnDemandArNSResolver","name":"gateways","timestamp":"2024-09-10T13:34:10.000Z"}
This should also help slamming AO on fresh or expired names and likely a pattern would could extend to other request paths.
cc @djwhitt
@@ -616,6 +623,7 @@ export const shutdown = async (express: Server) => { | |||
eventEmitter.removeAllListeners(); | |||
arIODataSource.stopUpdatingPeers(); | |||
dataSqliteWalCleanupWorker?.stop(); | |||
await arnsResolverCache.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This protects against multiple calls to `nameResolver.resolver(name)` while resolving a single name. The cache is given a short TTL and requests are removed after they are resolved. The best way to verify this behavior is trigger concurrent requests to resolve an arns name (e.g. using `hey`) and observing the logs to resolve the name only appear once for all the requests. Once the original promise is resolved, every outstanding request is resolved with the result and the name & promise are removed from the request cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth putting a circuit breaker around the AO requests?
resolver, | ||
message: error.message, | ||
stack: error.stack, | ||
}); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fall back to the cached resolution here, perhaps with some staleness threshold, if we have one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, can add that, we can use the TTL of the cache as the staleness threshold. that would mean if the cache has it, and we can't fetch anything new - return what the cache has until it expires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified to return the cached resolution data on error - if we have it - here - b9e4fe6
I do think we should, especially considering how long its taken to resolve the issues we've seen today. Will add. |
PR here to add circuit breaker in this PR: #201 |
This should help avoid slamming AO when there are intermittent issues
…solvers If we have the resolution in the cache, and we fail to fetch new data from the resolvers, returned the cached resolution data. This data will expire based on the the ARNS_CACHE_TTL_SECONDS config variable.
With on-demand resolution, we no longer need to support the arns-resolver. We can effectively fetch and cache arns resolutions quickly via AO and gateways. In addition to removing the
standalone-arns-resolver
implementation and service from docker, this PR replaces the defaultMemoryArNSCache
with a configurable KvBufferStore that supportsredis
or a localnode-cache
. When resolving an arns name, theCompositeArNSResolver
will check the provided cache and the TTL of the record, if it is not in the cache it will then use the available arns resolvers (on-demand and/or another gateway) to get resolution data. If an operator would like to disable caching of arns names, and always resolve to the latest they can set ARNS_CACHE_TTL_SECONDS to 0.Additionally, prometheus metrics are available for hit/miss rate for the arns cache and total resolution times.
Logs
Metrics