Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

follow trust-dns to its new name: hickory #5912

Merged
merged 17 commits into from
Aug 15, 2024
Merged

follow trust-dns to its new name: hickory #5912

merged 17 commits into from
Aug 15, 2024

Conversation

ahl
Copy link
Contributor

@ahl ahl commented Jun 18, 2024

obviates #4439 (if this works)

@smklein
Copy link
Collaborator

smklein commented Jun 18, 2024

As a heads-up, when I wrote qorb, I just started using hickory from the get-go (see: #5876)

I ran into some issues specifically within the Hickory DNS client resolver. Although hickory DNS doesn't take any slog logs as arguments, it does use tracing, and when I manually added a tracing subscriber I got more info (this is a big motivation for me writing RFD 489).

Anyway, when hickory DNS clients made requests through qorb, I saw a bunch of tracing messages that looked kinda like:

WARN trust_dns_proto::udp::udp_client_stream: dropped malformed message waiting for id: 21579 err: unexpected end of input reached

This seemed to match some of the symptoms described by hickory-dns/hickory-dns#2140 , which were triggered by an upgrade from the 0.22 -> 0.23 boundary. When I enabled edns, I stopped seeing the end-of-input reached messages.

I still would like to dig into the underlying root cause more here, but if you see failing DNS client requests with this upgrade, hopefully this can be a useful trail-of-breadcrumbs.

@smklein
Copy link
Collaborator

smklein commented Aug 6, 2024

From the logs, I'm seeing the following from the internal-dns logs:

2024-08-06T00:00:00.384Z	ERRO	dns-server (dns): failed to handle incoming DNS message: SERVFAIL: server is not authoritative for name: "oxz_cockroachdb_49283f4a-b51b-4b52-8a00-dda887133849."
    peer_addr = [fd00:1122:3344:101::4]:44059
    req_id = 952b5a26-0a7a-419d-b88c-b6047b03ce2c

It also appears CockroachDB initialization isn't completing, based on this error

@ahl
Copy link
Contributor Author

ahl commented Aug 13, 2024

From the logs, I'm seeing the following from the internal-dns logs:

2024-08-06T00:00:00.384Z	ERRO	dns-server (dns): failed to handle incoming DNS message: SERVFAIL: server is not authoritative for name: "oxz_cockroachdb_49283f4a-b51b-4b52-8a00-dda887133849."
    peer_addr = [fd00:1122:3344:101::4]:44059
    req_id = 952b5a26-0a7a-419d-b88c-b6047b03ce2c

It also appears CockroachDB initialization isn't completing, based on this error

To follow up on this: I see this same error in successful builds. I also honed in on a similar failure for _nexus._tcp.control-plane.oxide.internal. but I also see that in some CI runs.

I'm trying to get some tracing information out of hickory dns.

@ahl
Copy link
Contributor Author

ahl commented Aug 13, 2024

I see messages coming into the internal DNS server:

1032	2024-08-13T22:10:31.961Z	DEBG	dns-server (dns): message_request
    mr = MessageRequest {\n    header: Header {\n        id: 56913,\n        message_type: Query,\n        op_code: Query,\n        authoritative: false,\n        truncation: false,\n        recursion_desired: true,\n        recursion_available: false,\n        authentic_data: false,\n        checking_disabled: false,\n        response_code: NoError,\n        query_count: 1,\n        answer_count: 0,\n        name_server_count: 0,\n        additional_count: 0,\n    },\n    query: WireQuery {\n        query: LowerQuery {\n            name: LowerName(\n                Name("_cockroach._tcp.control-plane.oxide.internal."),\n            ),\n            original: Query {\n                name: Name("_cockroach._tcp.control-plane.oxide.internal."),\n                query_type: SRV,\n                query_class: IN,\n            },\n        },\n        original: [\n            10,\n            95,\n            99,\n            111,\n            99,\n            107,\n            114,\n            111,\n            97,\n            99,\n            104,\n            4,\n            95,\n            116,\n            99,\n            112,\n            13,\n            99,\n            111,\n            110,\n            116,\n            114,\n            111,\n            108,\n            45,\n            112,\n            108,\n            97,\n            110,\n            101,\n            5,\n            111,\n            120,\n            105,\n            100,\n            101,\n            8,\n            105,\n            110,\n            116,\n            101,\n            114,\n            110,\n            97,\n            108,\n            0,\n            0,\n            33,\n            0,\n            1,\n        ],\n    },\n    answers: [],\n    name_servers: [],\n    additionals: [],\n    sig0: [],\n    edns: None,\n}
    peer_addr = [fd00:1122:3344:101::6]:49500
    req_id = 958e5f1f-eb9e-4c34-818f-173b527978ac

But the corresponding request times out:

27	2024-08-13T22:11:01.968Z	WARN	dnswait: DNS query failed; will try again
    delay = 818.849846ms
    error = request timed out

Note that this crdb zone is fd00:1122:3344:101::6 which corresponds to the peer_addr above

@davepacheco
Copy link
Collaborator

I finally put this up on a4x2 so we could debug the helios-deploy failure interactively. The problem readily reproduced: the system got stuck bringing up the CockroachDB zones:

root@g0:~# zoneadm list
global
oxz_switch
oxz_internal_dns_fd3abde8-1f2c-44bf-83bc-1a6479524260
oxz_ntp_56c2fb6f-add4-4325-9ef3-c0bd6c35de1a
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65
oxz_cockroachdb_b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1
root@g0:~# zlogin oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65
[Connected to zone 'oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65' pts/6]
The illumos Project     helios-2.0.22694        May 2024
root@oxz_cockroachdb_a1ea5264:~# svcs
STATE          STIME    FMRI
...
offline*        2:50:13 svc:/oxide/cockroachdb:default

and it's the same problem we saw in helios-deploy: dnswait is sitting there waiting to get a response from the DNS servers:

root@oxz_cockroachdb_a1ea5264:~# tail -f $(svcs -L cockroachdb) | looker
02:50:43.957Z WARN dnswait: DNS query failed; will try again
    delay = 139.591848ms
    error = request timed out
02:51:14.103Z WARN dnswait: DNS query failed; will try again
    delay = 381.765426ms
    error = request timed out
02:51:44.490Z WARN dnswait: DNS query failed; will try again
    delay = 1.23004047s
    error = request timed out
02:52:15.728Z WARN dnswait: DNS query failed; will try again
    delay = 2.408434624s
    error = request timed out
02:52:48.141Z WARN dnswait: DNS query failed; will try again
    delay = 5.939410988s
    error = request timed out
02:53:24.086Z WARN dnswait: DNS query failed; will try again
    delay = 6.926888473s
    error = request timed out
02:54:01.021Z WARN dnswait: DNS query failed; will try again
    delay = 8.993978337s
    error = request timed out
02:54:40.018Z WARN dnswait: DNS query failed; will try again
    delay = 31.943154409s
    error = request timed out
02:55:41.964Z WARN dnswait: DNS query failed; will try again
    delay = 82.841641439s
    error = request timed out
02:57:34.803Z WARN dnswait: DNS query failed; will try again
    delay = 136.735336465s
    error = request timed out

but the DNS servers are happily reporting receiving and responding to these requests:

root@oxz_internal_dns_fd3abde8:~# tail -n 20 -f $(svcs -L internal_dns) | looker
04:19:12.024Z DEBG dns-server (dns): message_request
    mr = MessageRequest {\n    header: Header {\n        id: 24737,\n        message_type: Query,\n        op_code: Query,\n        authoritative: false,\n        truncation: false,\n        recursion_desired: true,\n        recursion_available: false,\n        authentic_data: false,\n        checking_disabled: false,\n        response_code: NoError,\n        query_count: 1,\n        answer_count: 0,\n        name_server_count: 0,\n        additional_count: 0,\n    },\n    query: WireQuery {\n        query: LowerQuery {\n            name: LowerName(\n                Name("_cockroach._tcp.control-plane.oxide.internal."),\n            ),\n            original: Query {\n                name: Name("_cockroach._tcp.control-plane.oxide.internal."),\n                query_type: SRV,\n                query_class: IN,\n            },\n        },\n        original: [\n            10,\n            95,\n            99,\n            111,\n            99,\n            107,\n            114,\n            111,\n            97,\n            99,\n            104,\n            4,\n            95,\n            116,\n            99,\n            112,\n            13,\n            99,\n            111,\n            110,\n            116,\n            114,\n            111,\n            108,\n            45,\n            112,\n            108,\n            97,\n            110,\n            101,\n            5,\n            111,\n            120,\n            105,\n            100,\n            101,\n            8,\n            105,\n            110,\n            116,\n            101,\n            114,\n            110,\n            97,\n            108,\n            0,\n            0,\n            33,\n            0,\n            1,\n        ],\n    },\n    answers: [],\n    name_servers: [],\n    additionals: [],\n    sig0: [],\n    edns: None,\n}
    peer_addr = [fd00:1122:3344:101::4]:64374
    req_id = 1b3c41d1-5d74-4ce1-9380-10c2f41ae002
zones
zone control-plane.oxide.internal
04:19:12.024Z DEBG dns-server (store): query key
    key = _cockroach._tcp
zones
zone control-plane.oxide.internal
04:19:12.025Z DEBG dns-server (store): query key
    key = 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host
zones
zone control-plane.oxide.internal
04:19:12.025Z DEBG dns-server (store): query key
    key = a1ea5264-563d-40a9-8446-6a32951e5c65.host
zones
zone control-plane.oxide.internal
04:19:12.025Z DEBG dns-server (store): query key
    key = a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host
zones
zone control-plane.oxide.internal
04:19:12.026Z DEBG dns-server (store): query key
    key = b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host
zones
zone control-plane.oxide.internal
04:19:12.026Z DEBG dns-server (store): query key
    key = b6736d88-bffb-4361-9857-c8ac7eab4ab8.host
04:19:12.026Z DEBG dns-server (dns): dns response
    additional_records = [Record { name_labels: Name("17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal"), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:102::4))) }, Record { name_labels: Name("a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal"), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:101::3))) }, Record { name_labels: Name("a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal"), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:103::3))) }, Record { name_labels: Name("b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal"), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:101::4))) }, Record { name_labels: Name("b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal"), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:102::3))) }]
    peer_addr = [fd00:1122:3344:101::4]:64374
    query = LowerQuery { name: LowerName(Name("_cockroach._tcp.control-plane.oxide.internal.")), original: Query { name: Name("_cockroach._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN } }
    records = [Record { name_labels: Name("_cockroach._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 32221, target: Name("17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal") })) }, Record { name_labels: Name("_cockroach._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 32221, target: Name("a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal") })) }, Record { name_labels: Name("_cockroach._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 32221, target: Name("a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal") })) }, Record { name_labels: Name("_cockroach._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 32221, target: Name("b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal") })) }, Record { name_labels: Name("_cockroach._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 32221, target: Name("b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal") })) }]
    req_id = 1b3c41d1-5d74-4ce1-9380-10c2f41ae002

I was immediately able to confirm that dig can query all of the DNS servers in /etc/resolv.conf and get the correct answers:

root@oxz_cockroachdb_a1ea5264:~# cat /etc/resolv.conf 
nameserver fd00:1122:3344:3::1
nameserver fd00:1122:3344:2::1
nameserver fd00:1122:3344:1::1

root@oxz_cockroachdb_a1ea5264:~# awk '$1 == "nameserver"{ print $2 }' /etc/resolv.conf  | while read ip; do echo checking $ip; dig -t SRV _cockroach._tcp.control-plane.oxide.internal. @$ip +short; echo; done
checking fd00:1122:3344:3::1
0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

checking fd00:1122:3344:2::1
0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

checking fd00:1122:3344:1::1
0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

So there's no problem with the networking stack here and the server at least seems to be working. @ahl had already confirmed that the problem was not readily reproducible when locally running the DNS server and dnsadm, which is rather surprising. We wondered if this was an issue only with release builds and did the local test with those, but that didn't reproduce the problem either.

Of course, I was also able to reproduce this running dnsadm myself in the zone:

# /opt/oxide/internal-dns-cli/bin/dnswait cockroach 2>&1 | looker
note: configured to log to "/dev/stderr"
03:21:18.442Z INFO dnswait: using system configuration
03:21:48.449Z WARN dnswait: DNS query failed; will try again
    delay = 374.992936ms
    error = request timed out
...

I also used # snoop -d oxControlService17 udp port 53 in the CockroachDB zone to verify that the traffic looks like what we'd expect, and it mostly does:

root@oxz_cockroachdb_a1ea5264:~# snoop -d oxControlService17 udp port 53
Using device oxControlService17 (promiscuous mode)
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.0.0.4.4.3.3.2.2.1.1.0.0.d.f.ip6.arpa. IN PTR ?
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R  Error: 2(Server Fail)
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.0.0.4.4.3.3.2.2.1.1.0.0.d.f.ip6.arpa. IN PTR ?
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R  Error: 2(Server Fail)
...
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:1::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:1::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:1::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:1::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:1::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:1::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:1::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:1::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 

In particular, we see the zone querying all three servers and getting responses for the SRV queries.

Around this point I noticed this message in the dnsadm output:

03:21:18.442Z INFO dnswait: using system configuration

and wondered if that could be related. @ahl immediately noticed that he had enabled eDNS in our internal DNS resolver constructor that accepts a specific list of (our) resolvers:
https://github.com/oxidecomputer/omicron/pull/5912/files#diff-b1212398b8c6caf455365ebfb06b4347121ff72a555070b6221545961e43deeaR60

but had not changed the path used by dnswait when loading the system configuration. (That's not trivial -- see 10b92eb for the fix for that.) When removing this line, the problem became readily reproducible locally. (@ahl did I have that right?)

Wondering if eDNS was on the scene, I went back to the full dig output:

root@oxz_cockroachdb_a1ea5264:~# dig -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1 

; <<>> DiG 9.18.14 <<>> -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54500
;; flags: qr rd; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN SRV

;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

;; ADDITIONAL SECTION:
17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::4
a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::3
a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:103::3
b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::4
b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::3

;; Query time: 6 msec
;; SERVER: fd00:1122:3344:3::1#53(fd00:1122:3344:3::1) (UDP)
;; WHEN: Thu Aug 15 04:28:26 UTC 2024
;; MSG SIZE  rcvd: 652

I do not see EDNS here. When using EDNS, dig prints something like this for the query part:

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096

and this in the response:

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232

At the same time, the "rcvd: 652" above shows the packet received was 652 bytes. DNS appears to have a limit of 512 bytes for normal UDP (non-EDNS) messages. I'm not sure this is definitely wrong, but at best it doesn't seem sound to expect clients to handle this well. I filed #6342 for this.

I also saved a complete packet capture of the DNS traffic for a few queries and responses. It's not notable except that Wireshark also reports nothing about EDNS being used.

I have not verified any of the following but here's my best guess about what was happening:

  • Our server completely ignores all of this: it doesn't claim to support EDNS, nor does it truncate its responses at 512 bytes like it's supposed to. It just sends big packets out. I think this because I don't see any code to handle any of this, plus we see the large, not-truncated response in dig.
  • dig is not strict on the receiving side and prints whatever it got, so it works by accident with our server.
  • The new hickory-dns is accidentally strict on the receiving side. Without edns enabled, it either has a buffer that's too small and thinks it got back garbage or it explicitly drops packets that were too large. So we see this failure. I'm inferring this because setting edns changes the behavior even though we're not using edns.

@davepacheco
Copy link
Collaborator

I just learned about dig +qr:

root@oxz_cockroachdb_a1ea5264:~# dig +qr -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1

; <<>> DiG 9.18.14 <<>> +qr -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1
;; global options: +cmd
;; Sending:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51079
;; flags: rd ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 0d3addd966bdb5e2
;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN SRV

;; QUERY SIZE: 85

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51079
;; flags: qr rd; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN SRV

;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

;; ADDITIONAL SECTION:
17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::4
a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::3
a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:103::3
b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::4
b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::3

;; Query time: 2 msec
;; SERVER: fd00:1122:3344:3::1#53(fd00:1122:3344:3::1) (UDP)
;; WHEN: Thu Aug 15 04:39:22 UTC 2024
;; MSG SIZE  rcvd: 652

If I'm reading this right, dig actually is sending its query with EDNS and advertising a max response size of 1232 bytes. So maybe the server is doing something reasonable for dig? I'm not sure if it's supposed to have sent EDNS information in the response. But also, if I use +bufsize=400 to make the max response size smaller (as advertised by the client), I can see that it successfully changed the message from client to server, but the server still sent too much back:

root@oxz_cockroachdb_a1ea5264:~# dig +qr -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1 +bufsize=400

; <<>> DiG 9.18.14 <<>> +qr -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1 +bufsize=400
;; global options: +cmd
;; Sending:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59109
;; flags: rd ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 400
; COOKIE: c99f7affaee9c498
;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN SRV

;; QUERY SIZE: 85

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59109
;; flags: qr rd; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN SRV

;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

;; ADDITIONAL SECTION:
17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::4
a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::3
a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:103::3
b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::4
b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::3

;; Query time: 2 msec
;; SERVER: fd00:1122:3344:3::1#53(fd00:1122:3344:3::1) (UDP)
;; WHEN: Thu Aug 15 04:41:13 UTC 2024
;; MSG SIZE  rcvd: 652

Comment on lines 55 to 66
/// Construct a new DNS resolver from the system configuration.
pub fn new_from_system_conf(
log: slog::Logger,
) -> Result<Self, ResolveError> {
let (rc, mut opts) = hickory_resolver::system_conf::read_system_conf()?;
opts.edns0 = true;

let resolver = TokioAsyncResolver::tokio(rc, opts);

Ok(Self { log, resolver })
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one of the few things that I think is "new"; @davepacheco (or others) please let me know if you think there's a better approach

Comment on lines +385 to +387
let mut resolver_opts = ResolverOpts::default();
// Enable edns for potentially larger records
resolver_opts.edns0 = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block repeats a lot and the implications have been subtle -- I just wonder if we can/should put this into a common helper with more of an explanation.

@ahl ahl merged commit 66ac7b3 into main Aug 15, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants