Confusing AXFR errors #1624

kalfeher · 2025-01-11T01:15:05Z

Description

When attempting an xfr from zonedata.iis.se, for se. using a non-tsig signed transfer I see the following error consistently:
dns: overflowing header size

When attempting an xfr from zonedata.iis.se, for nu. using a non-tsig signed transfer I see the following two errors occasionally:
dns: overflow unpacking uint16
dns: buffer size too small
However the transfer attempt for nu. will also work as expected about half of the time.

I'm using just the high level APIs for a simple transfer. AXFR via dig @zonedata.iis.se axfr se works as expected.

Steps to reproduce

func main() {
	name := "se."
	primary := "zonedata.iis.se:53"
	t := new(dns.Transfer)
	m := new(dns.Msg)
	m.SetAxfr(name)
	c, err := t.In(m, primary)
	if err != nil {
		fmt.Println(err)
		return
	}
	for r := range c {
		if r.Error != nil {
			fmt.Println(r.Error)
			return
		}
		for _, a := range r.RR {
			fmt.Println(a.String() + "\n")
		}
	}
}

Expected Behaviour

All records received for zone file (~8106000 RRs), finishing with the SOA RR.

Actual Behaviour

For SE zone I receive the error dns: overflowing header size at some point during the transfer. Number of records received before the error is not consistent.

For NU zone I receive the one of the following errors at some point during the transfer dns: overflow unpacking uint16 or dns: buffer size too small. As with the SE zone, the number of records received before the error is not consistent. Approx half of my transfer attempts succeed for the NU zone.

The text was updated successfully, but these errors were encountered:

miekg · 2025-01-24T15:07:33Z

thanks for the repro. I just ran that on my laptop , without seeing any error.

Was some record removed from zone?

kalfeher · 2025-01-25T00:18:36Z

Unclear if changes have been made to the zone or platform. I reported the issue to IIS who shared internally, but I've not heard back.
The symptoms remain for me. My pcaps indicate that the records restart at some point (sometimes after only a few hundred RRs other times after a few thousand), which results a TCP segment that is never received. Then it's just DUP and retransmissions and eventually the connection fails. By restart, I mean the SOA is re-sent and RRs begin again from the top of the zone.
I suspect this is related to some special behaviour of the zonedata.iis.se host, particularly over high latency connections. I'm fetching from Australia.
I see the same symptoms in pcaps using dig, but the connection remains resilient and eventually the axfr completes.

As far as this DNS library is concerned, I'm trying to understand the errors in order to be more resilient to the behaviour of the host.

miekg · 2025-01-29T19:07:59Z

[ Quoting ***@***.***> in "Re: [miekg/dns] Confusing AXFR erro..." ]

Unclear if changes have been made to the zone or platform. I reported the issue to IIS who shared internally, but I've not heard back. The symptoms remain for me. My pcaps indicate that the records restart at some point (sometimes after only a few hundred RRs other times after a few thousand), which results a TCP segment that is never received. Then it's just DUP and retransmissions and eventually the connection fails. By restart, I mean the SOA is re-sent and RRs begin again from the top of the zone. I suspect this is related to some special behaviour of the zonedata.iis.se host, particularly over high latency connections. I'm fetching from Australia. I see the same symptoms in pcaps using dig, but the connection remains resilient and eventually the axfr completes. As far as this DNS library is concerned, I'm trying to understand the errors in order to be more resilient to the behaviour of the host.

Happy to include fixes need here. But resend tcp segments, or not something this lib should see (i think).... so that would hint at deeper golang problems? (minimal tcpdump would help here I guess?) Cheers!

kalfeher · 2025-02-03T05:28:19Z

se_cap_golang-filtered-01.pcap.gz

I've attached a pcap starting a few records before the first malformed RR is seen. The library returns dns: overflowing header size at the end.

miekg · 2025-02-22T12:20:56Z

after loading it up in wireshark, after the TCP stuff, the first DNS seen according to ws is:

124 0.004490 13.49.252.56 10.0.7.154 DNS 16978 Unknown operation (15) 0x65cd[Malformed Packet], Unknown operation (3) response 0x7f1c Unknown error (15)[Malformed Packet]

... and looking at ws this is indeed completely broken packet.

Then until the next red block, no valid DNS packets are seen.

That would mean we could do better and erroring when seeing:

Frame 124: 16978 bytes on wire (135824 bits), 16978 bytes captured (135824 bits)
Ethernet II, Src: 02:7c:c0:b7:c6:d0 (02:7c:c0:b7:c6:d0), Dst: MS-NLB-PhysServer-32_0f:66:0b:89:5f (02:2f:66:0b:89:5f)
Internet Protocol Version 4, Src: 13.49.252.56, Dst: 10.0.7.154
Transmission Control Protocol, Src Port: 53, Dst Port: 36464, Seq: 1227329, Ack: 1, Len: 16912
[7 Reassembled TCP Segments (57118 bytes): #111(3), #119(3624), #120(15704), #121(16912), #122(13288), #123(3624), #124(3963)]
Domain Name System (query)
[Malformed Packet: DNS]
    [Expert Info (Error/Malformed): Malformed Packet (Exception occurred)]
        [Malformed Packet (Exception occurred)]
        [Severity level: Error]
        [Group: Malformed]
Domain Name System (response)
    Length: 3111
    Transaction ID: 0x7f1c
    Flags: 0x9c0f Unknown operation response, Unknown error
        1... .... .... .... = Response: Message is a response
        .001 1... .... .... = Opcode: Unknown (3)
        .... .1.. .... .... = Authoritative: Server is an authority for domain
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...0 .... .... = Recursion desired: Don't do query recursively
        .... .... 0... .... = Recursion available: Server can't do recursive queries
        .... .... .0.. .... = Z: reserved (0)
        .... .... ..0. .... = Answer authenticated: Answer/authority portion was not authenticated by the server
        .... .... ...0 .... = Non-authenticated data: Unacceptable
        .... .... .... 1111 = Reply code: Unknown (15)
    Questions: 37926
    Answer RRs: 47346
    Authority RRs: 64511
    Additional RRs: 49665
    Queries
[Malformed Packet: DNS]
    [Expert Info (Error/Malformed): Malformed Packet (Exception occurred)]
        [Malformed Packet (Exception occurred)]
        [Severity level: Error]
        [Group: Malformed]

Which would a better opcode check or something?

miekg · 2025-02-22T12:44:04Z

what does this do?

diff --git a/msg.go b/msg.go
index 5fa7f9e8..ead1f5f4 100644
--- a/msg.go
+++ b/msg.go
@@ -885,6 +885,13 @@ func (dns *Msg) Unpack(msg []byte) (err error) {
        }

        dns.setHdr(dh)
+       if _, ok := OpcodeToString[dns.Opcode]; !ok {
+               return fmt.Errorf("bad opcode %d", dns.Opcode)
+       }
+       if _, ok := RcodeToString[dns.Rcode]; !ok {
+               return fmt.Errorf("bad rcode %d", dns.Rcode)
+       }
+
        return dns.unpack(dh, msg, off)
 }

miekg · 2025-02-23T07:20:54Z

hmmm this will give you a saner error, but is not the root cause.

This may give more hints:

diff --git a/msg_helpers.go b/msg_helpers.go
index acec21f7..0c189bdc 100644
--- a/msg_helpers.go
+++ b/msg_helpers.go
@@ -5,6 +5,7 @@ import (
        "encoding/base64"
        "encoding/binary"
        "encoding/hex"
+       "fmt"
        "net"
        "sort"
        "strings"
@@ -134,7 +135,7 @@ func (hdr RR_Header) packHeader(msg []byte, off int, compression compressionMap,
 func truncateMsgFromRdlength(msg []byte, off int, rdlength uint16) (truncmsg []byte, err error) {
        lenrd := off + int(rdlength)
        if lenrd > len(msg) {
-               return msg, &Error{err: "overflowing header size"}
+               return msg, &Error{err: fmt.Sprintf("overflowing header size, got size %d, but msg is %d", lenrd, len(msg))}
        }
        return msg[:lenrd], nil
 }

kalfeher · 2025-02-24T00:48:38Z

That second error message provides more helpful information. The changes to the Unpack function did not result in any difference. The packets I received never triggered those additional error conditions in the Unpack function.
With both changes applied I saw the following errors:

dns: overflow unpacking uint32
dns: overflow unpacking uint16
dns: overflowing header size, got size 9620, but msg is 9577
dns: overflowing header size, got size 9387, but msg is 9264

The message in truncateMsgFromRdlength is certainly more informative now.

miekg · 2025-02-24T09:07:23Z

what happens if you dont error here, but just return, i.e.

if lenrd > len(msg) {
     return msg
}

obv, the first 2 error from your comment, are leading up to the last two, so it might make sense to instrument those parts well. I'll check and post another diff.

I'm more confused that dig continues unfazed when looking at that tcp dump TBH

kalfeher · 2025-02-24T11:27:12Z

Changing the code to

if lenrd > len(msg) {
    return msg, nil
}

resulted in these errors being returned:
dns: overflow unpacking uint16
dns: bad rdlength
dns: bad rdlength
dns: buffer size too small
dns: bad rdlength

I've done more testing with dig and it's behaviour confuses me as well. When I dig @zonedata.iis.se axfr se from that host instead of ceasing the transfer when packet errors occur, the transfer is re-started. The SOA RR is re-sent and the zone file contents will start from the top again. This can happen once or twice during an AXFR.

AFAIK this is not behaviour that is part of the protocol, but is probably the result of server logic. The axfr will complete, although I'll have duplicate or triplicate RRs. So the connection stays resilient to malformed packets, but I don't end up with a clean version of the zone. IMO this is probably not something that makes sense to copy. But maybe this is a question for DNSOP as to what should the correct client behaviour be?

My personal opinion is that the right thing to do would be to drop the connection and return a clear error

miekg · 2025-02-24T14:27:14Z

restarting the axfr is indeed not specified, although you can add it to the code. I think godns is probably right by calling this transfer broken, some part is not being (re)send and this messes up the (length) counters and there is no mechanism to fix that in the DNS. It's hard to distill something sane, error wise, from that situation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing AXFR errors #1624

Confusing AXFR errors #1624

kalfeher commented Jan 11, 2025

miekg commented Jan 24, 2025

kalfeher commented Jan 25, 2025

miekg commented Jan 29, 2025 via email

kalfeher commented Feb 3, 2025

miekg commented Feb 22, 2025

miekg commented Feb 22, 2025

miekg commented Feb 23, 2025 •

edited

Loading

kalfeher commented Feb 24, 2025

miekg commented Feb 24, 2025

kalfeher commented Feb 24, 2025 •

edited

Loading

miekg commented Feb 24, 2025

Confusing AXFR errors #1624

Confusing AXFR errors #1624

Comments

kalfeher commented Jan 11, 2025

Description

Steps to reproduce

Expected Behaviour

Actual Behaviour

miekg commented Jan 24, 2025

kalfeher commented Jan 25, 2025

miekg commented Jan 29, 2025 via email

kalfeher commented Feb 3, 2025

miekg commented Feb 22, 2025

miekg commented Feb 22, 2025

miekg commented Feb 23, 2025 • edited Loading

kalfeher commented Feb 24, 2025

miekg commented Feb 24, 2025

kalfeher commented Feb 24, 2025 • edited Loading

miekg commented Feb 24, 2025

miekg commented Feb 23, 2025 •

edited

Loading

kalfeher commented Feb 24, 2025 •

edited

Loading