Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - New DialerGroupConfig interface for modules to given greater control over connection (L4/TLS/etc) establishment #506

Draft
wants to merge 43 commits into
base: master
Choose a base branch
from

Conversation

phillip-stephens
Copy link
Contributor

Description and Motivation

Some users of ZGrab modules, especially those who are doing scanning of L7 hosts as part of a larger scanning pipeline, desire having more control over the connections ZGrab establishes/uses in order to scan.

This PR intends to modify the interface between protocol modules and the ZGrab framework in order to give users the ability to have full control over ZGrab's connections. This could include:

  • setting the source IP/Port for Layer 4 connections on TCP/UDP
  • Use a pre-existing connection for scanning a given target
  • Upon a successful TLS handshake with a target, to update some telemetry with an outside service of the state of this handshake.

In sum, we want to give the user the ability to take full control of establishing a connection to a scanned host. We also want all existing functionality of ZGrab as a CLI tool to remain unaffected.

Protocol Differences in needs for a Connection

First, let's delineate between 2 classes of protocol.

Transport-Agnostic

The majority of protocols are transport-agnostic and can run on any L4 connection or a TLS-wrapped connection. There may be standard norms (HTTPS on TCP port 443, NTP on port 123 over UDP, etc) but the protocol will work over any if both sides agree.

Non-Transport-Agnostic

Some L7 protocols have certain demands of the underlying transport. For example, HTTPS needs to be able to follow re-directs to HTTP sites and vice-versa. The HTTP module will usually require both a L4 connection and a TLS-wrapped connection to be able to follow both of these types of redirects.

There are other protocols as well, some (SMTP for ex.) want to establish an L4 connection, send a STARTTLS packet, and then proceed with a TLS handshake. We could not just hand this module a pre-existing, post-handshake TLS connection and expect the protocol to work.

Dialer Groups

In order to accomplish this in a clean abstraction, we've settled on the idea of a DialerGroup that is taken as input in every call to Scan(). Modules will provide a "default" dialer group setting the dialers to be as they'd commonly expect. Transport-agnostic modules will set the TransportAgnosticDialer to their default L4/TLS settings, minding any flags the user sets.

For example, NTP would set the dialer to be a UDP dialer connecting to port 123, by default.

From module.go:

type DialerGroup struct {
	// ProtocolAgnosticDialer should be used by most modules that do not need control over the transport layer.
	// It abstracts the underlying transport protocol so a module can  deal with just the L7 logic. Any protocol that
	// doesn't need to know about the underlying transport should use this.
	// If the transport is a TLS connection, the dialer should provide a zgrab2.TLSConnection so the underlying log can be
	// accessed.
	ProtocolAgnosticDialer func(ctx context.Context, target *ScanTarget) (net.Conn, error)
	// L4Dialer will be used by any module that needs to have a TCP/UDP connection. Think of following a redirect to an
	// http:// server, or a module that needs to start with a TCP connection and then upgrade to TLS as part of the protocol.
	// The layered function is needed since we set DialerGroups at Scanner.Init, but modules like HTTP will modify the
	// Dialer based on the target.
	L4Dialer func(target *ScanTarget) func(ctx context.Context, network, addr string) (net.Conn, error)
	// TLSWrapper is a function that takes an existing net.Conn and upgrades it to a TLS connection. This is useful for
	// modules that need to start with a TCP connection and then upgrade to TLS later as part of the protocol.
	// Cannot be used for QUIC connections, as QUIC connections are not "upgraded" from a L4 connection
	TLSWrapper func(ctx context.Context, target *ScanTarget, l4Conn net.Conn) (*TLSConnection, error)

Protocols like HTTP will instead use the L4Dialer and TLSWrapper to specify the two types of connections they require.

Users that require full control over connections can set these dialers to whichever they wish and the Scan() function will utilize them as specified.

The new Scan() function looks as follows:

	// Scan connects to a host. The result should be JSON-serializable
	Scan(ctx context.Context, t *ScanTarget, dialer *DialerGroup) (ScanStatus, any, error)

DialerGroupConfig

Rather than having module authors define dialer groups, we'll have modules define a DialerGroupConfig that encapsulates what a module needs from its dialer.

// DialerGroupConfig lets modules communicate what they'd need in a dialer group
type DialerGroupConfig struct {
	// TransportProtocol is the L4 transport the module needs.
	L4TransportProtocol TransportProtocol
	// NeedSeparateL4Dialer indicates whether the module needs a dedicated L4 dialer in its DialerGroup.
	// Some modules' protocols need to send some command (STARTTLS) after L4 connection and before TLS handshake.
	// Others like http need to follow redirects to https:// and http:// servers, which requires both a TLS and L4 (TCP) conn.
	// If this is true, the framework will ensure dialerGroup.L4Dialer is set.
	// If false, the framework will set the TransportAgnosticDialer
	NeedSeparateL4Dialer bool
	BaseFlags            *BaseFlags
	// TLSEnabled indicates whether the module needs a TLS connection. The behavior depends on if NeedsL4Dialer is true.
	// If NeedsL4Dialer is true, the framework will set up a TLSWrapper in the DialerGroup so a module can access both.
	// If NeedsL4Dialer is false, the framework will set up a TLS dialer as the TransportAgnosticDialer since the module has indicated it only needs a TLS connection.
	TLSEnabled bool
	TLSFlags   *TLSFlags // must be non-nil if TLSEnabled is true
	UDPFlags   *UDPFlags // must be non-nil if L4TransportProtocol is TransportUDP
}

With this information, the framework can build a default dialer group to be used with standard CLI operation.

How it all comes together

The overall flow will look as follows:

  1. Framework starts, passing the user's scan flags to the module.
  2. During module Init(scanFlags), a scan module will use the user-provided scan flags (entries like what dst port to scan on, TLS flags to control aspects of the TLS connection, etc) to set a DialerGroupConfig, providing the framework with all the info it needs to build the dialers.
  3. Prior to scanning a target, the Framework will call GetDialerGroupConfig to get a module's dialer group config for "default" operation.
  4. The Framework will use this config to build a dialerGroup.
  5. The Framework will then pass these dialers into Scan(ctx context.Context, dialerGroup *DialerGroup, t *ScanTarget).
  6. The module will use whatever dialers are passed into Scan() to connect to the target.

This let's users of the modules themselves inject their own dialers into Scan() in Step 5.

In conclusion, this interface let's modules define a certain "default behavior" as each L7 protocol has differing norms on what connections it usually uses, while also giving developers who use these modules in larger scanning systems the ability to fully control all aspects of a connection.

Testing

WIP

… ensure the target always has the port set correctly
@phillip-stephens phillip-stephens changed the title WIP - New interface for modules to given greater control over connection (L4/TLS/etc) establishment WIP - New DialerGroupConfig interface for modules to given greater control over connection (L4/TLS/etc) establishment Mar 14, 2025
@Seanstoppable
Copy link
Contributor

Tangent/question:
I've been meaning to experiment with allowing a module to upgrade/retry it's connection with TLS, or maybe try TLS and just continue with TCP if the handshake fails
My reasoning being that a number of these protocols actually use TLS on the same port and it seems cheaper to just try to continue with the scan than do a second one.

This probably could just be accomplished with the multiple module, and defining order appropriately, bu maybe this would also make it easier to sub in my own dialergroup to accomplish that?

@@ -165,39 +175,27 @@ func (scanner *Scanner) Protocol() string {
return "amqp091"
}

func (scanner *Scanner) Scan(target zgrab2.ScanTarget) (zgrab2.ScanStatus, any, error) {
conn, err := target.Open(&scanner.config.BaseFlags)
func (scanner *Scanner) GetDialerGroupConfig() *zgrab2.DialerGroupConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be identical in all the classes. Is it really necessary?
Alternately, can we maybe simplify with struct embedding to clean up all the repetition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you with avoiding repetition, but other than the function header, each module will have a unique implementation of the DialerGroupConfig. So I'd argue it's not too much more repetition.

Additionally, we'd lose the assertion that each module will provide this functionality, rather than expecting some struct to be embedded at the implementation level. This approach gives us compile-time checks that a module author has provided what the Framework requires.

scanner.dialerGroupConfig = &zgrab2.DialerGroupConfig{
L4TransportProtocol: zgrab2.TransportTCP,
BaseFlags: &f.BaseFlags,
TLSEnabled: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why force tls vs predicate on UseTLS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I didn't intend to drop this feature from the module. Let me go back and double-check the other modules too.

s.dialerGroupConfig = &zgrab2.DialerGroupConfig{
L4TransportProtocol: zgrab2.TransportTCP,
BaseFlags: &f.BaseFlags,
TLSEnabled: f.UseTLS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is banner the only one where TLSEnabled is based on a config arg?
See my other comment for why is TLSEnabled true in a lot of modules where something may or may not be over TLS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd figured that in cases where we have a separate L4 Dialer it was fine to set TLSEnabled to true since this just populates the TLSWrapper and then it's up to Scan() to use it or not depending on the TLSEnabled flag.

That being said, I think my approach was more prone to misconfiguration so I've moved the other modules to be similar to Banner.

}
if f.UseTLS {
scanner.dialerGroupConfig.TLSFlags = &f.TLSFlags
scanner.dialerGroupConfig.TLSEnabled = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example where we are setting TLSEnabled slightly differently than banner.
I wonder if we should just have some default "TLS Optional if enabled" dialer group for consistency, and then most of the configuration is in cases where NeedSeparateL4Dialer is true.

@phillip-stephens
Copy link
Contributor Author

Tangent/question: I've been meaning to experiment with allowing a module to upgrade/retry it's connection with TLS, or maybe try TLS and just continue with TCP if the handshake fails My reasoning being that a number of these protocols actually use TLS on the same port and it seems cheaper to just try to continue with the scan than do a second one.

This probably could just be accomplished with the multiple module, and defining order appropriately, bu maybe this would also make it easier to sub in my own dialergroup to accomplish that?

The DialerGroup just gives you a way to control how connections are made individually, not what to do on a failed handshake, etc. Might be worthwhile to add a --retry-with-tls option to the module of interest and have this handled in Scan similar to how the http module does it.

Copy link
Contributor

@silascutler silascutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incredible work.

Copy link
Member

@codyprime codyprime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this, excited for this functionality! I was going to jump into some actual e2e testing, but I had some design thoughts before I do. Apologies in advance if I misunderstood some design elements.

if err != nil {
logrus.Errorf("could not close connection to %v: %v", conn.RemoteAddr(), err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about what led to the need for CloseConnAndHandleError; did you see errors in testing? My concern is this will be very noisy with sufficient scanning volume (e.g. broken connections prior to Close, etc.), and I'm not sure we care to know if the Close failed. Maybe logrus.Debugf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it was best practice to log any error that occurred since if we silently errored when closing connections, it'd be a lot easier to run out of sockets and just seems generally like a state we'd want to know about. And since all modules need to close connections, thought I'd just write a wrapper for it to make it easier to change 😄.

That being said, I don't particularly have an issue with making it Debug so if someone notices they're running out of sockets they at least have a mechanism to learn why. I had figured tho that closing the connection more so just freed the socket and flushed any buffers, which if we couldn't close would be something I'd want to know about. WDYT?

defer func() {
err := sshClient.Close()
if err != nil {
log.Errorf("error closing SSH client for target %s: %v", t.String(), err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is this log.Error will be too noisy; do we care if we get an error on Close()?

TLSEnabled bool
TLSFlags *TLSFlags // must be non-nil if TLSEnabled is true
UDPFlags *UDPFlags // must be non-nil if L4TransportProtocol is TransportUDP
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the design correctly:

Just as scan modules registered themselves, they now also provide a way to get a module-scoped DialerGroupConfig for when a framework registers a module (or at least, prior to scan). So a framework will in effect:

  • allocate a module
  • call GetDialerGroupConfig so it knows what type of connection is needed
  • call Scan to perform that scan

Here are some design thoughts:

With a static DialGroupConfig scoped to the scan module level, there are some things that become rigidly defined (the TransportAgnosticDialer is fixed to whatever transport the static config contains, for instance), while others (L4Dialer) allow per-target configuration of transport protocol.

I worry this could be an issue, because a module may want to opportunistically scan udp or tcp, for instance. Or, to have the BaseFlags.Timeout apply to the total scan time of all connections in aggregate that comprise a protocol scan (so individual Dial connections within a scanner may want different timeouts)

It could be we just say this behavior is explicitly not supported. However, it may be cheap to achieve.

For each Scan, the framework passes in a zgrab2.ScanTarget struct, which is then also passed in to all the Dial implementations.

A lot of the DialerGroupConfig options actually feel like ScanTarget scoped fields to me: which transport protocol to use, if a compat separate dialer is required, udp flags, tls flags. I think that is why some elements like TLSEnable and sometimes-static sometimes-not transport protocols feels a bit off to me, as well as the need to pass TLSWrapper around in some modules. I think this may indicate that the config is attached at the wrong level.

if zgrab2.ScanTarget was not a struct, but rather an interface, then we could rely on a framework to implement zgrab2.ScanTarget:
e.g.:

type ScanTarget interface {
    IP()             net.IP
    Port()           *uint
    // ... etc.
    Dial(zgrab2.DialerGroupConfig) // ...
    L4Dialer(zgrab2.DialerGroupConfig) // ...
    TLSWrapper(zgrab2.DialerGroupConfig, net.Conn) // ...
}

Then, Scan can accept anything that fulfills ScanTarget, and the scan module can make connections by just calling Dial, L4Dialer, TLSWrapper with the appropriate DialerGroupConfig. The framework that implements ScanTarget would then be responsible for each of the methods to do the "right thing" given the DialerGroupConfig. The DialerGroupConfig as part of registration (or pre-Scan call) becomes unnecessary then.

Does this makes sense, or did I overlook some design intended functionality that would make this approach less optimal than the current implementation?

Copy link
Contributor Author

@phillip-stephens phillip-stephens Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful response here. It's a very fair point that the current design means a module can't really scan a target on both TCP and UDP and that's probably something we don't want to write-off.

I think I have a solution though for this, which lets a module provide a network and get either TCP or UDP.

dialerGroup.L4Dialer = func(scanTarget *ScanTarget) func(ctx context.Context, network, addr string) (net.Conn, error) {
	return func(ctx context.Context, network, addr string) (net.Conn, error) {
		switch network {
		case "udp", "udp4", "udp6":
			return GetDefaultUDPDialer(config.BaseFlags, config.UDPFlags)(ctx, scanTarget)
		case "tcp", "tcp4", "tcp6":
			return GetDefaultTCPDialer(config.BaseFlags)(ctx, scanTarget)
		default:
			return nil, fmt.Errorf("unsupported network type: %s", network)
		}
	}
}

This way, a module can do both by toggling the network type when they dial and we can provide both a UDP dialer and TCP one.

Back to your suggested interface, I wonder what the difference is in passing dialers around within the ScanTarget (whether it's a struct or interface) vs. what we have now of passing in dialers in a call to Scan() with a ScanTarget? I think it maps better to my mental model of a ZGrab scan target where it's the "thing" you're scanning, not the connection mechanism by which you scan.

Additionally, to your point about a module scanning both UDP and tcp, this would be possible with a module just using the L4 dialer and selecting network = tcp or network = udp. In my head, I think a module should request a TransportAgnosticDialer if they don't need any further configuration or use of the dialer. They don't need to re-send a query over a different L4 protocol, they just do the protocol over whatever connection is provided.

Secondly, to your point about timeouts, having more fine-grained control over timeouts seems like a good idea, but I'm not sure how this design precludes that future addition. We could still add these as some ScanTargetTimeout and ConnectionTimeout to differentiate between timeout for an entire ScanTarget vs. a specific connection (that might be re-tried with TCP/UDP).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants