Skip to content

Conversation

kllinzy
Copy link
Contributor

@kllinzy kllinzy commented Sep 26, 2025

Prefer SSM Credentials over EC2Role Credentials for ECS-A (External=true).

Summary

This change updates the Linux credential provider chain to prefer the RotatingSharedCredentialsProvider over EC2RoleProvider. The motivation is that, when running ECS-A on an EC2 instance that has been registered with SSM, the EC2 Role credentials were being used whenever the EC2 instance had an IAM role. This caused the RegisterContainerInstance attempt to fail with the error "The identity document and identity document signature were not valid".

Implementation details

This change updates the Linux credential provider chain, and was based off this stale PR: #4155. I tried to follow the feedback provided there, by combining the Linux and Windows files into one, and using the same order of precedence for both. I cannot say I fully understand the implications there, I was just trying to follow instructions.

I also would love a little bit of confirmation from someone who knows better, that this won't break people. The scenario I'm concerned about is if a user had used an EC2 instance as a container instance for an ECS Cluster, but had incorrectly labeled it External. I couldn't figure it out in just the day today, but I'm worried that would "work" because the EC2 Role was being used, and this change would make it stop "working" because the SSM cred is going to be selected instead. Hopefully that already didn't work, and there's no way this change can break anyone, but I did want to call it out, just in case.

Testing

make test - this actually failed on my machine, but only with some version number differences coming from amazon-ecs-cni-plugins thing. I also ran all the unit tests in the ecs-agent package with

go test -tags=unit ./... and by clicking the button on my IDE, all passed.

I had to update one test, because it had different conditions for Windows and Linux, which are now the same, and added one test, to make sure the RotatingSharedCredentialsProvider was used ahead of the EC2RoleProvider when External=true.

And here's the one test that did fail on me:

=== RUN   TestCNIPluginVersionNumber
    plugin_test.go:39: 
        	Error Trace:	/home/XXXXXX/Repos/aws/amazon-ecs-agent/agent/ecscni/plugin_test.go:39
        	Error:      	Not equal: 
        	            	expected: "2024.09.0"
        	            	actual  : "2020.09.0"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-2024.09.0
        	            	+2020.09.0
        	Test:       	TestCNIPluginVersionNumber
--- FAIL: TestCNIPluginVersionNumber (0.00s)

New tests cover the changes: Yes

Description for the changelog

Enhancement - Use same precedence for credential providers in Windows and Linux

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?
No

Does this PR include the addition of new environment variables in the README?
No

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@kllinzy kllinzy requested a review from a team as a code owner September 26, 2025 21:57
@sparrc sparrc changed the base branch from master to dev September 26, 2025 22:00
@kllinzy kllinzy force-pushed the master branch 2 times, most recently from 48a1bad to 8d539af Compare September 26, 2025 22:32
@kllinzy
Copy link
Contributor Author

kllinzy commented Sep 27, 2025

I think I'm out of my depth on the Linux Unit Test Failures, I'm running on Linux and seeing that same tests pass:

Running tool: /usr/local/go/bin/go test -timeout 30s -tags unit -run ^TestMetricsToPublishMetricRequestsNonIdleStatsSourcePaginationWithTaskNumber$ github.com/aws/amazon-ecs-agent/ecs-agent/tcs/client -test.count=1

=== RUN   TestMetricsToPublishMetricRequestsNonIdleStatsSourcePaginationWithTaskNumber
--- PASS: TestMetricsToPublishMetricRequestsNonIdleStatsSourcePaginationWithTaskNumber (0.00s)
PASS
ok      github.com/aws/amazon-ecs-agent/ecs-agent/tcs/client    0.006s

and

Running tool: /usr/local/go/bin/go test -timeout 30s -tags unit -run ^TestSessionReconnectsWithBackoffOnNonEOFError$ github.com/aws/amazon-ecs-agent/ecs-agent/acs/session -test.count=1

=== RUN   TestSessionReconnectsWithBackoffOnNonEOFError
1758937470014681195 [Debug] logger=structured msg="Received connect to ACS message. Attempting connect to ACS" containerInstanceARN="arn:aws:ecs:us-west-2:123456789012:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE"
1758937470014738944 [Error] logger=structured msg="Failed to connect to ACS" containerInstanceARN="arn:aws:ecs:us-west-2:123456789012:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE" error="not EOF"
1758937470014744024 [Warn] logger=structured msg="ACS WebSocket connection closed" containerInstanceARN="arn:aws:ecs:us-west-2:123456789012:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE" error="not EOF"
1758937470014748563 [Info] logger=structured msg="Waiting before reconnecting to ACS" reconnectDelay="1ms" containerInstanceARN="arn:aws:ecs:us-west-2:123456789012:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE"
1758937470015843982 [Info] logger=structured msg="Done waiting; reconnecting to ACS" containerInstanceARN="arn:aws:ecs:us-west-2:123456789012:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE"
1758937470015861596 [Debug] logger=structured msg="Received connect to ACS message. Attempting connect to ACS" containerInstanceARN="arn:aws:ecs:us-west-2:123456789012:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE"
1758937470015928442 [Error] logger=structured msg="Failed to connect to ACS" containerInstanceARN="arn:aws:ecs:us-west-2:123456789012:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE" error="EOF"
1758937470015954882 [Info] logger=structured msg="ACS WebSocket connection closed for a valid reason" containerInstanceARN="arn:aws:ecs:us-west-2:123456789012:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE"
1758937470015961515 [Info] logger=structured msg="Reconnecting to ACS immediately without waiting" containerInstanceARN="arn:aws:ecs:us-west-2:123456789012:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE"
1758937470015966855 [Info] logger=structured msg="ACS session ended (context closed)" containerInstanceARN="arn:aws:ecs:us-west-2:123456789012:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE" reason="context canceled"
--- PASS: TestSessionReconnectsWithBackoffOnNonEOFError (0.00s)
PASS
ok      github.com/aws/amazon-ecs-agent/ecs-agent/acs/session   0.009s

I did check the go version:

go version go1.24.3 linux/amd64

So I don't quite match 1.24 .3 versus 1.24.6, but I'd be very surprised if it's that. Is there any known thing I need to make sure my environment matches to get the same results as the test environment?

It's also possible I'm just very naive, I'm not seeing these tests in the output when I run make test locally, I'm running them from my IDE or typing go test ....

singholt
singholt previously approved these changes Oct 6, 2025
Copy link
Contributor

@singholt singholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

prateekchaudhry
prateekchaudhry previously approved these changes Oct 6, 2025
@kllinzy
Copy link
Contributor Author

kllinzy commented Oct 7, 2025

Sorry I tried to just accept your recommendations in the GitHub UI, but that (obviously) didn't run make gomod so the copy in /agent and the copy in /ecs-agent differed. I fixed it and pushed again.

@prateekchaudhry prateekchaudhry merged commit 2f77c81 into aws:dev Oct 7, 2025
40 of 42 checks passed
@harishxr harishxr mentioned this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants