-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provider produced inconsistent result after apply when interacting with remote secondary datacenters #249
Comments
Hi @danieleva, thanks for reporting this issue. I did not test much with federated datacenters, the provider certainly behaves weirdly in this cases and is probably not coherent for each resource. I will have a look in the coming days to find what is the best way to proceed, the retry solution looks appropriate for ACLs but I would like to make sure it is. |
@remilapeyre Any updates on this? It is still an issue with the latest version of Terraform/Consul provider. |
Hi @rrijkse, I made some tests and found the way I wanted to implement this. I will come back to this issue and try to fix all the resources that currently show this behaviour |
We experienced the same issue and solved it by configuring the provider to the primary datacenter. |
I also found what seems a related behaviour when creating intentions on a federated secondary datacenter. Note the intention is in fact created. 2023-06-28T15:42:37.308+0200 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 17
Given the error provided, "error: failed to read config entry after setting it." it seems a workaround may be catch that error and reattempt some few cycles with increasing waiting time (i.e., 1 sec, then 2, then 4, then 8) before finally giving up. |
Hi @remilapeyre: did you manage to advance on this issue? At least you might apply @danieleva 's suggested workaround "A naive workaround, adding time.Sleep(10 * time.Second) before the return in resourceConsulACLPolicyCreate to allow for acl replication to complete fixes the problem" (quite possibly a lower wait time would do the trick as I also saw time in the 1~3 seconds range for replication) till you find the time/inspiration for a better solution. TIA |
I opened a PR to fix this. |
I published the patched version to the registry to make it easier to validate/test: https://registry.terraform.io/providers/7fELF/consul/latest |
I could test it today with following versions' definition: terraform {
required_version = "= 1.4.6"
required_providers {
consul = {
source = "7fELF/consul"
version = "= 2.20.1"
}
null = "= 3.2.1"
}
} I can still reproduce the bug: upon first
The intention is nevertheless properly created and I can see it on Consul webui. A second This is exactly the same behaviour I got with consul = "= 2.17.0". Also relevant code on main.tf (Consul access variables from shell environment pointing to a remote secondary datacenter): # Loops through intentions
resource "consul_config_entry" "intentions" {
for_each = {
for intention in local.intentions:
intention.name => intention
}
name = each.value.name
kind = "service-intentions"
config_json = jsonencode({
Sources = [
for source in each.value.sources: {
Name = source
Type = "consul"
Action = "allow"
Namespace = "default"
Partition = "default"
Precedence = 9
}
]
})
} |
Thanks for testing my patch @next-jesusmanuelnavarro
I'm not a service mesh user, but according to the docs, setting a replication token also enables service mesh data replication. So to also fix it, I need to figure out:
|
On this, I can be of little help as I don't admin my Consul cluster, I'm just a user of it (in fact, I can't even list policies with my credentials). All I can say, if that's what you mean, is my use case is for service-intentions, service-defaults, service-resolver and service-splitters. https://developer.hashicorp.com/consul/docs/connect/config-entries/service-intentions |
Hi, this is a long standing issue and the patch from @7fELF looks like the right way forward to fix this. I wish this could be handled automatically by the Consul Go client but we should move forward with the current approach first, and improve the situation for all users of the Go client later. Regarding the inconsistency with the config entry, I'm not sure the same fix is applicable but will look into that as wel. |
Any update in this? I'm experiencing the same issue in my Federated Clusters when pointing to the Secondary DC. |
Terraform Version
Terraform v0.14.7
registry.terraform.io/hashicorp/consul v2.11.0
consul 1.9.4
Affected Resource(s)
Please list the resources as a list, for example:
Reproducing the issue requires some setup.
I have 2 consul datacenters, WAN federated with ACL replication enabled. The primary is in US, secondary in Asia/Pacific.
There is a ~200ms latency on the WAN connection used for federation.
If terraform is configured to connect to consul API on the remote datacenter, acl_policy creation fails with
This fails:
If I force the provider to use the primary datacenter, the resource is created correctly:
Debug logs on consul show the issue. In both cases the provider is connected to a server in the secondary datacenter
When provider is configured with
datacenter=secondary
:When provider is configured with
datacenter=primary
In both cases the first part of the flow is identical, the behaviour changes when reading the policy back from consul
/v1/acl/policy
/v1/acl/policy/<policy_id>
ACL not found
error and breaksA naive workaround, adding
time.Sleep(10 * time.Second)
before the return in resourceConsulACLPolicyCreate to allow for acl replication to complete fixes the problem, but I don't think that's the proper way to address this.The provider documentation is not clear on what should be the configuration when dealing with federated datacenters.
If the datacenter parameter in the provider must be configured to point at the primary, that should be explicit in the documentation, in addition of ensuring all the resources specify the datacenter they refer to if it's not the primary.
IMHO a better option would be to add some retry logic in the resources, to account for delay and eventually consistent nature of ACL federation. In my tests the replication is still very fast, usually under 1s, so a configurable retry with exponential backoff would handle it nicely.
If you agree on the retry solution, I'm happy to provide a PR for it.
[GH-167] partially addressed this, but didn't add any retry logic.
Thanks :)
The text was updated successfully, but these errors were encountered: