Skip to content
This repository has been archived by the owner on Jul 15, 2021. It is now read-only.

RTR Server keeps reporting not ready when fetching validated ROA prefixes #283

Closed
KillerDAN opened this issue Sep 28, 2020 · 12 comments
Closed

Comments

@KillerDAN
Copy link

KillerDAN commented Sep 28, 2020

Sep 28 12:11:15 rpki01 rpki-rtr-server.sh: 2020-09-28 12:11:15.767 INFO 2148 --- [eduler_Worker-3] n.r.r.r.a.v.RefreshCacheController : fetching validated roa prefixes from http://localhost:8080/api/objects/validated
Sep 28 12:11:19 rpki01 rpki-rtr-server.sh: 2020-09-28 12:11:19.482 INFO 2148 --- [eduler_Worker-3] n.r.r.r.a.v.RefreshCacheController : validator http://localhost:8080/api/objects/validated not ready yet, will retry later

But If I curl this info out, it comes clean:

curl http://localhost:8080/api/objects/validated | (head; tail)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 16193 { 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
"data" : {
"ready" : false,
0 "trustAnchors" : [ {
"type" : "trust-anchor",
1 "id" : 1,
6 "name" : "AfriNIC RPKI Root",
1 "locations" : [ "https://rpki.afrinic.net/repository/AfriNIC.cer", "rsync://rpki.afrinic.net/repository/AfriNIC.cer" ],
93 "subjectPublicKeyInfo" : "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAxsAqAhWIO+ON2Ef9oRDMpKxv+AfmSLIdLWJtjrvUyDxJPBjgR+kVrOHUeTaujygFUp49tuN5H2C1rUuQavTHvve6xNF5fU3OkTcqEzMOZy+ctkbde2SRMVdvbO22+TH9gNhKDc9l7Vu01qU4LeJHk3X0f5uu5346YrGAOSv6AaYBXVgXxa0s9ZvgqFpim50pReQe/WI3QwFKNgpPzfQL6Y7fDPYdYaVOXPXSKtx7P4s4KLA/ZWmRL/bobw/i2fFviAGhDrjqqqum+/9w1hElL/vqihVnV18saKTnLvkItA/Bf5i11Yhw2K7qv573YWxyuqCknO/iYLTR1DToBZcZUQIDAQAB",
"rsyncPrefetchUri" : "rsync://rpki.afrinic.net/repository/",
100 15.7M 0 15.7M 0 0 13.1M 0 --:--:-- 0:00:01 --:--:-- 13.1M
"prefix" : "45.11.116.0/22",
"maxLength" : 24
}, {
"asn" : "4214120002",
"prefix" : "185.168.163.0/24",
"maxLength" : 24
} ],
"routerCertificates" : [ ]
}
}

@ties
Copy link
Member

ties commented Sep 28, 2020

Hi,

The json contains ready: false – that's why the rtr-server indicates that the validator is not ready. Can you show the output of http://<validator>/api/healthcheck? This should show which trust anchor is not ready.

Are you running the most recent release (last friday ) of the validator? Over the last months there have been a number of stability and memory consumption improvements (and, less fortunately, bug fixes).

Have you restarted the validator recently?

@KillerDAN
Copy link
Author

KillerDAN commented Sep 28, 2020

curl http://localhost:8080/api/healthcheck
{
"data" : {
"overalStatus" : "OK",
"trustAnchorReady" : [ {
"taName" : "AfriNIC RPKI Root",
"complete" : true
}, {
"taName" : "APNIC RPKI Root",
"complete" : true
}, {
"taName" : "ARIN",
"complete" : true
}, {
"taName" : "LACNIC RPKI Root",
"complete" : true
}, {
"taName" : "RIPE NCC RPKI Root",
"complete" : true
} ],
"bgpDumpReady" : {
"https://www.ris.ripe.net/dumps/riswhoisdump.IPv6.gz" : true,
"https://www.ris.ripe.net/dumps/riswhoisdump.IPv4.gz" : true
},
"databaseStatus" : {
"READONLY_TRANSACTIONS" : "453003",
"DISK_USAGE" : "3908584025",
"BYTES_WRITTEN" : "107046519385",
"FLUSHED_TRANSACTIONS" : "24388",
"ACTIVE_TRANSACTIONS" : "1",
"BYTES_MOVED_BY_GC" : "2074024265",
"BYTES_READ" : "55349581824",
"TRANSACTIONS" : "74709",
"UTILIZATION_PERCENT" : "47"
},
"buildInformation" : {
"version" : "3.1-2020.09.25.11.16"
}
}
}

Previously checked on the validator GUI and everything seemed OK.

Currently running latest release but this issue was noticeable on previous one.

Have restarted both validator, rtr-server and host machine at least twice, but do notice that current situation is persistent over 2 days, so it does not derive from restarting any of the components or recently restarting them.

@ties
Copy link
Member

ties commented Sep 28, 2020

It looks good (now). There were recent changes to the behaviour when the validator has been running for a longer period of time or on machines with a fast connection/many cores.

Do you still encounter the issue you explained above?

@KillerDAN
Copy link
Author

Yes I do. That is the point. This has been like these over two days now.

Sep 28 14:46:15 rpki01 rpki-rtr-server.sh: 2020-09-28 14:46:15.767 INFO 2148 --- [eduler_Worker-5] n.r.r.r.a.v.RefreshCacheController : fetching validated roa prefixes from http://localhost:8080/api/objects/validated
Sep 28 14:46:18 rpki01 rpki-rtr-server.sh: 2020-09-28 14:46:18.419 INFO 2148 --- [eduler_Worker-5] n.r.r.r.a.v.RefreshCacheController : validator http://localhost:8080/api/objects/validated not ready yet, will retry later

@ties
Copy link
Member

ties commented Sep 28, 2020

Hi --- ok, that's clear.

In this situation it is best to try to reset the local database. I just tested our latest release (provisioned using an automated install in a clean vm) and did get to a stable state, where the rtr server is ready.

Please let us know if that helps!

If that resolves the issue, it would help if you can share the current database with us so we can investigate, because at the moment we do not have an copy of a database that gets stuck with the latest release.

@KillerDAN
Copy link
Author

Hello, reset to the local database worked. Thank you.

How can I share the backup database with you ? It is quite big... ~2.5GB.

@ties
Copy link
Member

ties commented Sep 29, 2020

The download size is no problem for me if you have a place to store it.

We can also skip further investigation and only do so if this happens again.

There was a bug in 3.1-2020.09.18.13.38 that should recover after a restart and that we fixed in 3.1-2020.09.25.11.16. Since a restart did not fix your issue, you were not hitting that. So I'm not sure what was causing your issue.

@ties
Copy link
Member

ties commented Sep 29, 2020

When running a validator using the copy of the database from before the reset the validator converged for me after some time. However then I noticed that the validated reports "not ready" quite often.

I think you were hitting the situation I've described in #284. I'm closing this issue for now but please re-open this if you hit it later.

@ties ties closed this as completed Sep 29, 2020
@KillerDAN
Copy link
Author

Do not think it is a time issue.
This issue run like this for over two days without ever providing data to RTR Server and probably was doing this long before.

@ties
Copy link
Member

ties commented Sep 30, 2020

Ok... Thanks for the clarification.

In hindsight it would have been interesting to see how long ago each trust anchor had updated (from the web interface or rpkivalidator_last_validation_run in the prometheus /metrics). If you are using prometheus I recommend that you monitor those.

It could also be a deadlock due to a high number of threads – we have a report of this in #277. Are you running it on a machine with a large number of cores?

@KillerDAN
Copy link
Author

Initially we had 4 vCPUs when the issue first came up. In the following days we "feed" it 4 more vCPUs trying to cope with observed high CPU usage and load thinking it could be a time/ processing issue. It did not help.
What do you consider "large number of cores" ? Is there a recommendation to follow currently ?
Currently running properly with 8 vCPUs after reset the local database as suggested.

@ties
Copy link
Member

ties commented Oct 1, 2020

That number of cores sounds fine. My personal long-running instance works fine with two (skylake) vCPUs. It will use a higher number of cores; the peaks will use all cores but last shorter.

I would recommend 2 cores as a minimum, four (or eight) should work well.

The issue I described with a high number of cores occurred on a machine with 48 cores was reproducible on a machine with 56 cores so I don't think that situation applies (we don't hit it on machines with 16 hyperthreads).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants