RTR Server keeps reporting not ready when fetching validated ROA prefixes #283

KillerDAN · 2020-09-28T11:14:57Z

Sep 28 12:11:15 rpki01 rpki-rtr-server.sh: 2020-09-28 12:11:15.767 INFO 2148 --- [eduler_Worker-3] n.r.r.r.a.v.RefreshCacheController : fetching validated roa prefixes from http://localhost:8080/api/objects/validated
Sep 28 12:11:19 rpki01 rpki-rtr-server.sh: 2020-09-28 12:11:19.482 INFO 2148 --- [eduler_Worker-3] n.r.r.r.a.v.RefreshCacheController : validator http://localhost:8080/api/objects/validated not ready yet, will retry later

But If I curl this info out, it comes clean:

curl http://localhost:8080/api/objects/validated | (head; tail)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 16193 { 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
"data" : {
"ready" : false,
0 "trustAnchors" : [ {
"type" : "trust-anchor",
1 "id" : 1,
6 "name" : "AfriNIC RPKI Root",
1 "locations" : [ "https://rpki.afrinic.net/repository/AfriNIC.cer", "rsync://rpki.afrinic.net/repository/AfriNIC.cer" ],
93 "subjectPublicKeyInfo" : "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAxsAqAhWIO+ON2Ef9oRDMpKxv+AfmSLIdLWJtjrvUyDxJPBjgR+kVrOHUeTaujygFUp49tuN5H2C1rUuQavTHvve6xNF5fU3OkTcqEzMOZy+ctkbde2SRMVdvbO22+TH9gNhKDc9l7Vu01qU4LeJHk3X0f5uu5346YrGAOSv6AaYBXVgXxa0s9ZvgqFpim50pReQe/WI3QwFKNgpPzfQL6Y7fDPYdYaVOXPXSKtx7P4s4KLA/ZWmRL/bobw/i2fFviAGhDrjqqqum+/9w1hElL/vqihVnV18saKTnLvkItA/Bf5i11Yhw2K7qv573YWxyuqCknO/iYLTR1DToBZcZUQIDAQAB",
"rsyncPrefetchUri" : "rsync://rpki.afrinic.net/repository/",
100 15.7M 0 15.7M 0 0 13.1M 0 --:--:-- 0:00:01 --:--:-- 13.1M
"prefix" : "45.11.116.0/22",
"maxLength" : 24
}, {
"asn" : "4214120002",
"prefix" : "185.168.163.0/24",
"maxLength" : 24
} ],
"routerCertificates" : [ ]
}
}

ties · 2020-09-28T11:49:16Z

Hi,

The json contains ready: false – that's why the rtr-server indicates that the validator is not ready. Can you show the output of http://<validator>/api/healthcheck? This should show which trust anchor is not ready.

Are you running the most recent release (last friday ) of the validator? Over the last months there have been a number of stability and memory consumption improvements (and, less fortunately, bug fixes).

Have you restarted the validator recently?

KillerDAN · 2020-09-28T12:10:02Z

curl http://localhost:8080/api/healthcheck
{
"data" : {
"overalStatus" : "OK",
"trustAnchorReady" : [ {
"taName" : "AfriNIC RPKI Root",
"complete" : true
}, {
"taName" : "APNIC RPKI Root",
"complete" : true
}, {
"taName" : "ARIN",
"complete" : true
}, {
"taName" : "LACNIC RPKI Root",
"complete" : true
}, {
"taName" : "RIPE NCC RPKI Root",
"complete" : true
} ],
"bgpDumpReady" : {
"https://www.ris.ripe.net/dumps/riswhoisdump.IPv6.gz" : true,
"https://www.ris.ripe.net/dumps/riswhoisdump.IPv4.gz" : true
},
"databaseStatus" : {
"READONLY_TRANSACTIONS" : "453003",
"DISK_USAGE" : "3908584025",
"BYTES_WRITTEN" : "107046519385",
"FLUSHED_TRANSACTIONS" : "24388",
"ACTIVE_TRANSACTIONS" : "1",
"BYTES_MOVED_BY_GC" : "2074024265",
"BYTES_READ" : "55349581824",
"TRANSACTIONS" : "74709",
"UTILIZATION_PERCENT" : "47"
},
"buildInformation" : {
"version" : "3.1-2020.09.25.11.16"
}
}
}

Previously checked on the validator GUI and everything seemed OK.

Currently running latest release but this issue was noticeable on previous one.

Have restarted both validator, rtr-server and host machine at least twice, but do notice that current situation is persistent over 2 days, so it does not derive from restarting any of the components or recently restarting them.

ties · 2020-09-28T13:13:22Z

It looks good (now). There were recent changes to the behaviour when the validator has been running for a longer period of time or on machines with a fast connection/many cores.

Do you still encounter the issue you explained above?

KillerDAN · 2020-09-28T13:47:09Z

Yes I do. That is the point. This has been like these over two days now.

Sep 28 14:46:15 rpki01 rpki-rtr-server.sh: 2020-09-28 14:46:15.767 INFO 2148 --- [eduler_Worker-5] n.r.r.r.a.v.RefreshCacheController : fetching validated roa prefixes from http://localhost:8080/api/objects/validated
Sep 28 14:46:18 rpki01 rpki-rtr-server.sh: 2020-09-28 14:46:18.419 INFO 2148 --- [eduler_Worker-5] n.r.r.r.a.v.RefreshCacheController : validator http://localhost:8080/api/objects/validated not ready yet, will retry later

ties · 2020-09-28T14:33:07Z

Hi --- ok, that's clear.

In this situation it is best to try to reset the local database. I just tested our latest release (provisioned using an automated install in a clean vm) and did get to a stable state, where the rtr server is ready.

Please let us know if that helps!

If that resolves the issue, it would help if you can share the current database with us so we can investigate, because at the moment we do not have an copy of a database that gets stuck with the latest release.

KillerDAN · 2020-09-28T16:25:01Z

Hello, reset to the local database worked. Thank you.

How can I share the backup database with you ? It is quite big... ~2.5GB.

ties · 2020-09-29T08:57:52Z

The download size is no problem for me if you have a place to store it.

We can also skip further investigation and only do so if this happens again.

There was a bug in 3.1-2020.09.18.13.38 that should recover after a restart and that we fixed in 3.1-2020.09.25.11.16. Since a restart did not fix your issue, you were not hitting that. So I'm not sure what was causing your issue.

ties · 2020-09-29T15:06:23Z

When running a validator using the copy of the database from before the reset the validator converged for me after some time. However then I noticed that the validated reports "not ready" quite often.

I think you were hitting the situation I've described in #284. I'm closing this issue for now but please re-open this if you hit it later.

KillerDAN · 2020-09-30T10:52:11Z

Do not think it is a time issue.
This issue run like this for over two days without ever providing data to RTR Server and probably was doing this long before.

ties · 2020-09-30T15:14:03Z

Ok... Thanks for the clarification.

In hindsight it would have been interesting to see how long ago each trust anchor had updated (from the web interface or rpkivalidator_last_validation_run in the prometheus /metrics). If you are using prometheus I recommend that you monitor those.

It could also be a deadlock due to a high number of threads – we have a report of this in #277. Are you running it on a machine with a large number of cores?

KillerDAN · 2020-10-01T17:40:02Z

Initially we had 4 vCPUs when the issue first came up. In the following days we "feed" it 4 more vCPUs trying to cope with observed high CPU usage and load thinking it could be a time/ processing issue. It did not help.
What do you consider "large number of cores" ? Is there a recommendation to follow currently ?
Currently running properly with 8 vCPUs after reset the local database as suggested.

ties · 2020-10-01T18:09:08Z

That number of cores sounds fine. My personal long-running instance works fine with two (skylake) vCPUs. It will use a higher number of cores; the peaks will use all cores but last shorter.

I would recommend 2 cores as a minimum, four (or eight) should work well.

The issue I described with a high number of cores occurred on a machine with 48 cores was reproducible on a machine with 56 cores so I don't think that situation applies (we don't hit it on machines with 16 hyperthreads).

ties mentioned this issue Sep 29, 2020

Validator can become "not ready" a large part of the time #284

Closed

ties closed this as completed Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTR Server keeps reporting not ready when fetching validated ROA prefixes #283

RTR Server keeps reporting not ready when fetching validated ROA prefixes #283

KillerDAN commented Sep 28, 2020 •

edited

Loading

ties commented Sep 28, 2020 •

edited

Loading

KillerDAN commented Sep 28, 2020 •

edited

Loading

ties commented Sep 28, 2020

KillerDAN commented Sep 28, 2020

ties commented Sep 28, 2020 •

edited

Loading

KillerDAN commented Sep 28, 2020

ties commented Sep 29, 2020

ties commented Sep 29, 2020

KillerDAN commented Sep 30, 2020

ties commented Sep 30, 2020

KillerDAN commented Oct 1, 2020

ties commented Oct 1, 2020 •

edited

Loading

RTR Server keeps reporting not ready when fetching validated ROA prefixes #283

RTR Server keeps reporting not ready when fetching validated ROA prefixes #283

Comments

KillerDAN commented Sep 28, 2020 • edited Loading

ties commented Sep 28, 2020 • edited Loading

KillerDAN commented Sep 28, 2020 • edited Loading

ties commented Sep 28, 2020

KillerDAN commented Sep 28, 2020

ties commented Sep 28, 2020 • edited Loading

KillerDAN commented Sep 28, 2020

ties commented Sep 29, 2020

ties commented Sep 29, 2020

KillerDAN commented Sep 30, 2020

ties commented Sep 30, 2020

KillerDAN commented Oct 1, 2020

ties commented Oct 1, 2020 • edited Loading

KillerDAN commented Sep 28, 2020 •

edited

Loading

ties commented Sep 28, 2020 •

edited

Loading

KillerDAN commented Sep 28, 2020 •

edited

Loading

ties commented Sep 28, 2020 •

edited

Loading

ties commented Oct 1, 2020 •

edited

Loading