Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC gets stuck in SSL handshake exception when running documentation tests from Linux developer systems #257

Closed
novoj opened this issue Sep 13, 2023 · 7 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@novoj
Copy link
Collaborator

novoj commented Sep 13, 2023

If we run all the documentation tests locally, or single documents with a large number of examples - such as fetching.md - locally on a Linux environment, the Java / gRPC tests will start to fail at a certain point. No other Java / gRPC test will succeed after that. The problem doesn't occur on GitHub CI, nor on Windows developer machine. Other protocols (REST/GraphQL) on the same server work. When the test is restarted, gRPC works again for a while - so the server can recover from the problem. The problem is probably server-side, as HAProxy sees the requests and logs that the server didn't respond.

We should investigate this issue to avoid potential problems on production systems in the future. First, we need to enable more logging on the server side, or add logging at some early stage of request processing in gRPC, and observe what happens where.

@novoj novoj added the bug Something isn't working label Sep 13, 2023
@novoj novoj added this to the Alpha milestone Sep 13, 2023
@novoj
Copy link
Collaborator Author

novoj commented Sep 16, 2023

The problem also occurred on locale - with sun.security.validator.ValidatorException - local Java environment doesn't trust the server certificate:

[
[
  Trust Anchors: [[
  Trusted CA cert: [
[
  Version: V3
  Subject: CN=evitaDB-CA-selfSigned
  Signature Algorithm: SHA256withRSA, OID = 1.2.840.113549.1.1.11

  Key:  Sun RSA public key, 2048 bits
  params: null
  modulus: 21182750260160774293320731316303415468851655991244789180583673183040271091351406466075036536381217053183358505587630045565757052594780362512230007780218542650225029946186355776474961632610607874807175940801728945101406659536509357635676937753142655807141470212783743944892618690594908136015158097120653915111914692229499139606747535422728431464810411621257400558305224700776990401481946738651816080585866816975178640306340642214659752722614462264785323175858681907481416363929125668336852753199530768874170394330325606412435126620305617920965573263784082517101687145459598428154385750333098726312911494818806099848551
  public exponent: 65537
  Validity: [From: Sat Sep 16 09:05:20 CEST 2023,
               To: Sun Sep 15 09:05:20 CEST 2024]
  Issuer: CN=evitaDB-CA-selfSigned
  SerialNumber: [    018a9cce e812]

Certificate Extensions: 4
[1]: ObjectId: 2.5.29.35 Criticality=false
AuthorityKeyIdentifier [
KeyIdentifier [
0000: 3C 39 00 96 2A 4B 74 64   BF 06 E6 66 47 A0 0E 0F  <9..*Ktd...fG...
0010: E2 44 F8 03                                        .D..
]
]

[2]: ObjectId: 2.5.29.19 Criticality=true
BasicConstraints:[
  CA:true
  PathLen: no limit
]

[3]: ObjectId: 2.5.29.15 Criticality=true
KeyUsage [
  DigitalSignature
  Key_CertSign
]

[4]: ObjectId: 2.5.29.14 Criticality=false
SubjectKeyIdentifier [
KeyIdentifier [
0000: 3C 39 00 96 2A 4B 74 64   BF 06 E6 66 47 A0 0E 0F  <9..*Ktd...fG...
0010: E2 44 F8 03                                        .D..
]
]

]
  Algorithm: [SHA256withRSA]
  Signature:
0000: A3 15 E5 68 9B FD 91 CA   4F 6B DE B0 10 6C 37 6F  ...h....Ok...l7o
0010: AC 34 8C FF B0 EB CB 84   43 58 5C C7 9B D0 55 A0  .4......CX\...U.
0020: 45 DB 59 B8 FB 04 D3 9D   EC 25 25 BE 8A A2 5B A9  E.Y......%%...[.
0030: C5 6A FA 48 BA 4A 21 37   DC 8B B1 B4 49 99 6D DA  .j.H.J!7....I.m.
0040: 10 80 6A 0C CF 46 16 93   68 99 62 F7 40 E8 1F 6D  [email protected]
0050: FB B8 59 01 A1 70 21 F4   87 55 8A 07 DB 9D B0 B0  ..Y..p!..U......
0060: 9A B9 8C 50 87 42 37 33   1B A6 99 D5 BF B4 FA 8F  ...P.B73........
0070: 79 EE 0A 28 19 36 D1 9F   D5 F5 9C 6D 3A 81 84 C9  y..(.6.....m:...
0080: F0 28 03 7F 5D 51 CD DD   16 0B EA DC 3E D1 B7 20  .(..]Q......>.. 
0090: 47 16 25 99 99 9A 23 E1   65 3C EB 7A 7D 03 FD 5D  G.%...#.e<.z...]
00A0: 58 81 61 93 7F 7E 4F 8D   64 E6 AF 92 C0 8E A1 1F  X.a...O.d.......
00B0: 41 A9 50 51 AA 14 18 CE   4B 94 4B 07 5B F1 4A A0  A.PQ....K.K.[.J.
00C0: 69 39 8F F1 AA D5 CD 62   A3 A2 9D 82 43 6E CF 45  i9.....b....Cn.E
00D0: 5B 2A B6 07 EA C1 2F 9F   33 FF 83 42 2F A4 86 E8  [*..../.3..B/...
00E0: F3 8C 44 3C B4 94 23 6E   F3 12 58 F8 8A DB FE 45  ..D<..#n..X....E
00F0: 2D EE 80 C1 F6 27 2A E4   B8 AC F0 54 94 AD 90 51  -....'*....T...Q

]
]
  Initial Policy OIDs: any
  Validity Date: null
  Signature Provider: null
  Default Revocation Enabled: false
  Explicit Policy Required: false
  Policy Mapping Inhibited: false
  Any Policy Inhibited: false
  Policy Qualifiers Rejected: true
  Target Cert Constraints: null
  Certification Path Checkers: [[]]
  CertStores: [[]]
]  Maximum Path Length: 5
]

What is weird, that it also fails on demo.evitadb.io which has valid LE certificate sometimes.

@novoj
Copy link
Collaborator Author

novoj commented Sep 17, 2023

It seems that the problem I observed on the local evitaDB instance was a different problem. When running documentation tests I don't see ValidatorException. I also observed a problem in different implementation than gRPC:

image

@novoj
Copy link
Collaborator Author

novoj commented Sep 18, 2023

There are a lot of similar bugs being reported on the web - some of them are summarised in this issue: reactor/reactor-netty#907

It looks like the exception itself is misleading, and can occur whenever the server doesn't respond within a 10s interval, and may not be related to SSL problems. You recommend increasing the number of reactor.netty.ioWorkerCount (https://projectreactor.io/docs/netty/release/reference/index.html), but it's not easy because the Netty is hidden under the gRPC cover.

I've also checked that the clients are closed properly in the documentation tests, and we don't have a problem with too many open clients keeping their connection to the server and thus blocking other clients, but this doesn't seem to be the case.

So, unfortunately, I have no solution, even after closer examination.

@lukashornych
Copy link
Collaborator

Yeah, I've came across some of the mentioned issues in past, but also couldn't come up with the solution. Maybe we will have to find a way to set the ioWorkerCount.

@novoj
Copy link
Collaborator Author

novoj commented Dec 18, 2023

It seems the problem was partly problem of IPv6 and too low limit on our HA Proxy server. Can you @Khertys elaborate more on this problem and we could close the issue now since it ceased to happen in recent documentation test runs.

@novoj
Copy link
Collaborator Author

novoj commented Dec 18, 2023

The changes on the HA proxy side were related to the following setting: https://docs.haproxy.org/2.6/configuration.html#4.2-maxconn

We initially set it to 30. Probably the TCP mode plays its role here (and not HTTP as it is on tomcat) and after exhausting those connections it probably cuts off. We increased it to 100 on demo (port 5556) and the problems didn't exist anymore.

@novoj
Copy link
Collaborator Author

novoj commented Dec 18, 2023

The documentation tests work now both in ipv4 and ipv6 reliably - no changes in code were necessary - just the settings of the HAProxy server.

@novoj novoj closed this as completed Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants