Release 0.6.3 #481

chibenwa · 2022-10-03T03:53:15Z

quantranhong1999 · 2022-10-05T03:49:20Z

Change log: https://ci.linagora.com/linagora/lgs/software-linagora-com-website/-/merge_requests/40
Helm James: https://gitlab.com/linagora_private/saas/tools/helm-charts/helm-james/-/merge_requests/15
staging env: https://gitlab.com/linagora_private/saas/tmail/emaily-staging-apps/-/merge_requests/14
sandbox env: https://gitlab.com/linagora_private/saas/tmail/sandbox-emaily-apps/-/merge_requests/10

quantranhong1999 · 2022-10-06T11:28:14Z

IMAP

A bit slower overall than 0.6.2: notable: unselect, searchUnseen, searchDeleted, fetchBody. Maybe relate to search override.

JMAP-Draft

A bit better overall than 0.6.2: same 99th, better 95th and mean. sendMessage 99th slower ~400ms which is understandable regarding Rspamd integration.

JMAP-RFC

3 times hanging:

---- PushPlatformValidation ----------------------------------------------------
[#########################################################################-] 99%
          waiting: 0      / active: 18     / done: 6482  
================================================================================

================================================================================
2022-10-06 11:15:55                                        2243s elapsed
---- Requests ------------------------------------------------------------------
> Global                                                   (OK=826453 KO=28    )
> prepare / retrieveAccountId                              (OK=6500   KO=0     )
> prepare / getMailboxes                                   (OK=6494   KO=6     )
> prepare / emailQuery                                     (OK=12987  KO=1     )
> prepare / emailGetState                                  (OK=6500   KO=0     )
> Open websocket                                           (OK=13059  KO=0     )
> Enable push                                              (OK=6500   KO=0     )
> ping                                                     (OK=521648 KO=2     )
> markAsSeen                                               (OK=18294  KO=1     )
> mailboxChanges                                           (OK=39077  KO=0     )
> emailGet                                                 (OK=104157 KO=16    )
> emailChanges                                             (OK=39081  KO=0     )
> getMailboxes                                             (OK=5305   KO=0     )
> emailQuery                                               (OK=26047  KO=0     )
> submitEmails                                             (OK=12952  KO=2     )
> markAsFlagged                                            (OK=2625   KO=0     )
> markAsAnswered                                           (OK=5227   KO=0     )
---- Errors --------------------------------------------------------------------
> emailGet: Failed to build request: No attribute named 'emailId     77 (40.10%)
s' is defined 
> mailboxChanges: Failed to build request: No attribute named 'm     35 (18.23%)
ailboxState' is defined 
> emailQuery: Failed to build request: No attribute named 'rando     24 (12.50%)
mMailbox' is defined 
> jsonPath($.methodResponses[0][1].list[*]).find.exists, found n     14 ( 7.29%)
othing
> emailQuery: Failed to build request: No attribute named 'inbox     12 ( 6.25%)
ID' is defined 
> markAsSeen: Failed to build request: No attribute named 'email      9 ( 4.69%)
Ids' is defined 
> submitEmails: Failed to build request: No attribute named 'dra      6 ( 3.13%)
ftMailboxId' is defined 
> jsonPath($.methodResponses[0][1].state).find.exists, found not      5 ( 2.60%)
hing
> j.i.IOException: Premature close                                    4 ( 2.08%)
> jsonPath($.methodResponses[0][1].ping).find.is(dummy), but act      2 ( 1.04%)
ually found nothing
> jsonPath($.methodResponses[0][1].ids[*]).findAll.exists, found      1 ( 0.52%)
 nothing
> i.g.h.c.i.RequestTimeoutException: Request timeout to jmap-san      1 ( 0.52%)
dbox.upn.integration-open-paas.org/51.210.211.11:443 after 600...
> markAsAnswered: Failed to build request: No attribute named 'e      1 ( 0.52%)
mailIds' is defined 
> j.n.c.ClosedChannelException                                        1 ( 0.52%)

---- PushPlatformValidation ----------------------------------------------------
[#########################################################################-] 99%

No error log on James side, not sure what happens yet.

Gatling log: https://app.zenhub.com/files/325713638/acca925e-4e76-45f3-b5ec-639344ec8c51/download
Did not have time to check the Gatling log, will do it tmr.

My run params: /root/upn/james/james-gatling, 6500 users, 15m user inject, 20m scenario.

Arsnael · 2022-10-07T02:11:41Z

You could try to do a jmap rfc run without the changed params you did recently on the helm james conf. If it finishes, it means it's one of those. If not then it's more related to the release?

chibenwa · 2022-10-07T02:36:18Z

Flamme graph would be nice for IMAP too IMO

Arsnael · 2022-10-07T03:06:47Z

Hmm regarding IMAP being a bit slower... simple question first. Do you do a little warmup run first before doing the main one or not?

quantranhong1999 · 2022-10-07T03:07:23Z

Hmm regarding IMAP being a bit slower... simple question first. Do you do a little warmup run first before doing the main one or not?

Sure I do, for any case.

Arsnael · 2022-10-07T03:12:51Z

Alright, flame graph it is then :)

quantranhong1999 · 2022-10-07T07:55:30Z

So with the same JMAP RFC simulation, TMail 0.6.2 was ok whereas TMail 0.6.3 (without changed configuration) hangs.

I suspect JMAP RFC 0.6.3 degraded a bit when handling many connections simultaneously (only a few hangs though ~ 10/6500 users).

I adapt the Gatling simulation to early quit for users who can not open a WebSocket connection/ has a premature close HTTP connection: linagora/james-gatling#151

This could avoids those users from trying to do further actions and hung the simulation. I am not sure this could make the simulation 100% stable yet... at least better than all hang right now.

JMAP-RFC result

Basically the same to me.

quantranhong1999 · 2022-10-07T11:00:27Z

I re-test IMAP with a kinda unlimited 2000 concurrent req limit for IMAP -> the perf is the same. -> 200 is a appropriate value for our preprod env.
IMAP flame graph (5 minutes): flame_graph.zip. We can have a look at this together next week.

quantranhong1999 · 2022-10-10T11:13:38Z

Some error logs collected

LDAP

"level":"ERROR","thread":"blocking-call-wrapper-535","logger":"org.apache.james.user.ldap.ReadOnlyLDAPUser","message":"Unexpected error upon authentication for [email protected]","context":"default","exception":"com.unboundid.ldap.sdk.LDAPException: An error occurred while attempting to connect to server 51.91.141.10:389:  IOException(LDAPException(resultCode=82 (local error), errorMessage='A thread was interrupted while waiting for the connect thread to establish a connection to 51.91.141.10:389

Another one:

"level":"ERROR","thread":"blocking-call-wrapper-536","logger":"org.apache.james.jmap.core.ProblemDetails","message":"Unexpected error upon API request","context":"default","exception":"org.apache.james.user.api.UsersRepositoryException: Unable check user existence from ldap\n\tat org.apache.james.user.ldap.ReadOnlyLDAPUsersDAO.getUserByName(ReadOnlyLDAPUsersDAO.java:285)\n\tat com.linagora.tmail.combined.identity.CombinedUserDAO.test(CombinedUserDAO.java:84)\n\tat com.linagora.tmail.combined.identity.CombinedUsersRepository.test(CombinedUsersRepository.java:20)\n\tat org.apache.james.jmap.http.BasicAuthenticationStrategy.$anonfun$isValid$1(BasicAuthenticationStrategy.scala:132)\n\tat org.apache.james.jmap.http.BasicAuthenticationStrategy.$anonfun$isValid$1$adapted(BasicAuthenticationStrategy.scala:132)\n\tat reactor.core.publisher.MonoCallable.call(MonoCallable.java:92)\n\tat reactor.core.publisher.FluxSubscribeOnCallable$CallableSubscribeOnSubscription.run(FluxSubscribeOnCallable.java:227)\n\tat reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)\n\tat reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)\n\tat java.base/java.util.concurrent.FutureTask.run(Unknown Source)\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: com.unboundid.ldap.sdk.LDAPSearchException: Search processing was interrupted while waiting for a response from server 51.91.141.10:389.\n\tat com.unboundid.ldap.sdk.AbstractConnectionPool.search(AbstractConnectionPool.java:2160)\n\

These 2 errors only happen when running JMAP RFC with 10k users (we usually do this for 6.5k users). With many users we overload the LDAP -> some blocking thread hang (waiting for LDAP response) over reactor TTL (60s) and then be interrupted?

Cassandra timeout

"level":"ERROR","thread":"s1-io-0","mdc":{"protocol":"IMAP","mailbox":"INBOX","selectedMailbox":"baf2cdb0-307a-11eb-ad60-d91e26006be3","ip":"10.2.2.0","action":"STATUS","sessionId":"SID-tkgtbjglvbwf","user":"[email protected]","parameters":"StatusDataItems{statusItems=[MESSAGES, RECENT, UID_NEXT, UNSEEN]}"},"logger":"org.apache.james.imap.processor.AbstractMailboxProcessor","message":"Unexpected error during IMAP processing","context":"default","exception":"com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded). In case this was generated during read repair, the consistency level is not representative of the actual consistency.\n"

A few IMAP STATUS commands fail because of this. Maybe because we overload Cassandra then it fails with requests requiring a high consistency level?

RabbitMQ

"level":"ERROR","thread":"AMQP Connection 51.210.36.126:5672","logger":"com.rabbitmq.client.impl.ForgivingExceptionHandler","message":"An unexpected connection driver error occurred","context":"default","exception":"java.net.SocketTimeoutException: Timeout during Connection negotiation\n\tat com.rabbitmq.client.impl.AMQConnection.handleSocketTimeout(AMQConnection.java:835)\n\tat com.rabbitmq.client.impl.AMQConnection.readFrame(AMQConnection.java:747)\n\tat com.rabbitmq.client.impl.AMQConnection.access$300(AMQConnection.java:47)\n\tat com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:666)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n

Likely a driver error more than James itself. Could because RabbitMQ overloaded?

Rspamd

"level":"ERROR","thread":"spooler-388","logger":"org.apache.james.mailetcontainer.impl.ProcessorImpl","message":"Exception calling org.apache.james.rspamd.RspamdScanner: reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response\n

I observe this only happens when Rspamd is down or overloaded. Maybe we just need to make sure the mail processing still operates when Rspamd is unavailable. (or it is already?)

Do you consider these blockers?

chibenwa · 2022-10-10T14:53:43Z

LDAP issues

JMAP RFC calls uses basic auth: all JMAP requests hits the LDAP, which is somehow an unrealistic scenario.

Maybe we should have configuration within gatling-jmap on the authentication method?

Use basic-auth
Use X-User header (like with krakend)

Could be a start?

I can live with this flow for now but can you please create a ticket on james-gatling repo?

chibenwa · 2022-10-10T14:59:17Z

Cassandra timeout
A few IMAP STATUS commands fail because of this. Maybe because we overload Cassandra then it fails with requests requiring a high consistency level?

All our requests use QUORUM consistency level. If needed, I can re-explain in detail why. In short: good compromise between consistency and availabilty. That's somewhat of a standard practice in the industry.

Likely yes it shows Cassandra can be the bottleneck.

I would be interested to get the count of error too. Are we speaking of 10? 10.000?

Finally by offloading some of the searches to Cassandra we might load Cassandra a bit more than we should.

chibenwa · 2022-10-10T15:00:43Z

RabbitMQ

What is suspicious is that it even tries to reconnect. What's the time of the log? Was it when re-deploying/scaling James?

I already noticed scaling/re-deploying james can knock out the RabbitMQ cluster.

chibenwa · 2022-10-10T15:03:05Z

Rspamd

rspamd.RspamdScanner: reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response\n

I observe this only happens when Rspamd is down or overloaded. Maybe we just need to make sure the mail processing still operates when Rspamd is unavailable. (or it is already?)

Mailet configuration allows fine grain error management.

We should ensure all default configuration is written so that we ignore error of the RspamdScanner mailet.

Can you open a ticket on linagora/james-project for this?

chibenwa · 2022-10-10T15:05:14Z

Do you consider these blockers?

None of these are blockers!

But there is some really valuable lessons to be learned here. I'd be happy to dig a bit more the unknown parts here, especially the Cassandra errors.

Thanks for this very valuable report.

chibenwa · 2022-10-10T15:07:00Z

Regarding Cassandra if this is #481 (comment) with 1 failed status then it do not seem too bad to me, likely we should avoid spending too much time on this.

chibenwa · 2022-10-10T15:10:13Z

Regarding Cassandra, search overrides for searchUnseen might filter a lot of data when executed against Cassandra, and might in the end be more optimal executed against a search index...

Maybe a little configuration tweak to be made: get rid of NotDeletedWithRangeSearchOverride ?

(ticket?)

quantranhong1999 · 2022-10-11T02:56:23Z

I would be interested to get the (IMAP STATUS) count of error too. Are we speaking of 10? 10.000?

just a few ~10

What's the time of the (RabbitMQ) log? Was it when re-deploying/scaling James?

There are only 3 consecutive ones that I am not sure of the timing. Could be as you said.

Tickets:

quantranhong1999 self-assigned this Oct 4, 2022

quantranhong1999 added the I Time spent (1 halth day) label Oct 5, 2022

quantranhong1999 added II Time spent (2 halth day) and removed I Time spent (1 halth day) labels Oct 5, 2022

quantranhong1999 added IV Time spent (4 halth day) and removed II Time spent (2 halth day) labels Oct 7, 2022

quantranhong1999 added V Time spent (5 halth day) and removed IV Time spent (4 halth day) labels Oct 10, 2022

Arsnael closed this as completed Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.6.3 #481

Release 0.6.3 #481

chibenwa commented Oct 3, 2022 •

edited by Arsnael

Loading

quantranhong1999 commented Oct 5, 2022

quantranhong1999 commented Oct 6, 2022

Arsnael commented Oct 7, 2022

chibenwa commented Oct 7, 2022

Arsnael commented Oct 7, 2022

quantranhong1999 commented Oct 7, 2022

Arsnael commented Oct 7, 2022

quantranhong1999 commented Oct 7, 2022

quantranhong1999 commented Oct 7, 2022

quantranhong1999 commented Oct 10, 2022 •

edited

Loading

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022 •

edited

Loading

quantranhong1999 commented Oct 11, 2022 •

edited

Loading

Release 0.6.3 #481

Release 0.6.3 #481

Comments

chibenwa commented Oct 3, 2022 • edited by Arsnael Loading

quantranhong1999 commented Oct 5, 2022

quantranhong1999 commented Oct 6, 2022

IMAP

JMAP-Draft

JMAP-RFC

Arsnael commented Oct 7, 2022

chibenwa commented Oct 7, 2022

Arsnael commented Oct 7, 2022

quantranhong1999 commented Oct 7, 2022

Arsnael commented Oct 7, 2022

quantranhong1999 commented Oct 7, 2022

JMAP-RFC result

quantranhong1999 commented Oct 7, 2022

quantranhong1999 commented Oct 10, 2022 • edited Loading

Some error logs collected

LDAP

Cassandra timeout

RabbitMQ

Rspamd

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022

chibenwa commented Oct 10, 2022 • edited Loading

quantranhong1999 commented Oct 11, 2022 • edited Loading

chibenwa commented Oct 3, 2022 •

edited by Arsnael

Loading

quantranhong1999 commented Oct 10, 2022 •

edited

Loading

chibenwa commented Oct 10, 2022 •

edited

Loading

quantranhong1999 commented Oct 11, 2022 •

edited

Loading