Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 0.6.3 #481

Closed
6 tasks done
chibenwa opened this issue Oct 3, 2022 · 18 comments
Closed
6 tasks done

Release 0.6.3 #481

chibenwa opened this issue Oct 3, 2022 · 18 comments
Assignees
Labels
V Time spent (5 halth day)

Comments

@chibenwa
Copy link
Member

chibenwa commented Oct 3, 2022

  • Release tag
  • Changelog
  • Deploy on sandbox
  • Performance tests
  • Deploy on staging
  • Deploy on LNG
@quantranhong1999 quantranhong1999 self-assigned this Oct 4, 2022
@quantranhong1999 quantranhong1999 added the I Time spent (1 halth day) label Oct 5, 2022
@quantranhong1999
Copy link
Member

@quantranhong1999 quantranhong1999 added II Time spent (2 halth day) and removed I Time spent (1 halth day) labels Oct 5, 2022
@quantranhong1999
Copy link
Member

IMAP

image.png

A bit slower overall than 0.6.2: notable: unselect, searchUnseen, searchDeleted, fetchBody. Maybe relate to search override.

JMAP-Draft

image.png

A bit better overall than 0.6.2: same 99th, better 95th and mean. sendMessage 99th slower ~400ms which is understandable regarding Rspamd integration.

JMAP-RFC

3 times hanging:

---- PushPlatformValidation ----------------------------------------------------
[#########################################################################-] 99%
          waiting: 0      / active: 18     / done: 6482  
================================================================================

================================================================================
2022-10-06 11:15:55                                        2243s elapsed
---- Requests ------------------------------------------------------------------
> Global                                                   (OK=826453 KO=28    )
> prepare / retrieveAccountId                              (OK=6500   KO=0     )
> prepare / getMailboxes                                   (OK=6494   KO=6     )
> prepare / emailQuery                                     (OK=12987  KO=1     )
> prepare / emailGetState                                  (OK=6500   KO=0     )
> Open websocket                                           (OK=13059  KO=0     )
> Enable push                                              (OK=6500   KO=0     )
> ping                                                     (OK=521648 KO=2     )
> markAsSeen                                               (OK=18294  KO=1     )
> mailboxChanges                                           (OK=39077  KO=0     )
> emailGet                                                 (OK=104157 KO=16    )
> emailChanges                                             (OK=39081  KO=0     )
> getMailboxes                                             (OK=5305   KO=0     )
> emailQuery                                               (OK=26047  KO=0     )
> submitEmails                                             (OK=12952  KO=2     )
> markAsFlagged                                            (OK=2625   KO=0     )
> markAsAnswered                                           (OK=5227   KO=0     )
---- Errors --------------------------------------------------------------------
> emailGet: Failed to build request: No attribute named 'emailId     77 (40.10%)
s' is defined 
> mailboxChanges: Failed to build request: No attribute named 'm     35 (18.23%)
ailboxState' is defined 
> emailQuery: Failed to build request: No attribute named 'rando     24 (12.50%)
mMailbox' is defined 
> jsonPath($.methodResponses[0][1].list[*]).find.exists, found n     14 ( 7.29%)
othing
> emailQuery: Failed to build request: No attribute named 'inbox     12 ( 6.25%)
ID' is defined 
> markAsSeen: Failed to build request: No attribute named 'email      9 ( 4.69%)
Ids' is defined 
> submitEmails: Failed to build request: No attribute named 'dra      6 ( 3.13%)
ftMailboxId' is defined 
> jsonPath($.methodResponses[0][1].state).find.exists, found not      5 ( 2.60%)
hing
> j.i.IOException: Premature close                                    4 ( 2.08%)
> jsonPath($.methodResponses[0][1].ping).find.is(dummy), but act      2 ( 1.04%)
ually found nothing
> jsonPath($.methodResponses[0][1].ids[*]).findAll.exists, found      1 ( 0.52%)
 nothing
> i.g.h.c.i.RequestTimeoutException: Request timeout to jmap-san      1 ( 0.52%)
dbox.upn.integration-open-paas.org/51.210.211.11:443 after 600...
> markAsAnswered: Failed to build request: No attribute named 'e      1 ( 0.52%)
mailIds' is defined 
> j.n.c.ClosedChannelException                                        1 ( 0.52%)

---- PushPlatformValidation ----------------------------------------------------
[#########################################################################-] 99%

No error log on James side, not sure what happens yet.

Gatling log: https://app.zenhub.com/files/325713638/acca925e-4e76-45f3-b5ec-639344ec8c51/download
Did not have time to check the Gatling log, will do it tmr.

My run params: /root/upn/james/james-gatling, 6500 users, 15m user inject, 20m scenario.

@quantranhong1999 quantranhong1999 added IV Time spent (4 halth day) and removed II Time spent (2 halth day) labels Oct 7, 2022
@Arsnael
Copy link
Member

Arsnael commented Oct 7, 2022

You could try to do a jmap rfc run without the changed params you did recently on the helm james conf. If it finishes, it means it's one of those. If not then it's more related to the release?

@chibenwa
Copy link
Member Author

chibenwa commented Oct 7, 2022

Flamme graph would be nice for IMAP too IMO

@Arsnael
Copy link
Member

Arsnael commented Oct 7, 2022

Hmm regarding IMAP being a bit slower... simple question first. Do you do a little warmup run first before doing the main one or not?

@quantranhong1999
Copy link
Member

Hmm regarding IMAP being a bit slower... simple question first. Do you do a little warmup run first before doing the main one or not?

Sure I do, for any case.

@Arsnael
Copy link
Member

Arsnael commented Oct 7, 2022

Alright, flame graph it is then :)

@quantranhong1999
Copy link
Member

So with the same JMAP RFC simulation, TMail 0.6.2 was ok whereas TMail 0.6.3 (without changed configuration) hangs.

I suspect JMAP RFC 0.6.3 degraded a bit when handling many connections simultaneously (only a few hangs though ~ 10/6500 users).

I adapt the Gatling simulation to early quit for users who can not open a WebSocket connection/ has a premature close HTTP connection: linagora/james-gatling#151

This could avoids those users from trying to do further actions and hung the simulation. I am not sure this could make the simulation 100% stable yet... at least better than all hang right now.

JMAP-RFC result

image

Basically the same to me.

@quantranhong1999
Copy link
Member

I re-test IMAP with a kinda unlimited 2000 concurrent req limit for IMAP -> the perf is the same. -> 200 is a appropriate value for our preprod env.
IMAP flame graph (5 minutes): flame_graph.zip. We can have a look at this together next week.

@quantranhong1999 quantranhong1999 added V Time spent (5 halth day) and removed IV Time spent (4 halth day) labels Oct 10, 2022
@quantranhong1999
Copy link
Member

quantranhong1999 commented Oct 10, 2022

Some error logs collected

LDAP

"level":"ERROR","thread":"blocking-call-wrapper-535","logger":"org.apache.james.user.ldap.ReadOnlyLDAPUser","message":"Unexpected error upon authentication for [email protected]","context":"default","exception":"com.unboundid.ldap.sdk.LDAPException: An error occurred while attempting to connect to server 51.91.141.10:389:  IOException(LDAPException(resultCode=82 (local error), errorMessage='A thread was interrupted while waiting for the connect thread to establish a connection to 51.91.141.10:389

Another one:

"level":"ERROR","thread":"blocking-call-wrapper-536","logger":"org.apache.james.jmap.core.ProblemDetails","message":"Unexpected error upon API request","context":"default","exception":"org.apache.james.user.api.UsersRepositoryException: Unable check user existence from ldap\n\tat org.apache.james.user.ldap.ReadOnlyLDAPUsersDAO.getUserByName(ReadOnlyLDAPUsersDAO.java:285)\n\tat com.linagora.tmail.combined.identity.CombinedUserDAO.test(CombinedUserDAO.java:84)\n\tat com.linagora.tmail.combined.identity.CombinedUsersRepository.test(CombinedUsersRepository.java:20)\n\tat org.apache.james.jmap.http.BasicAuthenticationStrategy.$anonfun$isValid$1(BasicAuthenticationStrategy.scala:132)\n\tat org.apache.james.jmap.http.BasicAuthenticationStrategy.$anonfun$isValid$1$adapted(BasicAuthenticationStrategy.scala:132)\n\tat reactor.core.publisher.MonoCallable.call(MonoCallable.java:92)\n\tat reactor.core.publisher.FluxSubscribeOnCallable$CallableSubscribeOnSubscription.run(FluxSubscribeOnCallable.java:227)\n\tat reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)\n\tat reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)\n\tat java.base/java.util.concurrent.FutureTask.run(Unknown Source)\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: com.unboundid.ldap.sdk.LDAPSearchException: Search processing was interrupted while waiting for a response from server 51.91.141.10:389.\n\tat com.unboundid.ldap.sdk.AbstractConnectionPool.search(AbstractConnectionPool.java:2160)\n\

These 2 errors only happen when running JMAP RFC with 10k users (we usually do this for 6.5k users). With many users we overload the LDAP -> some blocking thread hang (waiting for LDAP response) over reactor TTL (60s) and then be interrupted?

Cassandra timeout

"level":"ERROR","thread":"s1-io-0","mdc":{"protocol":"IMAP","mailbox":"INBOX","selectedMailbox":"baf2cdb0-307a-11eb-ad60-d91e26006be3","ip":"10.2.2.0","action":"STATUS","sessionId":"SID-tkgtbjglvbwf","user":"[email protected]","parameters":"StatusDataItems{statusItems=[MESSAGES, RECENT, UID_NEXT, UNSEEN]}"},"logger":"org.apache.james.imap.processor.AbstractMailboxProcessor","message":"Unexpected error during IMAP processing","context":"default","exception":"com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded). In case this was generated during read repair, the consistency level is not representative of the actual consistency.\n"

A few IMAP STATUS commands fail because of this. Maybe because we overload Cassandra then it fails with requests requiring a high consistency level?

RabbitMQ

"level":"ERROR","thread":"AMQP Connection 51.210.36.126:5672","logger":"com.rabbitmq.client.impl.ForgivingExceptionHandler","message":"An unexpected connection driver error occurred","context":"default","exception":"java.net.SocketTimeoutException: Timeout during Connection negotiation\n\tat com.rabbitmq.client.impl.AMQConnection.handleSocketTimeout(AMQConnection.java:835)\n\tat com.rabbitmq.client.impl.AMQConnection.readFrame(AMQConnection.java:747)\n\tat com.rabbitmq.client.impl.AMQConnection.access$300(AMQConnection.java:47)\n\tat com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:666)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n

Likely a driver error more than James itself. Could because RabbitMQ overloaded?

Rspamd

"level":"ERROR","thread":"spooler-388","logger":"org.apache.james.mailetcontainer.impl.ProcessorImpl","message":"Exception calling org.apache.james.rspamd.RspamdScanner: reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response\n

I observe this only happens when Rspamd is down or overloaded. Maybe we just need to make sure the mail processing still operates when Rspamd is unavailable. (or it is already?)

Do you consider these blockers?

@chibenwa
Copy link
Member Author

LDAP issues

JMAP RFC calls uses basic auth: all JMAP requests hits the LDAP, which is somehow an unrealistic scenario.

Maybe we should have configuration within gatling-jmap on the authentication method?

  • Use basic-auth
  • Use X-User header (like with krakend)

Could be a start?

I can live with this flow for now but can you please create a ticket on james-gatling repo?

@chibenwa
Copy link
Member Author

Cassandra timeout
A few IMAP STATUS commands fail because of this. Maybe because we overload Cassandra then it fails with requests requiring a high consistency level?

All our requests use QUORUM consistency level. If needed, I can re-explain in detail why. In short: good compromise between consistency and availabilty. That's somewhat of a standard practice in the industry.

Likely yes it shows Cassandra can be the bottleneck.

I would be interested to get the count of error too. Are we speaking of 10? 10.000?

Finally by offloading some of the searches to Cassandra we might load Cassandra a bit more than we should.

@chibenwa
Copy link
Member Author

RabbitMQ

What is suspicious is that it even tries to reconnect. What's the time of the log? Was it when re-deploying/scaling James?

I already noticed scaling/re-deploying james can knock out the RabbitMQ cluster.

@chibenwa
Copy link
Member Author

Rspamd

rspamd.RspamdScanner: reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response\n

I observe this only happens when Rspamd is down or overloaded. Maybe we just need to make sure the mail processing still operates when Rspamd is unavailable. (or it is already?)

Mailet configuration allows fine grain error management.

We should ensure all default configuration is written so that we ignore error of the RspamdScanner mailet.

Can you open a ticket on linagora/james-project for this?

@chibenwa
Copy link
Member Author

Do you consider these blockers?

None of these are blockers!

But there is some really valuable lessons to be learned here. I'd be happy to dig a bit more the unknown parts here, especially the Cassandra errors.

Thanks for this very valuable report.

@chibenwa
Copy link
Member Author

Regarding Cassandra if this is #481 (comment) with 1 failed status then it do not seem too bad to me, likely we should avoid spending too much time on this.

@chibenwa
Copy link
Member Author

chibenwa commented Oct 10, 2022

Regarding Cassandra, search overrides for searchUnseen might filter a lot of data when executed against Cassandra, and might in the end be more optimal executed against a search index...

Maybe a little configuration tweak to be made: get rid of NotDeletedWithRangeSearchOverride ?

(ticket?)

@quantranhong1999
Copy link
Member

quantranhong1999 commented Oct 11, 2022

I would be interested to get the (IMAP STATUS) count of error too. Are we speaking of 10? 10.000?

just a few ~10

What's the time of the (RabbitMQ) log? Was it when re-deploying/scaling James?

There are only 3 consecutive ones that I am not sure of the timing. Could be as you said.

Tickets:

@Arsnael Arsnael closed this as completed Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V Time spent (5 halth day)
Projects
None yet
Development

No branches or pull requests

3 participants