Skip to content
tomstrummer edited this page Sep 13, 2010 · 2 revisions

Openfire Clustering Plugin Status

Over the past few weeks a lot of defect fixing and optimization has occurred on the plug-in. Load testing has been happening along with the optimization for the past 2 weeks. The plug-in has gone from being unable to login more than 1,000 users per machine to peaking at 17,000 users logged in and sending messages to each-other. I believe that with further tuning the JBoss clustering implementation could be made faster, more reliable and production ready.

What follows is an outline of what has already been done to optimize and test the cluster as well as a roadmap for the optimizations that are still needed and the challenges facing the implementation.

Environment

The current testing environment consists of 5 VM’s. Each one is a dual core Xeon processor running at 2332.512 MHz with 2 gigs of ram. 3 of these machines are running the Openfire cluster and the remaining 2 are being used to run load against the 3-node Openfire cluster.

The JVM for the Openfire server is being run with the following options:

-Djava.net.preferIPv4Stack=true -Xms2048m -Xmx2048m -XX:NewSize=768M -XX:MaxNewSize=768M 
-XX:MaxTenuringThreshold=4 -XX:SurvivorRatio=6 -XX:+ScavengeBeforeFullGC -XX:PermSize=256M -XX:MaxPermSize=256M 
-XX:+HandlePromotionFailure -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=3 
-XX:+CMSParallelRemarkEnabled -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled -verbosegc 
-XX:+DisableExplicitGC -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -Xloggc:/var/log/openfire/gc.dat

There are also a few machine settings that must be changed for optimal cluster performance. More info is available on the setup page

Testing

Clustering is being tested using a program called Tsung (http://tsung.erlang-projects.org). The Tsung script connects clients to the cluster every 0.01 seconds. Each client then retrieves a 30 user roster, sends an message to an online user, sends a message to an offline user, then disconnects.

Using this load test users connected have reached a peak of 17,000 logged in users per node on the cluster. Memory and CPU usage on each node at this point begins to max out the system resources and response time for connecting new users and messaging between clients slows considerably.

Optimization

The biggest single reason for the performance increase has been changing the replication mode from synchronous to asynchronous. Allowing nodes in the cluster to not have to wait for a message response from all other nodes has allowed us to scale to users in the tens of thousands from the few thousand we were initially able to connect. This however did present other issues, for example, if a task for a newly logged in user on Node A arrives at node B before the cache has replicated the task will fail. This may lead to messages being messed or incorrect presence broadcasts.

In order to solve this issue temporarily task may attempt to execute a few times before failing. Currently a packet routing task will attempt to execute 5 times in 500 ms before failing. At this point the only time this has been identified to happen is just after a user logs in and their session information has yet to replicate to all caches. To scale better a new solution may need to be found for this use case.

Other optimizations that have gone into the system consist of reducing the size of objects passed between cluster nodes. One example of this is the ClusterSession and ClusteredClientSession, was serializing most of the fields on this object when typically the only thing needed is the JID. These fields have been replaced with tasks that can be executed to retrieve this information if needed. These tasks are rarely called and by using these tasks in place of the fields messaging between nodes on the cluster has improve significantly.

Openfire also has an internal cache locking mechanism. The initial implementation of this was causing thread deadlocking and slowdown issues. Since the switch to asynchronous replication the locking mechanism has been moved out of Openfire and JBoss cache has been handling the locking itself. This has eliminated the deadlocking and thread slowing issue that were being seen in the cluster and has increased the speed of user login, presence information and messaging between nodes greatly.

Roadmap

In order to get the JBossCache clustering plugin to a state where it is production ready there will need to be improvements to the memory usage of the program, how quickly tasks are sent over the wire and able to be processed and allowing more users to log into a single node. Below I have outlined the challenges that are ahead of this implementation, and some of the work items that need to be done in order to deliver a highly reliable, scalable clustering plugin for Openfire.

Challenges

Memory Usage

Currently memory usage can get very high, and stay very high. Currently all cache information from each node in the cluster is distributed to every node in the cluster. This can cause memory usage to skyrocket, and with no eviction policy or caps currently in place this could be a real issue with the ever growing size of users and nodes in the cluster

Recommendation

Implement an eviction policy. Some caches need to be able to grow without limits, such as the user session cache, however some other caches can be capped and vetted incrementally with minimal impact on the speed of the cluster. Also some caches may be converted to strictly local caches as their information is only needed at the point of origin of the user session.

In the case of ever growing users and cluster nodes some caches may need to be offloaded into a cache that is not a multisession replicated JBoss cache. Perhaps something such as a memcached cluster could serve as the back end store for user sessions in the future should the cluster grow to such a size.

Something that would also help for both testing and implementation would be to move cluster nodes on to more powerful hardware. More processor speed and increased ram are the two resource issues that is currently holding the cluster nodes from hosting more users.

Cluster Speed

Currently under high load it may take a few seconds for a user to login or perform an action (retrieve presence, send a message, obtain a roster, etc..). Some of these delays are transparent, some of them are not, but in either case the system needs to gain some efficiency to be able to scale up to more simultaneous users.

Recommendation

Currently the marshalling of tasks to go out over the wire is inefficient. Tasks are being written using the built in java serialization and passed over the wire. Many tasks are executed each second between nodes and picking a more efficient protocol to encode these messages should be a high priority. Using a wire protocol such as Thrift:http://incubator.apache.org/thrift/ or protobufs:http://code.google.com/p/protobuf/ would allow for much smaller messages as well as faster marshalling and unmarshalling of tasks.

Along the same line would be to replace the ExternalizableUtilStrategy with something more efficient. A more efficient strategy for serializing objects would go a long way towards increasing the speed of cluster nodes. Perhaps this too could be replaced with something like Thrift.

The ProcessPacketTask could be more efficient as well. Under the current implementation when the ProcessPacketTask is run on the target server the entire process stack is called again, even though the packet has already been partially processed on the server initially processing the task. This causes some repeat calls to the node that originally processed the packet for JID information. We could eliminate these unnecessary calls by including JID information directly in the task.

Implementation incomplete

Currently the entire interface is not yet complete. A majority of the functionality is working, users can send messages, retrieve rosters and get presence information, but there are still many parts of the interface that are not finished. Things still unimplemented include finding the ClusterComponentSession, ConnectionMultiplexerSession, IncomingServerSession and ClusterOutgoingSession.

Recommendation

Find the unfinished portions of the plugin and finish them!

Tasks

The following list of tasks must be completed for the JBossClustering plugin to become production worthy:
*Implement cache eviction for JBossCache
*Split out caches that should be local from replicated caches
*Convert the wire format of tasks to a more efficient format (Thrift, protobuf?)
*Explore the possibility of using a more efficient wire protocol for task return objects (Thrift, protobuf?)
*Fix inefficiencies in the task processing code
*Implement missing pieces of the plugin.

Clone this wiki locally