Performance Engineering Series – Tuning for Performance - Case Study – 4

 ActiveMQ Crash

Introduction

In this article let us look at an interesting case of ActiveMQ crashing and the symptoms it showed up and how it was resolved.  ActiveMQ is a widely used message broker that talks via a wide range of sources. 

In the current context, its used for the purpose of sending email notifications where the test application places a message on the queue for it to be consumed by other interfacing applications for email notification delivery.  The ActiveMQ runs in redundant mode with active standby.  If a node goes down for any reason, the other node immediately picks up the processing and continues until it goes down. 

ActiveMQ is initiated using a wrapper which was designed off a community release version and not the regular commercial deployment version.  The limitation of this would be explained later in the article. 

Problem

It was observed that the ActiveMQ queue fills up due to pending messages (or) the messages that cannot be delivered and end up in the DLQ (Dead Letter Queue).  This build up causes the disk space used by ActiveMQ to be filled up. 

From the applications perspective, the logic involves a UI action initiated by the user to create an email which in turn has a DB side action to insert particular data and update few other tables for housekeeping.  This application side logic further branches out in to asynchronous / synchronous methods.  In the first path, the application just places a message on the queue and does not bother about the status of the message

But in the synchronous path, the application initiates a DB side transaction that waits until the message is successfully processed by the consumer.  This is where the problem started.

As the messaged piled up, the dependent application created more DB connections to process new incoming requests that ate up the connection pool which eventually leading the application in to an unresponsive state.  The intent behind using these two paths were unclear but it was necessary to first understand the problem and an attempt was made to remediate it by external measures while the entire code was being investigated for improvements. 

As first step, this was replicated in the test environment in various use cases involving the email subsystem.

Diagnosis

The following images of ActiveMQ from the test system show the message delivery issue -













Disk space problems

user@<node2>:/usr/share/activemq/data [UNKNOWN]

# cd kahadb/

user@<node2>:/usr/share/activemq/data/kahadb [UNKNOWN]

# du -sh

4.0G    .

----------------------

user@node1:/usr/share/activemq/data [UNKNOWN]

# cd kahadb/ 

user@node1:/usr/share/activemq/data/kahadb [UNKNOWN]

# du -sh

2.6G    .

 user@node1:/usr/share/activemq/data/kahadb [UNKNOWN]

JVM OOM messages

mq.broker.TransportConnection.Transport | ActiveMQ Transport: tcp:///xxxxxxxxx@616
 | WARN  |  | org.eclipse.jetty.util.thread.strategy.EatWhatYouKill | qtp1669904747-74
java.lang.OutOfMemoryError: GC overhead limit exceeded
| INFO  | Refreshing org.apache.activemq.xbean.XBeanBrokerFactory$1@4f6a5034: startup date []; root of context hierarchy | org.apache.activemq.xbean.XBeanBrokerFactory$1 | WrapperSimpleAppMain
 | INFO  | Using Persistence Adapter: MultiKahaDBPersistenceAdapter[/usr/share/activemq/bin/linux-x86-64/../../data/kahadb][KahaDBPersistenceAdapter[/usr/share/activemq/bin/linux-x86-64/../../data/kahadb/queue#3a#2f#2fActiveMQ.DLQ], KahaDBPersistenceAdapter[/usr/share/activemq/bin/linux-x86-64/../../data/kahadb/#210]] | org.apache.activemq.broker.BrokerService | WrapperSimpleAppMain

-----------------------------------

| WARN  | java.lang.OutOfMemoryError: GC overhead limit exceeded | org.eclipse.jetty.io.ManagedSelector | qtp134105631-759
 | WARN  | /api/jolokia/read/org.apache.activemq:type=Broker,brokerName=localhost | org.eclipse.jetty.server.HttpChannel | qtp134105631-386
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:111)[:1.8.0_271]
        at java.io.OutputStreamWriter.write(OutputStreamWriter.java:194)[:1.8.0_271]
        at org.json.simple.JSONObject.writeJSONString(JSONObject.java:69)[json-simple-1.1.1.jar:]
        at org.json.simple.JSONObject.writeJSONString(JSONObject.java:73)[json-simple-1.1.1.jar:]
        at org.json.simple.JSONValue.writeJSONString(JSONValue.java:135)[json-simple-1.1.1.jar:]
        at org.json.simple.JSONArray.writeJSONString(JSONArray.java:53)[json-simple-1.1.1.jar:]
        at org.json.simple.JSONArray.writeJSONString(JSONArray.java:59)[json-simple-1.1.1.jar:]
        at org.json.simple.JSONValue.writeJSONString(JSONValue.java:135)[json-simple-1.1.1.jar:]
        at org.json.simple.JSONObject.writeJSONString(JSONObject.java:67)[json-simple-1.1.1.jar:]
        at org.json.simple.JSONObject.writeJSONString(JSONObject.java:73)[json-simple-1.1.1.jar:]
        at org.json.simple.JSONValue.writeJSONString(JSONValue.java:135)[json-simple-1.1.1.jar:]
        at org.json.simple.JSONObject.writeJSONString(JSONObject.java:67)[json-simple-1.1.1.jar:]
        at org.json.simple.JSONObject.writeJSONString(JSONObject.java:73)[json-simple-1.1.1.jar:]
        at org.jolokia.util.IoUtil.streamResponseAndClose(IoUtil.java:34)[jolokia-core-1.6.2.jar:]
        at org.jolokia.http.AgentServlet.sendStreamingResponse(AgentServlet.java:541)[jolokia-core-1.6.2.jar:]
        at org.jolokia.http.AgentServlet.sendResponse(AgentServlet.java:524)[jolokia-core-1.6.2.jar:]
        at org.jolokia.http.AgentServlet.handle(AgentServlet.java:336)[jolokia-core-1.6.2.jar:]
        at org.jolokia.http.AgentServlet.doGet(AgentServlet.java:277)[jolokia-core-1.6.2.jar:]
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.servlet.ServletHolder$NotAsyncServlet.service(ServletHolder.java:1386)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:755)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1617)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:226)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:566)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1610)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)[jetty-all-9.4.27.v20200227-uber.jar:9.4.27.v20200227]

Solution(s)

The problem was determined to be caused by the following factors -

  • Undersized heap for the ActiveMQ
  • Tuning requirement for ActiveMQ JVM
  • Undersized KahaDB disk space
  • Lack of monitoring of the queue (in test systems)

To fix the above problems, the following recommendations were given and these and were implemented and tested in performance test system with good success

  • Heap size increased to 4GB (from 1GB)
  • GC algorithm changed to G1GC
  • KahaDB space increased to 16GB (from 4GB)
  • GC logs enabled for the JVM
  • UI side monitoring enabled for the ActiveMQ

Edit the file '/usr/share/activemq/bin/linux-x86-64/wrapper.conf' to make changes to heap size, GC algorithm and to enable GC logs as below -

# Initial Java Heap Size (in MB)

wrapper.java.initmemory=4096

# Maximum Java Heap Size (in MB)

wrapper.java.maxmemory=4096

NOTE: This is the max heap that can be set for the wrapper irrespective of the total available memory due to the community release version. 

For higher heap sizes, an upgrade to this is also necessary as per this page - https://activemq.apache.org/java-service-wrapper

wrapper.java.additional.13=-XX:+UseG1GC

wrapper.java.additional.14=-Xloggc:/var/log/activemq/activemq-gc-%t.log

wrapper.java.additional.15=-XX:+PrintGCDetails

 Edit the file '/etc/activemq/activemq.xml' to make changes to KahaDB space

 <systemUsage>

            <systemUsage sendFailIfNoSpace="true">

                <memoryUsage>

                    <memoryUsage percentOfJvmHeap="70" />

                </memoryUsage>

                <storeUsage>

                    <storeUsage limit="16 gb"/>

                </storeUsage>

                <tempUsage>

                    <tempUsage limit="16 gb"/>

                </tempUsage>

            </systemUsage>

        </systemUsage>

Additionally, in the test system all the pending messages were discarded upon restart by the addition of this parameter → deleteAllMessagesOnStartup="true"

<broker xmlns="http://activemq.apache.org/schema/core" brokerName="localhost" deleteAllMessagesOnStartup="true" dataDirectory="${activemq.data}" schedulerSupport="true">

NOTE - This needs to be implemented with caution in production systems as its necessary to account for the undelivered messages in many cases to identify and issues. 

Results in test systems

After this change, the ActiveMQ processed around 60 messages / sec without any crash issues and the heap was also fully utilized in a healthy pattern without excessive GC

Following screen shots show the usage (node1 currently active) -














GC log entries of ActiveMQ after the changes

12662.436: [GC pause (G1 Evacuation Pause) (young), 0.0353496 secs]

   [Parallel Time: 33.5 ms, GC Workers: 8]

      [GC Worker Start (ms): Min: 12662436.2, Avg: 12662436.4, Max: 12662436.7, Diff: 0.6]

      [Ext Root Scanning (ms): Min: 0.3, Avg: 0.6, Max: 1.1, Diff: 0.8, Sum: 5.1]

      [Update RS (ms): Min: 18.6, Avg: 25.4, Max: 26.8, Diff: 8.2, Sum: 203.5]

         [Processed Buffers: Min: 48, Avg: 63.5, Max: 72, Diff: 24, Sum: 508]

      [Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.4]

      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]

      [Object Copy (ms): Min: 5.5, Avg: 6.9, Max: 14.2, Diff: 8.7, Sum: 55.1]

      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]

         [Termination Attempts: Min: 1, Avg: 2.4, Max: 5, Diff: 4, Sum: 19]

      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 0.7]

      [GC Worker Total (ms): Min: 32.8, Avg: 33.1, Max: 33.3, Diff: 0.5, Sum: 264.9]

      [GC Worker End (ms): Min: 12662469.4, Avg: 12662469.5, Max: 12662469.6, Diff: 0.2]

   [Code Root Fixup: 0.0 ms]

   [Code Root Purge: 0.0 ms]

   [Clear CT: 0.4 ms]

   [Other: 1.4 ms]

      [Choose CSet: 0.0 ms]

      [Ref Proc: 0.4 ms]

      [Ref Enq: 0.0 ms]

      [Redirty Cards: 0.4 ms]

      [Humongous Register: 0.1 ms]

      [Humongous Reclaim: 0.0 ms]

      [Free CSet: 0.3 ms]

   [Eden: 584.0M(584.0M)->0.0B(564.0M) Survivors: 10.0M->24.0M Heap: 3654.4M(4096.0M)->3085.4M(4096.0M)]

 [Times: user=0.25 sys=0.00, real=0.04 secs]

Next steps

As described in the initial sections, it was necessary to check the application code logic which caused the DB connections to saturate.  In addition to it, the recommendation to move to a commercial edition of the wrapper from the community edition was also recommended.  This being a very old application, such problems were never faced in the past because of which the change in the wrapper never came in to focus.  But with the growth in the usage of the application over the years, the limitations have become evident and hence required a detailed investigation.

Conlusions and learning

I hope this article presented an overview of how ActiveMQ is used and the areas to look in to in case of any issues without getting in to the actual application code.  There is an ocean of documentation around ActiveMQ with lot of discussion forums discussing about associated problems and I urge the reader to browse through to gain further knowledge.

Finally, as I always state, there could be many solutions to a problem and the implementation of a particular solution depends on the though process employed and may not be the only one.  This statement also holds good in the current situation and I would want the reader to further debug to discover any potential alternative solutions / approaches. 

Cheers and see you in another article!


Popular posts from this blog

Performance Engineering Series – Tuning for Performance - Case Study – 5

Performance Engineering Series – Tuning for Performance - Case Study - 1