Terracotta Clustering Best Practices

The following sections contain advice on optimizing the operations of a Terracotta cluster.

Analyze Java Garbage Collection (GC)
Detect Memory Pressure Using the Terracotta Logs
Reduce Swapping
Keep Disks Local
Do Not Interrupt!
Diagnose Client Disconnections
Bring a Cluster Back Up in Order
Manage Sessions in a Cluster

Analyze Java Garbage Collection (GC)

Long GC cycles are one of the most common causes of issues in a Terracotta cluster because a full GC event pauses all threads in the JVM. Servers disconnecting clients, clients dropping servers, OutOfMemoryErrors, and timed-out processes are just some of the problems long GC cycles can cause.

Having a clear understanding of how your application behaves with respect to creating garbage, and how that garbage is being collected, is necessary for avoiding or solving these issues.

Printing and Analyzing GC Logs

The most effective way to gain that understanding is to create a profile of GC in your application by using tools made for that purpose. Consider using JVM options to generate logs of GC activity:

-verbose:gc
-Xloggc:<filename>
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps

Apply an appropriate parsing and visualization tool to GC log files to help analyze their contents.

Observing GC Statistics With jstat

One way to observe GC statistics is by using the Java utility jstat. The following command will produce a log of GC statistics, updated every ten seconds:

jstat -gcutil <pid> 10 1000000

An important statistic is the Full Garbage Collection Time. The difference between the total time for each reading is the amount of time the system was paused. A jump of more than a few seconds will not be acceptable in most application contexts.

Solutions to Problematic GC

Once your application's typical GC cycles are understood, consider one or more of the following solutions:

Using BigMemory to eliminate the drag GC imposes on performance in large heaps.

BigMemory opens up off-heap memory for use by Java applications, and off-heap memory is not subject to GC.
Configuring the HealthChecker parameters in the Terracotta cluster to account for the observed GC cycles.

Increase nodes' tolerance of inactivity in other nodes due to GC cycles.
Tuning the GC parameters to change the way GC runs in the heap.

If running multi-core machines and no collector is specifically configured, consider -XX:+UseParallelGC and -XX:+UseParallelOldGC.

If running multiple JVMs or application processes on the same machine, tune the number of concurrent threads in the parallel collector with -XX:ParallelGCThreads=<number>.

Another collector is called Concurrent Mark Sweep (CMS). This collector is normally not recommended (especially for Terracotta servers) due to certain performance and operational issues it raises. However, under certain circumstances related to the type of hosting platform and application data usage characteristics, it may boost performance and may be worth testing with.
If running on a 64-bit JVM, and if your JDK supports it, use -XX:+UseCompressedOops.

This setting can reduce substantially the memory footprint of object pointer used by the JVM.

Detect Memory Pressure Using the Terracotta Logs

Terracotta server and client logs contain messages that help you track memory usage. Locations of server and client logs are configured in the Terracotta configuration file, tc-config.xml.

You can view the state of memory usage in a node by finding messages similar to the following:

2011-12-04 14:47:43,341 [Statistics Logger] ... memory free : 39.992699 MB
2011-12-04 14:47:43,341 [Statistics Logger] ... memory used : 1560.007301 MB
2011-12-04 14:47:43,341 [Statistics Logger] ... memory max : 1600.000000 MB

These messages can indicate that the node is running low on memory and could soon experience an OutOfMemoryError. You could take one or more of the following actions:

Increase the heap memory available to Terracotta.

Heap memory available to Terracotta is indicated by the message 2011-12-04 14:47:43,341 [Statistics Logger] ... memory max : 1600.000000 MB.
If increasing heap memory is problematic due to long GC cycles, consider the remedies suggested in this section.

Reduce Swapping

An operating system (OS) that is swapping to disk can substantially slow down or even stop your application. If the OS is under pressure because Terracotta servers—along with other processes running on a host—are squeezing the available memory, then memory will start to be paged in and out. This type of operation, when too frequent, requires either tuning of the swap parameters or a permanent solution to a chronic lack of RAM.

Many tools are available to help you diagnose swapping. Popular options include using a built-in command-line utility. On Linux, for example:

See available RAM with free -m (display memory statistics in megabtyes). Pay attention to swap utilization.
vmstat displays swap-in ("si") and swap-out ("so") numbers. Non-zero values indicate swapping activity. Set vmstat to refresh on a short interval to detect trends.
Process status can be used to get detailed information on all processes running on a node. For example, ps -eo pid,ppid,rss,vsize,pcpu,pmem,cmd -ww --sort=pmem displays processes ordered by memory use. You can also sort by virtual memory size ("vsize") and real memory size ("rss") to focus on both the most memory-consuming processes and their in-memory footprint.

Keep Disks Local

To provide scalability and persistence, and (when necessary) ease pressure on memory to use it more efficiently, Terracotta servers write and read data from a disk-based database. This database should always be on local disks to avoid potential issues from delays or disconnections.

Avoid using SAN, NFS/NAS, and other networked disk stores that could cause lock timeouts and trigger a TCDatabaseException. If you must use a storage system that is not local, avoid using Terracotta persistent mode with the Terracotta Server Array to reduce or eliminate write to disk.

Hate to see your Terracotta servers rely on disk to ease pressure on memory? Consider adding BigMemory.

Do Not Interrupt!

Ensure that your application does not interrupt clustered threads. This is a common error that can cause the Terracotta client to shut down or go into an error state, after which it will have to be restarted.

The Terracotta client library runs with your application and is often involved in operations which your application is not necessarily aware of. These operations can get interrupted, something the Terracotta client cannot anticipate. Interrupting clustered threads, in effect, puts the client into a state which it cannot handle.

Diagnose Client Disconnections

If clients disconnect on a regular basis, try the following to diagnose the cause:

Analyze the Terracotta client logs for potential issues, such as long GC cycles.
Analyze the Terracotta server logs for disconnection information and any rejections of reconnection attempts by the client.
See the operator events panel in the Terracotta Developer Console for disconnection events, and note the reason.

If the disconnections are due to long GC cycles or inconsistent network connections in the client, consider the remedies suggested in this section. If disconnections continue to happen, and you are using Ehcache, consider configuring caches with nonstop behavior and enabling rejoin.

Bring a Cluster Back Up in Order

Terracotta servers that are configured to persist data across restarts are operating in "permanent-store" mode. This type of cluster attempts to restore all server data (all "shared" data) when the servers return, and to remember previous clients. This persistence mode is configured in the Terracotta configuration file, tc-config.xml.

In an orderly shutdown, any passive (backup) servers should be taken down first and brought up last. (Note that clients should be brought down before any servers are brought down.) This ensures a database that remains in sync between active and passive servers. Once all passive servers are down, active servers can be shut down. When restoring the cluster, the last server to go down should be brought up first.

If a cluster crashes, and no passive server takes over (even temporarily), then the last active server to go down should be brought up first.

Manage Sessions in a Cluster

Make sure the configured time zone and system time is consistent between all application servers. If they are different a session may appear expired when accessed on different nodes.
Set -Dcom.tc.session.debug.sessions=true and -Dcom.tc.session.debug.invalidate=true to generate more debugging information in the client logs.
All clustered session implementations (including terracotta Sessions) require a mutated session object be put back into the session after it's mutated. If the call is missing, then the change isn't known to the cluster, only to the local node. For example:
```
 Session session = request.getSession();
 Map m  = session.getAttribute("foo");
 m.clear();
 session.setAttribute("foo", m); // Without this call, the clear() is not effective across the cluster.
```

Without a setAttribute() call, the session becomes inconsistent across the cluster. Sticky sessions can mask this issue, but as soon as the session is accessed on another node, its state does not match the expected one. To view the inconsistency on a single client node, add the Terracotta property -Dcom.tc.session.clear.on.access=true to force locally cached sessions to be cleared with every access.

If third-party code cannot be refactored to fix this problem, and you are running Terracotta 3.6.0 or higher, you can write a servlet filter that calls setAttribute() at the end of every request. Note that this solution may substantially degrade performance.

    package controller.filter;

    import java.io.IOException;
    import java.util.Enumeration;

    import javax.servlet.Filter;
    import javax.servlet.FilterChain;
    import javax.servlet.FilterConfig;
    import javax.servlet.ServletException;
    import javax.servlet.ServletRequest;
    import javax.servlet.ServletResponse;
    import javax.servlet.http.HttpServletRequest;
    import javax.servlet.http.HttpSession;

    public class IterateFilter implements Filter {

      public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
        HttpSession session = ((HttpServletRequest) request).getSession();
        if (session != null) {
          @SuppressWarnings("rawtypes")
          Enumeration e = session.getAttributeNames();
          while (e.hasMoreElements()) {
            String name = (String)e.nextElement();
            Object value = session.getAttribute(name);
            session.setAttribute(name, value);
          }
        }
      }

      public void init(FilterConfig filterConfig) throws ServletException {
        // TODO Auto-generated method stub
      }

      public void destroy() {
        // TODO Auto-generated method stub
      }
    }