This site hosts historical documentation. Visit www.terracotta.org for recent product information.

Recovery Scenarios

The recovery scenarios in the following sections are explained assuming we are using the default health check settings from tc.properties:

l2.healthcheck.l1.ping.idletime = 5000
l2.healthcheck.l1.ping.interval = 1000
l2.healthcheck.l1.ping.probes = 3
l2.healthcheck.l1.socketConnectTimeout = 5
l2.healthcheck.l1.socketConnectCount = 10

l2.healthcheck.l2.ping.idletime = 5000
l2.healthcheck.l2.ping.interval = 1000
l2.healthcheck.l2.ping.probes = 3
l2.healthcheck.l2.socketConnectTimeout = 5
l2.healthcheck.l2.socketConnectCount = 10

l1.healthcheck.l2.ping.idletime = 5000
l1.healthcheck.l2.ping.interval = 1000
l1.healthcheck.l2.ping.probes = 3
l1.healthcheck.l2.socketConnectTimeout = 5
l1.healthcheck.l2.socketConnectCount = 13
  1. Default health monitoring parameter in tc.properties
    • L1 - L2 detection of failure
      • 109 secs, if L2 is reachable and L1 can initiate a new socket connection. This basically allows a max of 109 secs GC on L2.
      • 13 secs, if connectivity to L2 is broken and L1 can not create a new socket connection to L2
    • L2 - L2 detection of failure
      • If active (passive) L2 is reachable and passive(active) L2 can initiate a new socket connection = 85 secs
      • If connectivity to active (passive) L2 is broken and passive (active) L2 can not create a new socket connection to active (passive) L2 = 13 secs
    • L2 – L1 detection of failure
      • 85 secs, if L1 is reachable and L2 can initiate a new socket connection. This allows a maximum of 85 secs GC on L1.
      • 13 secs, if the connectivity to L1 is broken and L2 cannot create a new socket connection.
      • In case L2 is not able to initiate a socket connection during the first connection cycle due to firewall settings etc., socket connection failure message would be printed in the server logs and all the L2→L1 health check properties for this particular client will be multiplied by a factor of 10.
  2. Reconnect properties
    • L2 - L1 reconnect parameters
      • l2.l1reconnect.enabled = true (default is false)
      • l2.l1reconnect.timeout.millis = 15000 (default is 5000)
    • L2 - L2 reconnect properties
      • l2.nha.tcgroupcomm.reconnect.enabled = true (default is false)
      • l2.nha.tcgroupcomm.reconnect.timeout = 15000 (default is 2000)

With the above parameters set:

  • Max GC allowed at L1 before it is quarantined from cluster = L2-L1 health monitoring = 85s
  • Max allowed GC at passive L2 before it is quarantined by active L2 = L2-L2 health monitoring = 85 secs
  • Max allowed GC at active L2 before
    • Passive takes over, is = (L2-L2 health monitoring(85 secs) + Election time(5 secs)) = 90 secs
    • L1 disconnects from active L2 and tries connection with another L2, is = L1-L2 health monitoring = 13 secs