Re: [Robustness] DDP Checkpointing (Part 2)

12 Apr 2000

      Part 2:

-Qiaobing

Terry L Anderson wrote:
...snip...
...
It would seem preferable to either break up state information into a
number of subsets (to avoid resending large amounts each time) or to
only send differences.  Since DDP gives reliable delivery to the other
members of the pool, perhaps only changes need be sent.  Did your trials
solve this problem.  Perhaps each call would be a separate subset.
I think it's a good idea to define a checkpoint call record structure
for each call that should contain the minimal information you need to
recover the call. The call processing engine will checkpoint the call
record at the proper time. You can for example to index the call record
using call ref number. In general, what and when to checkpoint is solely
an application decision. This is another big reason we left
checkpointing out of DDP.
...
One issue is how to solve reliable end-to-end delivery.  SCTP would
guarantee delivery to other end of connection but  the H.225 and H.245
protocols may pass through intermediate nodes (e.g., routing GKs).  If a
node fails after acknowledging the message but before sending it on,
end-to-end delivery fails and no live element knows.  Either we need far
If that message is crucial to the continuation of the call, the receiver
of the message should probably checkpoint the info received before
acking the sender. This way, if the receiver dies after acking the
sender, the back-up of the receiver will still be able to get the info
from the shared memory. In most cases, the back-up server knows it's
taking over a failed peer, and will examine the recovered info from the
share memory and decide what to do with the call.
...
end acknowledgement or backup elements must accept delivery
reponsibility once acknowledgement has been sent by its later failed
peer.  DDP would guarantee delivery but the recipient would have to
checkpoint this message before acknowledgement to guarantee that a peer
taking over whould know that there was an outstanding message that it
must deliver and receive acknowledgement for.  This mechanism would
require checkpoint after receipt of but before sending acknowledgement
of a message and again after receiving acknowledgement from its next
neighbor.  Does this seem reasonable?
It's reasonable. Again, which message/response needs to be checkpointed
is REALLY an application decision which should be made case by case in
each call scenario. In most cases, not every message in a call flow
needs to be checkpointed since the call model itself usually have some
error recovering mechanism built-in and it can recover by itself if the
faulure hits at certain points of the call.
...
Unless the amount of data sent in
such checkpointing is kept very small this would seem to add too much
network activity to be acceptible.
Always a good practice to minimize the checkpointed call record to only
include critical information of the call. Also, the checkpointing
traffic are localized to the server group. Proper network engineering
can help a lot (eg, don't let the checkpointing traffic of a pair of
heavy duty call servers run through a bottleneck router that is also
handling signalling traffic).

Re: [Robustness] DDP Checkpointing (Part 2)

Qiaobing Xie