Re: [Robustness] DDP Checkpointing (Part 2)
Part 2: -Qiaobing Terry L Anderson wrote: ...snip...
It would seem preferable to either break up state information into a number of subsets (to avoid resending large amounts each time) or to only send differences. Since DDP gives reliable delivery to the other members of the pool, perhaps only changes need be sent. Did your trials solve this problem. Perhaps each call would be a separate subset.
I think it's a good idea to define a checkpoint call record structure for each call that should contain the minimal information you need to recover the call. The call processing engine will checkpoint the call record at the proper time. You can for example to index the call record using call ref number. In general, what and when to checkpoint is solely an application decision. This is another big reason we left checkpointing out of DDP.
One issue is how to solve reliable end-to-end delivery. SCTP would guarantee delivery to other end of connection but the H.225 and H.245 protocols may pass through intermediate nodes (e.g., routing GKs). If a node fails after acknowledging the message but before sending it on, end-to-end delivery fails and no live element knows. Either we need far
If that message is crucial to the continuation of the call, the receiver of the message should probably checkpoint the info received before acking the sender. This way, if the receiver dies after acking the sender, the back-up of the receiver will still be able to get the info from the shared memory. In most cases, the back-up server knows it's taking over a failed peer, and will examine the recovered info from the share memory and decide what to do with the call.
end acknowledgement or backup elements must accept delivery reponsibility once acknowledgement has been sent by its later failed peer. DDP would guarantee delivery but the recipient would have to checkpoint this message before acknowledgement to guarantee that a peer taking over whould know that there was an outstanding message that it must deliver and receive acknowledgement for. This mechanism would require checkpoint after receipt of but before sending acknowledgement of a message and again after receiving acknowledgement from its next neighbor. Does this seem reasonable?
It's reasonable. Again, which message/response needs to be checkpointed is REALLY an application decision which should be made case by case in each call scenario. In most cases, not every message in a call flow needs to be checkpointed since the call model itself usually have some error recovering mechanism built-in and it can recover by itself if the faulure hits at certain points of the call.
Unless the amount of data sent in such checkpointing is kept very small this would seem to add too much network activity to be acceptible.
Always a good practice to minimize the checkpointed call record to only include critical information of the call. Also, the checkpointing traffic are localized to the server group. Proper network engineering can help a lot (eg, don't let the checkpointing traffic of a pair of heavy duty call servers run through a bottleneck router that is also handling signalling traffic).
participants (1)
-
Qiaobing Xie