FW: Issue in H.323 robustness not addressed by SCTP/DDP

Qiaobing Xie xieqb at CIG.MOT.COM
Wed Apr 26 19:52:58 EDT 2000


see my comments below...

-Qiaobing

Archana Nehru wrote:
>
> Subject: RE: Issue in H.323 robustness not addressed by SCTP/DDP
> Date: Wed, 26 Apr 2000 14:55:02 -0700
> From: Archana Nehru <archie at trillium.com>
> To: 'Mailing list for parties associated with ITU-T Study Group 16'
>      <ITU-SG16 at mailbag.cps.intel.com>
> CC: "'xieqb at CIG.MOT.COM'" <xieqb at CIG.MOT.COM>
>
> Qiaobing,
>
> Could you explain how SCTP/DDP layer below would handle the PROBLEM B. My
> understanding of the SCTP/DDP is based on the internet draft that exists
> right now. Just as to make clear that we are in sync about the problem, let
> me explain our view of SCTP/DDP functionality
> for this particular problem of H.323 robustness.
>
> a) SCTP will help H.323 to route data to multiple IP addresses belonging to
> the SAME node(multi-homed host).
>
> b) DDP extends this functionality by allowing an H.323 endpoint to send data
> not only to multiple IP addresses belonging to the SAME node but between
> physically DIFFERENT nodes .This is achieved by translating a "NAME" to a
> group of IP addresses.

Correct..

> If there is any other functionality in this context that is relevant for the
> discussion here, please do point it out.

Another important piece is the shared memory (or other similar
mechanism) for checkpointing your state information. From DDP's view,
checkpointing is application specific, i.e., the application (owner of
the call model) is the only one knows when and what to checkpoint. The
replication mechanism used for the checkpointing is implementation
specific, i.e., an implementor can empoly various methods ranging from
shared hard disk array to special hardware-assisted reflected memory...

>
> Now coming to problem B that we had stated (I am modifying the diagram for
> clarity):
>
>                               (CRASH)     RELCOMPLETE
>            EP2 <--------------  GK   <------------  EP1
>           (SCTP/DDP)           (SCTP/DDP)         (SCTP/DDP)
>
>     EP1 sends a RELCOMPLETE to EP2 via the GK.
>
>                | H.323 layer of GK|
>                |  |            ^  |
>                |  |            |  |
>                ---|------------|---
>   Outgoing     |  | DDP/SCTP   |  |
>   RELCOMP<---------             ------------ Incoming RELCOMP
>                |                  |
>                --------------------
>
> The SCTP/DDP layer receives  the RELCOMP and sends an ACK back to the EP1.so
> SCTP/DDP's job is over.Now the SCTP/DDP layer sends the RELCOMP message to
> the H.323 layer and the H.323 layer crashes.So there is no context of that
> RELCOMP message on the STANDBY. So our point is that this problem is outside
> the domain of the SCTP/DDP.

This is what most likely will happen with DDP/SCTP:

Without receiving a RELCOMP, EP2's H.323 layer will time-out and resend
RELEASE, and the resend will prompt DDP layer at EP2 to detect the
faulted GK and route the resent RELEASE to the alternate GK. Of cause,
the alternate GK needs to be able to fetch the call state info of this
call from, for example, a shared network memory containing checkpointed
call state data and continue the call release sequence of this call.
This re-sent RELEASE may surprise EP1 a little; it appears as a
duplicate to EP1. But EP1 simply needs to reply another RELCOMP to EP2,
DDP will route it through the alternate GK of cause.

>From application's view, except a time-out at EP2 and a duplicate
RELEASE at EP1, the call flow continues. The take-over is carried out
transparently by DDP and SCTP.

Several things should also be noted in your example:

1) When the H.323 layer crashes, SCTP/DDP will most likely crash too,
there is no such thing as a partially crashed unix process (however, if
you use multi-threaded programming model, you can have your SCTP/DDP
running in a separate thread than H.323. And, you may have your H.323
thread spinning and eating messages while your SCTP/DDP thread still
functioning normally. But I will not consider that as an application
failure - from DDP/SCTP's viewpoint, this is indistingushable from a
design flaw).

2) Since SCTP is reliable, the SCTP peer at EP1 will eventually detect
the failure of the SCTP endpoint at GK and notify its DDP layer to stop
sending to this GK.

>
> Btw, are you suggesting that we do a "checkpointing" in the H.323 layer for
> every message we received? I am sure you will agree that that will be very
> expensive.

Checkpointing is the cost you have to pay for the redundancy. But the
checkpointing traffic can be completely localized within the server pool
if you engineer your network right. You don't checkpoint on every
message, it really depends on your call flow and how much you want to
recover. To achieve conservation of stable calls only (that's what most
teleco switches do today, correct me if I am wrong on this), you only
checkpoint a call when it becomes stable..

>
> -Regards
> Archana
>
> >-----Original Message-----
> >From: Qiaobing Xie [mailto:xieqb at CIG.MOT.COM]
> >Sent: Wednesday, April 26, 2000 1:22 PM
> >To: ITU-SG16 at mailbag.cps.intel.com
> >Subject: Re: Issue in H.323 robustness not addressed by SCTP/DDP
> >
> >
> >Archana,
> >
> >One thing you might have missed is that the DDP/SCTP fault-tolerance
> >model is designed to provide robustness to the application in a
> >*transparent* fashion. The state synchronization issue (your PROBLEM B)
> >is a no-issue to DDP/SCTP model. In our model, a back-up GK will
> >automatically kick in and continue forwarding the RELCOMPLETE to EP2,
> >without either EP even noticing that the failure ever happened at all!
> >There is NO application involvment required in this scenario.
> >
> >-Qiaobing
> >
> >Archana Nehru wrote:
> >>
> >> Hello,
> >>
> >> We think that SCTP/DDP by itself is not a complete solution
> >> for robustness (see PROBLEM B below) and certain changes
> >> need to be made in the H.323 layer to achieve robustness.
> >> For the sake of clarity, we restate the issues we need to address in
> >> order to achieve robustness:
> >>
> >> In the current H.323 specs, if the TCP connection for a
> >H.323 call goes
> >> down, the call is  lost. To overcome this problem, we need:
> >>
> >> A. Fail over mechanism
> >>
> >>    Whenever an endpoint detects that the other side is  down
> >(e.g.: TCP
> >>    connection failure/ no ACKs received in Annex E) the endpoint can
> >>    save an active H.323 call, if it knows about a "recovery H.323
> >>    address".
> >>
> >>    The "recovery address" is the back-up address that the
> >endpoint can
> >>    use  to re-establish a TCP connection (for TCP) or to
> >resend Annex E
> >>    data (UDP). From the  endpoint's point of view, the
> >"recovery address"
> >>    represents a node that has enough  information about the
> >H.323 call
> >>    to continue processing as if the failure had never occurred
> >>
> >>    The failure in the node could have been one of the
> >following types:
> >>
> >>    1. Transport failure:  e.g. failed NIC, congested network.
> >>
> >>    2. Node failure: e.g. the entire gatekeeper fail. In this case, we
> >>    need a synchronization mechanism between the gatekeeper and its
> >>    backup so the active calls can be saved.
> >>
> >> B. Handle Call State Synchronization
> >>    We need to make sure that both legs of a H.323 call are
> >in sync. When
> >>    an intermediate node  (e.g. Gk) fails, messages from an
> >endpoint can
> >>   get lost. e.g.: Take the example of a lost RELEASE COMPLETE in the
> >>    following scenario:
> >>
> >>                              (CRASH)     RELCOMPLETE
> >>           EP2 <--------------  GK   <------------  EP1
> >>
> >>    EP1 sends a RELCOMPLETE to EP2 via the GK. The GK crashes, before
> >>    forwarding the RELCOMPLETE from EP1 to EP2. As a result EP1 thinks
> >>    the  call is released, while as the EP2 thinks the call is up.
> >>
> >>    As Paul has pointed out: several H.245 messages are problematic--
> >>    especially those related to conferencing, such as chair control,
> >>    terminal join/left, terminal you are seeing, etc.
> >>    UserInputIndication and any other "indication" message
> >that does not
> >>    require a response is an issue.
> >>
> >> POSSIBLE SOLUTION(s):
> >> ---------------------
> >>
> >> Solution to Problem A:
> >> ----------------------
> >> This problem can be solved using SCTP/DDP or modifying Annex E to
> >> include alternate addresses.
> >>
> >> Solution to PROBLEM B:
> >> ----------------------
> >>
> >> This problem cannot be solved using SCTP/DDP as it is inherent in the
> >> H.323 protocol. If we take the same example as above:
> >>
> >>                                 (CRASH)     RELCOMPLETE
> >>           EP2 <--------------  GK   <------------  EP1
> >>          (SCTP/DDP)           (SCTP/DDP)         (SCTP/DDP)
> >>
> >> what happens if the GK fails just after its SCTP layer
> >finished sending
> >> an SCTP-ACK for the RELCOMPLETE message to EP1.  EP1 receives the
> >> SCTP-ACK and therefore considers the call released but EP2 never
> >> receives the RELCOMPLETE message. It is important to note here that
> >> "checkpointing" in the H.323 layer of the GK will not help
> >since the ACK
> >> at the SCTP level is generated before RELCOMPLETE message is
> >delivered
> >> to the H.323 layer of the GK.
> >>
> >> So we can solve the problem by having an "END-to-END acknowledgement
> >> mechanism"  to make sure that EP1 and EP2 are in sync even when the
> >> intermediate node fails.
> >>
> >> One approach as suggested by Paul is to modify Annex E to have
> >> end-to-end acknowledgement.  We want to point out that
> >actually this is a
> >> H.323 layer problem.  By introducing end-to-end ack into Annex E, we
> >> will be trying to solve a protocol layer problem by making
> >modifications
> >> in the transport layer mechanisms.  The problem of
> >synchronization comes
> >> from the fact that the H.323 layer does not have an ACK for every
> >> message that is sent out.
> >>
> >> Alternatively, if we introduce an ACK packet for every H.323 message
> >> that currently has no ACK (e.g: H.245 commands/indications or H.225
> >> RELEASE COMPLETE), we can address the problem cleanly. This
> >ACK message
> >> will be supported only by the nodes that support robustness.
> >Unlike the
> >> Annex-E approach, this approach is independent of the transport layer
> >> protocol layer below H.323, and can also be applied to SCTP/DDP.
> >>
> >> Comments are welcome on this issue.
> >>
> >> Regards,
> >> Archana
> >>
> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> For help on this mail list, send "HELP ITU-SG16" in a message to
> >> listserv at mailbag.intel.com
> >
> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >For help on this mail list, send "HELP ITU-SG16" in a message to
> >listserv at mailbag.intel.com
> >
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> For help on this mail list, send "HELP ITU-SG16" in a message to
> listserv at mailbag.intel.com

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For help on this mail list, send "HELP ITU-SG16" in a message to
listserv at mailbag.intel.com



More information about the sg16-avd mailing list