Re: Issue in H.323 robustness not addressed by SCTP/DDP

26 Apr 2000

      Qiaobing,

Could you explain how SCTP/DDP layer below would handle the PROBLEM B. My
understanding of the SCTP/DDP is based on the internet draft that exists
right now. Just as to make clear that we are in sync about the problem, let
me explain our view of SCTP/DDP functionality
for this particular problem of H.323 robustness.

a) SCTP will help H.323 to route data to multiple IP addresses belonging to
the SAME node(multi-homed host).

b) DDP extends this functionality by allowing an H.323 endpoint to send data
not only to multiple IP addresses belonging to the SAME node but between
physically DIFFERENT nodes .This is achieved by translating a "NAME" to a
group of IP addresses.

If there is any other functionality in this context that is relevant for the
discussion here, please do point it out.

Now coming to problem B that we had stated (I am modifying the diagram for
clarity):

                              (CRASH)     RELCOMPLETE
           EP2 <--------------  GK   <------------  EP1
          (SCTP/DDP)           (SCTP/DDP)         (SCTP/DDP)

    EP1 sends a RELCOMPLETE to EP2 via the GK.

               | H.323 layer of GK|
               |  |            ^  |
               |  |            |  |
               ---|------------|---
  Outgoing     |  | DDP/SCTP   |  |
  RELCOMP<---------             ------------ Incoming RELCOMP
               |                  |
               --------------------

The SCTP/DDP layer receives  the RELCOMP and sends an ACK back to the EP1.so
SCTP/DDP's job is over.Now the SCTP/DDP layer sends the RELCOMP message to
the H.323 layer and the H.323 layer crashes.So there is no context of that
RELCOMP message on the STANDBY. So our point is that this problem is outside
the domain of the SCTP/DDP.

Btw, are you suggesting that we do a "checkpointing" in the H.323 layer for
every message we received? I am sure you will agree that that will be very
expensive.

-Regards
Archana
...
-----Original Message-----
From: Qiaobing Xie [mailto:xieqb@CIG.MOT.COM]
Sent: Wednesday, April 26, 2000 1:22 PM
To: ITU-SG16@mailbag.cps.intel.com
Subject: Re: Issue in H.323 robustness not addressed by SCTP/DDP
Archana,
One thing you might have missed is that the DDP/SCTP fault-tolerance
model is designed to provide robustness to the application in a
*transparent* fashion. The state synchronization issue (your PROBLEM B)
is a no-issue to DDP/SCTP model. In our model, a back-up GK will
automatically kick in and continue forwarding the RELCOMPLETE to EP2,
without either EP even noticing that the failure ever happened at all!
There is NO application involvment required in this scenario.
-Qiaobing
Archana Nehru wrote:
...
Hello,
We think that SCTP/DDP by itself is not a complete solution
for robustness (see PROBLEM B below) and certain changes
need to be made in the H.323 layer to achieve robustness.
For the sake of clarity, we restate the issues we need to address in
order to achieve robustness:
In the current H.323 specs, if the TCP connection for a
...
down, the call is  lost. To overcome this problem, we need:
A. Fail over mechanism
Whenever an endpoint detects that the other side is  down
(e.g.: TCP
   connection failure/ no ACKs received in Annex E) the endpoint can
   save an active H.323 call, if it knows about a "recovery H.323
   address".
The "recovery address" is the back-up address that the
endpoint can
   use  to re-establish a TCP connection (for TCP) or to
resend Annex E
   data (UDP). From the  endpoint's point of view, the
"recovery address"
   represents a node that has enough  information about the
H.323 call
   to continue processing as if the failure had never occurred
The failure in the node could have been one of the
following types:
1. Transport failure:  e.g. failed NIC, congested network.
2. Node failure: e.g. the entire gatekeeper fail. In this case, we
   need a synchronization mechanism between the gatekeeper and its
   backup so the active calls can be saved.
B. Handle Call State Synchronization
   We need to make sure that both legs of a H.323 call are
in sync. When
   an intermediate node  (e.g. Gk) fails, messages from an
endpoint can
  get lost. e.g.: Take the example of a lost RELEASE COMPLETE in the
   following scenario:
(CRASH)     RELCOMPLETE
          EP2 <--------------  GK   <------------  EP1
EP1 sends a RELCOMPLETE to EP2 via the GK. The GK crashes, before
   forwarding the RELCOMPLETE from EP1 to EP2. As a result EP1 thinks
   the  call is released, while as the EP2 thinks the call is up.
As Paul has pointed out: several H.245 messages are problematic--
   especially those related to conferencing, such as chair control,
   terminal join/left, terminal you are seeing, etc.
   UserInputIndication and any other "indication" message
H.323 call goes
that does not
...
require a response is an issue.
POSSIBLE SOLUTION(s):
---------------------
Solution to Problem A:
----------------------
This problem can be solved using SCTP/DDP or modifying Annex E to
include alternate addresses.
Solution to PROBLEM B:
----------------------
This problem cannot be solved using SCTP/DDP as it is inherent in the
H.323 protocol. If we take the same example as above:
(CRASH)     RELCOMPLETE
          EP2 <--------------  GK   <------------  EP1
         (SCTP/DDP)           (SCTP/DDP)         (SCTP/DDP)
what happens if the GK fails just after its SCTP layer
finished sending
an SCTP-ACK for the RELCOMPLETE message to EP1.  EP1 receives the
SCTP-ACK and therefore considers the call released but EP2 never
receives the RELCOMPLETE message. It is important to note here that
"checkpointing" in the H.323 layer of the GK will not help
since the ACK
at the SCTP level is generated before RELCOMPLETE message is
delivered
to the H.323 layer of the GK.
So we can solve the problem by having an "END-to-END acknowledgement
mechanism"  to make sure that EP1 and EP2 are in sync even when the
intermediate node fails.
One approach as suggested by Paul is to modify Annex E to have
end-to-end acknowledgement.  We want to point out that
actually this is a
H.323 layer problem.  By introducing end-to-end ack into Annex E, we
will be trying to solve a protocol layer problem by making
modifications
in the transport layer mechanisms.  The problem of
synchronization comes
from the fact that the H.323 layer does not have an ACK for every
message that is sent out.
Alternatively, if we introduce an ACK packet for every H.323 message
that currently has no ACK (e.g: H.245 commands/indications or H.225
RELEASE COMPLETE), we can address the problem cleanly. This
ACK message
will be supported only by the nodes that support robustness.
Unlike the
Annex-E approach, this approach is independent of the transport layer
protocol layer below H.323, and can also be applied to SCTP/DDP.
Comments are welcome on this issue.
Regards,
Archana
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For help on this mail list, send "HELP ITU-SG16" in a message to
listserv@mailbag.intel.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For help on this mail list, send "HELP ITU-SG16" in a message to
listserv@mailbag.intel.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For help on this mail list, send "HELP ITU-SG16" in a message to
listserv@mailbag.intel.com

Archana Nehru

tags

participants (1)