RE: H.248 and UTF-8 strings.

23 Jun 2005

      Hi Christian

As you correctly stated all the "extra chars" are the region between
0x80 and 0xff (0xf7 to be precise).
I am not aware about any special meaning of the characters from this
region, so as a result I cannot understand why these "extra chars" need
to be escaped or quoted in any way.

The naive question is what is wrong with extending SafeChar to contain
this region?
I.e.
   SafeChar             = DIGIT / ALPHA / "+" / "-" / "&" /
                          "!" / "_" / "/" / "\'" / "?" / "@" /
                          "^" / "`" / "~" / "*" / "$" / "\" /
                          "(" / ")" / "%" / "|" / "." / %x80-F7 

Or, if for some reason the UTF-8 strings may be used only in VALUE,
then:

  VALUE                = quotedString / 1*(SafeChar / %x80-F7)
  quotedString         = DQUOTE *(SafeChar / %x80-F7 / RestChar / WSP)
DQUOTE

And...

Assuming that the way to fix this is found, can it be fixed in ver 3?
If yes, then what is the procedure?
Do you want me to bring relevant contribution to July meeting?

Thanks,
Sasha

-----Original Message-----
From: Christian Groves [mailto:christian.groves@ericsson.com] 
Sent: Wednesday, June 22, 2005 8:16 PM
To: Sasha Ruditsky
Cc: itu-sg16@external.cisco.com; Angelo.Contardi@italtel.it
Subject: Re: H.248 and UTF-8 strings.

Hello Sasha,

This problem was raised recently on the Megaco list. I've had some
off-line discussion with Angelo (the person who raised the problem) and
currently there's
2 proposed solutions (I hope he would sent this to the Megaco list):

1) The simple one, Code the UTF-8 string in "Octect Mode". This is a BAD
solution from the efficiency transmission point of view because of it
"halve the TX band": to TX one UTF-8 char (max 4 ASCII chars) i must TX
max 2 x 4 = 8 ASCII chars.

2) The complicated one, allow the ABNF quoted form of VALUE to TX ALL
ASCII chars 0x01-0xFF, except 0x22, that should be ESCAPED with "\", as
already done for ABNF Local and Remote Descriptor (see SDP). Note that
'\0' (0x00) is NOT ALLOWED in this new quoted string form as in the
present one, but this is not a problem because in UTF-8 the char '\0'
(0x00) is the same as in ASCII (string terminator) and is NOT used to
code "non ASCII" UTF-8 chars, all those chars > 0x7F that require more
than one ASCII chars to be encoded (from 2 to 4 ASCII chars). In fact
the "extra chars" needed to code an UTF-8 char are all above 0x7F (have
the MSBit = 1).

Regards, Christian

Sasha Ruditsky wrote:
...
Hi
I'm trying to understand how H.248 supports UTF-8 string properties.
According to H.248 the string property is encoded as UTF-8 string.
UTF-8 encoding is defined by the following table:
Scalar Value 			1st Byte 2nd Byte 3rd Byte 4th Byte
00000000 0xxxxxxx 		0xxxxxxx
00000yyy yyxxxxxx 		110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx 		1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx 	11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
I.e. all the character codes between x80 and xf7 need to be supported.
According to H.248 Annex B.2:
The ABNF in this section uses the VALUE construct (or lists of VALUE
constructs) to encode various package element values (properties, 
signal parameters, etc.).
The VALUE is defined as follows:
VALUE                = quotedString / 1*(SafeChar)
  SafeChar             = DIGIT / ALPHA / "+" / "-" / "&" /
                          "!" / "_" / "/" / "\'" / "?" / "@" /
                          "^" / "`" / "~" / "*" / "$" / "\" /
                          "(" / ")" / "%" / "|" / "."
  ALPHA                = %x41-5A / %x61-7A ; A-Z / a-z
  DIGIT                = %x30-39         ; 0-9
  quotedString         = DQUOTE *(SafeChar / RestChar/ WSP) DQUOTE
  RestChar             = ";" / "[" / "]" / "{" / "}" / ":" / "," / "#"
/
                          "<" / ">" / "="
  WSP                  = SP / HTAB ; white space
  SP                   = %x20        ; space
  HTAB                 = %x09        ; horizontal tab
  DQUOTE               = %x22            ; " (Double Quote)
So I believe this excludes the x80-xff characters.
So the question is how to text encoding defined in Annex B to encode
UTF-8 strings?
Thanks,
Sasha

Sasha Ruditsky

tags

participants (1)