H.248 and UTF-8 strings.

Sasha Ruditsky sasha at RADVISION.COM
Fri Jun 24 11:41:27 EDT 2005


Hi Tom

I am reading RFC2234 and in 2.4 it says:

   "External representations of terminal value characters will vary
   according to constraints in the storage or transmission environment.
   Hence, the same ABNF-based grammar may have multiple external
   encodings, such as one for a 7-bit US-ASCII environment, another for
   a binary octet environment and still a different one when 16-bit
   Unicode is used.  Encoding details are beyond the scope of ABNF,
   although Appendix A (Core) provides definitions for a 7-bit US-ASCII
   environment as has been common to much of the Internet."

So me from here I conclude that encoding is not defined by RFC2234.


Now in Appendix A, a.k.a section 6 one can find:
In 6.1 it shows
OCTET          =  %x00-FF
However section 6.2 says that that only 7 bit values are allowed.

Taking this into account and the fact that Appendix A of RFC 2234
represents 
informative (not normative) part of the document, I do not see how and
why it should be followed. 

You are saying, that "We adopted that core for Megaco".
I've failed to find any reference in the Megaco documents to RFC2234
Appendix A or
section 6.2.

And as I wrote before, Megaco document does have the following
definition: 
nonEscapeChar        = ( "\}" / %x01-7C / %x7E-FF )
which definitely requires encoding which supports 8 bit values.

And Megaco document claims support for UTF-8 string which also need 8
bit values support. 


Regards,
Sasha


-----Original Message-----
From: Tom Taylor [mailto:taylor at nortel.com] 
Sent: Friday, June 24, 2005 11:07 AM
To: Sasha Ruditsky
Cc: Christian Groves; itu-sg16 at external.cisco.com;
Angelo.Contardi at ITALTEL.IT
Subject: Re: H.248 and UTF-8 strings.

Appendix A in which section 6.2 appears is "a convenient core for
specific grammars".  We adopted that core for Megaco.  In fact, ABNF can
be more general.  When I look at the first paragraph of section 2.3 I
see:

   "Rules resolve into a string of terminal values, sometimes called
    characters.  In ABNF a character is merely a non-negative integer.
    In certain contexts a specific mapping (encoding) of values into a
    character set (such as ASCII) will be specified."

Note further down, however, that:

   "Literal text strings" [i.e. strings between quotes]
   "are interpreted as a concatenated set of
    printable characters."

and printable characters are defined in Appendix A to be

    VCHAR          =  %x21-7E
                                ; visible (printing) characters


BTW I seem to be unsubscribed from the SG 16 (Cisco) list and therefore
no longer authorized to post to it.  The list is therefore not seeing my
responses.

Sasha Ruditsky wrote:
> Hi Tom
> 
> Then I am flummoxed by the line:
>      OCTET          =  %x00-FF
> Which appears several lines before the 6.2 in the same RFC2234.
> 
> In addition Megaco already has the following definition:
>  nonEscapeChar        = ( "\}" / %x01-7C / %x7E-FF )
> 
> How then these are encoded?
> 
> Regards,
> Sasha
> 
> 
> -----Original Message-----
> From: Tom Taylor [mailto:taylor at nortel.com]
> Sent: Friday, June 24, 2005 9:30 AM
> To: Christian Groves
> Cc: Sasha Ruditsky; itu-sg16 at external.cisco.com; 
> Angelo.Contardi at ITALTEL.IT
> Subject: Re: H.248 and UTF-8 strings.
> 
> I'd guess the restriction came from RFC 2234 (ABNF).  See paragraph
6.2.
> 
> Christian Groves wrote:
> 
>>Hello Sasha,
>>
>>You're quite welcome to bring in a contribution to the July Meeting on
> 
> 
>>this issue (addressed to Q.3- same procedure as any question), 
>>although I hope that there will be some agreement on the solution on 
>>the Megaco list. Those that can remember that far back know that I 
>>wasn't really a proponent of the text encoding so I can't remember why
> 
> 
>>these "extra chars" were excluded in the first place. I've added Tom 
>>to see if he can remember. In terms of solution I would support adding
> 
> 
>>this to VALUE only as indicated below.
>>
>>Regards, Christian
>>
>>Sasha Ruditsky wrote:
>>
>>
>>>Hi Christian
>>>
>>>As you correctly stated all the "extra chars" are the region between 
>>>0x80 and 0xff (0xf7 to be precise).
>>>I am not aware about any special meaning of the characters from this 
>>>region, so as a result I cannot understand why these "extra chars"
>>>need to be escaped or quoted in any way.
>>>
>>>The naive question is what is wrong with extending SafeChar to 
>>>contain this region?
>>>I.e.
>>>   SafeChar             = DIGIT / ALPHA / "+" / "-" / "&" /
>>>                          "!" / "_" / "/" / "\'" / "?" / "@" /
>>>                          "^" / "`" / "~" / "*" / "$" / "\" /
>>>                          "(" / ")" / "%" / "|" / "." / %x80-F7 Or, 
>>>if for some reason the UTF-8 strings may be used only in VALUE,
>>>then:
>>>
>>>  VALUE                = quotedString / 1*(SafeChar / %x80-F7)
>>>  quotedString         = DQUOTE *(SafeChar / %x80-F7 / RestChar /
> 
> WSP)
> 
>>>DQUOTE
>>>
>>>
>>>
>>>And...
>>>
>>>Assuming that the way to fix this is found, can it be fixed in ver 3?
>>>If yes, then what is the procedure?
>>>Do you want me to bring relevant contribution to July meeting?
>>>
>>>Thanks,
>>>Sasha
>>>
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Christian Groves [mailto:christian.groves at ericsson.com] Sent: 
>>>Wednesday, June 22, 2005 8:16 PM
>>>To: Sasha Ruditsky
>>>Cc: itu-sg16 at external.cisco.com; Angelo.Contardi at italtel.it
>>>Subject: Re: H.248 and UTF-8 strings.
>>>
>>>Hello Sasha,
>>>
>>>This problem was raised recently on the Megaco list. I've had some 
>>>off-line discussion with Angelo (the person who raised the problem) 
>>>and currently there's
>>>2 proposed solutions (I hope he would sent this to the Megaco list):
>>>
>>>1) The simple one, Code the UTF-8 string in "Octect Mode". This is a 
>>>BAD solution from the efficiency transmission point of view because 
>>>of it "halve the TX band": to TX one UTF-8 char (max 4 ASCII chars) i
> 
> 
>>>must TX max 2 x 4 = 8 ASCII chars.
>>>
>>>2) The complicated one, allow the ABNF quoted form of VALUE to TX ALL
> 
> 
>>>ASCII chars 0x01-0xFF, except 0x22, that should be ESCAPED with "\", 
>>>as already done for ABNF Local and Remote Descriptor (see SDP). Note 
>>>that '\0' (0x00) is NOT ALLOWED in this new quoted string form as in 
>>>the present one, but this is not a problem because in UTF-8 the char
> 
> '\0'
> 
>>>(0x00) is the same as in ASCII (string terminator) and is NOT used to
> 
> 
>>>code "non ASCII" UTF-8 chars, all those chars > 0x7F that require 
>>>more than one ASCII chars to be encoded (from 2 to 4 ASCII chars). In
> 
> 
>>>fact the "extra chars" needed to code an UTF-8 char are all above 
>>>0x7F (have the MSBit = 1).
>>>
>>>Regards, Christian
>>>
>>>Sasha Ruditsky wrote:
>>>
>>>
>>>
>>>>Hi
>>>>
>>>>I'm trying to understand how H.248 supports UTF-8 string properties.
>>>>According to H.248 the string property is encoded as UTF-8 string.
>>>>
>>>>UTF-8 encoding is defined by the following table:
>>>>
>>>>Scalar Value             1st Byte 2nd Byte 3rd Byte 4th Byte
>>>>00000000 0xxxxxxx         0xxxxxxx
>>>>00000yyy yyxxxxxx         110yyyyy 10xxxxxx
>>>>zzzzyyyy yyxxxxxx         1110zzzz 10yyyyyy 10xxxxxx
>>>>000uuuuu zzzzyyyy yyxxxxxx     11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
>>>>
>>>>
>>>>I.e. all the character codes between x80 and xf7 need to be
> 
> supported.
> 
>>>>According to H.248 Annex B.2:
>>>>
>>>>The ABNF in this section uses the VALUE construct (or lists of VALUE
>>>>constructs) to encode various package element values (properties, 
>>>>signal parameters, etc.).
>>>>
>>>>The VALUE is defined as follows:
>>>>
>>>> VALUE                = quotedString / 1*(SafeChar)
>>>> SafeChar             = DIGIT / ALPHA / "+" / "-" / "&" /
>>>>                         "!" / "_" / "/" / "\'" / "?" / "@" /
>>>>                         "^" / "`" / "~" / "*" / "$" / "\" /
>>>>                         "(" / ")" / "%" / "|" / "."
>>>> ALPHA                = %x41-5A / %x61-7A ; A-Z / a-z
>>>> DIGIT                = %x30-39         ; 0-9
>>>> quotedString         = DQUOTE *(SafeChar / RestChar/ WSP) DQUOTE
>>>> RestChar             = ";" / "[" / "]" / "{" / "}" / ":" / "," /
> 
> "#"
> 
>>>
>>>/
>>>
>>>
>>>>                         "<" / ">" / "="
>>>> WSP                  = SP / HTAB ; white space
>>>> SP                   = %x20        ; space
>>>> HTAB                 = %x09        ; horizontal tab
>>>> DQUOTE               = %x22            ; " (Double Quote)
>>>>
>>>>
>>>>So I believe this excludes the x80-xff characters.
>>>>
>>>>So the question is how to text encoding defined in Annex B to encode
>>>>UTF-8 strings?
>>>>
>>>>Thanks,
>>>>Sasha
>>>>
>>>
>>>
>>
> 
> 
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 6174 bytes
Desc: not available
URL: <https://lists.packetizer.com/pipermail/sg16-avd/attachments/20050624/1f404262/attachment-0004.bin>


More information about the sg16-avd mailing list