[Megaco] The ABNF coding of VALUE doesn't allow to represent ageneric UTF-8 string/char

Fri Jun 24 15:14:39 EDT 2005

Angelo,

Today all ASCII characters except double quotes can be specified by the
VALUE production (some of the characters need to be in the quotedstring
form).  Adding UTF-8 does not change this.  The part of UTF-8 that is <
0x80 is exactly ASCII and does not change or need to change the way in
which VALUE is encoded.  

UTF-8 characters above 0x7F are encoded as multiple bytes in which each
byte of the character is guaranteed to be > 0x7F.  So UTF-8 characters are
either < 0x80 and are simply the equivalent of ASCII or they are encoded
into multiple bytes in which every byte of the character is > 0x7F.  There
will never be a circumstance where a byte of a UTF-8 multibyte character
collides with _any_ ASCII character.

Perhaps the table below will help clarify

Scalar Value 			1st Byte 2nd Byte 3rd Byte 4th Byte
00000000 0xxxxxxx 		0xxxxxxx
00000yyy yyxxxxxx 		110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx 		1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx 	11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Notice that "one byte" UTF-8 characters have their high order bit cleared
(0) and that all bytes of all multi-byte UTF-8 encodings have the
high-order bit set (1).  In other words all multi-byte character's bytes
are > 0x7F.

So changing the productions to:
  VALUE                = quotedString / 1*(SafeChar / %x80-F7)
  quotedString         = DQUOTE *(SafeChar / %x80-F7 / RestChar / WSP)
DQUOTE

as Sasha Ruditsky suggests will allow UTF-8 characters to be used in
VALUEs.  ASCII characters (and UTF-8 equvialents < 0x80) that are in the
RestChar set will need to be quoted just as they have always been.  UTF-8
characters above 0x7F may or may not be quoted, just as all the ASCII
characters in the SafeChar set.

--Stephen

> -----Original Message-----
> From: Contardi Angelo [mailto:Angelo.Contardi at italtel.it] 
> Sent: Friday, June 24, 2005 12:25 PM
> To: Steve Cipolli
> Subject: R: [Megaco] The ABNF coding of VALUE doesn't allow 
> to represent ageneric UTF-8 string/char
> 
> 
> 
> 
> -----Messaggio originale-----
> Da: Steve Cipolli [mailto:SCipolli at radvision.com]
> Inviato: venerdì 24 giugno 2005 18.01
> A: Contardi Angelo; Megaco IETF - Mail List (E-mail)
> Cc: Christian Groves (E-mail)
> Oggetto: RE: [Megaco] The ABNF coding of VALUE doesn't allow 
> to represent ageneric UTF-8 string/char
> 
> 
> 
> Two points:
> 
> 1. It is not necessary to quote the UTF-8 characters as you 
> define in your second solution.  VALUE can be extended to 
> accept characters (quoted and unquoted) in the range 0x80-0xFF.  
> 
> The problem is that VALUE don't allow to "carry" ALL ASCII 
> (0x00-0x7F) too, NEEDED to code UTF-8 characters < 0x80. So 
> the only "extension" 0x80-0xFF is NOT enough to allow the TX 
> of ALL UTF-8 with ABNF. The quotation is NEEDED becaue of if 
> you "scan" an input sequence and allow a VALUE as "any 
> sequence of ono or more 0x00-0xFF", all "tokens" identified 
> as VALUEs. The extension of quoted form of VALUE i propose 
> don't increment the number of the forms of VALUE, just EXEND 
> the actual Quoted form of VALUE
> 
> 2. Providing a mechanism to allow double quotes (") in a 
> (double) quotedstring is essentially a separate issue, since 
> addition of UTF-8 chars does not motivate the need for this 
> mechanism nor complicate its addition.
> 
> See point 1.
> 
> --Stephen
> 
> > -----Original Message-----
> > From: megaco-bounces at ietf.org
> > [mailto:megaco-bounces at ietf.org] On Behalf Of Contardi Angelo
> > Sent: Friday, June 24, 2005 11:34 AM
> > To: Megaco IETF - Mail List (E-mail)
> > Cc: Christian Groves (E-mail)
> > Subject: [Megaco] The ABNF coding of VALUE doesn't allow to 
> > represent ageneric UTF-8 string/char
> > 
> > 
> > Hello,
> > 
> > from my previous  private  discussion with Christian Groves,
> > i have deduce this:
> > 
> > The ABNF coding of VALUE doesn't  allow to  represent a
> > generic  UTF-8 string or char (RFC 3629) because of:
> > 
> > 1) The FIRST byte of an UTF-8 char may be ANY ASCII char in
> > the range  0x00-0x7E
> >    and, for instance, the  ASCII 0x00-0x07  and 0x22 (") are 
> > not  allowed in ANY
> >    ABNF form of VALUE.
> >    
> > 2) Furthermore, UTF-8 chars "greater than 0x7E", need  "ASCII
> > chars"  (to better
> >    say, Octets) in the range 0x80-0xFF (not ALL), not allowed 
> > in  ANY  ABNF form
> >    of VALUE.
> > 
> > So, to "correct" this problem, i can suggest two possible solutions:
> > 
> > 1) The simple one, code the UTF-8 string/char  in "Octect
> > Mode", as described in
> >    ANNEX B.3. This is a BAD solution from the efficiency  
> > transmission  point of
> >    view because of it "halve the TX band": to TX one  UTF-8  
> > char  (max 4 Octet)
> >    i must TX max 2 x 4 = 8 Octet (ASCII chars).
> > 
> > 2) The difficult one, allow the ABNF quoted form of VALUE to
> > TX ALL ASCII chars
> >    0x01-0xFF (the range 0x80-0xFF are more  properly  named  
> > OCTET in RFC2234 ),
> >    except 0x22 ("), that should be  ESCAPED with "\", as  
> > already  done for ABNF
> >    Local and Remote Descriptor (see SDP).  Note that '\0' 
> > (0x00) is  NOT ALLOWED
> >    in this new quoted string form as in the present one, but 
> > it's  not a problem
> >    because in UTF-8 the char '\0' (0x00) is same as in ASCII 
> > (string terminator)
> >    and is NOT used to code "non ASCII" UTF-8 chars,  all 
> > those chars > 0x7F that
> >    require more than one ASCII chars to be encoded (from 2 to 
> > 4 ASCII chars). In
> >    fact the "extra chars" needed to code an UTF-8 char are 
> > all above 0x7F.
> > 
> > While the 1st mode doesn't  require any  modification  of
> > ABNF syntax (but is it applicable in any case ?), the 2nd one 
> > require this ABNF modification:
> > 
> >       quotedString         = DQUOTE *(quotedChar) DQUOTE
> >   	
> >       ; %x22 (") is allowed just if "escaped" with "\"
> >       quotedChar           = ( %x01-21 / "\" DQUOTE / %x23-FF )
> > 
> > In a parser implementation, i think, this  is not a
> > "terrible complication" and can be solved in the same way of 
> > "octetString" of Local and RemoteDescriptor. It require  also 
> > to  implement an  ESCAPE "strip(rx)/padding(tx)"  mechanism, as 
> > already required for SDP.
> > 
> > P.S.: I suppose NO ONE need to  send the  "string terminator"
> > '\0' (0x00)  in an
> >       UTF-8 string or char. If my  assertion is false,  in 
> > the "solution 2)" the
> >       %x00 should also be "escaped". Viceversa,  the 
> > "solution 1)"  can  already
> >       "transport" %x00 octet.
> > 
> > Best regards
> > 
> >         o o o o o o o . . .   ___________________________________
> >         o      _____           ||        Angelo Contardi         |
> >       .][__n_n_|DD[  ====_____  |   angelo.contardi at italtel.it   |
> >      >(________|__|_[_________]_|________________________________|
> >      _/oo OOOOO oo`  ooo   ooo  'o!o!o                      o!o!o`
> > 
> > _______________________________________________
> > Megaco mailing list
> > Megaco at ietf.org
> > https://www1.ietf.org/mailman/listinfo/megaco
> > 
> > 
> > 
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 5538 bytes
Desc: not available
URL: <https://lists.packetizer.com/pipermail/sg16-avd/attachments/20050624/4d2b35e3/attachment-0006.bin>