[Megaco] The ABNF coding of VALUE doesn't allow to represent ageneric UTF-8 string/char
Steve Cipolli
SCipolli at RADVISION.COM
Fri Jun 24 15:14:39 EDT 2005
Angelo,
Today all ASCII characters except double quotes can be specified by the
VALUE production (some of the characters need to be in the quotedstring
form). Adding UTF-8 does not change this. The part of UTF-8 that is <
0x80 is exactly ASCII and does not change or need to change the way in
which VALUE is encoded.
UTF-8 characters above 0x7F are encoded as multiple bytes in which each
byte of the character is guaranteed to be > 0x7F. So UTF-8 characters are
either < 0x80 and are simply the equivalent of ASCII or they are encoded
into multiple bytes in which every byte of the character is > 0x7F. There
will never be a circumstance where a byte of a UTF-8 multibyte character
collides with _any_ ASCII character.
Perhaps the table below will help clarify
Scalar Value 1st Byte 2nd Byte 3rd Byte 4th Byte
00000000 0xxxxxxx 0xxxxxxx
00000yyy yyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
Notice that "one byte" UTF-8 characters have their high order bit cleared
(0) and that all bytes of all multi-byte UTF-8 encodings have the
high-order bit set (1). In other words all multi-byte character's bytes
are > 0x7F.
So changing the productions to:
VALUE = quotedString / 1*(SafeChar / %x80-F7)
quotedString = DQUOTE *(SafeChar / %x80-F7 / RestChar / WSP)
DQUOTE
as Sasha Ruditsky suggests will allow UTF-8 characters to be used in
VALUEs. ASCII characters (and UTF-8 equvialents < 0x80) that are in the
RestChar set will need to be quoted just as they have always been. UTF-8
characters above 0x7F may or may not be quoted, just as all the ASCII
characters in the SafeChar set.
--Stephen
> -----Original Message-----
> From: Contardi Angelo [mailto:Angelo.Contardi at italtel.it]
> Sent: Friday, June 24, 2005 12:25 PM
> To: Steve Cipolli
> Subject: R: [Megaco] The ABNF coding of VALUE doesn't allow
> to represent ageneric UTF-8 string/char
>
>
>
>
> -----Messaggio originale-----
> Da: Steve Cipolli [mailto:SCipolli at radvision.com]
> Inviato: venerdì 24 giugno 2005 18.01
> A: Contardi Angelo; Megaco IETF - Mail List (E-mail)
> Cc: Christian Groves (E-mail)
> Oggetto: RE: [Megaco] The ABNF coding of VALUE doesn't allow
> to represent ageneric UTF-8 string/char
>
>
>
> Two points:
>
> 1. It is not necessary to quote the UTF-8 characters as you
> define in your second solution. VALUE can be extended to
> accept characters (quoted and unquoted) in the range 0x80-0xFF.
>
> The problem is that VALUE don't allow to "carry" ALL ASCII
> (0x00-0x7F) too, NEEDED to code UTF-8 characters < 0x80. So
> the only "extension" 0x80-0xFF is NOT enough to allow the TX
> of ALL UTF-8 with ABNF. The quotation is NEEDED becaue of if
> you "scan" an input sequence and allow a VALUE as "any
> sequence of ono or more 0x00-0xFF", all "tokens" identified
> as VALUEs. The extension of quoted form of VALUE i propose
> don't increment the number of the forms of VALUE, just EXEND
> the actual Quoted form of VALUE
>
> 2. Providing a mechanism to allow double quotes (") in a
> (double) quotedstring is essentially a separate issue, since
> addition of UTF-8 chars does not motivate the need for this
> mechanism nor complicate its addition.
>
> See point 1.
>
> --Stephen
>
> > -----Original Message-----
> > From: megaco-bounces at ietf.org
> > [mailto:megaco-bounces at ietf.org] On Behalf Of Contardi Angelo
> > Sent: Friday, June 24, 2005 11:34 AM
> > To: Megaco IETF - Mail List (E-mail)
> > Cc: Christian Groves (E-mail)
> > Subject: [Megaco] The ABNF coding of VALUE doesn't allow to
> > represent ageneric UTF-8 string/char
> >
> >
> > Hello,
> >
> > from my previous private discussion with Christian Groves,
> > i have deduce this:
> >
> > The ABNF coding of VALUE doesn't allow to represent a
> > generic UTF-8 string or char (RFC 3629) because of:
> >
> > 1) The FIRST byte of an UTF-8 char may be ANY ASCII char in
> > the range 0x00-0x7E
> > and, for instance, the ASCII 0x00-0x07 and 0x22 (") are
> > not allowed in ANY
> > ABNF form of VALUE.
> >
> > 2) Furthermore, UTF-8 chars "greater than 0x7E", need "ASCII
> > chars" (to better
> > say, Octets) in the range 0x80-0xFF (not ALL), not allowed
> > in ANY ABNF form
> > of VALUE.
> >
> > So, to "correct" this problem, i can suggest two possible solutions:
> >
> > 1) The simple one, code the UTF-8 string/char in "Octect
> > Mode", as described in
> > ANNEX B.3. This is a BAD solution from the efficiency
> > transmission point of
> > view because of it "halve the TX band": to TX one UTF-8
> > char (max 4 Octet)
> > i must TX max 2 x 4 = 8 Octet (ASCII chars).
> >
> > 2) The difficult one, allow the ABNF quoted form of VALUE to
> > TX ALL ASCII chars
> > 0x01-0xFF (the range 0x80-0xFF are more properly named
> > OCTET in RFC2234 ),
> > except 0x22 ("), that should be ESCAPED with "\", as
> > already done for ABNF
> > Local and Remote Descriptor (see SDP). Note that '\0'
> > (0x00) is NOT ALLOWED
> > in this new quoted string form as in the present one, but
> > it's not a problem
> > because in UTF-8 the char '\0' (0x00) is same as in ASCII
> > (string terminator)
> > and is NOT used to code "non ASCII" UTF-8 chars, all
> > those chars > 0x7F that
> > require more than one ASCII chars to be encoded (from 2 to
> > 4 ASCII chars). In
> > fact the "extra chars" needed to code an UTF-8 char are
> > all above 0x7F.
> >
> > While the 1st mode doesn't require any modification of
> > ABNF syntax (but is it applicable in any case ?), the 2nd one
> > require this ABNF modification:
> >
> > quotedString = DQUOTE *(quotedChar) DQUOTE
> >
> > ; %x22 (") is allowed just if "escaped" with "\"
> > quotedChar = ( %x01-21 / "\" DQUOTE / %x23-FF )
> >
> > In a parser implementation, i think, this is not a
> > "terrible complication" and can be solved in the same way of
> > "octetString" of Local and RemoteDescriptor. It require also
> > to implement an ESCAPE "strip(rx)/padding(tx)" mechanism, as
> > already required for SDP.
> >
> > P.S.: I suppose NO ONE need to send the "string terminator"
> > '\0' (0x00) in an
> > UTF-8 string or char. If my assertion is false, in
> > the "solution 2)" the
> > %x00 should also be "escaped". Viceversa, the
> > "solution 1)" can already
> > "transport" %x00 octet.
> >
> > Best regards
> >
> > o o o o o o o . . . ___________________________________
> > o _____ || Angelo Contardi |
> > .][__n_n_|DD[ ====_____ | angelo.contardi at italtel.it |
> > >(________|__|_[_________]_|________________________________|
> > _/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
> >
> > _______________________________________________
> > Megaco mailing list
> > Megaco at ietf.org
> > https://www1.ietf.org/mailman/listinfo/megaco
> >
> >
> >
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 5538 bytes
Desc: not available
URL: <https://lists.packetizer.com/pipermail/sg16-avd/attachments/20050624/4d2b35e3/attachment-0006.bin>
More information about the sg16-avd
mailing list