Angelo, Today all ASCII characters except double quotes can be specified by the VALUE production (some of the characters need to be in the quotedstring form). Adding UTF-8 does not change this. The part of UTF-8 that is < 0x80 is exactly ASCII and does not change or need to change the way in which VALUE is encoded. UTF-8 characters above 0x7F are encoded as multiple bytes in which each byte of the character is guaranteed to be > 0x7F. So UTF-8 characters are either < 0x80 and are simply the equivalent of ASCII or they are encoded into multiple bytes in which every byte of the character is > 0x7F. There will never be a circumstance where a byte of a UTF-8 multibyte character collides with _any_ ASCII character. Perhaps the table below will help clarify Scalar Value 1st Byte 2nd Byte 3rd Byte 4th Byte 00000000 0xxxxxxx 0xxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx 000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx Notice that "one byte" UTF-8 characters have their high order bit cleared (0) and that all bytes of all multi-byte UTF-8 encodings have the high-order bit set (1). In other words all multi-byte character's bytes are > 0x7F. So changing the productions to: VALUE = quotedString / 1*(SafeChar / %x80-F7) quotedString = DQUOTE *(SafeChar / %x80-F7 / RestChar / WSP) DQUOTE as Sasha Ruditsky suggests will allow UTF-8 characters to be used in VALUEs. ASCII characters (and UTF-8 equvialents < 0x80) that are in the RestChar set will need to be quoted just as they have always been. UTF-8 characters above 0x7F may or may not be quoted, just as all the ASCII characters in the SafeChar set. --Stephen
-----Original Message----- From: Contardi Angelo [mailto:Angelo.Contardi@italtel.it] Sent: Friday, June 24, 2005 12:25 PM To: Steve Cipolli Subject: R: [Megaco] The ABNF coding of VALUE doesn't allow to represent ageneric UTF-8 string/char
-----Messaggio originale----- Da: Steve Cipolli [mailto:SCipolli@radvision.com] Inviato: venerdì 24 giugno 2005 18.01 A: Contardi Angelo; Megaco IETF - Mail List (E-mail) Cc: Christian Groves (E-mail) Oggetto: RE: [Megaco] The ABNF coding of VALUE doesn't allow to represent ageneric UTF-8 string/char
Two points:
1. It is not necessary to quote the UTF-8 characters as you define in your second solution. VALUE can be extended to accept characters (quoted and unquoted) in the range 0x80-0xFF.
The problem is that VALUE don't allow to "carry" ALL ASCII (0x00-0x7F) too, NEEDED to code UTF-8 characters < 0x80. So the only "extension" 0x80-0xFF is NOT enough to allow the TX of ALL UTF-8 with ABNF. The quotation is NEEDED becaue of if you "scan" an input sequence and allow a VALUE as "any sequence of ono or more 0x00-0xFF", all "tokens" identified as VALUEs. The extension of quoted form of VALUE i propose don't increment the number of the forms of VALUE, just EXEND the actual Quoted form of VALUE
2. Providing a mechanism to allow double quotes (") in a (double) quotedstring is essentially a separate issue, since addition of UTF-8 chars does not motivate the need for this mechanism nor complicate its addition.
See point 1.
--Stephen
-----Original Message----- From: megaco-bounces@ietf.org [mailto:megaco-bounces@ietf.org] On Behalf Of Contardi Angelo Sent: Friday, June 24, 2005 11:34 AM To: Megaco IETF - Mail List (E-mail) Cc: Christian Groves (E-mail) Subject: [Megaco] The ABNF coding of VALUE doesn't allow to represent ageneric UTF-8 string/char
Hello,
from my previous private discussion with Christian Groves, i have deduce this:
The ABNF coding of VALUE doesn't allow to represent a generic UTF-8 string or char (RFC 3629) because of:
1) The FIRST byte of an UTF-8 char may be ANY ASCII char in the range 0x00-0x7E and, for instance, the ASCII 0x00-0x07 and 0x22 (") are not allowed in ANY ABNF form of VALUE.
2) Furthermore, UTF-8 chars "greater than 0x7E", need "ASCII chars" (to better say, Octets) in the range 0x80-0xFF (not ALL), not allowed in ANY ABNF form of VALUE.
So, to "correct" this problem, i can suggest two possible solutions:
1) The simple one, code the UTF-8 string/char in "Octect Mode", as described in ANNEX B.3. This is a BAD solution from the efficiency transmission point of view because of it "halve the TX band": to TX one UTF-8 char (max 4 Octet) i must TX max 2 x 4 = 8 Octet (ASCII chars).
2) The difficult one, allow the ABNF quoted form of VALUE to TX ALL ASCII chars 0x01-0xFF (the range 0x80-0xFF are more properly named OCTET in RFC2234 ), except 0x22 ("), that should be ESCAPED with "\", as already done for ABNF Local and Remote Descriptor (see SDP). Note that '\0' (0x00) is NOT ALLOWED in this new quoted string form as in the present one, but it's not a problem because in UTF-8 the char '\0' (0x00) is same as in ASCII (string terminator) and is NOT used to code "non ASCII" UTF-8 chars, all those chars > 0x7F that require more than one ASCII chars to be encoded (from 2 to 4 ASCII chars). In fact the "extra chars" needed to code an UTF-8 char are all above 0x7F.
While the 1st mode doesn't require any modification of ABNF syntax (but is it applicable in any case ?), the 2nd one require this ABNF modification:
quotedString = DQUOTE *(quotedChar) DQUOTE
; %x22 (") is allowed just if "escaped" with "\" quotedChar = ( %x01-21 / "\" DQUOTE / %x23-FF )
In a parser implementation, i think, this is not a "terrible complication" and can be solved in the same way of "octetString" of Local and RemoteDescriptor. It require also to implement an ESCAPE "strip(rx)/padding(tx)" mechanism, as already required for SDP.
P.S.: I suppose NO ONE need to send the "string terminator" '\0' (0x00) in an UTF-8 string or char. If my assertion is false, in the "solution 2)" the %x00 should also be "escaped". Viceversa, the "solution 1)" can already "transport" %x00 octet.
Best regards
o o o o o o o . . . ___________________________________ o _____ || Angelo Contardi | .][__n_n_|DD[ ====_____ | angelo.contardi@italtel.it | >(________|__|_[_________]_|________________________________| _/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
_______________________________________________ Megaco mailing list Megaco@ietf.org https://www1.ietf.org/mailman/listinfo/megaco