Issue 000523.1

000523.1

R. Brender

Representation

Reconsider 991118.1 re UTF-8

Proposal
--------

Reconsider 991118.1, because it is not upward compatible with existing
practice.

If we really want to allow/require UTF-8, add an attribute which specifies
the encoding in use within a module.

Proposal 1 -- Simplest

    Add DW_AT_use_utf8, a flag. If present, the string representation in
    use is UTF-8. If not present, the string representation in use is
    unspecified.

Proposal 2 -- Harder

    Add DW_AT_string_encoding, a constant (or possibly a block). If present,
    the string representation in use is specified by the value of the constant
    (or block). If not present, the string representation in use is
    unspecified.

    Define some method for "naming" the possible encodings/character sets.
    There are a variety to choose from, and we should avoid defining
    our own scheme. (But I am unprepared to recommend a particular scheme
    at the moment -- it has been too long since I worked large character
    set issues in a prior life. I will do so, however, if Proposal 1 is
    deemed not sufficient.)

Discussion
----------

During the Draft2 editorial review, we paused over this statement
now in the Introduction:

    "Strings whose representation is not otherwise dictated by the data types
    of the target system (such as names used in declarations of the source
    program) are represented using UNICODE (see Section 7.5.4)."

Section 7.5.4 is more specific, specifying UNICODE UTF-8 and explaining
the UNICODE relationship to ISO 10646.

Mike and Dave convinced me that the qualification regarding "whose
representation is not otherwise dictated by the data types of the target
system" is unnecessary. Class string values are in fact used only for such
names of one sort or another, so the question of target type context never
arises. Well, almost. The attribute DW_AT_const_value for a named constant
entry can be a string, which is a border line case in this regard. However,
the classes constant and block are also available and are clearly target
type dependent -- so specifying that class string has a specific representation
does not limit expressibility in any way. I guess that is probably OK,
though some might still regard that as an incompatibility with DWARF V2.

But the big issue is whether specifying UTF-8 (or indeed, *any* specific
representation) for class string generally will be regarded as too big an
incompatibility to be accepted.

[Aside: If you don't understand what UTF-8 is, take a look at either
http://www.talisman.org or http://prominence.com/books/net/cd/html/utf.html.]

Note that it is not sufficient to claim that implementations today are
generally "eight bit clean" and so can handle UTF-8 without problem. For
that same reason, they can handle any of the 8-bit character sets (Latin-1,
etc).

However, if a DWARF file is in fact using a character set other than *7*-bit
ASCII (which equals ISO 646, which is a common subset of the vast majority
of world character sets, EBCDIC and its many close friends excepted), then
that file will not be a valid DWARF V2.1 file according to 991118.1. That is,
UTF-8 is not binary upward compatible with any 8-bit character set -- 8-bit
ASCII, ISO Latin-1, whatever!

I think the best we can do is provide the suggested DW_AT_use_utf8 attribute
and recommend it's use for character sets/representations other than 7-bit
ASCII. If a producer elects not to do so, then we let the current caveate
emptor (consumer beware) situation persist. In particular, I see no need
or real advantage for us to go whole hog the other way and provide a general
scheme for specifying use of arbitrary or even selected character sets.

Proposal 1 approved. The DW_AT_use_utf8 only applies to DW_TAG_compilation_unit.