010219.1 A D. Anderson Compression Duplicate Dwarf elimination

This proposal replaces 991026.3.

This describes a way to emit dwarf that is shared
by multiple compilation units, avoiding duplication.
It deals with #include duplications, function duplications,
and more. Parts of this have been implemented in gcc
(as is described below) and parts are simply invention.

The following has major sections preceded by lines like this:

Section numbers here are draft 5 section numbers.

The first section is revised wording for section 3.1

The second section is revised wording for
with respect to abstract roots.

The third section is a new document section, which
we call 3.8 just to be specific here.
Some parts of this third section might be better
as appendices. <3.8 compression>

The fourth is a revision to section 5.6.1.

We present these in the order a
reader of the dwarf document will encounter them.

===================================================<3.1 wording>
Replace the first paragraph of Section 3.1,

    "An object file may be derived from one or more compilation units.
    Each such compilation unit is described by a debugging information
    unit with the tag DW_TAG_compile_unit or the tag
    In simple normal compilation, a single compilation unit with
    the tag DW_TAG_compile_unit is emitted per object file, and
    DW_TAG_subunit is not emitted.
    When DWARF space compression and duplication elimination and
    the like is being done, additional compilation-unit-tags
    may be emitted in an object file (and these additional
    compilation-unit-tags may be DW_TAG_compile_unit or
    DW_TAG_subunit as appropriate).

    "A DW_TAG_compile_unit entry owns debugging information entries that
    represent declarations made in the corresponding compilation unit.
    A DW_TAG_subunit entry owns debugging information entries that represent
    some portion of the declarations made in a related compilation

    <i>A DW_TAG_subunit does not necessarily correspond to any
    source language
    syntax; it is part of a mechanism by which a compiler may attempt
    to make the DWARF description of a program more
    space efficient. See section 3.8, "Space Compression".</i>

    <i>The place where the declarations of a subunit logically
    occur is indicated by means of a DW_TAG_import_subunit
    debugging information entry that refers to the subunit.</i>"

    See section 3.8,
    "Space Compression", for the definition and use of
    DW_TAG_subunit and for the use of
    DW_TAG_compile_unit when the producer
    is attempting to save space in the debugging information.

Following bullet 10, remove the paragraph
  "A compilation unit entry owns debugging information entries
  that represent the declarations made in the corresponding
  compilation unit."
as that information is now earlier in 3.1.

===================================================< wording>
Part of this proposal is that 2. at the end of section
"Out-of-Line Instances of Inline Subroutines"
have its wording changed to:

2. The root entry for a concrete out-of-line-instance tree
is normally owned by the same parent entry that owns
the root entry of the associated abstract instance, however
there is no requirement that the abstract and concrete
out-of-line-instance trees be owned by the same parent entry.

===================================================<3.8 compression>

Section 3.8 Data Compression

3.8.1 motivation

DWARF2 can use a lot of disk space, especially for C++.

The incredible depth and complexity of headers for C++ means
many many (possibly thousands of) declarations are repeated in every
compilation unit.

C++ templates mean some functions and their dwarf get duplicated.

For maximum flexibility, implementations want to be able to
move functions around (so that frequently called code can be
placed to avoid excessive-instruction-page-references or
icache-thrashing) and putting all the dwarf2 for all the
functions in a single compilation unit adds difficulty. Consider, for
example, if a function is dead (never called). How can the
unneeded dwarf information be removed?

Since discussing this seems inextricably tied to
object-file aspects, various object-format-specific
terms are used. Such terms are intended to
aid in explaining the concepts,
not to prescribe use of one object format
or another.

3.8.2 Overview

The solution is to break up the debug information into
separate sections and separate compilation units in
the output from compiling a single source file.

We'll use some traditional section naming here but
aside from the dwarf sections, the names are just meant
to suggest traditional contents as a way
of explaining the approach, not to be limiting in any way
on an implementation.

Where a traditional relocatable-object output from a
single source file might contain sections named:


A relocatable object from a compilation system attempting
some duplicate-dwarf elimination might contain


    followed (or preceded, order is not significant) a series
    of 'section groups'
    section-group 1
    section-group N

Where section groups might contain executable code (.text sections)
or might not.

The contents of a section group could be
discarded as a group (if determined appropriate by a linker).
For example, if a linker determined that section-group 1
from A.o and section-group 3 from B.o were identical it
could discard one group and arrange that all references in A.o and
B.o were to apply to the remaining one of the two identical
section groups. This space compression
definition is intended to
make that 'arranging' trivial and automatic because
the reference are simply to external names and the linker
already knows how to match up references and definitions.

What is minimally needed from the object file format
(outside of dwarf2 itself, and normal object/linker facilities
such as simple relocations):

    A means of having multiple .debug_info etc sections from
    a single compilation.

    A means of identifying a section-group (giving it a name).

    A means of identifying which groups of sections go together
    (the elf Section Group , or COMDAT, notion) so that
    a group can be treated as a group (kept or discarded).

    A means of referencing from inside one .debug_info
    compilation-unit to another .debug_info compilation unit
    (DW_FORM_ref_addr provides this).

The remainder of this section uses current UNIX and Elf
terminology for specificity, though
nothing here is inherently Elf or UNIX specific.

3.8.3 terminology

The following terms are not all used, but the sketchy
definitions may help communicate the meaning and use of Section

Relocatable-object. A simple object file, to be bound together
with others to make an executable or shared-library. Also
known as a '.o'. (it may contain SECTION GROUPs (defined
below)); Many UNIX static-linkers have a -r flag which
enables the creation of a new relocatable-object from several
relocatable-objects as input. Static linker implementors (and
linker users) have to realize that -r may impact duplicate
handling and possibly even executable correctness, depending on
exactly what the static linker does with -r.

Shared-library. Also known as Dynamic Shared Object. All
static relocations are done. It may reference other
shared-libraries and use such at run time. Never contains

Executable. An application. It may reference shared-libraries
and use such at run time. All static relocations are done.
Never contains SECTION GROUPs.

3.8.4 Example 1,C++
A Simple Example ( a sketch of parts of the relocatable object
a compiler might output to an assembler -- showing
assembler-like output so we can show the labels):

Source file wa.h
struct A {
    int i;

Source file wa.C
#include "wa.h"
f(A &a)
    return a.i +2;

Base CU sections of the relocatable object:

== section .text
    [function f code]

== section .debug_info
.L1 (local):
        DW_AT_type ref to DW.cpp.wa.h.123456.3
        DW_AT_name "f"
        DW_AT_type ref to DW.cpp.wa.h.123456.2
            DW_AT_name "a"
            DW_AT_type ref to <.L1>
== section .debug_abbrev
== section .debug_aranges
== section .debug_line

SectionGroup sections (COMDAT sections) of the same relocatable

group identifier my.compiler.company.cpp.wa.h.123456 (linker global symbol)

== section .debug_info
DW.cpp.wa.h.123456.1: (linker global symbol)
        DW_AT_language DW_LANG_C_plus_plus
DW.cpp.wa.h.123456.2: (linker global symbol)
        DW_AT_name "int"
    DW.cpp.wa.h.123456.3: (linker global symbol)
    DW.cpp.wa.h.123456.4: (linker global symbol)
            DW_AT_name "i"
            DW_AT_type      DW_FORM_ref to DW.cpp.wa.h.123456.2,
                (it is a local ref, so the more compact
                DW_FORM_ref can be used)
== section .debug_abbrev
== section .debug_line

This example is C++-like in that it uses
DW_TAG_compile_unit for the Section Group, implying
that the contents of the compilation unit are
globally visible (following the language rules).

3.8.5 Fortran Example

For a Fortran example, consider the following function:

---- File CommonStuff.fh ----

    COMMON /Common1/ C(100)

---- Func.f ----

    INCLUDE 'CommonStuff.fh'
    FOO = C(N + SEVEN)


==== Section Group:

Group identifier
    my.f90.company.f90.CommonStuff.fh.654321 (linker global symbol)

== Section .debug_info

DW.myf90.CommonStuff.fh.654321.1: (linker global symbol)
        ! ...standard compilation unit attributes, including...

DW.myf90.CommonStuff.fh.654321.2: (linker global symbol)
3$:     DW_TAG_array_type
            ! unnamed
            DW_AT_type(reference to DW.f90.F90$main.f.2)! base type INTEGER
                    reference to DW.f90.F90$main.f.2) ! base type INTEGER)
                DW_AT_lower_bound(constant 1)
                DW_AT_upper_bound(constant 100)

DW.myf90.CommonStuff.fh.654321.3: (linker global symbol)
            DW_AT_location(Address of common block Common1)
                DW_AT_type(reference to 3$)
                DW_AT_location(address of C)

DW.myf90.CommonStuff.fh.654321.4: (linker global symbol)
                reference to DW.f90.F90$main.f.2) ! base type INTEGER
                DW_AT_const_value(constant 7)

== section .debug_abbrev

==== Section Group end

==== Sections for primary compilation unit

== section .text

== section .debug_abbrev

== section .debug_info

            DW_AT_type(reference to DW.f90.F90$main.f.2)! base type INTEGER

                DW_TAG_common_inclusion ! For Common1
                        reference to DW.myf90.CommonStuff.fh.654321.3)

            DW_TAG_variable ! For function result
                        reference to DW.f90.F90$main.f.2) ! base type INTEGER

== section .debug_line


A companion main program such as

---- Main.f ----

    INCLUDE 'CommonStuff.fh'
    C(50) = 8
    PRINT *, 'Result = ', FOO(50 - SEVEN)


would have the same representation for the section group
(my.f90.company.f90.CommonStuff.fh.654321) corresponding
to the included file plus the following corresponding to the remainder
of the main subprogram.

== section .debug_info

DW.f90.F90$main.f.1: (linker global symbol)
D        W_AT_name(F90$main)

DW.f90.F90$main.f.2: (linker global symbol)

DW.f90.F90$main.f.3: (linker global symbol)
        ... (other base types)

                    ref to DW.myf90.CommonStuff.fh.654321.1)
            DW_TAG_common_inclusion ! for Common1
                    ref to DW.myf90.CommonStuff.fh.654321.3)

== section .debug_abbrev

== section .debug_line

Note that the included part of the each compilation is represented using
DW_TAG_subunit because the included declarations are not independently
visible as global entities.

3.8.6 Naming

A precise description of the means of deriving names
usable by the linker to access dwarf entities
is not part of the
dwarf2 specification,
it is a quality-of-implementation issue.

Nonetheless, an outline of a usable approach is given here
to make this more understandable and to guide implementors.

Section Groups (Elf) must have a section group name.
For the above example a name like
would suffice, where
    <producer-prefix> is some string specific to the producer,
            which has a language-designation embedded in
            the name when appropriate.
            Or the language name could be embedded in the <gidnumber>.
    <file-designator> names the file, such as wa.h in the example.
    <gidnumber> is a string generated to identify
        that specific wa.h header file in such a way that
        a) a 'matching' output from another
           compile generates the same <gidnumber>
        b) a non-matching (say because of #defines) output generates
           a different <gidnumber>.
        <i>It may be useful to think of a <gidnumber> as a
        kind of hash code.</i>

So for example, one the trivial example wa.h above
is assigned my.compiler.company.c.wa.h.123456

The section-group-name is a name assigned to an entire
section group.

Global labels for DIEs (need explained below) within
a section group could be
such as
        <Prefix> distinguishes this as a dwarf debug info name,
        and should identify the producer and when appropriate,
        the language.
        <die-number> could be a number sequentially assigned.
            to entities (tokens, perhaps) found during compilation.
        <file-designator>, <gidnumber> are as above.

It is up to the producer to ensure that if <die-numbers>
in separate compilations would not match properly that
a distinct <gidnumber> would have been generated.

This means that every point in the section-group
.debug_info that could be referenced from outside
by *any* compilation unit
must normally have an external name
generated for it in the linker symbol table, whether the
current compile references all those points or not.
(The completeness of the set of names generated
is a quality of implementation issue.)

Note that only section-groups that are designated as
duplicate-removal-applies actually require the
external labels for DIEs as all other section group sections
can use 'local' labels (section-relative relocations).
(This is a consequence of separate compilation, not
a rule imposed by this document).
Local labels would be references with DW_FORM_ref4
or DW_FORM_ref8 (these are affected by
relocations so DW_FORM_ref_udata, DW_FORM_ref1
and DW_FORM_ref2 are normally not usable and
DW_FORM_ref_addr is not necessary for a local label).

Implementations should clearly document their naming

3.8.7 DW_TAG_subunit and DW_TAG_compile_unit

A Section Group compilation unit using
DW_TAG_compile_unit is like
any other compilation unit, in that it's contents
would be evaluated by consumers as it it were an
ordinary compilation unit.

consider a #include within a C++ namespace
declaration or within a function definition as examples where
the DIEs in the Section Group should not be used
independently of being referenced from elsewhere.
They are not (necessarily) file-level entities.
This also applies to Fortran INCLUDE lines
when declarations are included
into a procedure or module context.

Consequently a compiler would use use
in place of DW_TAG_compilation unit in a section-group whenever
the section-group contents are not necessarily globally-visible.
This directs consumers to ignore that compilation unit
when scanning top level declarations and definitions.
The DW_TAG_subunit 'compilation unit' will be
referenced from elsewhere and the referencing locations
give the appropriate context that the DW_TAG_subunit
be scanned.

A DW_TAG_subunit may have, as appropriate, any of
the attributes assigned to a DW_TAG_compile_unit.

3.8.8 DW_TAG_import_subunit

A DW_TAG_import_subunit debugging information entry
may have a DW_AT_import attribute referencing
a DW_TAG_compile_unit or DW_TAG_subunit debugging information

A DW_TAG_import_subunit debugging information entry
refers to a DW_TAG_compile_unit or DW_TAG_subunit debugging information
entry to specify that the DW_TAG_compile_unit or DW_TAG_subunit
logically appear at the point of the DW_TAG_import_subunit.

3.8.9 DW_FORM_ref_addr

Use DW_FORM_ref_addr to reference from one compilation
debugging-information-entries to those of another

When referencing into a removable-section-group .debug_info
from another .debug_info (from anywhere), the
name should be used for an external symbol and a relocation
generated based on that.

When referencing into a non-section-group .debug_info,
from another .debug_info
(from anywhere) DW_FORM_ref_addr is still the form to be used, but
a section-relative relocation generated by use of
a non-exported name (often called an 'internal
name') may be used.

3.8.10 #include compression

C++ has a much greater problem than C with the number and size
of the headers included and the amount of data in each, but
even with C there is substantial header file information duplication.

A reasonable approach is to put each header file in its
own section group, using the naming rules mentioned above.
The section groups would be marked to ensure duplicate removal.
All data instances an code instances (even if they came from
the header files above) would be put into
non-section-group sections such as the base object file .debug_info

Where there is no predefined order for headers
to be #included and odd interactions, such that the source of
definition of some subtype is different depending on order of
inclusion. Due to intention or error, such does happen. In
such a case the 'signature' of the header had better be very
precise else the users will be quite annoyed when the debugger
works in a way that does not reflect the real source.

3.8.11 eliminating function duplication

Function templates (C++) result in code for the same template
instantiation being compiled into multiple archives or
relocatable-objects. The linker wants to keep only one of a
given entity. The debug for this and everything else for this
function should be thrown away (keep just one copy).

For each such code group (function template in this example)
the compiler assigns a name for the group which will match all
other instantiations of this function but match nothing else.
(And the elf section group 'remove duplicates' flag would be

The second and subsequent definitions seen by the static linker
are simply discarded.

References to other .debug_info sections (for DIEs) follow the
approach suggested above, but the naming rule might be slightly
different as <file-designator> should be interpreted as

3.8.12 single-function-per-dwarf-compilation-unit

This is related to the section group above (as implementations
may want to produce a single relocatable-object with multiple
section groups, one per function).
One purpose of such is to allow a linker to easily
reorganize the order of functions in the executable
(perhaps to improve cache performance).
Another is to make it easy for a linker to completely
remove unused functions.
These would not be marked as 'remove duplicates', since
the functions are not duplicates of anything.

Each function is given a compilation unit (and a section group).

Each compilation unit is complete, with text, data, and dwarf

And there is a compilation unit that has the file-level
declarations and definitions. Other per-function
compilation-unit dwarf information (.debug_info) points to this
file-level compilation unit's .debug_info entries.

ELF Note:
    The section groups could have the section group flag
    set to zero (see the Elf section group definition near the end
    of this document) so there is no need for a unique
    section group name.

Here the section groups can use DW_FORM_ref_addr
and internal labels (section-relative relocations) to refer to
the main object file sections, as the section groups here are
either deleted as unused or kept. There is no possibility
(aside from error) of a group from some other compilation being
used in place of one of these groups.

3.8.13 Inlining and out-of-line-instances

Abstract instances and concrete-out-of-line instances
may be put in distinct compilation units using Section Groups.
This makes possible some useful duplicate dwarf-elimination.

No special provision for eliminating class duplication
resulting from template instantiation is made here,
though nothing prevents eliminating such duplicates
using Section Groups.

3.8.14 gcc example
[perhaps this should be in appendix, and just referenced.]
gcc-specific rules, mentioned to try to make this clearer

This is with respect to what will likely be
in gcc version 3.0
and is not turned on by default as of February 2001.
is an example of a section-group name.

At this time gcc does not implement Elf Section Groups, but
instead uses a section name like
instead with the linker applying special rules.
gcc will probably transition to the Elf Section Group rules.

is a sample name of a DIE in the wa.h section group,
92485121 being the <gidnumber> and 4 being the die id..

Data is never in these header section groups, but is
always in the object file base sections . Only
types are put in the section groups.
Abstract inline DIEs are planned to be put in
though this is not done as of February 2001.

3.8.15 ELF Section Group
[ This should probably be in an appendix or left out entirely.
Ron Brender suggests leaving it out, and he's probably right.
This is part of the generic Elf ABI.
The generic ELF abi is on line at http://www.sco.com/developer/devspecs/
for download but as of Feb 23 this did not have the comdat

The following attempts to be an accurate rendering of section
group but should be taken as general information only.
The generic Elf specification is the only true definition.

A section of type SHT_GROUP defines a grouping of sections.

In the section-header for section group the sh_link field of an
SHT_GROUP section gives the section number of a symbol table
section. The sh_info field gives the symbol index of the
identifying entry of this section group. The symbol indexed is
the "identifying symbol" of the section group.

The content of an SHT_GROUP section is
a) A single flag word. If set to 1, it
   means that duplicates are to be discarded.
   If not set, some other criterion might be applied
   by the linker to discard the section group (such as
   removing unreferenced functions).
b) A list of section numbers.
   The listed section numbers are the sections in this section group.

Only relocatable-objects have identifying symbols or section groups.

Given two section groups with the same identifying symbol the
linker will simply discard and ignore the second group and all
its sections.

References to the sections comprising a group from sections
outside the group must be made via symbol table entries with
STB_GLOBAL or STB_WEAK binding and section SHN_UNDEF. If there
is a definition of the same symbol in the relocatable-object
containing the references, it must have a separate symbol table
entry from the references. Sections outside the group may not
reference symbols with STB_LOCAL binding for addresses
contained in the group's sections, including symbols with type

No non-symbol references may exist from outside a section group
to the inside of the group.

A symbol table entry that is defined relative to one of the
group's sections and that is contained in a symbol table
section that is not part of the group, must be removed if the
group members are discarded.

===================================================<5.6.1 wording>

[From Jason. His comments changed to an italics entry here.
All 6 original paragraphs are mentioned here and 2 new ones
added. No paragraphs deleted. ]

    5.6.1 General Structure Description

        [ first, second paragraphs kept as is.]

        [third paragraph, kept as is:]
        An incomplete structure, union or class type
        is represented by a structure, union or class entry that
        does not have a byte size attribute and that has a
        DW_AT_declaration attribute.

[ new material inserted after the above paragraph]
+ If a structure, union or class entry represents the defining
+ declaration of a structure, class or union member of another
+ structure, class or union, the entry has a DW_AT_specification
+ attribute, whose value is a reference to the debugging information
+ entry representing the incomplete declaration described above.
+ Structure, union or class entries containing the
+ DW_AT_specification attribute do not need to duplicate information
+ provided by the declaration entry referenced by the specification
+ attribute. In particular, such entries do not need to contain an
+ attribute for the name of the structure, union or class whose
+ definition they represent.

        The members ... [pararaph kept unchanged]

        <i> For C and C++, data member ... [ paragraph kept
        unchanged] </i>

[Following paragraph shown before and after a tiny change , marked with !
This is the last paragraph in 5.6.1 draft 5]
        If the definition for a given member of the structure, union
        or class does not appear within the body of the declaration,
        that member also has a debugging information entry
        describing its definition. That entry will have a
        DW_AT_specification attribute referencing the debugging
        entry owned by the body of the structure, union or class
        debugging entry and representing a non-defining declaration
!       of the data or function member.
        The referenced entry will
        not have
        information about the location of that member (low and
        high pc attributes for function members, location descriptions
        for data members) and will have a DW_AT_declaration attribute.
        If the definition for a given member of the structure, union
        or class does not appear within the body of the declaration,
        that member also has a debugging information entry
        describing its definition. That entry will have a
        DW_AT_specification attribute referencing the debugging
        entry owned by the body of the structure, union or class
        debugging entry and representing a non-defining declaration
!       of the data, function or type member.
        The referenced entry will
        not have
        information about the location of that member (low and
        high pc attributes for function members, location descriptions
        for data members) and will have a DW_AT_declaration attribute.

[Following inserted after above paragraph, making the following
the final paragraph of 5.6.1. ]

+This is useful for nested classes which are defined outside of their
+containing class definition, as in:
+struct A {
+ struct B;
+struct A::B { };
+The two different structs could be
+put into different CUs so that dwarf compression(section 3.8)
+can be used to eliminate duplicates.

===================================================<Appendix A>

Add DW_TAG_subunit, with same attributes as
DW_TAG_compile_unit (or simply indicate both have the same list)

Add DW_TAG_import_subunit with DW_AT_import attribute (only the one,
I guess, I don't see a problem with DECL being there too)



Jason Merrill outlined this design in
a posting to the dwarf2 mailing list 19 Jan 2001 and
helped a lot in refining this document.

Ron Brender made crucial contributions to the design
and the document.

Jim Dehnert helped with an earlier version of this document.

Accepted with the following suggestions:  editorial changes to present description of
section groups earlier and with perhaps less reference to ELF.  Also, possibly change
the name DW_TAG_subuint to another, more descriptive name.