General Layout

Idea for a 64-bit extension of IFF (sort of).

Toplevel tag, 'LIFF' (although 'IFFX' is also possible, needs thought).

IFFX will use little-endian byte ordering for structural elements (like RIFF, and unlike IFF 85).

Defining 'LIFX' as a variant in which structural elements are big-endian.
I will spec that an implementation should be able to handle both interchangably.

Valid FOURCC tags will normally be restricted to only have characters in the range of 0x20(' ') to 0x7E('~').

Other tags may be used in a context-specific manner (eg: like indices into a strings table or whatever). This devides the space roughly into 3 parts:
0x00000000-0x1FFFFFFF: lower range, interpreted as context-specific integers;
0x20000000-0x7EFFFFFF: middle/fourcc range, reserved for fourcc values;
0x7F000000-0x7FFFFFFF: reserved;
0x80000000-0xFFFFFFFF: upper range, again, interpreted as context-specific integers.

A reader is allowed to reject files with tags outside the valid fourcc range in cases where they are not expected, or to reject files with invalid tags within the fourcc range.

Integer tags should not be used as literal constant tags.
For now I will specify that integer values in fourcc codes will default to little-endian byte ordering (presumably, the fourcc values, unlike the other values, are allways read in in a certain manner, and thus supporting variable endianess is likely to be more painful than using a constant one).

UUIDs may also be used for identifying chunks.
In the case of UUIDs, it may make sense to give each type's UUID in the spec.
Whether to use UUIDs or FOURCCs as the basic convention is left to the format designer, or whether to give UUIDs directly (UUID), or via aliasing (UIDS).
I may need to come up with how the implementation should handle this.

Strings may also be a sane approach for tag identification, but strings may conceivably be used for "tag names" rather than "tag types", or may be used for extension types and FOURCCs as the base types, or whatever.

Normally (the case of chunks <2GB), the tags resemble generic IFF tags.

DWORD len;

In the case where the chunk size exceeds 2GB, it is coded as:

DWORD pad; //a special value 0xFFFFFFFF
QWORD len; //a 64 bit length

In which case, len prefixes the chunk data.

Like IFF and RIFF, chunks are to be padded to even byte boundaries (eg: if the length is odd).

Specific Layout

Primitive Chunks

Junk will contain a glob of junk data.
Junk should be zeroed if possible.

UUID (possible)
UUID Identified chunk of data.
Possible for "unique" chunk types.

UUID uuid;        //uuid value
//optionally followed by any content

Where UUID is a 128 (16 byte) unique number generated with the UUID algorithm.

UIDS (possible)
Array of registered UUIDs.

UUID uuid;        //uuid value
FOURCC fcc;        //fourcc value to be registered as
FOURCC name;    //name for this UUID
FOURCC desc;    //description
DWORD flag;        //flags for thus UUID

A UUID does not actually change the type of a fourcc value, but instead "aliases with" a fourcc value.
A UUID can be used to verify the uniqueness of a tag, or identify tags in some cases.

If name and desc are string tags, then they become effectively bound to the registered fourcc, and that fourcc references a particular name and description.

If name is a valid fourcc value, then it may make sense to have fcc as the same value.

This chunk is only allowed within the toplevel chunk.

Compound Chunks

Like IFF and RIFF, the first 4 bytes of a typical compound chunk are to be used for a Type ID (FOURCC).

DWORD len;
FOURCC tyid;

QWORD len;
FOURCC tyid;

Main file-level chunks.
May contain STRS and UIDS chunks (UIDS is disalloweed elsewhere in the file).
The type id is interpreted relative to the contents (string tables or UUIDs).

Basic compound grouping.
Contents are a collection of chunks representing various things, with a context given via the type id.
The type id is interpreted relative to the containing chunk.

SLST (optional)
Defined as a variant of LIST which contains an 'STRS' chunk, and implies all lower-range integer members to be indices into this chunk.
There are no limits placed on the location of the STRS chunk, however, only one is allowed within the group.

CAT and PROP (mentioned but should not be used)
'CAT ' and 'PROP' are mentioned here, however, these should not be used in general (I will not define them as basic/required parts of the spec).
If used, they should be used according to IFF 85.


'    ' vs 'JUNK'.
IMO, both should be recognized as such, though I have a preference for 'JUNK' (as it is more recognizable).

When writing a chunk, it is possible that the size could exceed 2GB (or 4GB), and it is not possible to write the chunk length in some cases.
A thought here is a "reasonableness" assertion, eg, if this chunk can reasonably exceed the size limit, then it makes sense to start off coding it with a 64 bit length.

I will state that multiple toplevel chunks are allowed.
There should not be any non-chunk data following after the end of a chunk.

As a restriction, subsequent toplevel chunks should only be used if their existance is implied in a previous chunk (or by context). This is to fight off possible junk data, or additional chunks unrelated to the previous ones in some cases.

Need a lot more thought about implementation details and "semantics".

Possible Conventions

I will comment on a few conventions (not required, but may sensibly be adhered to).
A lot comes down, however, to the particular format involved.

All uppercase tags should not be used for context-specific uses, and should instead only be used for context-independent/"well known" tags.
All lowercase tags should be used for general data members.
Tags starting with non-alphabetic characters should be data members and similar.

Possible is a convention based on casing (eg: similar that used for png).
A (uppercase) must be understand for "correct" processing;
B (uppercase) well known, applies to general tags and not context-specific ones;
C unused;
D (uppercase) unsafe to copy if not understood.

This, however, will not apply to non-alphabetic characters.
I do recommend char checking, eg, as:
There is only ASCII;
It can rule out non-alphabetic characters (which default to "lower-case").

It may make sense to enforce this convention within a particular domain.

The lower range is used for offsets into a strings table.
The strings table is defined as an 'STRS' chunk within the same form or list, consisting of a number of null-terminated strings.
Index 0 of the STRS chunk should always be an empty string, to be viewed as a "NULL" string (sensibly, 1 could potentially also be an empty string, but would be viewed as "empty" rather than "null", if this distinction matters).

LIFF test {
    STRS {
        "", "", "hello", "world"
    2 {"a string for hello"}
    8 {"a string for world"}

Null should not be used as a tag name (unless this is what is desired).
An empty string should be used instead if an empty string is desired.

It is up to the format whether or not to allow string chunks to be inherited.

Values in the lower range falling outside the strings table are still trated as integers. This can allow allocating tags starting at the upper end of the lower range and working downward.

LIFF test {
    STRS {
        "", "", "foo", "bar"
    LIST 2 {
        6 {"baz"}

The upper range is thus fully context-specific integers, and could be further subdivided if needed, eg:
0x80000000-0xBFFFFFFF, positive integers;
0xC0000000-0xFFFFFFFF, negative integers.