T3 Portable Binary Encoding

All T3 binary files are encoded in a portable format that allows a binary file created on one type of computer to be used without any changes with T3 implementations on other types of computers. To achieve this binary portability, T3 binary files use a portable encoding that represents datatypes in a standard format. Each T3 implementation translates between the standard format and the local representation of the datatype when reading and writing files.

Datatypes

The following are the portable datatypes and their encoding:

UTF-8 Text

Unicode text is encoded in UTF-8. This encoding represents each 16-bit Unicode character with one, two, or three bytes:

FromToBinary Encoding
0x00000x007f0bbbbbbb
0x00800x07ff110bbbbb 10bbbbbb
0x08000xffff1110bbbb 10bbbbbb 10bbbbbb

The bits of the 16-bit value are encoded into the b's in the table above with the most significant group of bits in the first byte. So: 0x571 has the 16-bit Unicode binary pattern 0000011001110001, and is encoded in UTF-8 as 11011001 10110001.

Note that UTF-8 encodes the most significant bits in the first bytes; this is different from the byte ordering used for other types (such as integers), which are all stored in little-endian format (least-significant byte first). The reason for this disparity is that this ordering makes it possible to compare two UTF-8 strings in a byte-wise fashion. (This type of magnitude comparison is not always especially useful, since it does not produce correct results for a localized sorting order, but it at least produces a uniform sorting order based on the Unicode code points stored in the string and hence may be useful in certain cases for building internal indices and tables.)

Integer values

Integer values are stored in little-endian format (i.e., least-significant byte first) in fixed-size byte arrays. These values are not aligned on any particular boundary in the file. These values can be interpreted as signed or unsigned. Signed values are encoded in 2's-complement notation.

16-bit integers are stored as 2-byte arrays. The first byte has the low-order 8 bits, the second byte has the high-order 8 bits.

32-bit integers are stored as 4-byte arrays. The first byte has the low-order 8 bits, the second byte has the next more significant 8 bits, the third byte has the next more significant 8 bits, and the fourth byte has the most significant 8 bits.

Data Holders

The T3 VM uses run-time typing, which allows certain types of variables to hold any type of value; this type of value is tagged with its type, so that the VM can interpret the value correctly whenever it is used.

In order to store these "variant" types, the VM defines a composite type called a data holder. This composite contains the type information along with the value.

To store a data holder portably, we store a 5-byte array. The first byte contains the type ID value. The remaining 4 bytes encode the value using the standard primitive type encodings; the table below shows the correspondence between the primitive types and their encodings.

When an encoding does not take up the full 4 bytes, the value is packed into the earlier bytes, and the later bytes have arbitrary values. For example, a property ID is encoded in a data holder as follows:

Byte IndexValue
06 (the type code for VM_PROP)
1low-order 8 bits of property ID value
2high-order 8 bits of property ID value
3arbitrary
4arbitrary

Type ID's

The table below shows the assigned ID values for the primitive types. (The types shown in italics are reserved for internal use by implementations and will never appear in portable files; we list them for the sake of completeness, but they'll never be stored persistently and thus are not relevant to the portable file format.)

Type NameType IDDescriptionValue Encoding
VM_NIL1nil (boolean "false" or null pointer)none
VM_TRUE2boolean "true"none
VM_STACK3Reserved for implementation use for storing native machine pointers to stack frames (see note below) none
VM_CODEPTR4Reserved for implementation use for storing native machine pointers to code (see note below) none
VM_OBJ5object reference as a 32-bit unsigned object ID number UINT4
VM_PROP6property ID as a 16-bit unsigned numberUINT2
VM_INT7integer as a 32-bit signed numberINT4
VM_SSTRING8single-quoted string; 32-bit unsigned constant pool offsetUINT4
VM_DSTRING9double-quoted string; 32-bit unsigned constant pool offsetUINT4
VM_LIST10list constant; 32-bit unsigned constant pool offset UINT4
VM_CODEOFS11code offset; 32-bit unsigned code pool offset UINT4
VM_FUNCPTR12function pointer; 32-bit unsigned code pool offset UINT4
VM_EMPTY13no value (this is useful in some cases to represent an explicitly unused data slot, such as a slot that has never been initialized) none
VM_NATIVE_CODE14Reserved for implementation use for storing native machine pointers to native code (see note below) none
VM_ENUM15enumerated constant; 32-bit integer UINT4
VM_BIFPTR16built-in function pointer; 32-bit integer, encoding the function set dependency table index in the high-order 16 bits, and the function's index within its set in the low-order 16 bits. UINT4
VM_OBJX17Reserved for implementation use for an executable object, as a 32-bit object ID number (see note below) UINT4

Note that types 3 (VM_STACK), 4 (VM_CODEPTR), 14 (VM_NATIVE_CODE), and 17 (VM_OBJX) are reserved for implementation use. These will never appear in a portable binary file; we list them here only for completeness. These types are intended to allow implementations to store native datatypes (such as native machine pointers) for which there is no meaningful portable representation. Implementations are free to use these types for any purposes of their own; the names and descriptions in the table are for mnemonic value only and shouldn't be taken to imply a required use for these types.

Type Names

The file format specifications use the following names to refer to the portable datatypes:

NameDescription
SBYTESigned 8-bit byte
UBYTEUnsigned 8-bit byte
UTF8Unicode text encoded as UTF-8
INT2Signed 16-bit (2-byte) integer
UINT2Unsigned 16-bit (2-byte) integer
INT4Signed 32-bit (4-byte) integer
UINT4Unsigned 32-bit (4-byte) integer
DATA_HOLDERData holder for any primitive type

Copyright © 2001, 2006 by Michael J. Roberts.
Revision: September, 2006