Byte Packing

Byte packing is a new feature in TADS 3.1 designed to simplify binary file manipulation. The byte packer lets you read and write complex data structures in binary files with very compact code. You don't have to manually twiddle bits or shuffle bytes; the byte packer provides automatic translations between TADS datatypes and many common byte formats. It's not tied to any set of pre-defined file formats, either - it's really a "mini language" that makes it possible to read and write almost any binary file format you're likely to come across.

The TADS byte packer is based on the similar facilities in Perl and php. If you know one of those languages, you'll find all of this pretty familiar - but be aware that there are some differences, so you should at least read the reference section below.

If you're not already familiar with the basic ideas from Perl or php, skip down past the reference tables, and you'll find a full tutorial introduction.

Quick reference

Methods

File packBytes(format, ...)
unpackBytes(format)
ByteArray packBytes(index, format, ...)
ByteArray.packBytes(format, ...)
unpackBytes(index, format)
String String.packBytes(format, ...)
unpackBytes(format)

Type codes

a Latin-1 character string, padded with null bytes. When unpacking, null bytes are removed from the end of the string.
A Latin-1 character string, padded with spaces. When unpacking, trailing spaces are removed from the end of the string.
b Byte string. The source value for packing can be a string or ByteArray; if a string, each character is packed as a byte, so all characters must be in the Unicode range 0 to 255. Unpacks as a ByteArray.
c 8-bit signed integer ("char", in C terminology), -128 to 127; unpacks to integer. Packs to one byte.
C 8-bit unsigned integer, 0 to 255; unpacks to integer.
d Double-precision floating point number ("double" in C). Packed in standard IEEE 754-2008 64-bit binary interchange format, but in little-endian byte order. Use d> to pack in the IEEE standard big-endian order. Packs to 8 bytes; unpacks as BigNumber. This type can store about 16 decimal digits of precision, and can represent absolute values up to 1.7976931348623158e+308.
f Single-precision floating point number ("float" in C). Packed in standard IEEE 754-2008 32-bit binary interchange format, but in little-endian byte order. Use d> to pack in the IEEE standard big-endian order. Packs to 4 bytes; unpacks as BigNumber. This type can store about 7 decimal digits of precision, and can represent absolute values up to 3.402823466e+38F.
h Packs from a string containing hexadecimal digits to a byte string, packing two hex digits into each byte. The digits are packed low nibble first: e.g., the string '14' is packed into a byte value 0x41, which is the ASCII character 'A'. A repeat count gives the length in hex digits for the unpacked string, but the "!" suffix changes this to the packed byte length.
H Same as 'h', but packs the high nibble of each digit pair first: e.g., the string '41' is packed into byte value 0x41.
k Compressed unsigned integer value. Packs as a series of base-128 (7-bit) "digits", one digit per byte, most significant byte first; the 8th bit (0x80) is set on each byte except the last, to indicate where the value ends. Unpacks as a regular integer if the value fits in a 32-bit signed value, otherwise unpacks as BigNumber. This format can in principle store values of unlimited size, but there's an implementation limit of about 10500. Only unsigned (non-negative) values can be stored with this type.
l 32-bit signed integer ("long" in C terms), -2,147,483,648o to 2,147,483,647; unpacks as an integer. Packs to four bytes.
L 32-bit unsigned integer, 0 to 4,294,967,295. Unpacks as BigNumber, since a regular TADS integer can only store positive values up to 2,147,483,647.
q 64-bit signed integer ("quad word"), -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807; unpacks as BigNumber. Packs to eight bytes.
Q 64-bit unsigned integer, 0 to 18,446,744,073,709,551,615.
s 16-bit signed integer ("short" in C terms), -32,768 to 32,767; unpacks as an integer. Packs to two bytes.
S 16-bit unsigned integer, 0 to 65,535; unpacks as integer.
u UTF-8 character string, padded with null bytes. The length is in bytes in UTF-8 notation.
U UTF-8 character string, padded with spaces.
w "Wide character" Unicode string, in UCS-2 format, 16 bits per character; padded with null characters. The length is in characters, but the "!" suffix changes to the packed byte length (two bytes per character). The default byte order is little-endian; use w> for big-endian.
W Wide characters, UCS-2 format, padded with space characters.
x Packs a null byte; on unpack, skips forward by one byte in the file. Doesn't consume an argument value when packing, or produce a result value when unpacking. If there's a count suffix, packs or skips that many bytes. With "!", packs null bytes or skips bytes to the given alignment boundary, relative to the start of the current format string.
X Skips backwards one byte in the file; doesn't consume an argument value or produce a result value. With a count suffix, skips backwards by that many bytes. With "!", skips back to the nearest previous alignment boundary.
@ Moves the file pointer to a byte offset from the start of the current group (in parentheses or square brackets), or from the start of the format string if not in a group. The offset is given as the suffix: @15 moves to byte offset 15. When packing, if the position is moved forward, the bytes from the current position to the new position are filled with null bytes. Each iteration of a repeated group resets the zero point for the group. If the ! qualifier is used (e.g., @15!, the position is relative start of the entire format string rather than the current group.

When unpacking, @? returns the current byte offset in the file relative to the current group, and @!? returns the offset relative to the whole format string. These codes do nothing when packing.

"text" Packs the literal bytes text. Each character in text is treated as a byte value, so text can only contain characters in the Unicode range 0-255. To use a quote mark (") within text, write two in a row ("").

On unpacking, this simply skips the number of bytes implied by text. This doesn't return any value in the result list, and it doesn't check that the bytes in the file actually match the literal bytes given.

{hex-digits} Packs the literal bytes encoded by the series of hex digit pairs. For example, '{4142434445}' packs the bytes "ABCDE", since 0x41 is the ASCII code for "A", 0x42 for "B", etc.

On unpacking, this simply skips the number of bytes implied by the digit string. This doesn't return any value in the result list, and it doesn't check that the bytes in the file actually match the literal bytes given.

Qualifiers

A type code can be followed by one or more qualifiers. The order of the qualifiers doesn't matter.
number For integer and floating point types (cCsSlLqQdf), specifies the repeat count: the given number of values are written. E.g., s5 packs five 16-bit integer values.

For string types (aAbuUwWhH), the number qualifier specifies the length of the unpacked string. For aAbuU, the length is in bytes; for wW, it's in characters; for hH, it's in hex digits in the unpacked string. The "!" qualifier changes the count to reflect the packed byte length for wWhH.

For padding (xX), specifies the number of bytes to skip.

For positioning (@), specifies the offset from the start of the group (or the start of the format string, if not in a group).

Combining the * qualifier with a number qualifier (e.g., H30*) makes the number an upper limit when unpacking (and has no effect when packing); see below.

* In place of a numeric qualifier, * means "infinity" for the count or length. Packs all remaining argument values for a numeric type, or the full string for a string type. Unpacks the entire rest of the file.

You can also combine * with a numeric count for unpacking, as in H30*. The combination of a number and * means to unpack up to the numeric limit, but stop earlier if there's not enough source material to fulfill the count. Normally, trying to unpack more items than are actually available would cause an error, because the unpacker would try to read past the end of the source bytes. When using a count and * with a multi-byte item (such as one of the integer types) or a group, the file must end exactly at an item or group boundary - that is, there must not be any extraneous extra bytes after the last item. This is because the unpacker checks to see if it has reached the very end of the source object just before unpacking each iteration of the repeated item; if extraneous bytes follow the last item or group, the unpacker will think there's another item available and will go ahead with unpacking it, triggering a read error when the end of the source object is encountered midway through the iteration.

:type (where type is one of the type codes listed above, such as q for a quad-word integer) Equivalent to specifying the packed byte length of the given type as a numeric repeat count. For example, :q is equivalent to a repeat count of 8, because the packed size of the q type is 8 bytes. This is most useful with the x and X codes (e.g., X:q skips backwards 8 bytes), but can be used with any item.
0 For types auwhH, specifies null termination as the length indicator. When packing, a null byte (or a two-byte null character, for type 'w') is added after the end of the string. When unpacking, the length is determined by reading characters until reaching the null terminator. Unpacking skips the null terminator in the file, and doesn't include it in the result value.

'0' can be combined with a fixed length by adding the length qualifier after the '0', as in 'a015'. This packs a fixed-width string as normal, then adds the null terminator (so 'a015' packs to 16 bytes). When unpacking, the null terminator is simply skipped.

Not allowed for other types.

! For wide character (wW) and hex digit (hH) string types, changes the length count into a byte length.

For padding (xX), changes the length count into an alignment size. E.g., x4! pads just enough to position the next byte at a multiple of 4 bytes from the start of the format string.

For positioning (@), makes the position relative to the start of the entire format string (rather than the current group).

When unpacking a square bracketed group, specifies that each iteration of the group is to be unpacked into a sublist, rather than unpacking the whole group into a single sublist. For example, unpacking [L S]3! returns a list of three sublists, each of which contains two elements (a long integer and a short integer).

? Changes @ to a query operator: when unpacking, @? returns an integer value giving the current byte offset within the file, relative to the start of the current group, and @?! returns the file offset relative to the start of the entire format string.

Has no effect with other types.

> Change the byte order of a multi-byte type to big-endian. For integers, the most significant byte of the value is packed first. For floating-point numbers, packs the byte with the sign bit and exponent first, followed by the mantissa bytes from most to least significant. For wide Unicode strings, packs the more significant byte of each character first (this doesn't change the order of the characters themselves, though, obviously).

This can be applied to a group, in which case it makes everything in the group big-endian by default.

< Use little-endian byte order. This is the default for all types, but < can be used to override the order for an item within a group that has the > qualifier, because everything within a group inherits the endian-ness of the group. For integers, this packs the least significant byte first. For floating-point numbers, it packs the mantissa bytes first, from least significant to most significant, followed by the byte with the sign bit and exponent. For wide Unicode strings, packs the less significant byte of each character first.

This can be applied to a group, in which case it makes everything within the group little-endian by default.

~ When unpacking an integer type, uses the smallest type that will hold the value. Specifically, when unpacking L, q, and Q, returns an integer value if the packed value fits in a 32-bit signed integer (-2,147,483,648 to 2,147,483,647), otherwise returns a BigNumber. These types always return BigNumber by default, even for values that would fit an integer.

~ can be applied to a parenthesized or square-bracketed group, in which case it applies to each individual item within the group.

This qualifier has no effect when packing.

% When packing values, ignores type limit overflows, and instead packs a truncated value. The value stored in case of overflow depends on the type:
  • Integers (cCsSlLqQ): a value that's too large for the type is "truncated" by dropping the most significant bits until it fits. For example, packBytes('s%', 0x12345678) will pack the value 0x5678, dropping all but the low-order 16 bits.
  • Character and byte strings (aAb): The qualifier affects each character individually in the string. Each character that's outside the 0-255 range is truncated to 8 bits unsigned.
  • Floating point (df): A value that's too large to represent in the IEEE 754-2008 format is stored as "infinity", which is a special, distinguished value within the IEEE and BigNumber type systems.

% can be applied to a parenthesized or square-bracketed group, in which case it applies to each individual item within the group.

This qualifier has no effect when unpacking. The default packing behavior when this qualifier isn't specified is to thrown an error on overflow.

Other syntax

( ) Parentheses group a set of items. Groups can be repeated by using a count suffix, as in (l s)3, or a length prefix, as in C/(l s).

The following attributes can be applied to a group: < > ~ %. Attributes applied to a group are inherited by everything within the group. For example, (l s q)> is equivalent to l> s> q>.

[ ] Square brackets group items the same as parentheses, but indicate that the grouped items are taken from a list in the packing arguments, or unpacked into a list in the unpacking results. For example, unpacking 's [l]3 s' might return [1, [2, 3, 4], 5].

As with parenthesized groups, when any of the qualifiers < > ~ % are applied to the group, they're inherited by all items within the group.

If the ! modifier is used, each iteration of the group is packed from or unpacked into a separate sublist. Without this modifier, the entire group is packed/unpacked as a single sublist.

/ Count prefix. Use the syntax count-type / repeated-item. The count-type value is packed first, as a length prefix, then the repeated items are packed: fp.packBytes('S/l', 100, 200, 300) packs an unsigned 16-bit integer "3" as the number of items, then three 32-bit integers. If the repeated item is a string, the count is the length of the string in the normal units for the type: 'S/a' packs the length in characters of string, immediately followed by the string. The repeated item can be a group, which lets you pack a count prefix for a complex item such as a counted-length string or group of integers: 'S/(C/a)' packs a prefix-counted list of prefix-length strings, 'S/[a4 s l]' packs a prefix-counted list of structures with three elements each.

Be careful when repeating a fixed-length string. When an item has both a count prefix and a repeat count suffix, the prefix overrides the suffix. For example, C/A4 means "one counted-length Latin-1 string", because the "C/" prefix supersedes the "4" suffix. To pack a counted-length list of four-character strings, you must use C/(A4).

Miscellaneous notes

Spaces can be used anywhere in a format string. They're simply ignored, so you can use them for grouping to make the code easier to read.

All of the signed integer types use "two's complement" notation in the packed format.

Introduction to byte packing

The byte packer makes it easier to read and write binary files, by taking care of the details of translating between TADS data values and different byte formats. It provides a wide range of translations for common byte formats, including signed and unsigned integers from 8 to 64 bits, IEEE standard single- and double-precision floating point formats, big-endian and little-endian byte ordering for all multi-byte types, fixed-length strings, variable-length strings using length prefixes, ASCII, Latin-1, and Unicode strings, and more. The byte packer has a compact and powerful programming interface that lets you read and write complex structures with a couple of lines of code.

One key thing to understand is that the byte packer doesn't have any knowledge of any particular file format - for example, it doesn't have a special mode for JPEG files. It might have been convenient if it did, but it would also be limiting, since there's no way it could know about every format out there (not to mention that new formats are being invented all the time). The byte packer is more like a toolkit. Instead of knowing about particular file formats, it knows about the common components that make up most file formats. Taken together, these components let you build your own readers and writers for almost any format you might encounter.

Packing and unpacking

The byte packer has two basic operations: packing and unpacking. Packing is the process of taking one or more variable values in your program (numbers, strings, etc) and converting them to bytes in a file - you pack the values into bytes. Unpacking is the reverse, where you read bytes from a file and convert them into TADS values in your program.

You can actually pack and unpack bytes into and out of other objects besides files - specifically ByteArrays and strings. Everything works the same way regardless of the underlying byte storage location. So whenever we talk about reading or writing "the file", understand that we're talking generically about whatever underlying data source you're using.

You pack bytes using the packBytes() method. For files, this is a method on the File object. It's essentially a replacement for writeBytes(). Instead of having to prepare the individual byte values yourself, as you do with writeBytes(), the packBytes() method combines the steps of translating data values into the desired byte representation and writing the resulting bytes to the file. (You can still use writeBytes() - it's not obsolete by any means - and you can freely mix packBytes() and writeBytes() calls for the same file. You probably won't need to, though, since packBytes() can do anything writeBytes() can do, usually with a lot less hassle.)

Unpacking uses the unpackBytes() method, which is also a method on the File object. This method can serve as a replacement for readBytes(). It combines the steps of reading bytes from the file and translating them into data values.

Format strings

As we explained earlier, the byte packer doesn't have a "JPEG mode" - it doesn't have any built-in knowledge of any particular file formats. Instead, it relies on you to tell it how a file is structured. You have to tell the packer two things: the TADS datatype you want, and the byte format to use for the type. It's not enough to just know the TADS type, since there's no single, universal format for any type. Even for something as simple as an integer, there are several variations. You have to tell the byte packer which storage variation to use for each value.

The key to these conversions is called a format string. This is a string that you write in a little sub-language that defines the type conversions.

The format string language is pretty simple. The basic idea is that you write one "type code" (which is a single character, usually a letter) for each value you want to convert. The packer steps through the format string and matches up each type code with the corresponding value in the argument list, in order.

As a simple example, the type code for a 32-bit signed integer is "l". That's a lower-case "L", not a digit 1. The "L" stands for "long integer". This and a lot of the other type names come from the C programming language, so if you know C they'll be immediately intuitive, and if you don't know C they'll seem pretty random. C's type system has integer types in several sizes - meaning the amount of memory they take up, usually measured in bits. The smallest is the "char" type, which sounds like it's for character strings but is really just a small integer type, taking up only 8 bits. The next size up is "short integer", which is usually 16 bits; then "long integer", at 32 bits. The latest generation of processors also have 64-bit integers, which C calls "long longs" - but we call these "quads", because they take up four "words" of memory, and because we're already using the letter "l" for the 32-bit long.

Anyway, on to the example. To write a series of integers to a file, you write something like this:

local fp = File.openRawFile('myfile.bin', FileAccessWrite);
fp.packBytes('l l l', 1, 2, 3);

The format string is the first argument to packBytes - in this case, the string 'l l l'. The spaces in the string are meaningless; you can include spaces anywhere in the string without changing the meaning. As we'll see shortly, items can be more complex than a single character, so spacing things out can help make your code more readable.

The code above matches up each 'l' in the format string to an integer value in the argument list, and writes the value to the file as a series of four bytes. The 'l' format takes a 32-bit integer value and splits it up into four 8-bit chunks, writing each 8-bit chunk as a byte in the file. Bytes are 8 bits each, so a 32-bit integer divides up evenly into 4 bytes. In the file, we arrange the bytes in order from least significant (i.e., containing the lowest bit places of the number) to most significant. So after running this code, the file contains 12 bytes, which look like this in hexadecimal format:

01 00 00 00 02 00 00 00 03 00 00 00

To unpack this file - that is, read the bytes from the file and convert them back into TADS data values - we use the unpackBytes method. We once again need a format string to tell us how to interpret the bytes. Fortunately, the unpacking format string uses exactly the same syntax as the packing format string, so there's no new syntax to learn for this part. In fact, in most cases, you'll use exactly the same format string to unpack a given set of values that you used to pack the values in the first place. That's the case here:

local fp = File.openRawFile('myfile.bin', FileAccessRead);
local vals = fp.unpackBytes('l l l');

The unpacker reads just enough bytes to satisfy the items in the format string, converting each item into the corresponding data value. For an integer type such as 'l', the unpacker converts the bytes into an integer value. The unpackBytes() function returns a list containing the values it unpacked, in sequence, so vals now contains the list [1, 2, 3].

Repeat counts

The format string syntax has a shorthand for a repeated item. Instead of writing 'l l l', we can write:

fp.packBytes('l3', 1, 2, 3);

The "3" after the "l" means "pack (or unpack) three copies of this". You can use this notation with any integer or floating point type.

For string types, which we'll come to shortly, you can also use a number suffix, but it means something different. For a string, a number suffix specifies the length of the string rather than a repeat count.

More on integers

Let's go back and take another look at the file we've been working on. Recall that the bytes in the file, in hexadecimal, are:

01 00 00 00 02 00 00 00 03 00 00 00

The first integer value, 1, takes up the first four bytes of the file. Remember that type code 'l' means "32-bit integer", and 32 bits take up 4 bytes, because it's 8 bits to the byte. The default byte ordering is little-endian - least significant byte first - so the value 1, which we can write in hex as 0x00000001, comes out as the byte sequence 01 00 00 00. The values 2 and 3 follow in the same format.

It might be easier to see how endian-ness works with a larger number that isn't mostly zeros. Let's take 305,419,896. Why this number? Because it happens to have a nice hex representation: 0x12345678. Remember elementary school and the tens place, hundreds place, thousands place, etc.? With hex numbers we have the same idea, but of course it's not multiples of 10, but rather multiples of 16. In 0x12345678, the highest "place" digit is the 1, just as in our decimal version the highest would be the 3 in the hundred-millions place. We call this highest-place digit the "most significant digit", because it's the one that carries the biggest single slice of the number. The next most significant hex digit is the 2, and so on down to the 8. There's yet another way we can look at this, which is to split up the number into pairs of hex digits: 12 34 56 78. Now if we consider each pair to be a "place", we can see that the most significant pair is 12, and the least significant is 78. Is it starting to make sense why we wanted to write this in hex? Note how we've split the integer into four pieces, exactly like 'l' splits it into four bytes. In fact, the four hex digits pairs correspond exactly to the four bytes. This is no accident; it's the main reason computer programmers like hex so much. Just as we had the most significant digit pair, we now have the most significant byte. Endian-ness is all about how we arrange those bytes in the file. In little-endian order, we write them in sequence from least significant byte to most significant, which in this case would give us 78 56 34 12. In big-endian order, we write them the other way around, 12 34 56 78.

The packer has variations for four sizes of signed integers: 8-bit ('c', for "character"), 16-bit ('s', for "short integer"), 32-bit ('l', for "long integer"), and 64-bit ('q', for "quad word integer").

For each integer size, the packer also has an "unsigned" version. "Unsigned" means that the value doesn't have a plus or minus sign - it's always taken to be positive or zero, so you can't store a negative value in an unsigned integer slot. Why would you want such a limitation? It's because by throwing out the negative numbers, you roughly double the highest positive value that the slot can hold. For example, a signed short (16-bit) integer can hold values from -32,768 to +32,767, but an unsigned short can hold values from 0 to 65,535. In a lot of cases, a particular value simply can't be negative because the physical quantity it represents can't be negative - for example, if an integer represents the height or width of a picture, only positive values are meaningful, since there's no such thing as a negative width. When you know for a fact that a value can never be negative, you can use an unsigned integer field in order to get the extra capacity for storing higher positive values.

The unsigned type codes are all simply the upper-case versions of the signed equivalents: C, S, L, and Q.

Finally, the packer lets you control the byte order. Recall that the packer always uses little-endian byte order by default. Many file formats call for big-endian order, though, so the packer lets you override the default. To use big-endian order for any integer type, place a > after the type code. You can also mark a type as explicitly little-endian by putting a < after the type code.

So if we wanted to change our format to store big-endian, unsigned, 16-bit values, here's what we'd write:

fp.packBytes('S3>', 1, 2, 3);

The file would now look like this:

00 01 00 02 00 03

The 3 and the > are both suffix codes. These apply to the immediately preceding item only. For example, if we wrote 'lS3>', this would write one signed little-endian long, followed by three unsigned big-endian shorts. You can probably see how spaces would help make this clearer: 'l S3>' means exactly the same thing but is a bit easier to read. (Then again, spaces can also make things less clear: 'lS 3>' means exactly the same thing as 'l S3>', even though it might look like the 'l' and 'S' are meant to be grouped. But the space doesn't change anything, no matter where you put it.)

The types c, C, s, and S all correspond to TADS integer values. Note that these types have smaller range than the TADS integer type. If you try to pack a value that doesn't fit, it'll trigger a "numeric overflow" error. For example, this will cause an error:

fp.packBytes('c', 1000);

The ranges for the integer types are as follows:

CodeDescriptionRange
c Signed "char", 8 bits -128 .. 127
C Unsigned "char", 8 bits 0 .. 255
s Signed "short", 16 bits -32,768 .. 32,767
S Unsigned "short", 16 bits 0 .. 65,535
l Signed "long", 32 bits -2,147,483,648 .. 2,147,483,647
L Unsigned "long", 32 bits 0 .. 4,294,967,295
q Signed "quad", 64 bits -9,223,372,036,854,775,808 .. 9,223,372,036,854,775,807
Q Unsigned "quad", 64 bits 0 ..18,446,744,073,709,551,615

The type 'l' (lower-case L, signed 32-bit long) corresponds exactly to the TADS integer type, so you can't trigger an overflow with it.

'L' is the unsigned 32-bit long, so it'll cause an error if you try to pack a negative value into this type. On the other hand, TADS integers can't store values as large as an 'L' can, so it's impossible to overflow this type in the positive direction with an integer value. You can, however, store such large values in a BigNumber. When you unpack an item with type 'L', by default, the value is returned as a BigNumber. This is true even if the value would fit into a regular 32-bit integer. The reasoning is that even if the unpacked value would fit a regular integer, you're asserting via the 'L' that you're using it as an unsigned 32-bit value, so you might perform arithmetic on the value that would push it over the regular integer type's limits. You can override this by using the ~ qualifier, as in 'L~', which tells the unpacker to return the value as regular integer if it'll fit, otherwise as a BigNumber.

The types 'q' and 'Q' are for 64-bit integers. TADS doesn't have a 64-bit integer type per se - regular TADS integer values have to fit in 32 bits. But the BigNumber type is readily capable of storing any value that will fit in a 64-bit integer, so in most cases you'll use BigNumbers as the source values when packing types q and Q. You can also use regular integers, of course; the packer automatically "sign extends" a regular integer value to the full 64 bits to fill out the file slot. When unpacking q and Q items, the unpacker always returns BigNumber values by default, even when the unpacked value would fit in an integer, just as for type L. And as with type L, you can override this with the ~ qualifier, which unpacks a q or Q value as an integer when it'll fit, and as a BigNumber when it won't.

The main reason for using ~, by the way, is that integers are quite a lot faster than BigNumber values for most calculations, and use less memory. If you're doing anything very complicated with the unpacked data, or reading very large files, this could make a difference. It's probably not worth worrying about for small files or simple processing.

Ignoring overflows

If an integer value is out of bounds for the item type, as listed in the table above, the packer throws a "numeric overflow" error by default. You can tell the packer to ignore these errors, though. To do this, add the % qualifier to the type. For example, to pack a short integer without checking for overflow, you'd write 's%'.

The % qualifier tells the packer to "truncate" any integer values that don't fit in the type code's range. This means that the packer simply discards as many bits of the value as needed to make it fit, at the most significant end of the value. For example, if you pack 0x123456 with 's%', only the low-order 16 bits are actually packed, which means the value stored is 0x3456.

The reason that the % keeps only the lowest bits of the value is that this is the behavior typical in C or Perl/php programs in similar situations. This behavior thus provides a degree of compatibility for programs ported from or based on code written in those languages. It's not exactly safe or programmer-friendly, since careless use could lead you to create corrupted files without realizing it, but that's why it's not the default.

The symbol %, by the way, is meant to suggest a "modulo" or remainder calculation, which is exactly what happens when a value overflows. An overflowing value is effectively reduced modulo the largest possible value for the type (i.e., it's divided by the upper bound for the type, and only the remainder is kept).

Data conversions

Up to now we've mostly taken for granted that there's an obvious correspondence between packed formats and TADS value types. The various integer formats (c C s S l) translate to and from TADS integers, and the "quad" formats (q Q) are for BigNumbers. The odd man out is the L format, which is too big for a TADS integer half the time, so it gets promoted to BigNumber when unpacked.

For unpacking purposes, that natural corrspondence is exactly what the unpacker uses to determine the type of each returned value. (But remember that you can also use the ~ qualifier for q, Q, and L, to unpack into integers instead of BigNumbers whenever possible.)

When packing, though, you get a lot more flexibility. The byte packer will automatically convert whatever type you supply to the suitable type for the format:

Compressed integers

There's one more integer format, which takes a rather different approach. Type 'k' stores an unsigned integer of any size, using a compressed format. You can pack an integer or BigNumber (or a string, which will be converted to BigNumber first) to this type. 'k' is an unsigned type, so the value can't be less than zero, but there's no hard upper limit (there is an implementation limit that's currently about 10500, though).

On unpacking, a 'k' item will be converted to a regular integer if it fits, otherwise it'll be returned as a BigNumber value.

The packed byte format for 'k' uses a variable number of bytes. The length depends on how large the integer value is - the larger the value, the more bytes it takes. That's what makes it a compressed type. A type like 'l' always uses the same number of bytes no matter what value is stored, which often results in storing lots of extraneous bytes full of zeros. 'k', on the other hand, only stores as many bytes as needed to represent the actual values, which can cut down on the overall file size if the most likely values are small numbers.

The format is as follows: take the binary representation of the value and divide it into 7-bit chunks. Find the most significant chunk that contains a non-zero bit; discard the rest. Now store the chunks in bytes, from most significant to least significant. Set the 8th bit (the high-order bit, 0x80) on every byte except the last (least significant).

The Perl documentation calls this format a "BER compressed integer" (BER is for Binary Encoded Representation). That terminology seems to be a source of some confusion, because "BER" is more commonly used in reference to a standard called ASN.1, which defines BER as the confusingly similar-sounding but completely unrelated Binary Encoding Rules. To be clear, the 'k' coding doesn't have anything to do with ASN.1. The TADS byte packer includes this format because it reportedly comes up from time to time in existing file formats.

Character strings

The byte packer can store strings in fixed-sized chunks of the file, or with varying lengths. It can translate between TADS's internal Unicode format and single-byte Latin-1, Unicode UTF-8 encoding, or Unicode UCS-2 encoding.

Here are the basic string types:

You've probably noticed the pattern: the lower-case version of a code pads with null bytes, and the upper-case version pads with spaces.

You might notice that the byte packer doesn't have codes for the full plethora of character sets that you can use for ordinary text files - it can only work with Latin-1 and the two Unicode formats. That shouldn't be a limitation in practice, since virtually all standard binary formats that you're likely to encounter will themselves use ASCII (which is a subset of Latin-1), Latin-1, or Unicode. If you should find yourself with a need for, say, Latin-2 conversions, you'll need an extra step: convert the string to or from a ByteArray, and use the 'b' code (see "byte strings" below) to pack or unpack it.

If you pack a string into type 'a' or 'A', any characters outside of the Latin-1 range are written as '?'. The u, U, w, and W formats can represent every character TADS can internally.

If you use the % qualifier with 'a' or 'A', it changes the behavior for characters outside the Latin-1 range. 'a%' or 'A%' treat characters outside the range as integer overflows, which are then truncated to fit the 8-bit character type using the same scheme we saw earlier for regular integer fields. For example, the character U+0170 is truncated to 8 bits, yielding U+0070.

Some existing file formats store strings in fixed-length fields. In some cases, this is because all possible strings for a given field are of exactly the same length; for example, the chunk type field in a TADS .t3 file is always exactly four characters. In many cases, strings in a fixed-length field can still vary in length, up to the fixed maximum; when a shorter string is stored, it's simply padded out to the full length by adding spaces or "null" characters (zero bytes) after the end of the string. For example, if you have a 16-byte field, you can pack a string into it like this:

fp.packBytes('a16', 'test string');

The resulting bytes in the file will look like this (hex 20 is the ASCII code for a space character):

t e s t 20 s t r i n g 00 00 00 00 00

If you write a string that's longer than 16 characters with 'a16', the string will be truncated to 16 characters.

For a, A, u, and U, the length is given in bytes, as stored in the file. For a and A, this is obvious, because one character always equals one byte in Latin-1. For u and U, though, it can be a little complicated, because a single character can turn into one, two, or three bytes in UTF-8. If you write a string with 'u16', you can store up to 16 characters - but since some characters might take two or three bytes, the actual number stored might be smaller. The byte packer will never store an "incomplete" character - in other words, it'll never cut off the string between two bytes making up one character. If the packer has to cut a 'u' string short, it'll do so on a full character boundary, so that the packed bytes always constitute a well-formed string. The reason that u and U lengths are specified in bytes rather than characters is that the whole notion of "fixed length" applies to the file's bytes. "16 UTF-8 characters" isn't a fixed byte length, whereas "16 bytes in UTF-8 format" is.

For w and W, the length is given in characters. The actual byte length is twice the character length: if you write a 'w16' string, you're storing 32 bytes. Likewise, when unpacking, 'w16' reads 32 bytes from the file. You can override this by adding the "!" qualifier: this says that the count should be in bytes after all. So if you write 'w16!', the packer will only write 16 bytes, which will only be room for 8 characters.

You specify a fixed-length string using a number suffix on the string type code. You'll recall that with integers, a numeric suffix means "write this many copies of the type". With strings, it's different: a number after a string means "write this many charaters in the string".

For varying-length strings, the most common approach is to store the length of a string somewhere in the file before the string. That way, when the file reader comes to the string, it knows in advance how many bytes it has to read to fetch the string. A frequent idiom is to store the length of the string immediately preceding the string's bytes. This is so common that the byte packer has a special syntax for it: you first specify one of the integer codes, then you write a slash "/", followed by the string code. The string code in this case doesn't need a length suffix, since the "/" tells the packer to write the exact length of the string. For example:

fp.packBytes('C/a', 'Hi!');

This first stores an 8-bit unsigned integer containing the length, then the bytes of the string in ASCII format. Here's how the file looks:

03 H i !

When unpacking, the file reader knows from the "/" to interpret the "C" code as a length prefix, so it knows exactly how many bytes to read for the string.

Note that the number stored for a "/" length prefix will always use the same length units for the type that a count suffix would use. In other words "C/a" and "C/u" store the length of the string in bytes, while "C/w" stores the length in characters.

By the way, you can also use a string type for the length count. This can be useful if you're working with a file format that's based on human-readable text. When you use a string format for the length prefix, the packer converts the length value to a string, in decimal format; the unpacker converts the string back to a number to use as the length. For example:

fp.packBytes('A3/a', 'Hello from the string packer!');

That code stores "29 " - two digits plus a space to fill out the three bytes of the 'A3' format - followed by the bytes of the string.

There's one catch with using a string type as the length prefix: you can only use a fixed-length string. 'A3/' is fine, but 'A*/' won't work.

Null-terminated strings

In some file formats, varying-length strings aren't stored with a length prefix, but have their extents marked by adding a null byte at the end of the string. This is the way C/C++ programs represent strings in memory, so some file formats use the same approach.

The special '0' qualifier lets you pack and unpack null-terminated strings. For example:

fp.packBytes('a0', 'Null-terminated!');

This is stored in the file as follows:

N u l l - t e r m i n a t e d ! 00

When unpacking, the unpacker reads a character at a time from the file, and stops when it reaches the null. The null isn't part of the returned string; the unpacker ends the string just before it, then skips the null in the file so that the reader is positioned at the next value in the file.

You can use null termination with a, u, w, h, and H strings. It's not allowed with the space-padded versions because it would be confusing to use different characters for padding and termination. With a, u, h, and H strings, one null byte is stored at the end of the string; with w strings, two null bytes are stored, since every character in a w string takes two bytes.

You can combine null termination with fixed-length strings. To do this, be sure to write the length after the '0', since doing it the other way around would make the zero look like part of the length value. For example, to write a fixed length of 16 plus null termination, use 'a016' (not 'a160', which specifies a 160-character string). When you use null termination with a fixed-length string, the packer writes the null in addition to the fixed length portion - so 'a016' writes 17 bytes in total.

When packing, there's no real value to combining fixed-length and null termination. The packer simply writes the fixed-length string as normal, then writes an extra null byte (or pair of bytes for 'w') after the string. This guarantees null termination, but you could do the same thing by adding an 'x' after the string (or 'x2' for the 'w' format). When unpacking, though, there's a useful extra feature with the combination. The unpacker reads the fixed-length string as normal, skips the null byte in the file, then scans the string for embedded null characters. If it finds one, it terminates the string there. This is useful because some third-party file writers might leave garbage in a fixed-length string field after the first null, knowing that readers will ignore anything past the null. (This is particularly likely with C programs that copy memory structures directly to disk.) Without the '0' qualifier, the ordinary a, u, and w types remove null padding from the end of the string, but they don't scan for earlier embedded nulls.

Byte strings

Sometimes you'll want to store a string or a ByteArray's contents as raw bytes, rather than as text characters. There are three codes for this:

When packing, the "b" format is almost identical to the "a" format. It has only one difference: if you pack a character string using the "b" format, and it contains any character with a Unicode value outside of the range 0-255, the packer throws an error ("numeric overflow"). This is because you can't store a higher number value in a single byte. The "a" code, on the other hand, quietly stores a "?" for each such character.

You can change the overflow handling of the "b" format by adding the "%" qualifier. "b%" acts as though it were packing a series of 8-bit integers with truncation, so any character outside of the 0-255 range is simply truncated to 8 bits. For example, character U+0170 is written as 0x70, since only the low-order 8 bits are kept.

When unpacking, the b format has an important difference from "a". Whereas "a" returns a String value with the unpacked data, "b" returns a ByteArray. You can unpack the same data either way; it's just a matter of which is more convenient for you.

The length for the b format (as a repeat count suffix, or as a "/" prefix) is counted in bytes, as you'd probably expect.

The h and H formats give you a third way to unpack raw bytes. These codes unpack into strings containing printable hex digits. For example, suppose we have a file with these bytes:

H e l l o !

Now let's unpack it with 'H12'. This returns the string '48656C6C6F21'. The first pair of digits, '48', is the hex value of the ASCII code for 'H'. The second pair, '65', is the hex code for 'e'. And so on.

Note that 'H12' unpacks only six bytes. This is because the units for h and H are unpacked digits, and each byte in the file corresponds to two digits in the unpacked string. You can override this and change the units to bytes using the "!" suffix. Changing the code to 'H6!' unpacks six bytes, which will return a string of 12 hex digits.

Floating point values

The byte packer has support for two floating point formats: 'd' for "double precision" and 'f' for "float". These are both represented in the file using the IEEE 754-2008 interchange format. 'd' is stored using the binary 64-bit sub-format (base 2, 11-bit exponent, 52-bit mantissa), and 'f' uses the binary 32-bit sub-format (base 2, 7-bit exponent, 24-bit mantissa).

BigNumber values can store much larger absolute values than the 'd' and 'f' types can represent. If you pack a BigNumber value that doesn't fit, a numeric overflow error is thrown. You can change this by adding the % qualifier. 'd%' and 'f%' won't throw errors when confronted with numbers that are too large, but instead pack "infinity" values. "Infinity" is a special distinguished value in the IEEE type scheme, meaning that the result of a calculation was too large to store in the type.

For consistency with the integer types, the default byte order for these types is little-endian. For a floating point value, little-endian means that the least significant byte of the mantissa is stored first, and the byte with the sign bit and exponent is stored last. Note that this byte order is backwards from the IEEE standard, which calls for big-endian order. You can force standard big-endian order using the > suffix as usual: d> and f> store precisely the formats defined in the standard.

The IEEE 754-2008 interchange format is a standard, portable format. It also happens to be the native format (modulo endian-ness) on a number of platforms, but that's beside the point - it doesn't matter to TADS whether your platform uses this format or some other format, or doesn't have a native floating point type at all. TADS converts directly between the standard IEEE 754-2008 representation and its internal BigNumber representation, guaranteeing that the conversions are identical across all platforms.

The 'd' and 'f' types are provided mostly for the sake of completeness. Floating point numbers seem to be rare in binary file formats, probably because (a) the IEEE standard for interchange formats is quite recent, and before that there really wasn't a well-defined universal format, and (b) even if there had been a portable standard, it's quite complex to translate from one floating point format to another, so in all likelihood no one would have bothered anyway. Case in point, the Perl and php byte packers both explicitly punt on this: they simply store the native machine formats. Now that a portable standard exists, though, it's possible that we'll see floats used more readily in future binary formats; so you could look at the inclusion of these types in TADS as a bit of future-proofing. It also has a pragmatic use: the "native" formats on PCs and Linux systems just happen to be byte-for-byte identical to the IEEE formats, so those supposedly non-portable Perl and php output files will just happen to be readable in TADS with the 'd' and 'f' formats, as long as the files were created on a PC.

If you're defining your own custom file format, you can of course use these types for storing floating point BigNumber values. Be aware that these types have limited precision, though: the 'd' type can store the equivalent of about 17 decimal digits, and 'f' stores a mere 7. If you need to preserve higher precisions, you're better off storing a BigNumber as a string value.

Grouping

You can group a series of codes using parentheses, ( ). This lets you apply a repeat count and a byte order suffix to a whole series of items at once.

Suppose you're working with a file format that stores pairs of names and numbers. For each one, we'll write 'C/a l' - a counted-length string, followed by a 32-bit integer. Now suppose we have six of these to store. We could write it as 'C/a l C/a l C/a l C/a l C/a l C/a l'. But that's tedious; using parentheses and a repeat count suffix, we could shorten this to '(C/a l)6'.

Note that the whole group is repeated on each iteration. The packer runs through the entire contents of the group once, then starts over at the beginning of the group for each repetition.

Groups can be nested: '(l (C/a)2)3' is the same as 'l C/a C/a l C/a C/a l C/a C/a'.

One of the simplest uses of groups is to repeat a string format. Remember that a repeat count suffix (or a "/" prefix) for a string format specifies the number of bytes or characters in the string. If you want to pack or unpack multiple strings, use a group. For example, to write four Latin-1 strings with a fixed length of 15 bytes, you can write '(a15)4'. To write six length-prefixed Unicode strings, use ''(C/u)6'.

Grouping is also handy if you need to apply a byte-order override to a group of items. For example, you can simplify 'l> s> s> S>' by grouping it as '(l s s S)>'. Within a group, you can override the group byte order: '(l< s s S)>' treats all of the items as big-endian except the first.

List grouping

You can also use square brackets, [ ], for grouping. Square bracket groups work the same way as parenthesis groups for the repeat count and endian-ness modifier. The difference is that the argument value for packing must be a list, and the result value when unpacking is represented as a list.

When packing, a square bracket group reads its contents from a list in the arguments:

fp.packBytes('C/a [l s]3', 'string', [1, 2, 3, 4, 5]);

Note how the [l s]3 corresponds to the list value [1, 2, 3, 4, 5] in the arguments.

You might also notice that the format code [l s]3 specifies six items to pack, but the list argument only contains five values. What happens with that sixth packed item? When a value list is too short for a square-bracket group in the format list, the packer simply packs a default value for each missing item. For integers or floating point numbers, the default is zero; for string types, it's an empty string.

If a list argument contains too many items, on the other hand, the packer simply ignores the extra items.

When unpacking, a square bracket group is unpacked as a sublist in the result list. We can unpack the file we just packed above, like this:

local lst = fp.unpackBytes('C/a [l s]3');

This returns the list ['string', [1, 2, 3, 4, 5, 0]. As you can see, the unpacker create a sublist for the grouped item. Whenever the unpacker sees a square-bracket group in the format list, it uses a list for that group in the return value list.

Note that the sublist for the group contains six entries, even though our original input list for packBytes() had five values. Remember what we said about the default: the packer stored a default value of zero in the file for the missing sixth slot. The unpacker didn't supply the default - that was actually stored in the file when we packed it. The unpacker simply returned the zero value that it read from the file.

Structure grouping

When packing or unpacking a group, you can also tell the packer to treat each iteration of the group as a separate sublist. This is often convenient when packing or unpacking object structures.

To pack or unpack a group as a list per iteration, use the "!" with the square-bracketed group.

Let's revisit the example above, but this time unpack the group data into sublists:

local lst = fp.unpackBytes('C/a [l s]3!');

In this case, the return list will be ['string', [1, 2], [3, 4], [5, 0]]. Rather than returning the entire group as a single sublist with six items, the unpacker returns a separate sublist for each iteration of the group.

This can be handy when you're unpacking data into object structures, since it lets you use list iteration functions like mapAll() to transform the unpacked data into objects. For example, if we have a file that contains a list of structures, each of which consists of a 32-bit integer and a 16-bit integer, we could read the file into a custom object with something like this:

class InfoObj: object
  construct(lst)
  {
     aLong = lst[1];
     aShort = lst[2];
  }
  aLong = nil
  aShort = nil
;

readInfoObjects(fp)
{
  return fp.unpackBytes('[L S]5!').mapAll({x: new InfoObj(x)});
}

This tells unpackBytes() to read five of the long/short structures from the file, returning each one as a list with two elements. The overall unpackBytes() result is then a list of five of these sublists. We apply mapAll() to the list, transforming each sublist into an InfoObj instance that we construct from the sublist data. This leaves us with a list of five InfoObj objects. In one line, we've decoded this section of the file into structured object data.

Variable-length groups

Square-bracket groups have another important ability: you can use them with the "/" prefix count syntax. Going back to the example above, suppose that we don't want to use a fixed count of three for the list group, but instead use the actual length of the list. We could do this using the usual "/" syntax, but this time we apply it to the whole group rather than an individual item:

fp.packBytes('C/s C/[l s]', 'string', [1, 2, 3, 4, 5]);

When you use the "/" prefix with a square-bracket group, the packer figures out how many iterations of the group will be needed to store all of the items in the list value, and uses that as the repeat count. It writes out the repeat count prefix, just as when you use "/" with a single item, then iterates through the list. So for the example above, the bytes in the file will be:

05 s t r i n g 03 01 00 00 00 02 00 03 00 00 00 04 00 05 00 00 00 00 00

When we unpack with the same format string, the result will be ['string', [1, 2, 3, 4, 5, 0]], just as when we explicitly entered the repeat count of 3.

You can use a string type, such as 'A', for the length prefix type. If you do, the string

Auto repeat count

Sometimes you'll want to pack a list of values, all of the same type, without knowing in advance how many values there are. As we've seen, you can use the "/" prefix to automatically count up the elements and pack them, but that also stores the counter prefix. When you want automatic counting without storing the count, you can use the special repeat count "*":

fp.packBytes('s*', 1, 2, 3, 4, 5);

The * repeat count simply packs everything remaining in the argument list, without storing a repeat count anywhere. When a starred item matches up with a list value, the packer writes out everything in the list to the starred item, then continues with the next item as normal:

fp.packBytes('s* a15', [1, 2, 3], 'hello!');

That writes the three 16-bit integer values, followed by the string.

On unpacking, a * means unpack the whole rest of the file. For example, to unpack an entire file into a byte array, you could simply write

local fp = File.openFile('test.bin', FileAccessRead);
local b = fp.unpackBytes('b*');
fp.closeFile();

Be careful with * on unpacking, since it could read a lot of data if you accidentally use it in the middle of a large file.

"Up to" repeat counts

It's sometimes useful to pack or unpack a repeated item up to some limit, or to the end of the actual data being packed/unpacked, whichever comes first. So far, we've seen how to pack or unpack an exact number of items, as in s30; we've also seen how to pack the whole rest of the argument list, or unpack the whole rest of hte file, as in s*. But what if you want to unpack 30 items, but stop if the file runs out of data before unpacking all 30?

You can do this using an "up to" count. Specify an up-to count by combining a numeric limit and *. For example, s30* packs or unpacks up to 30 short-integer values, but stops if the end of the arguments (when packing) or file (unpacking) is encountered midway through the set.

When packing, the limiting factor for most types is the argument list itself (or the sublist, if packing a grouped sublist). For character string types, the limiting factor is the length of the string. For example, packing a 10-character string with 'a30*' packs all 10 characters of the string, with no padding added, while packing a 50-character string with the same 'a30*' packs only the first 30 characters of the string.

When unpacking, the limiting factor is the length of the file, ByteArray, or string you're unpacking from.

Padding and positioning

There are a couple of special codes that don't pack or unpack data values, but rather move the read or write position in the file.

The code 'x' adds padding to the file when packing, in the form of null bytes (bytes with the value 0). Use a count suffix with 'x' to write more than one null byte: 'x4' writes four null bytes. 'x' doesn't consume any values from the argument list. During unpacking, 'x' simply skips forward by one byte (or by multiple bytes, if there's a suffix count). The unpacker simply skips the bytes; it doesn't add any values to the result list for the skipped bytes.

'x' is often handy when working with standard file formats. Many formats require specific byte layouts with some areas filled with null bytes for padding, which is exactly what 'x' takes care of.

Every so often, a file format calls for a fixed byte value or string of byte values other than zeros. There are two format codes that help with this. First, you can enclose a string of ordinary charactres in double quotes in the pack string, and the packer will simply write those characters as bytes:

fp.packBytes('"hello"');

This packs the bytes h, e, l, l, o. The characters within the double quotes are treated as byte values, so each character must be in the Unicode value range 0-255, or an error will occur. Note that this format code doesn't use any argument values - it simply packs the bytes you specify directly in the format string. By the same token, this code won't return any values when unpacking. In fact, this format code doesn't do anything when unpacking except skip the number of bytes implied. It doesn't even verify that the bytes in the file match the text in the format code. In other words, unpacking '"hello"' is exactly the same as unpacking 'xxxxx' or 'x5'}.

The second "literal" packing format is a string of hex digit pairs enclosed in curly braces. Each pair of hex digits gives a byte value. For example, we could write our '"hello"' example above like this instead, and get the same effect:

fp.packBytes('{68 65 6C 6C 6F}');

You can, of course, use repeat counts with these formats. For example, you could write 100 ASCII "A" bytes to a file like so:

fp.packBytes('"A"100');

Our next code doesn't write anything at all - in fact, it sort of "unwrites". 'X' moves the file position backwards by one byte. With a suffix count, it moves the file position back multiple bytes. For example, 'X15' moves the file position backwards by 15 bytes. As with 'x', this code doesn't consume any argument values when packing, and it doesn't produce any output values when unpacking.

You can use 'X' during unpacking to unpack the same bytes more than once, with different interpretations. For example, 'l X4 H8' unpacks four bytes as a 32-bit signed integer, then goes back and unpacks the same four bytes again as a hex byte string.

'x' and 'X' let you specify the repeat count using the size of another item type code. To do this, write ":code" as the suffix. For example, 'x:l' writes four null bytes, because the size of an 'l' code is four bytes. If you use a character or byte string type after ":", the effective count is the size of a single character for that type: 'x:a' counts as one byte, 'x:w' as two bytes. 'x:u' and 'x:h' count as one byte each.

'@' lets you set a position in the file, relative to the start of the current parenthesized or square-bracketed group. You use a number suffix with @, but it's not the usual repeat count. The number after @ gives a byte offset from the start of the current group, starting at 0 for the first byte of the group. The file is positioned to that offset before the next byte to be packed or unpacked.

If you use @ outside of any group, it refers to the offset from the start of the whole format string.

For example, packing 'L @3 x' first packs a 32-bit integer into four bytes, then moves the file pointer to the last byte of the integer (offset 3 means three bytes after the first byte, which is the fourth byte written by 'L'), then overwrites that last byte with a null byte. So this forces the high-order 8 bits of the integer to be zero, effectively truncating the integer to 24 bits.

If a group containing an @ item is repeated, the @ offset refers to the start of the current iteration of the group. In other words, the starting point for @ is reset for each repetition of the group.

Alignment

The 'x' and 'X' codes have one more trick up their sleeves: they make it easy to add "alignment" padding. In some file formats, an item might have to be aligned on a particular boundary, meaning its byte location has to be a multiple of some size. This is especially common for file formats designed to be read directly into memory structures, because this kind of alignment is a hardware requirement on many machines. The most common alignment requirements are even alignment, meaning simply that each item has to be at an even numbered byte offset in the file, and size alignment, meaning that an item has to be at a byte offset that's a multiple of its own size (e.g., a 32-bit value would have to be at a multiple of 4 bytes, a 64-bit value at a multiple of 8 bytes, etc).

To do alignment with 'x' and 'X', add the '!' qualifier. For 'x', this tells the packer to add enough padding to get to the next multiple of the size. For example, 'x2!' makes sure the next item will be at an even byte offset. If the current offset is already even, 'x2!' doesn't do anything; if the offset is odd, it adds a single null byte. When unpacking, 'x2!' simply skips ahead to the next even byte offset.

'Xsize!' moves the file position backwards to the nearest previous multiple of the size. As with 'x', if the current byte offset is already a multiple of the size, 'X!' does nothing.

'x!' and 'X!' do nothing if you don't specify a size. The default size is one byte, and of course every offset is an even multiple of 1.

You can combine the '!' and ':type' qualifiers to align to the size of a type. For example, 'x:l!' aligns to a four-byte (32-bit) boundary.

The offsets for 'x' and 'X' are always counted from the start of the current format string (not from the start of the file or the start of the group).