CharacterSet
TADS 3 uses the Unicode character set to represent all strings internally. Unicode is an international standard that was designed to be capable of representing, in a single character set, characters from every natural language in use throughout the world. Since most computers use other character sets for the display, keyboard, and file system, though, it is often necessary to translate strings between the Unicode characters that TADS uses internally and the coding systems. In almost all cases, TADS performs this translation automatically; when you display a string, for example, TADS translates the string to the display character set, and when you read a string from the keyboard, TADS translates the local character encoding to Unicode in the returned string.
In some cases, though, it's useful to be able to translate characters to and from Unicode, or from one local character set to another, under explicit program control. For these situations, TADS provides the CharacterSet intrinsic class. This class encapsulates a "character mapping," which defines the correspondences between local character codes and Unicode character codes.
To create a CharacterSet object, you use the new operator, specifying the name of the character set you want to translate to or from:
local cs = new CharacterSet('us-ascii');
The CharacterSet object can then be used to specify the encoding to use for explicit character translations. You can use a CharacterSet in these situations:
- You can specify the encoding of a text file you are reading or writing, by passing the CharacterSet to File.openTextFile().
- You can specify the interpretation of raw bytes in a ByteArray by passing the CharacterSet to the mapToString() method.
- You can specify how to encode a string into raw bytes by passing the CharacterSet to the mapToByteArray() method of a String.
In addition, CharacterSet provides a few methods that let you get information about the character mapping it describes.
Note: when using the CharacterSet class, you should #include <charset.h>.
Built-in and external character mappings
TADS 3 has several pre-defined character mappings built in to the system:
- 'US-ASCII' - the 7-bit ASCII character set. This "least common denominator" character set is available on practically every modern computer. Most computers extend this set by adding an additional set of accented letters and punctuation, but the extended sets vary by operating system and localization.
- 'ISO-8859-1' - the ISO 8859-1 character set, also known as ISO Latin-1. This is an 8-bit character set that contains the ASCII characters plus a set of punctuation and accented letters for Western European languages. This character set is not supported on all computers, but it has become widely supported because of its status as the default character set for HTTP.
- 'UTF-8' - the Unicode UTF-8 encoding. This encoding represents each 16-bit Unicode character as one, two, or three bytes; it is designed to be especially compact when coding strings that consist mostly of the ASCII subset of Unicode.
- 'UTF-16BE' - the 16-bit Unicode character set, in "big endian" representation: this means that each 16-bit character is encoded in a pair of 8-bit bytes, with the more significant byte first.
- 'UTF-16LE' - the 16-bit Unicode character set, in "little endian" representation: this means that each 16-bit character is encoded in a pair of 8-bit bytes, with the more significant byte first.
The character sets above are available on every TADS 3 interpreter. In addition, TADS has a mechanism that allows new character set definitions to be added with external mapping files - see the section on character maps for details. You can use any character set for which an external mapping file exists on the local system, simply by using the mapping name in the CharacterSet constructor. (However, don't use the ".tcm" or other filename suffix - just use the plain mapping name.)
The standard TADS 3 distribution includes a full suite of external character mapping files, including all of the 8-bit Windows, MS-DOS, and Macintosh code pages, and the ISO Latin-1 through Latin-10 character sets. This standard suite is normally included with all distributions on all TADS 3 platforms, but individual platforms may add or delete some of these standard encodings, so it's best not to assume that a mapping is present just because it's in the standard suite. You can check at run-time to see if a given mapping is available using the isMappingKnown() method.
Unknown character mappings
You can create a CharacterSet object that refers to a character mapping that doesn't exist on the local system. This is legal and won't cause any errors at the time you create the object; however, if you try to use the object to perform any character mapping, an exception - UnknownCharSetException - will be thrown.
You can check to see if a character mapping is known by calling the isMappingKnown() method after creating the CharacterSet object. If this method returns true, the character set is known and you can use it to perform character mapping.
It is legal to create a CharacterSet referring to an unknown mapping because it would otherwise be impossible to save the state of a program that contains a CharacterSet object and then restore the state on another computer without the same character mappings.
CharacterSet methods
getName()
isMappable(val)
Note that it is legal to map a string even if it contains unmappable characters, because the mapping process will simply map any unmappable characters to the "default" character defined in the character mapping. The default character varies by character set - it's part of the Unicode-to-local mapping definition - but it usually indicates visually that a character is missing; in some character sets it looks like an empty rectangle, and in others it's simply a question mark.
isMappingKnown()
IsRoundTripMappable(val)
Examples
Example 1: Using a CharacterSet to determine if the local machine is capable of displaying Cyrillic characters.
If you're writing a game in Russian, you would probably want to make sure the player's computer is capable of displaying Cyrillic characters - if it weren't, the player probably wouldn't be able to read most of the text in your game. You can do this by creating a CharacterSet object for the local system's display character set, and then testing a string of characters for mappability with the isMappable() method.
#include <tads.h> #include <charset.h> testCyrillic(args) { /* get the local display character set */ local cs = new CharacterSet(getLocalCharSet(CharsetDisplay)); /* * Check a few representative Cyrillic alphabetic characters * (see http://www.unicode.org/charts/) */ if (cs.isMappable('\u0410\u0411\u041a\u042f\0430\0431\u044f')) "Warning: This game uses Cyrillic characters. Your system does not appear to be localized for Russian, so the text in this game might not display properly. You might need to adjust your system localization settings to display Cyrillic characters before you can play this game. If you change your localization settings, please close and then re-start the game to ensure the new settings are used."; }
Example 2: Translating a file from one character set to another.
This isn't a very typical situation for most games, but suppose you wanted to write a program that reads a text file that was saved in one character set and save it in a different character set - say, translate the file from the Macintosh Roman character set to ISO Latin-1. To do this, you would need a Mac Roman mapping definition on your computer, because this isn't one of the built-in character sets; assuming we had this mapping file (let's say it's called "MacRoman.tcm"), we could perform the translation quite easily using the text file functions.
#include <tads.h> translate(inFileName, outFileName) { local inFile, outFile; local csMac, csISO; /* create the character set objects */ csMac = new CharacterSet('MacRoman'); csISO = new CharacterSet('iso-8859-1'); /* open the files */ inFile = File.openTextFile(inFileName, FileAccessRead, csMac); outFile = File.openTextFile(outFileName, FileAccessWrite, csISO); if (inFile == nil || outFile == nil) { "Error: cannot open files.\n"; return; } /* read text and write it back out */ for (;;) { local txt; /* read a line of input; stop if at end of file */ txt = inFile.readFile(); if (txt == nil) break; /* write it out */ outFile.writeFile(txt); } /* close the files */ inFile.closeFile(); outFile.closeFile(); }
Note that creating CharacterSet objects isn't strictly necessary in this example, since we could have more simply passed the name of the character set directly to File.openTextFile(). However, if we were going to use the same character set with more than one file, it's more efficient to use the CharacterSet object, since that we we only have to load the mapping file once.