CharacterSet
TADS 3 uses the Unicode character set to represent all strings internally. Unicode is an international standard that was designed to be capable of representing, in a single character set, characters from every natural language in use throughout the world. Since most computers use other character sets for the display, keyboard, and file system, though, it is often necessary to translate strings between the Unicode characters that TADS uses internally and the coding systems. In almost all cases, TADS performs this translation automatically; when you display a string, for example, TADS translates the string to the display character set, and when you read a string from the keyboard, TADS translates the local character encoding to Unicode in the returned string.
In some cases, though, it's useful to be able to translate characters to and from Unicode, or from one local character set to another, under explicit program control. For example, you might want to read or write an external disk file using a particular character set. For situations like this, TADS provides the CharacterSet intrinsic class. This class encapsulates a "character mapping," which defines the correspondences between local character codes and Unicode character codes.
Construction
To create a CharacterSet object, you use the new operator, specifying the name of the character set you want to translate to or from:
local cs = new CharacterSet('us-ascii');
The CharacterSet object can then be used to specify the encoding to use for explicit character translations. You can use a CharacterSet in these situations:
- You can specify the encoding of a text file you are reading or writing, by passing the CharacterSet to File.openTextFile().
- You can specify the interpretation of raw bytes in a ByteArray by passing the CharacterSet to the mapToString() method.
- You can specify how to encode a string into raw bytes by passing the CharacterSet to the mapToByteArray() method of a String.
In addition, CharacterSet provides a few methods that let you get information about the character mapping it describes.
Note: when using the CharacterSet class, you should #include <charset.h>.
Handling unmappable characters
TADS strings internally are represented in Unicode. Unicode combines essentially all of the world's alphabets and symbols into one character set. It can represent many thousands of unique characters.
Most proprietary character sets, on the other hand, are limited to a relatively small number of unique characters - often 256, which is the number of different characters you can represent with an 8-bit byte. These smaller sets are usually designed to represent only the characters needed for a small group of related languages. For example, Latin-1 only includes the Roman alphabet, and only has the accented letters that are commonly used in Western European languages (German, French, Italian, Spanish, etc).
When you convert a string from Unicode to one of these smaller proprietary character sets, it's entirely possible that the string will contain Unicode characters that don't exist in the target set. For example, Latin-1 doesn't contain any of the Greek alphabet. So what happens if you convert a TADS string containing Greek letters to Latin-1?
There are two possibilities when a character in a string you're mapping doesn't exist in the target character set.
The first is that it's replaced with an "approximation". This means that the mapper will substitute a similar looking character or group of characters for the original. For example, the plain ASCII character set doesn't include the copyright symbol, © (Unicode U+00A9), but it does have a visual approximation: the string "(c)". The most common approximation is for accented letters. Most of the mappers will replace an accented letter with its unaccented equivalent when the accented version doesn't exist in the target set. For example, if you convert a string containing "E with caron" (U+011A) to Latin-1, the mapper will substitute an unaccented E.
The second possibility is that the character will be replaced with a "missing character" symbol. This approach is used when there's no good substitute, such as when trying to map a Chinese character to Latin-1. The missing character symbol is up to the mapper to define; for some character sets it's a graphical symbol, often an empty rectangle, while for others it's an ordinary question mark.
The CharacterSet object has two methods that let you determine how a character is handled. isMappable() tells you whether or not the character has a mapping - this returns true if it has any mapping, exact or approximate. isRoundTripMappable() tells you whether the character has a one-to-one mapping, which usually means that there's an exact equivalent in the target set, since approximations are almost never one-to-one.
Built-in and external character mappings
TADS 3 has several pre-defined character mappings built in to the system:
- 'US-ASCII' - the 7-bit ASCII character set. This "least common denominator" character set is available on practically every modern computer. Most computers extend this set by adding an additional set of accented letters and punctuation, but the extended sets vary by operating system and localization.
- 'ISO-8859-1' - the ISO 8859-1 character set, also known as ISO Latin-1. This is an 8-bit character set that contains the ASCII characters plus a set of punctuation and accented letters for Western European languages. This character set is not supported on all computers, but it has become widely supported because of its status as the default character set for HTTP.
- 'UTF-8' - the Unicode UTF-8 encoding. This encoding represents each 16-bit Unicode character as one, two, or three bytes; it is designed to be especially compact when coding strings that consist mostly of the ASCII subset of Unicode.
- 'UTF-16BE' - the 16-bit Unicode character set, in "big endian" representation: this means that each 16-bit character is encoded in a pair of 8-bit bytes, with the more significant byte first.
- 'UTF-16LE' - the 16-bit Unicode character set, in "little endian" representation: this means that each 16-bit character is encoded in a pair of 8-bit bytes, with the more significant byte first.
The character sets above are available on every TADS 3 interpreter. In addition, TADS can load external mapping files, which makes it extensible to almost any character set. See the section on character maps for details. You can use any character set for which an external mapping file exists on the local system, simply by using the mapping name in the CharacterSet constructor. (Don't use the ".tcm" or other filename suffix - just use the base name of the mapping file.)
The standard TADS 3 distribution includes a full suite of external character mapping files, including all of the 8-bit Windows, MS-DOS, and Macintosh code pages, and the ISO Latin-1 through Latin-10 character sets. Most TADS distributions contain the whole standard set, but individual platforms may add or delete some of the encodings, so it's best to check at run-time. Use the isMappingKnown() method to determine if a character set is available.
Here's a list of the standard character sets included with most of the official TADS distributions. The names aren't sensitive to case.
The "synonyms" column lists other names you can use to refer to the same character set. The synonyms aren't there to give you more stuff to memorize - just the opposite. They're names that other programming languages might use for the same character sets. TADS accepts the common synonyms so that if you're already accustomed to using a certain name from another system, you don't have to remember a different name when using TADS.
Name(s) | Synonyms | Description |
---|---|---|
US-ASCII | ASCII, US_ASCII, ASC7DFLT, ISO646-US, ISO-IR-6, CP367, US | 7-bit US ASCII. This character set contains only the Roman alphabet (without any accented letters), the digits, and a few common punctuation marks. |
UTF-8 | UTF8 | Unicode UTF-8. This is an encoding of Unicode that uses a varying number of bytes per character. It's designed to be compatible with pre-Unicode applications. This encoding is common in Internet protocols. |
UCS-2LE | UCS2LE, UTF-16LE, UTF16LE, UTF_16LE, UnicodeL, Unicode-L, Unicode-LE | Unicode UCS-2, little-endian byte order. This is an
encoding of Unicode that stores each character in two
bytes, with the low-order byte of each pair stored first.
This is a common encoding in Windows applications that use
Unicode.
Technically, TADS uses UCS-2, not UTF-16. The latter is an upwardly compatible extension that can encode "supplementary" characters outside of the 16-bit range, by using two 16-bit elements known as surrogates. TADS accepts the UTF-16 names as synonyms because of the basic format compatibility, but TADS doesn't actually recognize surrogate pairs internally; it will incorrectly treat each pair as two ordinary characters. The compatible design of the encoding means that TADS won't corrupt data in this format and will largely process it correctly, but it will display each supplementary character as a pair of unknown/missing characters, and it will count surrogate pairs as two characters for the purposes of string lengths and the like. |
UCS-2BE | UCS2BE, UTF-16BE, UTF16BE, UTF_16BE, UnicodeB, Unicode-B, Unicode-BE | Unicode UCS-2, big-endian byte order. This is an encoding of Unicode that stores each character in two bytes, with the high-order byte of each pair stored first. This is the default UCS-2 byte order for most programs on non-Windows platforms. |
Latin-1 | ISO-8859-1, ISO_8859-1, ISO_8859_1, ISO8859-1, ISO8859_1, 8859-1, 8859_1, ISO-IR-100, Latin1, L1, CP819, ISO1 | Western Europe. This character set contains all of the ASCII characters, plus a set of accented Roman characters used in Western European languages (Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portugese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish). |
Latin-2 | Latin2, ISO-2, ISO2, 8859-2, ISO8859-2, ISO-8859-2, ISO_8859-2, ISO_8859_2, L2 | Central and Eastern Europe. Includes ASCII characters plus accented characters for Central and Eastern European languages that use the Roman alphabet (Bosnian, Polish, Croatian, Czech, Slovak, Slovene, Serbian, Hungarian). |
Latin-3 | same variations as Latin-2 | South Europe (Turkish, Maltese, Esperanto). |
Latin-4 | same variations as Latin-2 | North Europe (Estonian, Latvian, Lithuanian, Greenlandic, Sami). |
Latin-5 | same variations as Latin-2 | Latin/Cyrillic. Includes the ASCII characters plus the Cyrillic alphabet. |
Latin-6 | same variations as Latin-2 | Latin/Arabic. Includes the ASCII characters plus the most common Arabic language characters. |
Latin-7 | same variations as Latin-2 | Latin/Greek. Includes the ASCII characters plus the Greek alphabet. |
Latin-8 | same variations as Latin-2 | Latin/Hebrew. Includes the ASCII characters plus the Hebrew alphabet. |
Latin-9 | same variations as Latin-2 | Latin/Turkish/Kurdish. Includes most Latin-1 characters, but replaces some Icelandic letters in Latin-1 with Turkish letters. |
Latin-10 | same variations as Latin-2 | Latin/Nordic. A rearrangement of Latin-4 designed for Nordic languages. |
KOI8-R | Russian Cyrillic. | |
CP437 | 437, DOS437, DOS-437 | MS-DOS code page 437 (the original IBM PC code page; contains ASCII plus a complement of accented Roman letters, line-drawing characters, and miscellaneous symbols) |
CP737 | 737, DOS737, DOS-737 | MS-DOS code page 737 (Greek) |
CP775 | 775, DOS775, DOS-775 | MS-DOS code page 775 (Estonian, Lithuanian, Latvian) |
CP850 | 850, DOS850, DOS-850 | MS-DOS code page 850 ("Multilingual", mostly Western Europe; retains many symbols and line drawing characters from CP437, but adds more accented Roman letters) |
CP852 | 852, DOS852, DOS-852 | MS-DOS code page 852 ("Slavic", Central and Eastern Europe) |
CP855 | 855, DOS855, DOS-855 | MS-DOS code page 855 (Cyrillic) |
CP857 | 857, DOS857, DOS-857 | MS-DOS code page 857 (Turkish) |
CP860 | 860, DOS860, DOS-860 | MS-DOS code page 860 (Portugese) |
CP861 | 861, DOS861, DOS-861 | MS-DOS code page 861 (Icelandic) |
CP862 | 862, DOS862, DOS-862 | MS-DOS code page 862 (Hebrew) |
CP863 | 863, DOS863, DOS-863 | MS-DOS code page 863 (French Canadian) |
CP864 | 864, DOS864, DOS-864 | MS-DOS code page 864 (Arabic) |
CP865 | 865, DOS865, DOS-865 | MS-DOS code page 865 (Nordic) |
CP866 | 866, DOS866, DOS-866 | MS-DOS code page 866 (Cyrillic) |
CP869 | 869, DOS869, DOS-869 | MS-DOS code page 869 (Greek) |
CP874 | 874, DOS874, DOS-874 | MS-DOS code page 874 (Thai) |
CP1250 | 1250, Win1250, Win-1250, Windows1250, Windows-1250 | Windows code page 1250 (Central and Eastern Europe; similar to Latin-2, but not compatible, since some characters are rearranged) |
CP1251 | 1251, Win1251, Win-1251, Windows1251, Windows-1251 | Windows code page 1251 (Cyrillic; mostly equivalent to Latin-5) |
CP1252 | 1252, Win1252, Win-1252, Windows1252, Windows-1252 | Windows code page 1252 (Western Europe; this is a superset of Latin-1, with some added punctuation characters) |
CP1253 | 1253, Win1253, Win-1253, Windows1253, Windows-1253 | Windows code page 1253 (Greek; mostly equivalent to Latin-7, but not fully compatible) |
CP1254 | 1254, Win1254, Win-1254, Windows1254, Windows-1254 | Windows code page 1254 (Turkish; mostly equivalent to Latin-9) |
CP1255 | 1255, Win1255, Win-1255, Windows1255, Windows-1255 | Windows code page 1255 (Hebrew; a mostly compatible superset of Latin-8) |
CP1256 | 1256, Win1256, Win-1256, Windows1256, Windows-1256 | Windows code page 1256 (Arabic) |
CP1257 | 1257, Win1257, Win-1257, Windows1257, Windows-1257 | Windows code page 1257 (Baltic) |
CP1258 | 1258, Win1258, Win-1258, Windows1258, Windows-1258 | Windows code page 1258 (Vietnamese) |
Mac | Mac OS Roman | |
MacCyr | Mac OS Cyrillic | |
MacIceland | Mac OS Icelandic | |
MacCE | Mac OS Central Europe | |
MacGreek | Mac OS Greek | |
MacTur | Mac OS Turkish |
Unknown character mappings
You can create a CharacterSet object that refers to a character mapping that doesn't exist on the local system. This is legal and won't cause any errors at the time you create the object; however, if you try to use the object to perform any character mapping, an exception - UnknownCharSetException - will be thrown.
You can check to see if a character mapping is known by calling the isMappingKnown() method after creating the CharacterSet object. If this method returns true, the character set is known and you can use it to perform character mapping.
It's legal to create a CharacterSet referring to an unknown mapping because it would otherwise be impossible to save the state of a program that contains a CharacterSet object and then restore the state on another computer without the same character mappings.
CharacterSet methods
getName()
isMappable(val)
isMappingKnown()
isRoundTripMappable(val)
Examples
Example 1: Using a CharacterSet to determine if the local machine is capable of displaying Cyrillic characters.
If you're writing a game in Russian, you would probably want to make sure the player's computer is capable of displaying Cyrillic characters - if it weren't, the player probably wouldn't be able to read most of the text in your game. You can do this by creating a CharacterSet object for the local system's display character set, and then testing a string of characters for mappability with the isMappable() method.
#include <tads.h> #include <charset.h> testCyrillic(args) { /* get the local display character set */ local cs = new CharacterSet(getLocalCharSet(CharsetDisplay)); /* * Check a few representative Cyrillic alphabetic characters * (see http://www.unicode.org/charts/) */ if (cs.isMappable('\u0410\u0411\u041a\u042f\0430\0431\u044f')) "Warning: This game uses Cyrillic characters. Your system does not appear to be localized for Russian, so the text in this game might not display properly. You might need to adjust your system localization settings to display Cyrillic characters before you can play this game. If you change your localization settings, please close and then re-start the game to ensure the new settings are used."; }
Example 2: Translating a file from one character set to another.
This isn't a very typical situation for most games, but suppose you wanted to write a program that reads a text file that was saved in one character set and save it in a different character set - say, translate the file from the Macintosh Roman character set to ISO Latin-1. To do this, you would need a Mac Roman mapping definition on your computer, because this isn't one of the built-in character sets; assuming we had this mapping file (let's say it's called "MacRoman.tcm"), we could perform the translation quite easily using the text file functions.
#include <tads.h> translate(inFileName, outFileName) { local inFile, outFile; local csMac, csISO; /* create the character set objects */ csMac = new CharacterSet('MacRoman'); csISO = new CharacterSet('iso-8859-1'); /* open the files */ inFile = File.openTextFile(inFileName, FileAccessRead, csMac); outFile = File.openTextFile(outFileName, FileAccessWrite, csISO); if (inFile == nil || outFile == nil) { "Error: cannot open files.\n"; return; } /* read text and write it back out */ for (;;) { local txt; /* read a line of input; stop if at end of file */ txt = inFile.readFile(); if (txt == nil) break; /* write it out */ outFile.writeFile(txt); } /* close the files */ inFile.closeFile(); outFile.closeFile(); }
Note that creating CharacterSet objects isn't strictly necessary in this example, since we could have more simply passed the name of the character set directly to File.openTextFile(). However, if we were going to use the same character set with more than one file, it's more efficient to use the CharacterSet object, since that we we only have to load the mapping file once.