Unicode And Character Encoding


There seems to be quite some confusion regarding the term ‘Unicode’.
In this document, we will start from scratch, and try to describe, as briefly as possible, the most common character sets and encoding protocols.


Fundamentals



Bit

A digit in the binary numeral system, or, as a unit of measurement, the information capacity of one binary digit.
For a computer's memory, a bit is the smallest modifiable unit (as is e.g. a pixel for a 256 color digital image)


Byte

A contiguous sequence of a fixed number of bits.
For a computer's memory, a byte is the smallest unit of memory addressing - though a byte is divided in bits, you will never find a computer with e.g. 65536.5 bytes of RAM, or a variable that occupies less than 1 byte of RAM.
Though by definition, a byte does not necessarily consist of 8 bits -depends on the cpu's architecture-, in this document, the term 'byte' will be referring to 8 sequential bits, since for most modern computers a byte's size is is actually 8bit (octet).



Byte Order (Big Endian / Little Endian)

Byte order defines how a processor interprets 16bit (two bytes) or larger integer values. 1byte=8bit, 2bytes=16bit, 4bytes=32bit, etc.
Let’s assume we want to store the numbers 1 and 256 in memory, on a machine that stores the least significant byte first (Little Endian) and on another, that stores the most significant byte first (Big Endian).

Little Endian, least significant byte first, or left to right (x86)
bits decimal byte 0 byte 1 byte 2 byte 3 Hex Values To Dec Values
8 1 #01 #01 = 1*(2^0)
16 1 #01 #00 #01.00 = 1*(2^0) + 0*(2^8)
32 1 #01 #00 #00 #00 #01.00.00.00 = 1*(2^0) + 0*(2^8) + 0*(2^16) + 0*(2^24)
8 256 x out of range: a byte's value ranges from 0 to 255
16 256 #00 #01 #00.01 = 0*(2^0) + 1*(2^8)
32 256 #00 #01 #00 #00 #00.01.00.00 = 0*(2^0) + 1*(2^8) + 0*(2^16) + 0*(2^24)
Big Endian, most significant byte first, right to left, or Network Byte Order (Sun SPARK, Motorola, PowerPC)
bits decimal byte 0 byte 1 byte 2 byte 3 Hex Values To Dec Values
8 1 #01 #01 = 1*(2^0)
16 1 #00 #01 #00.01 = 0*(2^8) + 1*(2^0)
32 1 #00 #00 #00 #01 #00.00.00.01 = 0*(2^24) + 0*(2^16) + 0*(2^8) + 1*(2^0)
8 256 x out of range: a byte's value ranges from 0 to 255
16 256 #01 #00 #01.00 = 1*(2^8) + 0*(2^0)
32 256 #00 #00 #01 #00 #00.00.01.00 = 0*(2^24) + 0*(2^16) + 1*(2^8) + 0*(2^0)

This table demonstrates how data, stored in ram or in a file, are translated to 8, 16 and 32bit integer values by LE and BE processors.
note: though only #00 and #01 has been used above, every byte's value can range from #00 to #FF (0-255)
E.g: the maximum unsigned integer supported by a 32bit system is (LE:)
#1F.1F.1F.1F=255*(2^0) + 255*(2^8) + 255*(2^16) + 255*(2^24) = 4294967295 (=4GB)
while the maximum positive number when using signed integers will be:
#1F.1F.1F.7F = 255*(2^0) + 255*(2^8) + 255*(2^16) + 127*(2^24) = 2147483647 (=2GB = Director's maxInteger)
Values from #00.00.00.80 (2147483648 = -1) to #FF.FF.FF.FF (4294967295 = -2147483648) return negative numbers.

Single byte values are not affected by the machine’s byte ordering.
For larger numbers however, knowing the byte order of the data is mandatory. If the machine that generated the data uses a different byte order from the machine that is accessing the data, the latest must reverse the order of the data ( for 16bir values AB->BA, for 32bit values ABCD->DCBA) before interpreting the data to numeric values.
In the table above, if a machine that uses LE ordering (x86 machines do) tries to open the data stored by the second machine (BE) as 16bit values without reversing them, it will read 1 as 256 and 256 as 1.
Byte order conversion is required when: 1. the machine that generated the data uses different byte order than the machine that accesses the data (client), and 2. the client is trying to interpret a 2,4,8.. byte (16, 32, 64bit, etc) block to a numeric value.
Based on the above, it's not hard to conclude that working with data in a reversed byte order results in more work for the cpu. And though the overhead will be minor when loading a small file to memory, it will probably be quite noticeable, far as speed is concearned, when e.g. performing a search in a portable database containing 32bit integer values.



Fonts

A font is a library of 'character entries'. Each 'character' contains data that is passed to the system, when the specific character is requested from the font - or, more accuratelly, the system examines the font and decides which character's data it wants. The system then uses the information to e.g. draw the font on the screen.
Each character belongs to one or more encoding sets (e.g. ANSI, ISO, Unicode).
Each character contains data that the system will use (e.g. to draw the character on screen).
Each character contains one or more unique [encoding type:character code] pairs, that define the encoding(s) the character belongs to, and the character's id for the particular encoding.
When the system is instructed to display a string with a specific font, it will try to retrieve from the font the data required to draw each character. If some of the string's characters are not included in the font, it's up to the parent application to decide what to do - usually, will use an alternative font to display the character (retrieve character data from).
A font may, or may not support a specific encoding. Furthermore, it may partially support an encoding set, meaning that some characters -usually entire sub-sets of characters belonging to languages- are not included in the font.
A Unicode font is a font that contains Unicode mapping information.
When a Unicode string containing the Greek small letter alpha is displayed, the system will search to see if a Unicode character 945 exists in the font. If so, it will use it to display the character.
Likewise, when ANSI encoding is used, the system will seek the character 225 of the ANSI-Greek (1253) code page for the alpha character.
Usually, in such a font, to save space, both uni-945 and ansiCP1253-225 will be mapped to the same character data. However, this is entirelly up to the font's designer.



Single Byte Strings (8 bits per character)

A byte’s decimal value ranges from #00 to #FF (255), giving 256 unique values.
Accessing single bytes in memory is what computers do best. And since each byte can have a value of 0-255, working with the single byte approach offers a fixed range of 256 possible values to which unique characters can be assigned. However, a computer’s memory couldn’t care less what character is assigned to each value. For it, the decimal value 65 is always 65, no matter which character a word processor should interpret this value to. For the record, a word processor will usually interpret the decimal value 65 as the capital “A”, but that is not always the case (see 'fonts" above).


Single Byte Character Sets (Code Pages)

As mentioned above, a single byte string’s contents are raw data, each byte of which ranges between the decimal values 0 and 255.
Since computers and people’s brains work differently, it’s ok for us if the computer uses whatever method it prefers to store the data, but it wouldn’t be all that ok if we had to know that 65,66,67 means ABC.
That’s where code pages come in handy. Single byte code pages are nothing more than one-to-one value to character maps - each value corresponds to a unique character (spaces, tabs etc are considered characters) and vice versa.
A word processor will access a single byte string, byte after byte. For each byte, it will check the code page that it is instructed (*) to use, to find which character is associated with the particular value for the particular code page. Then, it will checking if the character exists in the font selected to display the string. If it does, it will draw the character on the screen – otherwise, it depends on the word processor.
Most non-symbol code pages use a globally agreed upon set of characters for the first 128 (0-127) values. These characters include the Latin alphabet (caps + small), numbers, punctuation and other commonly used characters. The rest of the values vary.
Code pages were invented mostly to support multiple languages. Usually, to tell the computer how to interpret (to tell it which character to map to) values of 128 and above. There are several standard code page sets (ANSI, ISO etc). Depending on the code page selected, a single byte per character string may appear correctly, or totally incomprehensible(*).
Unless some symbols, or non-standard code pages (e.g. a code page for which 65 does not mean 'A') were used when creating the string, for text files that contain only 7bit characters (chars with decimal value 0-127) a word processor will most probably open and display the text, no questions asked. If, on the other hand, values larger than 128 exist, the processor might ask the user to decide which code page to use to display the text.

(*) When a string is saved as plain single byte per character text, what is actually saved is raw binary data. No extra information regarding the code page is saved. When that file is opened, it’s up to the openning application to decide which code page to use.


Multi Byte Character Sets

Though sets of 128 characters (128-255) are enough to hold all the characters of most western languages, this is hardly the case with eastern languages. To address this issue, the MBCS (or DBCS) system was invented. Such a set has a primary code page, consisting of values that correspond to single byte characters, plus characters that correspond to sub-code pages (lead characters). After a lead character, follows a single byte character, referring to a position in the sub-code page pointed to by the lead byte.
E.g., when the character #81 (dec:129), followed by #9A (dec:154) is encountered in a string on a Japaneese Windows system, the 154th character of the ANSI CodePage 932_81 will be displayed (http://www.microsoft.com/globaldev/reference/dbcs/932/932_81.mspx).
According to the above, a DBCS (as Microsoft Calls MBCS) string is not a string that consists of double-byte characters, but a string in which certain two-character(two byte) combinations are mapped to a single displayable character of a sub-codepage.
[link: SBCS and MBCS Code Pages supported by Windows]


Universal Character Set and Unicode

ISO/IEC 10646, or Universal Multiple-Octet Coded Character Set, or, in short, Universal Character Set is what we usually refer to when we speak of Unicode. UCS is the first officially standarized character aimming to include all characters, digits and symbols of all written languages. The Unicode standard is based on UCS, but adds rules for collation, normalisation of forms, and the bidirectional algorithm for scripts like Hebrew and Arabic.
references:
A short overview of ISO/IEC 10646 and Unicode
Universal Character Set


Unicode Character Sets & Mappings



UCS-2

UCS-2 is the simplest UCS set. Each character in a UCS-2 string consists of 2 bytes, from a palette of 65536 unique entries. UCS-2 supports all BMP (Basic Multilingual Plane) characters. UCS-2 character's values range from #0000–#FFFF (with some ranges unused/reserved).
UCS-2 was the method used by earlier OSs, like NT4.0, to support multipple languages. As it turned out however, a 16bit palette (65K characters) was just not enough. And, though for an one-to-one mapping of all the characters that are in use today a 32bit character set (4 bytes per character) would be required, the amount of memory a string containing such characters would occupy makes this approach (UTF-32) quite inefficient. The need of more than 65536 unique characters, plus the inneficiency of using 4 bytes for each character, combined with backwards compatibility with UCS-2 systems lead to the creation of UTF-16 - and to the loss of the original UCS-2's simplicity.


UTF-16

UTF-16 is a superset of UCS-2.
A UTF-16 string consists of a combination of 16bit and 32bit values.
A UTF-16 string that contains UCS-2 characters only, will always be 2 * the number of characters of the string bytes long. If non UCS-2 characters are included in the string, the size of the string in bytes will be 2 * the number of UCS-2 characters + 4 * the number of non UCS-2 characters.
To handle unicode strings, Windows use what Microsoft calls 'Wide Characters'. A Wide Character is always a 16 bit value (two bytes). Therefore, a UCS-2 character consists of a single Wide Character, while non UCS-2 characters of two Wide Characters.
[ tech: unlike MBCS, where the first part of a two-character pair defines the codepage and the second the entry, a 32 bit value of a non UCS-2 character of a UTF-16 string, is converted to a unique 20bit value: If a Wide Character's hex value is between #D800 and #DFFF, followed by a character that belonds to the same range, then these two Wide Characters should be combined (by keeping the 10 last bits of each 16bit value) to a singe 20bit UCS character. Goes whitout saying that working with UTF-16 is significantly slower than working with UCS-2. ]



Unicode Encoding Protocols



Unicode encoding & BOM

Though Unicode encoding sets are standarized, there are various Unicode encoding protocols. This is often a source of confusion, since people tend to consider e.g. UTF-7 and UTF-8 as different character sets (or something), while in actuality, they are just methods of encoding Unicode strings. Something like saving an image in TIFF or BMP format. The image is the same. The only difference is the way the original data will be encoded, according to the protocol.
The fastest method to save a Unicode string would be to just dump the binary data from memory to a file. Since however a Unicode string is affected by the underlying OS’s byte order, some extra information is required in order to make the string platform independent. Therefore, a small ‘header’ called Byte Order Mark, or BOM, is often saved to the file, and the Unicode data follow. The application that opens a Unicode text file should check for the BOM, which will inform it whether it should reverse the bytes (e.g. UTF-16 file saved on mac, opened on windows) or not (mac to mac or windows to windows).


UTF-16 LE/BE

Saving in UTF-16 in Mac and Windows requires no extra encoding, therefore it is the fastest format when saving Unicode strings to files, but byte order dependant. To create a portable file, a BOM should be prefixed to the data:
UTF-16 Little Endian BOM: #FF.FE
UTF-16 Big Endian BOM: #FE.FF


UTF-8

The most commonly used Unicode encoding. Unlike UTF-16, it is a portable format (byte order independent). Another advantage of UTF-8 is readability and smaller size, compared to UTF-16. Characters with values 0-127 are stored as single byte characters, therefore parts of the UTF-8 files (e.g. Latin characters spaces and digits) can be displayed properly even by applications that do not support Unicode.
Though unaffected by byte order, a header is usually prefixed to UTF-8 encoded files, so that applications can identify them as such.
BOM: #EF.BB.BF


UTF-7

Encodes the string to 7bit values (0-127), therefore not the most compact format. Still used by certain applications, and may be useful for opening older files.
Like UTF-8, it is byte order independent. It’s not uncommon to address BOM-less UTF-7 files.
BOM: #2B.2F.76.38.2D




Abstract Definitions



Character

The smaller entity of a string.
For single byte strings, a character's length is 8 bit, or 1 byte, long.
For double byte, or wide character, strings, a character's length is 16bit, or 2 byte, long.
Though the above two definitions are quite straight forward for SBCS (8bit) and UCS-2 (16bit) strings, confusion may arise as to what is considered a character for e.g. MBCS and UTF-16 strings.
In Director, for instance, on a double-byte system, a 'char' of a string returns an 8 or 16bit value, while in C++, a 'char' is defined as an 8bit value (and wchar as 16bit).
Based on the above, there is no direct method of accessing the binary data (byte by byte) of a string on a double byte (e.g. Japaneese) system.


Digit

A single unit or numeral in a counting system.
In the Xtra, we'll be using the term 'digit' of a string, to refer to it's 8 or 16 bit components.
Working with digits (dgt) is much faster than working with characters (chr), while, in many cases (1. all SBCS CodePages and 2. Double Byte Strings containing only UCS-2 characters) they both return the same results.

For SBCS and MBCS strings, a digit will always be an 8bit (one byte) value.
For Double (or Wide) Character Strings, a digit will always be a 16bit (two bytes) value.
An MBCS string may consist of single and double byte characters, whle a UTF-16 string may consist of double and quad byte characters.
SBCS: ( bytes = digits ) = chars
MBCS: ( bytes = digits ) >= chars | digits = single byte chars + 2 * double byte chars.
USC-2: ( bytes /2 = digits ) = chars
UTF-16: ( bytes /2 = digits ) >= chars | digits = UCS-2 chars + 2 * non UCS-2 chars.



Spacing Characters

A spacing character is a character that adds printable data to the string, expanding it's printable size. Alphanumeric characters, punctuation points, symbols and white spaces are considered spacing characters.


Non-Spacing Characters

A non-spacing character is a character that adds printable data, but does not expand the printable width of the string. E.g. accents that are composited onto alpha characters are considered non-spacing characters.


White Spaces

A character that expands the printable size of the string, without adding printable data. Spaces, tabs line breaks and special formatting characters are considered white space characters.


String

A sequence of characters.


String Object, or String

An object in memory holding information for interpreting a block of memory to a string.
This information, besides the address and length may contain the CodePage, bytesPerDigit and other values.


Null Terminated String, or C-String

A sequence of characters terminated with a null (decimal value: 0) character. Using null terminated strings is common and convenient, since they are nativelly supported by C/C++, but, especially for large strings, they are highly CPU inefficient. For the CPU, a string is just a block of memory, defined by it's first character's address in memory, and it's length. Therefore, two values are required to keep a reference to a string. With null-terminated strings however, only the first character's address is required, since the end is calculated automatically: the CPU is searching byte after byte the characters, starting at the address of the first character. The first null byte that it encounters is considered as the end of the string - it then subtracts the address of the first from the address of the last character to calculate the string's length (for 16 byte strings, it searches for two sequential null bytes, and the address's difference is divided by 2 to calculate the length).
A null terminated string may not contain null bytes - any characters after the first null byte occurance are discarded.
Director's string objects follow the CPU-efficient method (storing both init address and length), and therefore can hold binary data. Some of it's functions however, plus a number of it's own (e.g. multiuser's text send) or third party Xtras don't.


Single Byte String

A string consisting of 8bit digits
For the Xtra, both SBCS and MBCS are held in single byte string objects. They can be accessed by digit (8bit, fast access) or by character (8 or 16bit, depending on the string's code page)


Double Byte (or Wide Char) String

A string consisting of 16bit digits - a string, whose smalest value is considered to be two-bytes long (CPU reads two bytes at a time).
Any string can be treated as a Double Byte String, long as it's length in bytes is even.
For the Xtra, both pure UCS-2 and UTF-16 strings are held in double byte string objects. They can be accessed by digit (16bit, fast access) or by character (16 or 32bit - UCS-2 or non-UCS-2 characters).


Binary String

An 8-bit per character string that may contain any character value. The contents of an .exe file, for example can be considered a binary string.
Actually, all types of strings can be considered binary strings, but not all binary strings can be considered e.g. double-byte or null terminated strings.



Links

http://en.wikipedia.org/wiki/UTF-16
http://jrgraphix.net/research/unicode_blocks.php?block=0
http://www.unicode.org/charts/
http://www.alanwood.net/unicode/aegean_numbers.html