Unicode Encode / Decode


String objects consist of three major values. A buffer, that holds the string's binary data, a property denoting if the string is a DDS or a SDS, and the CodePage. For strDs, the CodePage is used only if the object is requested to create and return a strS object (DDS to SDS conversion).
The difference between strDs and strSs is that the Xtra considers each digit of a strD to be two bytes wide. This means that a strD and a strS can be holding identical binary data, but they just treat them differently.
So, an strD is nothing more than an strS, that is instructed to treat it's data as two-digit pairs - you can even use the 'forceS' or 'forceD' commands to change the strXs' type, without modifying or duplicating the strX's data.

A strD assumes its data to be in UTF-16LE format (the native Windows format for wide character strings).
So, if you have a strS that holds data in UTF-16LE format (e.g. loaded from a unicode text file), you can instruct the Xtra to treat it as strD, by using the forceD command. This command will instantly turn the strS in a strD.

So, why are the toU/uToD commands there for?
First, to support converting to/from UTF-16LE from/to other formats, like utf-7, utf-8, utf-16BE (MacOS Native).
Second, to support BOMs. BOMs (Byte Order Marks) are short stamps that can be prefixed to files or strings to describe the type of data that follows. If a BOM does exist, the application accessing the data should treat the data starting at the end of BOM and till the end of the file according to the type described in the BOM. Note that BOMs are sequences of bytes that can never be found in properly encoded utf-7, 8 and 16 strings.

Tips:
- 'u' strings are strS strings.
- 'u' strings are holding binary data, that can be unicode-decoded to DDS, or unicode strings (strDs).
- 'u' is for a DD string, what e.g. PNG is for an image - a method to encode and store the actual data...


[1]
_u(uStrS {,uFormat} {, CP}  )  | uStrS.uToD({,uFormat} {, CP} ) Unicode Encoded SDS to DDS
  Returns <strD> / <err>
  uStrS String unicode data (strS, or Director string)
  uFormat Symbol unicode format of uStrS. Possible values are #u7, #u8, #u16, #u16b
  CP Integer CodePage to be set as the CP of the resulting <str>
     
  Description Attempts to decode a unicode encoded string to a strD.
uStrS can be holding e.g. the contents of a unicode text file.
If uStrS contains a BOM, the uFormat value is ignored.
If uFormat has been specified, and BOM does not exist in uStrS, the Xtra will try to 'uFormat ' decode (e.g. utf-7 decode, when uFormat= #u7) the data uStrS.
If BOM does not exists in uStrS, and no uFormat has been specified, the Xtra will try to utf-8 decode the data uStrS.
     
  Examples d=_s("B103B203B303310432043304").hexBlockToD() --convert a hex block to the strD 'αβγбвг'
d.pop()            --just checking...
uStr=d.toU(1)     --returns a strS holding utf-8 data (default) including BOM.
put uStr           --display raw utf-8 data
-- ο»ΏΞ±Ξ²Ξ³Π±Π²Π³ --as displayed on a Greek system (display does not affect the actual data)
dd=uStr.uToD()     --decode the unicode encoded data back to a strD.
put dd.pop()=d     --check if the result and the original strings are equal (+ view the result)

The string "+A7EDsgOz-", far as utf-8 is concerned, is the literal string "+A7EDsgOz-". For utf-7, however, it's the BOM-less representation of the Greek characters "αβγ" :
put _u("+A7EDsgOz-")
-- +A7EDsgOz- --No BOM & encoding not utf-8. We have to tell the Xtra which decoding method to use.
put _u("+A7EDsgOz-", #u7)  --or: put _s("+A7EDsgOz-").uToD(#u7)
-- αβγ                     --we now got the correct result

uStr = _s("abc-αβγ", 1253).toU(#u7, 1) --adding BOM when encoding...
put s
-- "+/v8-abc-+A7EDsgOz-"
put s.uToD()                            --...so, no need to tell the Xtra when decoder to use.
-- abc-αβγ
 
     
  Notes If uStrS is actually a strD, the operation will be performed on its SDS equivalent (automatic internal conversion, CP dependant).

[2]
strD.toU({,uFormat} {, addBOM} ) DDS to Unicode Encoded SDS
  Returns <strS> / <err>
  uFormat Symbol unicode format of uStrS. Possible values are #u7, #u8 (default), #u16, #u16b
  addBOM Boolean. When true, the Xtra will include the uFormat's BOM in the string. Byte Order Mark is a 'stamp', prefixed to the string's binary data, that describes both the encoding and byte order of the data that follows.
     
  Description Attempts to unicode encode the content of strD, using the uFormat protocol.
     
  Examples put _d("abc-αβγ").toU()       --utf-8 encode, no BOM
-- abc-Ξ±Ξ²Ξ³                  --display, as shown on a Greek system
put _d("abc-αβγ").toU(1)     -- utf-8 encode, with BOM
-- ο»Ώabc-Ξ±Ξ²Ξ³
put _d("abc-αβγ").toU(#u7)    -- utf-7 encode, no BOM
-- abc-+A7EDsgOz-              --system independent display: utf-7 uses ASCII characters only.
put _d("abc-αβγ").toU(#u7, 1) --utf-7 encode, addBOM
-- +/v8-abc-+A7EDsgOz-
 
     
  Notes If strD is actually a strS, the operation will be performed on its DDS equivalent (automatic internal conversion, CP dependant).
#u16 is the format strDs use internally. When uFormat=#u16, the result string will contain a copy of the original strD's data, prefixed with the utf-16's byte order mark, if addBOM is true.

[3]
strS.uType() get a string's unicode type, by checking for a BOM
  Returns Symbol unicode format / 0 / <err>
     
  Description Checks if strS's data start with a known BOM. If so, a symbol specifying the string's unicode encoding type is returned. Otherwise, 0.
unicode types: #u7: utf-7, #u8: utf-8, #u16: utf-16, #u16b: utf-16 big endian.
 
     
  Examples put _s("abcd").uType()
-- 0
put _d("abcd").toU(#u7).uType() --this will return 0, since no BOM was added while encoding.
-- 0
put _d("abcd").toU(#u7, 1).uType()
-- #u7
     
  Notes If strS is actually a DDS, the Xtra will try to convert the data to SDS, and perform the operation on the SDS data. If the conversion fails, an <err> will be returned.