String Conversions

Converting SDS to DDS and v.v, plus translate (map) strings.

[1]
strS.nmc( {#get} ) | xStr(#nmc) 'no map' character
  Returns <nmc strS> / <err>
  #get When used, the Xtra will return the current NMC
     
  Description Sets strS as the No Map Character (NMC).
If strS is an empty string, the NMC will be cleared, and the system's default character will be used.

To get the current NMC's value, use xStr(#nmc), or _s().nmc(#get)

The NMC will be used by the toS() command: any DD characters that cannot be mapped to an SD character of the target CP, will be replaced in the with the NMC. The character that will be used do display unmappable characters if NMC is not set, is decided by the system ( usually: '?' )
     
  Examples put _s().nmc(#get)  --check the current nmc value.
-- <Void>
put _d("abc-αβγ", 1253).toS(1251)
-- abc-???          --unmappable characters are displayed using the default '?' character.
put _s("#").nmc()   --set the NMC to '#'
-- #
put _d("abc-abg", 1253).toS(1251)
-- abc-###          --now '#' is used for all unmappable characters.
put _s().nmc()      --reset the NMC. the '?' character will be used in subsequent toS() calls.
-- <Void>

Since a strD object may contain cached SD data from previous calls or automatic conversions, you should clear the strS's cache before using a new NMC.

--Create a strD, with CP=1253. strD will contain Cyrillic, well as Greek characters.
d=_d("[GR.αβγ]",1253).app("[CY.αβγ]", 1251)
put _s().app(d)      --appending the string to a Greek SDS. SDS data will be cached by d
-- [GR.αβγ][CY.???]  --'?' was used for unmappable (Cyrillic) characters.
put _s("$").nmc()    --set the NMC to '$'
-- $
put d
-- [GR.αβγ][CY.$$$]  --'$' used for unmappable characters.
put _s().app(d)      --unlike the put command above, .app() accesses cached data, if available.
-- [GR.αβγ][CY.???]  --So, the previously cached SD string is appended.
put d.cacheSz(0)     --Clear any cached data...
-- -16
put _s().app(d)      --and try appending again.
-- [GR.αβγ][CY.$$$]

put _s("@").nmc()      --changing nmc again. 'd' now holds the new cached SD data.
-- @
put _s("",1251).app(d) --appending to a strS with different code page will not use the cached data.
-- [GR.@@@][CY.αβγ]
     
  Notes Though NMC accepts double digit characters, only the first digit will be used during conversion (Windows)
According to ms docs, using a custom NMC (or 'default character') will affect the performance of the toS() command.

[2]
strD.toS( {CP} {,#cc} {,flags} ) convert DDS to SDS
  Returns <strS> / <err>
  CP CodePage to be used for the conversion. If set, it overrides the CP stored in the strD object.
  flags Flags to be used.
     
  Description Creates a new strS, by converting the unicode strD to a SBPC or MBPC SDS object (strS).
If a CP is passed, it will be used instead of the strD's CP for the conversion, and CP will become the new object's CP.

The strD to strS conversions are performed using the Windows 'WideCharToMultiByte' function
[Related blog]

In most cases, you don't need to pass any flags to this command.
However, access to all flags the WCtoMB command is included in the Xtra, as it may be required for special cases. In the table below, the flags, well as their c++ equivalents have been included, along with some simplified as possible instructions and examples.
( the ms documents for these flags are rather confusing )

flags:
Xtra WideCharToMultiByte  
automatically added if any of the #comp(XXX) flags below is selected.
Just one #compXXX flag can be used at a time.
WC_COMPOSITECHECK Compose:
Combine two or more characters in one, if possible.
If two or more sequential characters of the source string can be combined to a single precomposed character, and if that single character exists in the destination CP, return the precomposed character instead of a sequence of characters.
E.g.: letter followed by accent -> letter with accent.

Selecting one of the following #compXXX flags, enables this method, and specifies how to handle exceptions.
#comp WC_COMPOSITECHECK |
WC_SEPCHARS (0)
Compose.
Any extra non-spacing characters will be returned as single characters.
E.g.: letter, accentA, accentB -> letter with accentA , accentB
#compXdropNs WC_COMPOSITECHECK | WC_DISCARDNS Compose when possible, discarding extra non spacing characters.
E.g.: letter , accentA , accentB -> letter with accentA (accent B is ignored)
#compXnmc WC_COMPOSITECHECK | WC_DEFAULTCHAR Try to compose. If a non spacing character cannot be added to the composition, return NMC for the entire sequence.
E.g:
letterA, accentA , accentB -> NMC
letterA, accentA , accentAA -> letterA with accents A and AA
#nbToNmc WC_NO_BEST_FIT_CHARS w98, w2k +.
Replace any Non-Biderectional characters with the NMC character.
NB: a DD character that can be mapped to a SD character, but if the resulting SD is converted back to a DD the result would not be the original DD:
a.toS().toD()<>a
#nmcErr WC_ERR_INVALID_CHARS If the source string contains unmappable characters, return <err>. Since the WC_ERR_INVALID_CHARS flag is Vista+ only, custom code has been used for this command to support all Windows versions.
     
  Examples put _d("abc-αβγ").toS()
-- abc-αβγ
put _d("abc-αβγ", 1251).toS()
-- abc-αβγ
put _d("abc-αβγ", 1251).toS(1253) -- the original data were created to Cyrillic unicode characters.
-- abc-???                        --the Greek codepage 1253 contains no Cyrillic characters.

d=_dcs("0075", "0308") --u ̈  (2 DD characters)
put d.tos(1252).pop().cList(#hex)
-- [75, A8]            --u¨(2 SD characters)
put d.tos(1252, #comp).pop().toClip().cList(#hex)
-- [FC]                --ü (1 SD character)

d=_dcs("0075", "0308", "0308").cp(1252)      --u ̈̈ (3 characters - overlapping)
put d.toS(#comp).pop().cList(#hex)        --ü¨ (1 composite + 1 non spacing character)
-- [FC, A8]
put d.toS(#compXdropNs).pop().cList(#hex) --ü (1 composite character)
-- [FC]
put d.toS(#compXnmc).pop().cList(#hex)    --? (1 unmappable character)
-- [3F]
put d.toS(#compXnmc, #nmcErr)               --same command as above, but with #nmcErr enabled.
-- <xErr 1113 No mapping for the Unicode character exists in the target multi-byte code page.>

put d.toS(#nbToNmc).pop().cList(#hex)     --u?? (1 character + 2 non-bidir -> nmc)
-- [75, 3F, 3F]
put d.toS(1252, #comp, #nbToNmc).pop().cList(#hex) -- ü? (1 composite + 1 non-bidirectional)
-- [FC, 3F]
 
     
  Notes If strD is actually a SDS, the result of the sequence strD.toD().toS( {cp} {,flags}) will be returned.
The .toS() command does not use any cached data stored in the source object.

[3]
strD.toSL( {#standard} {#method}) convert DDS to a list of SBCS SDSs
  Returns <strS> / <err>
  CP CodePage to be used for the conversion. If set, it overrides the CP stored in the strD object.
  #standard #win (default) or #mac. Specifies the type of SDS strings (CodePages) to return.
  #method #err: if the string contains unmappable characters, return an <err>
#ref: if the string contains unmappable characters, return a reference to the first unmappable character
#prop: return a propList of CP:strS pairs.
#propX: return a propList of CP:strS pairs. The property CP will be 0 for all 7bit characters, and -1 for unmappable characters.
     
  Description Attempts to convet strD to a sequence of SBPC strSs.
If a character (or sequence of characters) can't be converted to a strS (e.g. belonging to a DBPC code page), that part of the string will be returned as a strD.

If #prop, or #propX is used, the result will be a propList containing cp:strS pairs.
The cp property of the list, will be
-1: for parts of the original string that could not be mapped to SBPC code pages (and the value will be strD)
0: (#propX only) if the strS's characters exist in all SBPC code pages.
CP: if the strS's characters belong to a specific code page.

The CP values of the parsed str objects will always be valid code pages. Preferred order: strD's CP > glbCP > 1252(win) / 10000(mac)
To display the returned strings in a text or field member, you have to use for each string a font that matches its code page.
     
  Examples put _d("abcαβγ").app("abcαβγ", 1251).toSL()
-- [abcαβγabc, αβγ]
put _d("abcαβγ").app("abcαβγ", 1251).toSL(#prop)
-- [1253: abcαβγabc, 1251: αβγ]
put _d("abcαβγ").app("abcαβγ", 1251).toSL(#propX)
-- [0: abc, 1253: αβγ, 0: abc, 1251: αβγ]

put _d("abcαβγ").app("abcαβγ", 1251).toSL(#mac, #prop)
-- [10000: abc, 10006: αβηabc, 10007: αβγ]
put _d("abcαβγ").app("abcαβγ", 1251).toSL(#mac, #propX)
-- [0: abc, 10006: αβη, 0: abc, 10007: αβγ]
     
  Notes For the first example, you could use the Arial-Greek font to display the first str in the list, and Arial-Cyr for the second.
For methods other than #propX, the results of this command may vary for machines with different language settings, since language (defaultCP) affects the preferred order.

[4]
strS.toD( {CP} {,flags} ) convert SDS to DDS
  Returns <strD> / <err>
  CP Forces the Xtra to use CP instead of strS's CP for the conversion.
  flags Flags to be used.
     
  Description Creates a new strD, by converting the SBPC or MBPC strS to a unicode DDS object (strD).
If a CP is passed, it will be used instead of the strD's CP for the conversion, and CP will become the new object's CP.

The strS to strD conversion is performed using the Windows 'MultiByteToWideChar' function.
[Related blog]
 
flags:
Xtra WideCharToMultiByte  
  MB_PRECOMPOSED Do not split precomposed characters (default)
#decomp MB_COMPOSITE If strS contains characters that can be decomposed to a sequence of characters, return the decomposed characters.
E.g.: letter with accent -> letter , accent
#glyphsOnly MB_USEGLYPHCHARS related blog (msie: setting encoding to utf-8 may be required)
#nmcErr MB_ERR_INVALID_CHARS This should normally be #invErr, but, for simplification, the toS command's flag has been used.
Check if an invalid character exists in the source string.
An MBCS SDS's lead byte not followed by a legal byte is considered an invalid character.
Note that invalid characters may be also contained in strings with a unicode CP, but such an strX object will never have such a CodePage - unicode conversions are handled by the .uToD() / .toU() commands.
     
  Examples s=_sc("c4", 1252)           --Ä = [00C4] | A¨=[0041, 0308]
put s.toD().cList(#hex), s.toD(#decomp).cList(#hex)
-- [00C4] [0041, 0308] --the contents of the strings created with #comp and with #decomp

put _sc("82", 932 ).toD()
--
put _sc("82", 932).toD(#nmcErr)
-- <xErr 1113 No mapping for the Unicode character exists in the target multi-byte code page.>
     
  Notes If strS is actually a DDS, the result will be a copy of the original object { with CP=CP }

[5]
strX.map( flags ) map (foldString)
  Returns <strX> / <err>
  flags Flags to be used.
     
  Description Returns a new string after processing the original strX according to flags.

The processing is performed by the Windows 'FoldString' function. [Related blog]

flags:
Xtra WideCharToMultiByte  
#comp MAP_PRECOMPOSED If strX contains sequence of characters that can be composed to a single precomposed character, return the precomposed character
 E.g.: letter , accent -> letter with accent
#decomp MAP_COMPOSITE If strX contains accented characters that can be decomposed to a sequence of characters, return the decomposed characters.
E.g.: letter with accent -> letter , accent
#lgDecomp MAP_EXPAND_LIGATURES Decompose ligatures to character sequences
E.g. æ -> ae
#stdDec MAP_FOLDDIGITS Map characters that represent decimal characters in languages that don't use Arabic numbers to their Arabic digits (0-9) unicode equivalents.
E.g. map Indic digits ٠١٢٣٤٥٦٧٨٩ to 0123456789
blog
#cZone MAP_FOLDCZONE see FoldString, remarks section.
     
  Examples d=_dcs("0041", "0308")
put d, d.map(#comp), d.map(#comp).map(#decomp)
-- A¨ Ä A¨             --as seen on Western (1252) systems.

d=_dc("00e6")        
put d, d.map(#lgDecomp)
-- æ ae                --as seen on Western (1252) systems.
indic=_dcs("0660", "0661", "0662")    --٠١٢ = 012
put indic.pop().cList(#hex), indic.map(#stdDec).pop()
-- [0660, 0661, 0662] 012 --indic numbers ٠١٢ mapped to 012


Filtering out accents, and capitalizing text e.g. for indexing or accent insensitive searches.
s=_s("Ελληνικό κείμενο")
put s.map(#decomp).sRep("´").upper
-- ΕΛΛΗΝΙΚΟ ΚΕΙΜΕΝΟ

or:
put s.toD().map(#decomp).toS().sRep("΄").upper
-- ΕΛΛΗΝΙΚΟ ΚΕΙΜΕΝΟ

     
  Notes This command can be called on strSs, long as they use the system's codepage, but may not work as expected for all CodePages.
It is suggested to use this instead: result = strS.toD().map(flags).toS()

[6]
strX.isS() / strX.isD() check string type (DDS / SDS)
  Returns True / False
     
  Description isS() returns true if strX is a SDS, and false if it is a DDS.
isD() returns true if strX is a DDS, and false if it is a SDS.
     
  Examples put _s("abc").isS()
-- 1
     
  Notes  

[7]
strD.forceS() treat strD as strS
  Returns <strD as S> / <err>
     
  Description Forces the Xtra to treat strD as strS. The original object is returned (as strS) to the command line.
     
  Examples str=_d("ab")
put str.cList(#hex)
-- [0061, 0062]
put str.forceS().cList(#hex)
-- [61, 00, 62, 00]
     
  Notes You can use this command to turn a strD to a strS object, so that you can access/modify its binary content easier. The object can be converted back to strD using the strS.forceD() command.

[8]
strS.forceD() treat strS as strD
  Returns <strS as D> / <err>
     
  Description Forces the Xtra to treat strS as strD
If strS's length in bytes is not even, an <err> is returned.
Otherwise, the original object will be returned (as strD) to the command line.
     
  Examples str=_s("abcd")
put str.cList(#hex)
-- [61, 62, 63, 64]
put str.forceD().cList(hex)
-- [6261, 6463]

put _s("abc").forceD() --odd number of bytes.
-- <xErr 66623 InvalidData>
     
  Notes