Unicode, My Old Friend
Published 2018-11-26Do you use emojis in JavaScript? Do you want to?
Would you prefer that they look like this:
"🙌😂👍🎉😍🔥✨💯😏✌️"
Or like this?
"ðððððð¥â¨ð¯ðâï¸"
(and just in case the device you're viewing on happens to force the second string above into recognizable characters, it should look like gobbledygook)
The good news today is this: Imma help you understand JavaScript strings - so that you can keep 'em looking like 💩.
So what's the difference between the two? Well, to be precise it's just 7 lines of VanillaJS:
function ucs2ToBinaryString(str) {
var escstr = encodeURIComponent(str);
var binstr = escstr.replace(/%([0-9A-F]{2})/ig, function(_, hex) {
return String.fromCharCode(parseInt(hex, 16));
});
return binstr;
}
Or, more importantly these 10 lines, which do the opposite:
function binaryStringToUcs2(binstr) {
var escstr = binstr.replace(/(.)/g, function (m, p) {
var code = p.charCodeAt(0).toString(16).toUpperCase();
if (code.length < 2) {
code = '0' + code;
}
return '%' + code;
});
return decodeURIComponent(escstr);
}
Now it's true that node.js Buffers handle utf-8 / ucs2 / binary conversion correctly in every case that I ever recall encountering. It's great.
Buffer.from(str, 'utf8').toString('binary');
Buffer.from(str, 'binary').toString('utf8');
It's also true that, if you're lucky, the browser you're working with supports
TextEncoder
, and you can go through TypedArrays like this:
var encoder = new TextEncoder();
var buf = encoder.encode("Hello, 中国!");
var bin = '';
buf.forEach(function (i) {
b += String.fromCharCode(i);
});
And back again:
var decoder = new TextDecoder('utf-8');
var bin = "Hello, ä¸å½!";
var arr = [];
bin.split('').forEach(function (c) {
arr.push(c.charCodeAt(0));
});
decoder.decode(Uint8Array.from(arr));
So now you know the hard way, the easy way, and the overly complicated way.
Next up: Why. And after that: Why.
Backstory
Recently I was doing a deep dive into HTTPS-related cryptography which lead to my post a week and some change ago CSR, My Old Friend.
For the most part I was just doing the simple stuff - y'know, converting between base64, hex, buffers and strings. However, I also came across a case where I needed to handle international characters that were embedded into an arbitrary byte stream.
It's a far cry from emojis, but it's actually the same problem and, unlike emjois, it's relevant to the type of stuff that I typically do (being the atypical person that I am).
Unicode, My Old Friend
Many people don't realize this but traditionally JavaScript has two different types of strings: UCS-2 and binary.
If you want a demostration between the two, just take a look at the beginning of the article.
UCS-2 is essentially an off-brand UTF-8 - for handling Unicode, like Emojis and Chinese (UTF-16le, to be precise).
The binary variant is technically latin-1
, but that's neither here nor there really.
The important things are these:
1. UTF-8 is Lossy
UTF-8 (and its variants) are lossy formats.
All invalid characters are converted to the "Replacement Character" � (yet another old friend!).
You can easily go from a well-formed UCS-2 string to binary and back, but going in the opposite direction won't always work.
2. Counting is hard
For example, what would you expect here?
"🎉🔥✨💯".length
4? That would seem most desirable - the number of characters.
Or how about 15, the number of bytes?
Nope! You get 7!!
But why? Well, counting is a whole topic unto itself, so I've split into a second article, here:
How to count Unicode characters in Javascript
3. binary is the gateway drug
Despite the many advances in JavaScript in recent years (node Buffer's, TextEncoder, etc), it bears repeating that there's no school like the old school.
There's no school like the old school.
(for repitition's sake)
btoa()
and atob()
work reliably on binary strings
and it's easy to convert binary strings to hex.
And although I tend to use Unti8Array a lot myself (though I'm not always sure why), it has a lot of weird caveats and things that neither fail explicitly, nor work as expected. So be wary.
As a quick example:
// Array works as expected
var arr = [ 72, 101, 108, 108, 111 ];
arr.map(function (i) {
return String.fromCharCode(i);
}).join('');
// "Hello"
// Uint8Array neither works nor fails...
// (it's ES3 all over again)
var arr = Uint8Array.from([ 72, 101, 108, 108, 111 ]);
arr.map(function (i) {
return String.fromCharCode(i);
}).join('');
// "00000"
That's the good, bad, and ugly.
Good times, eh?
Hopefully that was all of useful, interesting, and enlightening.
Now you know the new ways, the old ways (for the inevitable times that you'll run into where you need to use the old school methods), and you're armed and ready to get 💩 done.
😂
You may also like:
- Unicode String to a UTF-8 TypedArray Buffer in JavaScript
- UTF-8, TypedArrays, Base64, Unicode, and You
- Convert a TypedArray Buffer to Base64 in JavaScript
- JavaScript Encoding by @mathias (The OG UCS-2 / UTF-8 JavaScript explanation)
By AJ ONeal
Did I make your day?
Buy me a coffee
(you can learn about the bigger picture I'm working towards on my patreon page )