UTF-8, TypedArrays, Base64, Unicode, and You
Watch on YouTube: youtu.be/G2LGrfcik8A
This is definitely a case of going down the rabbit hole.
I started on a journey of wanting to perform an md5sum on a text file in the browser and ended up learning all sorts of stuff about web crypto.
You want one of these three things:
- Unibabel UTF-8 <--> (Typed)Array <--> Base64
At ~100 lines of (not minified) code, it's prolly the most lightweight of your options.
Also, it only needs old-school DOM APIs (TypedArrays optional), so it even works in old / crappy browers.
Unibabel.utf8ToBuffer("I ½ ♥ 𩶘"); // [ 73, 32, 194, 189, 32, 226, 153, 165, 32, 240, 169, 182, 152 ] Unibabel.utf8ToBase64("I ½ ♥ 𩶘"); // 'SSDCvSDimaUg8Km2mA=='
node.js' Buffer in the browser. No DOM (pure js), Robust, all the features you expect.
Mix and Match
If you want the simplest, most poor man's solution, Unibabel is already that.
If you want to get a little more fancy, but keep it skim, you can mix and match.
Base64 <--> (Typed)Array
UTF-8 <--> (TypedArray)
You can either use Unibabel's ES3 implementation (way down below) with
decodeURIComponent, and some fancy binary string handling, or
TextEncoder or a ployfill.
"I ½ ♥ 𩶘"
I don't know what '𩶘' is or what it means, but it seems to be the only 4-byte UTF-8 character that I could find that displays in all browsers and fonts and devices.
A table of UTF-8 characters
Consider the string
"I ½ ♥ 𩶘".
8 === "I ½ ♥ 𩶘".length // Wrong!
Also, it's 13 bytes long.
13 === new Buffer("I ½ ♥ 𩶘", 'utf-8').length;
Not all UTF-8 are created equal
There's more than one way to skin a cat and, unfortunately, there's also more than one way to encode UTF-8.
It seems that one standard can encode data up to 6 bytes, but the more popular standard encodes up to 4 bytes.
Here's an algorithm to encode UTF-8 as 1, 2, 3, 4, 5, or 6 bytes: https://gist.github.com/coolaj86/024a2f332c47306c336c (and the original source: https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding)
What this means is that when choosing a polyfill, you have to choose wisely.
In the browser your best option for UTF-8 <--> TypedArray <--> Base64 will be to use TextEncoder and beatgammit's base64 Polyfill.
If you want more features you should use Feross' Buffer.
If you want tiny code that works in old browsers, but that probably isn't very performant you can use my hacks (below).
Unibabel: An ancient, lightweight solution that somehow works
It's quite surprising, but after a lot of trying this and that and different tutorials, the best solution seems to be the one that worked even way back in 1999. (WHAT!?)
- https://github.com/coolaj86/unibabel-js UTF-8 <--> (Typed)Array <--> Base64
bower install --save unibabel
It's so small and simple the whole thing fits in just these few lines of code:
TypedArray <--> Base64
Let's say you want to take any abitrary array of bytes (not just UTF-8) and put it into base64 (or vice-versa), even on an old browser. Here's an idea for that:
Notice that it's just a few fancy uses of APIs we've had since ES3:
By AJ ONeal
Did I make your day?
Buy me a coffee