UTF-8, TypedArrays, Base64, Unicode, and You
Published 2015-6-4Watch on YouTube: youtu.be/G2LGrfcik8A
See also
- Direct TypedArray to Base64 conversion in JavaScript
- Direct Unicode to Uint8Array conversion in JavaScript
Unibabel Demo
This is definitely a case of going down the rabbit hole.
I started on a journey of wanting to perform an md5sum on a text file in the browser and ended up learning all sorts of stuff about web crypto.
Through the journey the thing that kept biting me over and over again was JavaScript's lack of support for UTF-8.
TL;DR
You want one of these three things:
coolaj86's Unibabel
- Unibabel UTF-8 <--> (Typed)Array <--> Base64
At ~100 lines of (not minified) code, it's prolly the most lightweight of your options.
Also, it only needs old-school DOM APIs (TypedArrays optional), so it even works in old / crappy browers.
Unibabel.utf8ToBuffer("I ½ ♥ 𩶘");
// [ 73, 32, 194, 189, 32, 226, 153, 165, 32, 240, 169, 182, 152 ]
Unibabel.utf8ToBase64("I ½ ♥ 𩶘");
// 'SSDCvSDimaUg8Km2mA=='
Feross' Buffer
node.js' Buffer in the browser. No DOM (pure js), Robust, all the features you expect.
Mix and Match
If you want the simplest, most poor man's solution, Unibabel is already that.
If you want to get a little more fancy, but keep it skim, you can mix and match.
Base64 <--> (Typed)Array
You'll either use Beatgammit's base64-js
(which Feross also uses), Mozilla's StringView or see below for the ES3 solution with atob
, btoa
,
and binary string magic.
UTF-8 <--> (TypedArray)
You can either use Unibabel's ES3 implementation (way down below) with encodeURIComponent
, decodeURIComponent
, and some fancy binary string handling, or TextEncoder
or a ployfill.
"I ½ ♥ 𩶘"
I don't know what '𩶘' is or what it means, but it seems to be the only 4-byte UTF-8 character that I could find that displays in all browsers and fonts and devices.
A table of UTF-8 characters
bytes | |||||
---|---|---|---|---|---|
1 | ~ | a | 0 | $ | ! |
2 | ¶ | ñ | ε | ¢ | ë |
3 | ♥ | ☢ | ☃ | ‱ | ♣ |
4 | 💩 | 𐑶 | 𐐦 | 🃏 | 𝄢 |
Consider the string "I ½ ♥ 𩶘"
.
It's 7 characters (4 glyphs and 3 spaces) long, but JavaScript counts them as 8.
8 === "I ½ ♥ 𩶘".length // Wrong!
Also, it's 13 bytes long.
13 === new Buffer("I ½ ♥ 𩶘", 'utf-8').length;
Not all UTF-8 are created equal
There's more than one way to skin a cat and, unfortunately, there's also more than one way to encode UTF-8.
It seems that one standard can encode data up to 6 bytes, but the more popular standard encodes up to 4 bytes.
Here's an algorithm to encode UTF-8 as 1, 2, 3, 4, 5, or 6 bytes: https://gist.github.com/coolaj86/024a2f332c47306c336c (and the original source: https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding)
What this means is that when choosing a polyfill, you have to choose wisely.
Best Solutions
In the browser your best option for UTF-8 <--> TypedArray <--> Base64 will be to use TextEncoder and beatgammit's base64 Polyfill.
If you want more features you should use Feross' Buffer.
If you want tiny code that works in old browsers, but that probably isn't very performant you can use my hacks (below).
Unibabel: An ancient, lightweight solution that somehow works
It's quite surprising, but after a lot of trying this and that and different tutorials, the best solution seems to be the one that worked even way back in 1999. (WHAT!?)
- https://github.com/coolaj86/unibabel-js UTF-8 <--> (Typed)Array <--> Base64
bower install --save unibabel
It's so small and simple the whole thing fits in just these few lines of code:
From UTF-8
to TypedArray
to Base64
To UTF-8
from TypedArray
from Base64
TypedArray <--> Base64
Let's say you want to take any abitrary array of bytes (not just UTF-8) and put it into base64 (or vice-versa), even on an old browser. Here's an idea for that:
Notice that it's just a few fancy uses of APIs we've had since ES3:
- btoa
- atob
- encodeURIComponent
- decodeURIComponent
Appendix
By AJ ONeal
Did I make your day?
Buy me a coffee
(you can learn about the bigger picture I'm working towards on my patreon page )