UTF-8, TypedArrays, Base64, Unicode, and You

Watch on YouTube: youtu.be/G2LGrfcik8A

Unibabel Demo

UTF-8

Base64

This is definitely a case of going down the rabbit hole.

I started on a journey of wanting to perform an md5sum on a text file in the browser and ended up learning all sorts of stuff about web crypto.

Through the journey the thing that kept biting me over and over again was JavaScript's lack of support for UTF-8.

TL;DR

You want one of these three things:

coolaj86's Unibabel

Unibabel UTF-8 <--> (Typed)Array <--> Base64

At ~100 lines of (not minified) code, it's prolly the most lightweight of your options.

Also, it only needs old-school DOM APIs (TypedArrays optional), so it even works in old / crappy browers.

Unibabel.utf8ToBuffer("I ½ ♥ 𩶘");
// [ 73, 32, 194, 189, 32, 226, 153, 165, 32, 240, 169, 182, 152 ]
Unibabel.utf8ToBase64("I ½ ♥ 𩶘");
// 'SSDCvSDimaUg8Km2mA=='

Feross' Buffer

https://github.com/feross/buffer

node.js' Buffer in the browser. No DOM (pure js), Robust, all the features you expect.

Mix and Match

If you want the simplest, most poor man's solution, Unibabel is already that.

If you want to get a little more fancy, but keep it skim, you can mix and match.

Base64 <--> (Typed)Array

You'll either use Beatgammit's base64-js (which Feross also uses), Mozilla's StringView or see below for the ES3 solution with atob, btoa, and binary string magic.

UTF-8 <--> (TypedArray)

You can either use Unibabel's ES3 implementation (way down below) with encodeURIComponent, decodeURIComponent, and some fancy binary string handling, or TextEncoder or a ployfill.

"I ½ ♥ 𩶘"

I don't know what '𩶘' is or what it means, but it seems to be the only 4-byte UTF-8 character that I could find that displays in all browsers and fonts and devices.

A table of UTF-8 characters

bytes
1	~	a	0	$	!
2	¶	ñ	ε	¢	ë
3	♥	☢	☃	‱	♣
4	💩	𐑶	𐐦	🃏	𝄢

Consider the string "I ½ ♥ 𩶘".

It's 7 characters (4 glyphs and 3 spaces) long, but JavaScript counts them as 8.

8 === "I ½ ♥ 𩶘".length // Wrong!

Also, it's 13 bytes long.

13 === new Buffer("I ½ ♥ 𩶘", 'utf-8').length;

Not all UTF-8 are created equal

There's more than one way to skin a cat and, unfortunately, there's also more than one way to encode UTF-8.

It seems that one standard can encode data up to 6 bytes, but the more popular standard encodes up to 4 bytes.

Here's an algorithm to encode UTF-8 as 1, 2, 3, 4, 5, or 6 bytes: https://gist.github.com/coolaj86/024a2f332c47306c336c (and the original source: https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding)

What this means is that when choosing a polyfill, you have to choose wisely.

Best Solutions

In the browser your best option for UTF-8 <--> TypedArray <--> Base64 will be to use TextEncoder and beatgammit's base64 Polyfill.

If you want more features you should use Feross' Buffer.

If you want tiny code that works in old browsers, but that probably isn't very performant you can use my hacks (below).

Unibabel: An ancient, lightweight solution that somehow works

It's quite surprising, but after a lot of trying this and that and different tutorials, the best solution seems to be the one that worked even way back in 1999. (WHAT!?)

https://github.com/coolaj86/unibabel-js UTF-8 <--> (Typed)Array <--> Base64

bower install --save unibabel

It's so small and simple the whole thing fits in just these few lines of code:

From UTF-8

to TypedArray

to Base64

To UTF-8

from TypedArray

from Base64

TypedArray <--> Base64

Let's say you want to take any abitrary array of bytes (not just UTF-8) and put it into base64 (or vice-versa), even on an old browser. Here's an idea for that:

Notice that it's just a few fancy uses of APIs we've had since ES3:

btoa
atob
encodeURIComponent
decodeURIComponent

Appendix

https://mathiasbynens.be/notes/javascript-encoding

By AJ ONeal

Did I make your day?

Buy me a coffee

(you can learn about the bigger picture I'm working towards on my patreon page )