How to count Unicode characters in Javascript
Published 2018-11-27Consider the following two strings:
String A:
"🙌😂👍🎉😍🔥✨💯😏✌️"
String B:
"ðððððð¥â¨ð¯ðâï¸"
(image of the above text, for reference)
Guess what!? They're the same string!!
Don't believe me? Let me help your unbelief:
It's quite obvious to even the most casual of observers that String A
shows 10 characters, but that's not what .length
reports. It says there are 19
characters.
Huh!?
But wait, there's more! String B shows... a lot of characters (15 to be exact,
depending on your web browser) and yet... .length
reports 41
. And you know what?
That's the closest to the truth.
So what makes the difference?
The first is a set of unicode emojis as our good friend utf-8
.
The second is the same set of unicode emojis as our (sometimes better, sometimes worse) friend latin-1
.
Many people don't realize this but traditionally JavaScript has two different types of strings: UCS-2 and binary.
UCS-2 is essentially an off-brand (but workable) UTF-8 - for handling Unicode, like Emojis and Chinese.
The binary variant is technically latin-1
, but whatever.
And what about the length?
Counting binary is fairly easy.
Binary strings are well-behaved, reliably single-byte strings.
That means that string length and byte length are synonymous:
"Hello".length === 5
makes sense.
However, the same is not true for the unicode. Counting unicode is... not so easy.
In JavaScript unicode strings are always two-byte touples, even if the character only requires a single byte - or even if the character requires a dozen bytes (and displays as a single character).
As such, when you have a character that uses more than 2 bytes, JavaScript will report its length incorrectly.
For example: this is really, really wrong: "💩".length === 2
.
Likewise, you can't rely on String.prototype.split()
for anything other than
simple ascii characters. Don't believe me? Well that's stupid, but you can
try "🎉🔥✨💯".split('')
to convince yourself.
Counting Bytes
If all you're interested in is the byte-length of unicode characters, VanillaJS can do that for you quite easily.
First you have to escape it with escapeURIComponent(str)
,
which will replace all non-ascii characters with hex escape sequences
(each denoted by a preceeding %) and then you replace the escapes
with binary strings.
That looks like this:
function ucs2ToBinaryString(str) {
var escstr = encodeURIComponent(str);
var binstr = escstr.replace(/%([0-9A-F]{2})/ig, function(match, hex) {
var i = parseInt(hex, 16);
return String.fromCharCode(i);
});
return binstr;
}
Tada! Reliable byte counts!
ucs2ToBinaryString("🙌😂👍🎉😍🔥✨💯😏✌️").length
// 41 bytes long
Counting Characters
When I can, I prefer to build on no-nonsense solutions that are available in JavaScript ES5.1 and earlier, and which can be easily polyfilled.
Although String.prototype.codePointAt(i)
wasn't introduced until ES6,
and it isn't something that can be easily polyfilled in just a few lines of code,
I find it to be a very elegant solution with great support.
Anyway, the point is that we can hack together a decent solution using the new "💩".codePointAt(0)
which, unlike the old "💩".charCodeAt(0)
, will produce a number with the correct number of bytes.
Then, by keeping track of how many times the character's code point can be shifted by 8 bits
point >> 8
until it reaches zero, you can arrive at how many USC-2 characters are required
for the unicode character, divide that by 2 (rounding up), and advance to the next
full character in the string.
function countCodePoints(str) {
var point;
var index;
var width = 0;
var len = 0;
for (index = 0; index < str.length;) {
point = str.codePointAt(index);
width = 0;
while (point) {
width += 1;
point = point >> 8;
}
index += Math.round(width/2);
len += 1;
}
return len;
}
var str = "こんにちは世界!";
var len = countCodePoints(str);
console.log("character length is", len);
That yields 8
, which is correct... in this case.
But it's not that simple...
It's an 80/20 thing
Our humble solution isn't bad, but it's not perfect either.
In the particular case of whole characters it happens to be right, but start adding accent marks (or, more technically, graphmemes) and it's all bunk again:
var str = "Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘";
console.log("character length is", countCodePoints(str));
(image of the above text, for reference)
Yes, our old friend ZALGO yields a length of 57
so... what the Heinz is gonig on!?
Now, what we both want is for you to scroll down three or four more lines and to be greeted by a beautiful, succint snippet of code that solves that problem too.
Sadly, that isn't the case. I don't know any clever tricks short of loading 1,700+ lines of hard-coded common unicode accents to solve that problem.
However, if that's what you're into I will suggest the graphmeme-splitter package.
The Bottom line...s
Counting bytes isn't too hard.
Counting characters, however... is neigh unto impossible.
I've done all I can do to help you. Good luck!
P.S. If you enjoyed this and you're looking for a deeper dive into Unicode and JavaScript you may also enjoy Unicode, My Old Friend (JavaScript).
By AJ ONeal
Did I make your day?
Buy me a coffee
(you can learn about the bigger picture I'm working towards on my patreon page )