Surrogate Pair Calculator etc.

The Surrogate Pair Calculator etc.

A surrogate pair is defined by the Unicode Standard as “a representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit.” Since Unicode is a 21-bit standard, surrogate pairs are needed by applications that use UTF-16, such as JavaScript, to display characters whose code points are greater than 16-bit. (UTF-8, the most popular HTML encoding, uses a more flexible method of representing high-bit characters and does not use surrogate pairs.)

Enter a Unicode scalar value (hexadecimal 10000 - 10FFFF, or decimal 65536 - 1114111):

calculate: +

Enter a Unicode surrogate pair (hexadecimal D800-DBFF and DC00-DFFF, or decimal 55296 - 56319 and 56320 - 57343):

+ calculate:

Enter a character:

calculate: = +

hexadecimal decimal

Surrogate pair table: << 1 >> plane 1

The algorithm for converting to and from surrogate pairs is not widely published on the internet. (But the code here has been “borrowed” a time or two!) The official source is The Unicode Standard, Version 3.0| Unicode 3.0.0 (not later versions), Section 3.7, Surrogates.

Conversion of a Unicode scalar value S to a surrogate pair <H, L>:

H = (S - 10000₁₆) / 400₁₆ + D800₁₆
L = (S - 10000₁₆) % 400₁₆ + DC00₁₆

where the operator “/” is defined in Section 0.2, Notational Conventions, as “integer division (rounded down),” and “%” as “modulo operation; equivalent to the integer remainder for positive numbers.”

Sample JavaScript:

H = Math.floor((S - 0x10000) / 0x400) + 0xD800;
L = ((S - 0x10000) % 0x400) + 0xDC00;
return String.fromCharCode(H, L);

Alternately, the high and low surrogate values may be derived from the high and low 10 bits of a calculated value (from Wikipedia):

S' = yyyyyyyyyyxxxxxxxxxx  // S - 0x10000
 U = 110110yyyyyyyyyy      // 0xD800 + yyyyyyyyyy
 L = 110111xxxxxxxxxx      // 0xDC00 + xxxxxxxxxx

Sample JavaScript using bitwise operators:

S = S - 0x10000;
H = 0xD800 + (S >> 10);
L = 0xDC00 + (S & 0x3FF);
return String.fromCharCode(H, L);

The conversion of a surrogate pair <H, L> to a scalar value S:

S = (H - D800₁₆) * 400₁₆ + (L - DC00₁₆) + 10000₁₆

Sample JavaScript:

S = ((H - 0xD800) * 0x400) + (L - 0xDC00) + 0x10000;

Here is some miscellaneous test information. The encoding of this document is UTF-8.

The following test character is the emoji at codepoint 1F603.

Written with numeric code references, as the original codepoint and as a surrogate pair, using hexadecimal and decimal:
😃 😃
😃 😃
&#xD83D;&#xDE03; ��
&#55357;&#56835; ��

When using JavaScript, the character must be scripted as a surrogate pair:
\uD83D\uDE03
String.fromCharCode(0xD83D, 0xDE03);
String.fromCharCode(55357, 56835);

It does not display when scripted using the original codepoint:
String.fromCharCode(0x1F603);
String.fromCharCode(128515);

But if your browser supports ECMAScript 6, it will display Unicode code points enclosed in braces:
\u{1F603}

When reading a character, JavaScript “sees” it as a surrogate pair, even when the HTML is written as a single character.
The following form field contains the single HTML reference 😃
theField.value.length:
theField.value.charCodeAt(0):
theField.value.charCodeAt(1):
theField.value.charCodeAt(0).toString(16).toUpperCase():
theField.value.charCodeAt(1).toString(16).toUpperCase():

home | russellcottrell.com