The Surrogate Pair Calculator etc.


A surrogate pair is defined by the Unicode Standard as “a representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit.”  Since Unicode is a 21-bit standard, surrogate pairs are needed by applications that use UTF-16, such as JavaScript, to display characters whose code points are greater than 16-bit.  (UTF-8, the most popular HTML encoding, uses a more flexible method of representing high-bit characters and does not use surrogate pairs.)


Enter a Unicode scalar value (hexadecimal 10000 - 10FFFF, or decimal 65536 - 1114111):

  calculate:  +  


Enter a Unicode surrogate pair (hexadecimal D800-DBFF and DC00-DFFF, or decimal 55296 - 56319 and 56320 - 57343):

+   calculate:   


Enter a character:

  calculate:  = +


 

 

Surrogate pair table:   <<   1   >>   plane 1

 

The algorithm for converting to and from surrogate pairs is not widely published on the internet.  (But the code here has been “borrowed” a time or two!)  The official source is The Unicode Standard, Version 3.0| Unicode 3.0.0 (not later versions), Section 3.7, Surrogates.


Conversion of a Unicode scalar value S to a surrogate pair <H, L>:

H = (S - 1000016) / 40016 + D80016
L = (S - 1000016) % 40016 + DC0016

where the operator “/” is defined in Section 0.2, Notational Conventions, as “integer division (rounded down),” and “%” as “modulo operation; equivalent to the integer remainder for positive numbers.”

Sample JavaScript:

H = Math.floor((S - 0x10000) / 0x400) + 0xD800;
L = ((S - 0x10000) % 0x400) + 0xDC00;
return String.fromCharCode(H, L);

Alternately, the high and low surrogate values may be derived from the high and low 10 bits of a calculated value (from Wikipedia):

S' = yyyyyyyyyyxxxxxxxxxx  // S - 0x10000
 U = 110110yyyyyyyyyy      // 0xD800 + yyyyyyyyyy
 L = 110111xxxxxxxxxx      // 0xDC00 + xxxxxxxxxx

Sample JavaScript using bitwise operators:

S = S - 0x10000;
H = 0xD800 + (S >> 10);
L = 0xDC00 + (S & 0x3FF);
return String.fromCharCode(H, L);

The conversion of a surrogate pair <H, L> to a scalar value S:

S = (H - D80016) * 40016 + (L - DC0016) + 1000016

Sample JavaScript:

S = ((H - 0xD800) * 0x400) + (L - 0xDC00) + 0x10000;
 

Here is some miscellaneous test information.  The encoding of this document is UTF-8.

The following test character is the emoji at codepoint 1F603.


Written with numeric code references, as the original codepoint and as a surrogate pair, using hexadecimal and decimal:
&#x1F603;  😃
&#128515;  😃
&#xD83D;&#xDE03;  ��
&#55357;&#56835;  ��


When using JavaScript, the character must be scripted as a surrogate pair:
\uD83D\uDE03 
String.fromCharCode(0xD83D, 0xDE03); 
String.fromCharCode(55357, 56835); 

It does not display when scripted using the original codepoint:
String.fromCharCode(0x1F603); 
String.fromCharCode(128515); 

But if your browser supports ECMAScript 6, it will display Unicode code points enclosed in braces:
\u{1F603} 


When reading a character, JavaScript “sees” it as a surrogate pair, even when the HTML is written as a single character.
The following form field contains the single HTML reference &#x1F603; 
theField.value.length: 
theField.value.charCodeAt(0): 
theField.value.charCodeAt(1): 
theField.value.charCodeAt(0).toString(16).toUpperCase(): 
theField.value.charCodeAt(1).toString(16).toUpperCase():