Introducing UTF-8 support for Azure SQL Database

UTF-8 support is now in preview for Azure , allowing more options to companies extending their businesses to a global scale, where the requirement of providing global multilingual database applications and services is critical to meet customer demands, and specific market regulations. 

If the above requirements resonate with you and your business, then you need a database that supports the full Unicode character set. While this was already possible by choosing a collation that allowed “supplementary characters” and using the NCHAR or NVARCHAR data types, only UTF-16 was supported until now. Azure and 2019 introduce support for UTF-8 as well, under the CHAR and VARCHAR data types. Choosing whether to use UTF-16 or UTF-8 to support your requirements is the next step, and this blog together with the documentation here should provide more clarity on the topic. So please read on!

How is it implemented?

To limit the amount of changes required for the above scenarios, UTF-8 is enabled in existing the data types CHAR and VARCHAR. String data is automatically encoded to UTF-8 when creating or changing an object's collation to a collation with the “_UTF8” suffix, for example from LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8. Refer to Set or Change the Database Collation and Set or Change the Column Collation for more details on perform those changes. Note that NCHAR and NVARCHAR remains unchanged and allows UCS-2/UTF-16 encoding.

You can see all available UTF-8 collations by executing the following command in your SQL Database:

SELECT Name, Description FROM fn_helpcollations()

Functional comparison between UTF-8 and UTF-16

UTF-8 and UTF-16 both handle the same Unicode characters, and both are variable length encodings that require up to 32 bits per character. However, there are important differences that drive the choice of whether to use UTF-8 or UTF-16 in your multilingual database or column:

  • UTF-8 encodes the common ASCII characters including English and numbers using 8-bits. ASCII characters (0-127) use 1 byte, code points 128 to 2047 use 2 bytes, and code points 2048 to 65535 use 3 bytes. The code points 65536 to 1114111 use  4 bytes, and represent the character range for Supplementary Characters.
  • But UTF-16 uses at least 16-bits for every character in code points 0 to 65535 (available in UCS-2 and UTF-16 alike), and code points 65536 to 1114111 use the same 4 bytes as UTF-8. 

The table below outlines these boundaries:

Code Range (hexadecimal)

Code Range (decimal)

bytes with UTF-8

bytes with UTF-16

000000 – 00007F (ASCII)

0 – 127



000080 – 00009F
0000A0 – 0003FF
000400 – 0007FF

128 – 159
160 – 1,023
1,024 – 2,047



000800 – 003FFF
004000 – 00FFFF

2,048 – 16,383
16,384 – 65,535



010000 – 03FFFF
040000 – 10FFFF

65,536 – 262,143
262,144 – 1,114,111



Considerations for choosing UTF-8 or UTF-16

If your dataset uses primarily ASCII characters (which represent majority of Latin alphabets), significant storage savings may be achieved as compared to UTF-16 data types.

For example, changing an existing column data type from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50% reduction in storage requirements. This is because NCHAR(10) requires 22 bytes for storage, whereas CHAR(10) requires 12 bytes for the same Unicode string.

What if your dataset is not predominately ASCII? Above the ASCII range, almost all Latin alphabets, but also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Tāna and N'Ko will use 2 bytes per character in both UTF-8 and UTF-16 (128 to 2047). 

But Chinese, Japanese, or Korean characters are represented starting in the range 2048 to 65535, and use 3 bytes in UTF-8, but only 2 bytes in UTF-16. If your dataset is mostly in this character range then using UTF-16 is preferred.  

In the Supplementary character range (65536 to 1114111) there is no measurable difference between UTF-8 and UTF-16 encoding.

To read more about Unicode support in SQL Database, including details on UTF-8 support, see here.

convert to UTF-8?

Two methods to convert data are described next. 

1) Convert column data. Imagine you have a current table that is encoded in UCS-2/UTF-16 (depends on whether a supplementary character enabled collation is used) or non-Unicode, like the following examples:

CREATE TABLE dbo.MyTable (MyString NVARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC);
CREATE TABLE dbo.MyTable (MyString VARCHAR(50) COLLATE Latin1_General_100_CI_AI);

To convert it in-place, run an alter statement, like the following example:


It's very easy to implement, however this is a possibly blocking operation which may pose an issue for large tables and busy applications.

2) Copy and Replace. This method involves copying data to a new table where the target column(s) are already in UTF-8, and then replace the old table. The following T-SQL example illustrates this using the second example table above:

CREATE TABLE dbo.MyTable2 (VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8);
INSERT INTO dbo.MyTable2
SELECT * FROM dbo.MyTable;
DROP TABLE dbo.MyTable;
EXEC sp_rename 'dbo.MyTable2', 'dbo.MyTable';

This method is much faster, but handling complex schemas with many dependencies (FKs, PKs, Triggers, DFs) and tail of the table synch requires much more preparation.

In conclusion, keep the topics discussed here in mind when choosing what encoding to use in your database or columns. Especially in terms of what is the predominant language/character set that's expected to store in a specific table column.

Pedro Lopes ( @SQLPedro ) – Senior Program Manager


This article was originally published by Microsoft's ITOps Talk Blog. You can find the original article here.