UTF-8 support is now in preview for Azure SQL Database, allowing more options to companies extending their businesses to a global scale, where the requirement of providing global multilingual database applications and services is critical to meet customer demands, and specific market regulations.
If the above requirements resonate with you and your business, then you need a database that supports the full Unicode character set. While this was already possible by choosing a collation that allowed “supplementary characters” and using the NCHAR or NVARCHAR data types, only UTF-16 was supported until now. Azure SQL Database and SQL Server 2019 introduce support for UTF-8 as well, under the CHAR and VARCHAR data types. Choosing whether to use UTF-16 or UTF-8 to support your requirements is the next step, and this blog together with the documentation here should provide more clarity on the topic. So please read on!
How is it implemented?
To limit the amount of changes required for the above scenarios, UTF-8 is enabled in existing the data types CHAR and VARCHAR. String data is automatically encoded to UTF-8 when creating or changing an object’s collation to a collation with the “_UTF8” suffix, for example from LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8. Refer to Set or Change the Database Collation and Set or Change the Column Collation for more details on how to perform those changes. Note that NCHAR and NVARCHAR remains unchanged and allows UCS-2/UTF-16 encoding.
You can see all available UTF-8 collations by executing the following command in your SQL Database:
SELECT Name, Description FROM fn_helpcollations() WHERE Name LIKE '%UTF8';
Functional comparison between UTF-8 and UTF-16
UTF-8 and UTF-16 both handle the same Unicode characters, and both are variable length encodings that require up to 32 bits per character. However, there are important differences that drive the choice of whether to use UTF-8 or UTF-16 in your multilingual database or column:
- UTF-8 encodes the common ASCII characters including English and numbers using 8-bits. ASCII characters (0-127) use 1 byte, code points 128 to 2047 use 2 bytes, and code points 2048 to 65535 use 3 bytes. The code points 65536 to 1114111 use 4 bytes, and represent the character range for Supplementary Characters.
- But UTF-16 uses at least 16-bits for every character in code points 0 to 65535 (available in UCS-2 and UTF-16 alike), and code points 65536 to 1114111 use the same 4 bytes as UTF-8.
The table below outlines these storage boundaries:
Code Range (hexadecimal)
Code Range (decimal)
Storage bytes with UTF-8
Storage bytes with UTF-16
000000 – 00007F (ASCII)
0 – 127
000080 – 00009F
128 – 159
000800 – 003FFF
2,048 – 16,383
010000 – 03FFFF
65,536 – 262,143
Considerations for choosing UTF-8 or UTF-16
If your dataset uses primarily ASCII characters (which represent majority of Latin alphabets), significant storage savings may be achieved as compared to UTF-16 data types.
For example, changing an existing column data type from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50% reduction in storage requirements. This is because NCHAR(10) requires 22 bytes for storage, whereas CHAR(10) requires 12 bytes for the same Unicode string.
What if your dataset is not predominately ASCII? Above the ASCII range, almost all Latin alphabets, but also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Tāna and N’Ko will use 2 bytes per character in both UTF-8 and UTF-16 (128 to 2047).
But Chinese, Japanese, or Korean characters are represented starting in the range 2048 to 65535, and use 3 bytes in UTF-8, but only 2 bytes in UTF-16. If your dataset is mostly in this character range then using UTF-16 is preferred.
In the Supplementary character range (65536 to 1114111) there is no measurable difference between UTF-8 and UTF-16 encoding.
To read more about Unicode support in SQL Database, including details on UTF-8 support, see here.
How to convert to UTF-8?
Two methods to convert data are described next.
1) Convert column data. Imagine you have a current table that is encoded in UCS-2/UTF-16 (depends on whether a supplementary character enabled collation is used) or non-Unicode, like the following examples:
CREATE TABLE dbo.MyTable (MyString NVARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC); CREATE TABLE dbo.MyTable (MyString VARCHAR(50) COLLATE Latin1_General_100_CI_AI);
To convert it in-place, run an alter statement, like the following example:
ALTER TABLE dbo.MyTable ALTER COLUMN MyString VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8
It’s very easy to implement, however this is a possibly blocking operation which may pose an issue for large tables and busy applications.
2) Copy and Replace. This method involves copying data to a new table where the target column(s) are already in UTF-8, and then replace the old table. The following T-SQL example illustrates this using the second example table above:
CREATE TABLE dbo.MyTable2 (VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8); INSERT INTO dbo.MyTable2 SELECT * FROM dbo.MyTable; DROP TABLE dbo.MyTable; EXEC sp_rename 'dbo.MyTable2', 'dbo.MyTable’;
This method is much faster, but handling complex schemas with many dependencies (FKs, PKs, Triggers, DFs) and tail of the table synch requires much more preparation.
In conclusion, keep the topics discussed here in mind when choosing what encoding to use in your database or columns. Especially in terms of what is the predominant language/character set that’s expected to store in a specific table column.
Pedro Lopes ( @SQLPedro ) – Senior Program Manager