Describe the bug
According to the following blog post in 2019, Microsoft has added Unicode support to CHAR and VARCHAR columns using variable length UTF-8 encoding with the new UTF-8 collations: https://techcommunity.microsoft.com/blog/sqlserver/introducing-utf-8-support-for-sql-server/734928
This is also described in the Microsoft documentation: https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver17#utf-8-support
Although this seems to work in the command line interface, I am not able to store/encode and retrieve/decode UTF-8 encoded strings using mssql-python. Looking at the repository, there seems to be no test that covers this functionality.
e.g. the following excerpt from tests/test_013_encoding_decoding.py shows that the developers are aware that mssql-python does not properly support UTF-8 collations, but it is not mentioned in the documentation. That's why I am flagging this as a bug.
def test_setdecoding_with_unicode_data(db_connection):
"""Test setdecoding with actual Unicode data operations.
Note: VARCHAR columns in SQL Server use the database's default collation
(typically Latin1/CP1252) and cannot reliably store Unicode characters.
Only NVARCHAR columns properly support Unicode. This test focuses on
NVARCHAR columns and ASCII-safe data for VARCHAR columns.
"""
To reproduce
# Quick test script:
import mssql_python
import dotenv
import os
SQL_CHAR = 1
SQL_WCHAR = -8
TEST_DB_TABLE = "test_utf8"
dotenv.load_dotenv()
# Try different connection string formats:
conn_str = (
f"SERVER={os.getenv('DB_SERVER')};"
f"DATABASE={os.getenv('DATABASE')};"
f"UID={os.getenv('DB_USER')};"
f"PWD={os.getenv('DB_PASSWORD')};"
"TrustServerCertificate=yes"
)
conn = mssql_python.connect(conn_str)
cursor = conn.cursor()
# Check if table exists before running tests
print("Checking if table exists...")
result = cursor.execute(f"SELECT object_id FROM sys.tables WHERE name = '{TEST_DB_TABLE}'")
exists = result.fetchone() is not None
if not exists:
print(f"\nCreating table '{TEST_DB_TABLE}'")
cursor.execute(f"""
CREATE TABLE {TEST_DB_TABLE} (
id INT PRIMARY KEY,
varchar_data VARCHAR(100) COLLATE Latin1_General_100_CI_AS_SC_UTF8,
varchar_data_workaround VARBINARY(100), -- Store UTF-8 bytes directly
nvarchar_data NVARCHAR(100)
)
""")
cursor.commit()
print(f"✅ Table '{TEST_DB_TABLE}' created successfully!")
# Test Unicode data
unicode_test_cases = [
("ASCII", "Hello World"),
("German", "Grüße"),
("Chinese", "你好世界"),
("Japanese", "こんにちは"),
("Russian", "Привет"),
("Mixed", "Hello 世界"),
("Emoji", "😀😃😄😁"),
]
# Configure encodings properly:
# - SQL_CHAR encoding affects VARCHAR columns
# - SQL_WCHAR encoding affects NVARCHAR columns
conn.setencoding(encoding="utf-8", ctype=SQL_CHAR) # For VARCHAR
conn.setdecoding(SQL_CHAR, encoding="utf-8", ctype=SQL_CHAR)
# NVARCHAR always uses UTF-16LE (SQL_WCHAR)
conn.setencoding(encoding="utf-16le", ctype=SQL_WCHAR) # For NVARCHAR
conn.setdecoding(SQL_WCHAR, encoding="utf-16le", ctype=SQL_WCHAR)
# Clear table
cursor.execute(f"DELETE FROM {TEST_DB_TABLE}")
cursor.commit()
results = []
for i, (test_name, unicode_text) in enumerate(unicode_test_cases):
# Insert Unicode data
cursor.execute(f"INSERT INTO {TEST_DB_TABLE} VALUES ({i + 1}, '{unicode_text}', 0x{unicode_text.encode('utf-8').hex()}, '{unicode_text}')")
cursor.commit()
# Retrieve data
cursor.execute(
f"SELECT varchar_data, varchar_data_workaround, nvarchar_data FROM {TEST_DB_TABLE} WHERE id = {i + 1}"
)
result = cursor.fetchone()
# Store all results
results.append((test_name, unicode_text, result))
# Display all results in tabular format
print("\n" + "=" * 80)
print(f"{'Test Case':<15} {'Original':<25} {'VARCHAR Result':<25} {'VARCHAR Workaround':<25} {'NVARCHAR Result':<25}")
print("=" * 80)
for test_name, unicode_text, result in results:
varchar_result = repr(result[0])[:23]
varchar_workaround_result = repr(result[1].decode('utf-8'))[:23]
nvarchar_result = repr(result[2])[:23]
print(f"{test_name:<15} {repr(unicode_text)[:23]:<25} {varchar_result:<25} {varchar_workaround_result:<25} {nvarchar_result:<25}")
Decoding the VARCHAR result fails.
================================================================================
Test Case Original VARCHAR Result VARCHAR Workaround NVARCHAR Result
================================================================================
ASCII 'Hello World' 'Hello World' 'Hello World' 'Hello World'
German 'Grüße' b'Gr\xfc\xdfe' 'Grüße' 'Grüße'
Chinese '你好世界' '????' '你好世界' '你好世界'
Japanese 'こんにちは' '?????' 'こんにちは' 'こんにちは'
Russian 'Привет' '??????' 'Привет' 'Привет'
Mixed 'Hello 世界' 'Hello ??' 'Hello 世界' 'Hello 世界'
Emoji '😀😃😄😁' '????????' '😀😃😄😁' '😀😃😄😁'
I assume the problem lies with the decoding, as decoding seems to work in sqlcmd:
1> :setvar SQLCMDMAXVARTYPEWIDTH 30
2> :setvar SQLCMDMAXFIXEDTYPEWIDTH 30
3> SELECT varchar_data, varchar_data_workaround, nvarchar_data FROM test_utf8;
4> GO
varchar_data varchar_data_workaround nvarchar_data
------------------------------ -------------------------------- ------------------------------
Hello World 0x48656C6C6F20576F726C64 Hello World
Grüße 0x4772C3BCC39F65 Grüße
你好世界 0xE4BDA0E5A5BDE4B896E7958C 你好世界
こんにちは 0xE38193E38293E381ABE381A1E381AF こんにちは
Привет 0xD09FD180D0B8D0B2D0B5D182 Привет
Hello 世界 0x48656C6C6F20E4B896E7958C Hello 世界
😀😃😄😁 0xF09F9880F09F9883F09F9884F09F98 😀😃😄😁
Expected behavior
Proper decoding of Unicode strings in VARCHAR columns with new UTF-8 collations using:
conn.setencoding(encoding="utf-8", ctype=SQL_CHAR) # For VARCHAR
conn.setdecoding(SQL_CHAR, encoding="utf-8", ctype=SQL_CHAR)
Further technical details
Python version: 3.14.4
SQL Server version: SQL Server 2022 (docker mssql/server:2022-latest)
Operating system: Docker container
Describe the bug
According to the following blog post in 2019, Microsoft has added Unicode support to CHAR and VARCHAR columns using variable length UTF-8 encoding with the new UTF-8 collations: https://techcommunity.microsoft.com/blog/sqlserver/introducing-utf-8-support-for-sql-server/734928
This is also described in the Microsoft documentation: https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver17#utf-8-support
Although this seems to work in the command line interface, I am not able to store/encode and retrieve/decode UTF-8 encoded strings using mssql-python. Looking at the repository, there seems to be no test that covers this functionality.
e.g. the following excerpt from tests/test_013_encoding_decoding.py shows that the developers are aware that mssql-python does not properly support UTF-8 collations, but it is not mentioned in the documentation. That's why I am flagging this as a bug.
To reproduce
Decoding the VARCHAR result fails.
I assume the problem lies with the decoding, as decoding seems to work in
sqlcmd:Expected behavior
Proper decoding of Unicode strings in VARCHAR columns with new UTF-8 collations using:
Further technical details
Python version: 3.14.4
SQL Server version: SQL Server 2022 (docker mssql/server:2022-latest)
Operating system: Docker container