Monday, 19 April 2021

How come I can decode a UTF-8 byte string to ISO8859-1 and back again without any UnicodeEncodeError/UnicodeDecodeError?

How come the following works without any errors in Python?

>>> '你好'.encode('UTF-8').decode('ISO8859-1')
'ä½\xa0好'
>>> _.encode('ISO8859-1').decode('UTF-8')
'你好'

I would have expected it to fail with a UnicodeEncodeError or UnicodeDecodeError

Is there some property of ISO8859-1 and UTF-8 such that I can take any UTF-8 encoded string and decode it to a ISO8859-1 string, which can later be reversed to get the original UTF-8 string?

I'm working with an older database that only supports the ISO8859-1 character set. It seems like the developers were able to store Chinese and other languages in this database by decoding UTF-8 encoded strings into ISO8859-1, and storing the resulting garbage string in the database. Downstream systems which query this database then have to encode the garbage string in ISO8859-1 and then decode the result with UTF-8 to get the correct string.

I would have assumed that such a process would not work at all.

What am I missing?




from How come I can decode a UTF-8 byte string to ISO8859-1 and back again without any UnicodeEncodeError/UnicodeDecodeError?

No comments:

Post a Comment