The old "%" string formatting and the new string .format() method handle unicode differently
Today I learned that the old style "%" string formatting and the new string .format()
method behave differently when interpolating unicode strings. I was suprised to find out that one of these lines raised an error while one did not:
'%s' % u'O\u2019Connor'
'{}'.format(u'O\u2019Connor')
The old style "%" formatting operation returns a unicode string if one of the values is a unicode string even when the format string is a non-unicode string:
Python 2.7.3 (default, Feb 27 2014, 19:58:35)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> '%s' % u'O\u2019Connor'
u'O\u2019Connor'
The new string .format()
method called on a non-unicode string with a unicode string argument tries to encode the unicode string to a non-unicode string (bytestring) possibly raising a UnicodeEncodeError:
Python 2.7.3 (default, Feb 27 2014, 19:58:35)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> '{}'.format(u'O\u2019Connor')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 1: ordinal not in range(128)</module></stdin>
I guess the correct thing to do is to start with a unicode format string:
Python 2.7.3 (default, Feb 27 2014, 19:58:35)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'{}'.format(u'O\u2019Connor')
u'O\u2019Connor'
See also
- Documentation on the old style "%" string formatting
- Documentation on the new str.format method and the new format string syntax
- My old blog post on UnicodeEncodeError