Wednesday, 19 June 2019

Using Python textwrap.shorten for string but with bytes width

I'd like to shorten a string using textwrap.shorten or a function like it. The string can potentially have non-ASCII characters. What's special here is that the maximal width is for the bytes encoding of the string. This problem is motivated by the fact that several database column definitions and some message buses have a bytes based max length.

For example:

>>> import textwrap
>>> s = '☺ Ilsa, le méchant ☺ ☺ gardien ☺'

# Available function that I tried:
>>> textwrap.shorten(s, width=27)
'☺ Ilsa, le méchant ☺ [...]'
>>> len(_.encode())
31  # I want ⩽27

# Desired function:
>>> shorten_to_bytes_width(s, width=27)
'☺ Ilsa, le méchant [...]'
>>> len(_.encode())
27  # I want and get ⩽27

It's okay for the implementation to use a width greater than or equal to the length of the whitespace-stripped placeholder [...], i.e. 5.

The text should not be shortened any more than necessary. Some buggy implementations can use optimizations which on occasion result in excessive shortening.

Using textwrap.wrap with bytes count is a similar question but it's different enough from this one since it is about textwrap.wrap, not textwrap.shorten. Only the latter function uses a placeholder ([...]) which makes this question sufficiently unique.

Caution: Do not rely on any of the answers here for shortening a JSON encoded string in a fixed number of bytes. For it, substitute text.encode() with json.dumps(text).



from Using Python textwrap.shorten for string but with bytes width

No comments:

Post a Comment