0xed unicode

5/18/2023

This is different than what would happen if you constructed a UTF-16 string with two invalid surrogates in the right order, which is that you’d get a correctly encoded character. Specifically, this is an instance of the WTF-8 extension of UTF-8. Since the String type is UTF-8 based, that means having two separate well-formed but invalid characters with surrogate code points. Julia does allow you to use \uD83D\uDE3F to create a String value with two invalid surrogate code points: julia> "\uD83D\uDE3F" If those in charge of the language decided that \T or even \q were to be the tab escape, then that might be a poor choice due to the confusion it would cause, but it wouldn’t be “wrong” per-se.Īlong those same lines, I view the fact that many languages parse surrogate pairs specified via \u into their respective correct code points (regardless of how strings are stored internally) as a convenience that there is little to no reason to not support (especially for languages that do not, or did not in the past, support the \U, or equivalent, escape, such as JavaScript).Īnd, now that I think about it, I would argue that tying any of this escape sequence stuff to the encoding used internally would be a leaky abstraction. The only requirement is that they behave according to the language specification. There is no inherent meaning in \t any more than there is in \x09, \u0009, or \U00000009. They are somewhat arbitrary substitutions / macros. However, string escapes aren’t byte sequences of a particular encoding. Yes, I am aware that surrogate pairs are a UTF-16-specific construct. That they are exposed in the string escapes is a bit of a leaky abstraction in languages like JavaScript. Surrogate pairs are not Unicode characters, they are a property of the UTF-16 encoding, which Julia doesn’t use-there are no surrogate pairs in UTF-8. Many languages allow you to specify the surrogate pair via a double- \u sequence.

0 Comments

Author

Archives

Categories

0xed unicode

Leave a Reply.