Skip to content

Text about surrogate pairs has confusing example. #516

@lrhn

Description

@lrhn

The text says:

A UTF-16 surrogate code point, even if in a valid UTF-16 surrogate pair, e.g. \uD83D\uDE03 or \UD83DDE03.

That's confusing because \UD83dDE03 is not a valid surrogate pair, it represents the (non-existing) code point U+D83DDE03.
Using it as an example suggests that there is some way to interpret it as two 16-bit code points, so that, fx, "\u00410042" could be a valid way to write the string "AB".

(A string is a sequence of valid code points that are not in the surrogate range. That's the same as valid scalar values - Unicode scalar values are Unicode code points except the surrogates).

The same place, just above, also says that the following is not allowed in a string literal:

  • An invalid Unicode code point, e.g. \u2FE0.

Is this only if the value occurs as an escape, or also if the source contains the literal U+2FE0 code point?
(There is no specification of what input source is, other than that it contains "characters", so likely it's a sequence of scalar values.)

The Unicode specification does not define any code point as "invalid".
Is it any code point which is currently unassigned?
(If so, which strings are valid depends on which Unicode version is being validated against.)
Does that include code points that are reserved? Or assigned, but as non-characters?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions