What encoding should a URI in OpenID and OAuth discovery document use for an internationalized domain name (IDN)?

So, what encoding should a URI in OpenID Connect and OAuth discovery document use for an internationalized domain name such as “müsik.example.com”1.

One option is to represent it in the encoding of the discovery document. As of January 2018, it MUST be UTF-8.

Another option is to represent it in the punycode. Then, “müsik.example.com” will become “xn--msik-0ra.example.com“.  This will achieve a greater client compatibility but less user-friendly as you cannot read the punycoded string. Well, in the above case, you can sort of guess, but if it is “お筝.example.com“, it becomes “xn--t8j937t.example.com” and it is absolutely not readable.  You cannot even remotely guess.

So, which is better?

During the OpenID Connect standardization, many of us thought that allowing Unicode characters in the URI causes too many troubles like administrator phishing through look-alike characters. So, for the authority section at least, I would opt to ASCII strings but that is just a practice.

Spec-wise, only the provision that OpenID Connect has on Unicode is this:

14.  String Operations

Processing some OpenID Connect messages requires comparing values in the messages to known values. For example, the Claim Names returned by the UserInfo Endpoint might be compared to specific Claim Names such as sub. Comparing Unicode strings, however, has significant security implications.

Therefore, comparisons between JSON strings and other Unicode strings MUST be performed as specified below:

  1. Remove any JSON applied escaping to produce an array of Unicode code points.
  2. Unicode Normalization [USA15] MUST NOT be applied at any point to either the JSON string or to the string it is to be compared against.
  3. Comparisons between the two strings MUST be performed as a Unicode code point to code point equality comparison.

In several places, this specification uses space delimited lists of strings. In all such cases, a single ASCII space character (0x20) MUST be used as the delimiter.

Unfortunately, it does not say anything about the encoding of the document, but per the RFC7159, the JSON document not confined within a closed environment, it MUST be represented in UTF-8. (Note: it was not until December 2017! We need to update the reference in the Discovery spec.)

On the other hand, RFC3986 states:

The reg-name syntax allows percent-encoded octets in order to represent non-ASCII registered names in a uniform way that is independent of the underlying name resolution technology.  Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent-encoded to be represented as URI characters.  URI producing applications must not use percent-encoding in host unless it is used to represent a UTF-8 character sequence.  When a non-ASCII registered name represents an internationalized domain name intended for resolution via the DNS, the name must be transformed to the IDNA   encoding [RFC3490] prior to name lookup.  URI producers should provide these registered names in the IDNA encoding, rather than a   percent-encoding, if they wish to maximize interoperability with legacy URI resolvers.

This makes it clear that URL can actually contain UTF-8 encoded characters.

So, https://müsik.example.com/ is a valid URL. The requirement is that the user-agent must transform the hostname to punycode before submitting to the name resolver.

Thus, an RFC 3986 compliant client must be able to cope with UTF-8 URI as long as the encoding is clear from the underlying context.

My personal conclusion is then:

  1. Since discovery document is UTF-8, it should use UTF-8 encoded authority section.
  2. Since JWT header and body is JSON, it MUST be UTF-8.
  3. The client library SHOULD transform the UTF-8 authority section to punycode before submitting to the DNS resolver. A client MUST make sure that the library that it is using does so unless it is using an IDN enabled modern DNS resolver.

A punycoded string is extremely user-unfriendly for Asian and other characters. You just cannot read it. I would advocate for using UTF-8 in the JSON document and tell the client developer of the 3) above if they get an error with it, but YMMV.

Footnotes

  1. This question was brought up by Block Allen in the OpenID AB/Connect WGhttp://lists.openid.net/pipermail/openid-specs-ab/Week-of-Mon-20180402/006722.html