[MQTT-44] Specific details for UTF-8 Strings - OASIS Technical Committees Issue Tracker

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.1
Fix Version/s: None
Component/s: core
Labels:
None

Proposal:

Hide

Add the following to section 2.1.2

The encoded data must be well-formed UTF-8 as defined by the Unicode spec [Unicode], and restated in RFC 3629 [RFC 3629]. In particular the encoded data MUST NOT include encodings of codepoints between U+D800 and U+DFFF. If a receiver (server or client) receives a control packet containing ill-formed UTF-8 it MUST close the network connection.

The data MUST NOT include an encoding of the null character U+0000. If a receiver (server or client) receives a control packet containing U+0000 it MUST close the network connection.
The data SHOULD NOT include encodings of the Unicode codepoints listed below. If a receiver (server or client) receives a control packet containing any of them it MAY close the network connection.

U+0001..U+001F control characters
U+007F..U+009F control characters
Codepoints defined in the Unicode spec to be noncharacters (for example U+0FFFF)

The UTF-8 encoded sequence 0xEF 0xBB 0xBF is always to be interpreted to mean U+FEFF ("ZERO WIDTH NO-BREAK SPACE") wherever it appears in a string and must never be skipped over or stripped off by a packet receiver.

Add the following to section 4.7.3
When it performs subscription matching the server does not perform any normalization of Topic Names/Filters, or any modification or substitution of unrecognised characters. Each non-wildcarded level in the Topic Filter has to match the corresponding level in the Topic Name character for character for the match to succeed.

Non-normative comment. The UTF-8 encoding rules mean that the comparison of Topic Filter and Topic Name could be performed either by comparing the encoded UTF-8 bytes, or by comparing decoded Unicode characters

Update the normative reference (in section 1.3) to Unicode 6.3

Show
Add the following to section 2.1.2 The encoded data must be well-formed UTF-8 as defined by the Unicode spec [Unicode] , and restated in RFC 3629 [RFC 3629] . In particular the encoded data MUST NOT include encodings of codepoints between U+D800 and U+DFFF. If a receiver (server or client) receives a control packet containing ill-formed UTF-8 it MUST close the network connection. The data MUST NOT include an encoding of the null character U+0000. If a receiver (server or client) receives a control packet containing U+0000 it MUST close the network connection. The data SHOULD NOT include encodings of the Unicode codepoints listed below. If a receiver (server or client) receives a control packet containing any of them it MAY close the network connection. U+0001..U+001F control characters U+007F..U+009F control characters Codepoints defined in the Unicode spec to be noncharacters (for example U+0FFFF) The UTF-8 encoded sequence 0xEF 0xBB 0xBF is always to be interpreted to mean U+FEFF ("ZERO WIDTH NO-BREAK SPACE") wherever it appears in a string and must never be skipped over or stripped off by a packet receiver. Add the following to section 4.7.3 When it performs subscription matching the server does not perform any normalization of Topic Names/Filters, or any modification or substitution of unrecognised characters. Each non-wildcarded level in the Topic Filter has to match the corresponding level in the Topic Name character for character for the match to succeed. Non-normative comment. The UTF-8 encoding rules mean that the comparison of Topic Filter and Topic Name could be performed either by comparing the encoded UTF-8 bytes, or by comparing decoded Unicode characters Update the normative reference (in section 1.3) to Unicode 6.3

Description

This issues is based on comments in ~~MQTT-24~~, and is opened a Core issue to discuss in MQTT TC Call, I had a discussion with my co-editor Andy and he suggested to open a core issue for TC discussion.

from ~~MQTT-24~~
-------------------

> We should also make a simple statement that UTF-8 encodings MUST NOT have a three character initial BOM.

> A clarification that the encoding MUST NOT be Java's Modified UTF-8, and can contain ASCII NULL

> At the same time, it's probably worth nothing too that certain unicode combinations are invalid in UTF- 8 - the use of surrogate pairs from UTF-16 re-encoded and certain non-transmissable characters (eg U+FFFE from memory) - these normally delimit the last 2 characters in a multi-lingual plain. These restrictions are only a minor burden fro java implementations using the naive methods in string / character. These restrictions serve to stop propagation of bad data through a network of nodes.

> Implementations MAY decide to not support the use of ASCII NUL and C0 / C1 control codes / MAY decide to place additional restrictions on supported characters

Specific details for UTF-8 Strings

Details

Description

Attachments

Activity

People

Dates