XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.1
    • Fix Version/s: None
    • Component/s: core
    • Labels:
      None
    • Proposal:
      Hide

      Add the following to section 2.1.2

      The encoded data must be well-formed UTF-8 as defined by the Unicode spec [Unicode], and restated in RFC 3629 [RFC 3629]. In particular the encoded data MUST NOT include encodings of codepoints between U+D800 and U+DFFF. If a receiver (server or client) receives a control packet containing ill-formed UTF-8 it MUST close the network connection.

      The data MUST NOT include an encoding of the null character U+0000. If a receiver (server or client) receives a control packet containing U+0000 it MUST close the network connection.
      The data SHOULD NOT include encodings of the Unicode codepoints listed below. If a receiver (server or client) receives a control packet containing any of them it MAY close the network connection.

      U+0001..U+001F control characters
      U+007F..U+009F control characters
      Codepoints defined in the Unicode spec to be noncharacters (for example U+0FFFF)

      The UTF-8 encoded sequence 0xEF 0xBB 0xBF is always to be interpreted to mean U+FEFF ("ZERO WIDTH NO-BREAK SPACE") wherever it appears in a string and must never be skipped over or stripped off by a packet receiver.

      Add the following to section 4.7.3
      When it performs subscription matching the server does not perform any normalization of Topic Names/Filters, or any modification or substitution of unrecognised characters. Each non-wildcarded level in the Topic Filter has to match the corresponding level in the Topic Name character for character for the match to succeed.

      Non-normative comment. The UTF-8 encoding rules mean that the comparison of Topic Filter and Topic Name could be performed either by comparing the encoded UTF-8 bytes, or by comparing decoded Unicode characters

      Update the normative reference (in section 1.3) to Unicode 6.3

      Show
      Add the following to section 2.1.2 The encoded data must be well-formed UTF-8 as defined by the Unicode spec [Unicode] , and restated in RFC 3629 [RFC 3629] . In particular the encoded data MUST NOT include encodings of codepoints between U+D800 and U+DFFF. If a receiver (server or client) receives a control packet containing ill-formed UTF-8 it MUST close the network connection. The data MUST NOT include an encoding of the null character U+0000. If a receiver (server or client) receives a control packet containing U+0000 it MUST close the network connection. The data SHOULD NOT include encodings of the Unicode codepoints listed below. If a receiver (server or client) receives a control packet containing any of them it MAY close the network connection. U+0001..U+001F control characters U+007F..U+009F control characters Codepoints defined in the Unicode spec to be noncharacters (for example U+0FFFF) The UTF-8 encoded sequence 0xEF 0xBB 0xBF is always to be interpreted to mean U+FEFF ("ZERO WIDTH NO-BREAK SPACE") wherever it appears in a string and must never be skipped over or stripped off by a packet receiver. Add the following to section 4.7.3 When it performs subscription matching the server does not perform any normalization of Topic Names/Filters, or any modification or substitution of unrecognised characters. Each non-wildcarded level in the Topic Filter has to match the corresponding level in the Topic Name character for character for the match to succeed. Non-normative comment. The UTF-8 encoding rules mean that the comparison of Topic Filter and Topic Name could be performed either by comparing the encoded UTF-8 bytes, or by comparing decoded Unicode characters Update the normative reference (in section 1.3) to Unicode 6.3

      Description

      This issues is based on comments in MQTT-24, and is opened a Core issue to discuss in MQTT TC Call, I had a discussion with my co-editor Andy and he suggested to open a core issue for TC discussion.

      from MQTT-24
      -------------------

      > We should also make a simple statement that UTF-8 encodings MUST NOT have a three character initial BOM.

      > A clarification that the encoding MUST NOT be Java's Modified UTF-8, and can contain ASCII NULL

      > At the same time, it's probably worth nothing too that certain unicode combinations are invalid in UTF- 8 - the use of surrogate pairs from UTF-16 re-encoded and certain non-transmissable characters (eg U+FFFE from memory) - these normally delimit the last 2 characters in a multi-lingual plain. These restrictions are only a minor burden fro java implementations using the naive methods in string / character. These restrictions serve to stop propagation of bad data through a network of nodes.

      > Implementations MAY decide to not support the use of ASCII NUL and C0 / C1 control codes / MAY decide to place additional restrictions on supported characters

        Attachments

          Activity

            People

            • Assignee:
              Andrew_Banks Andrew Banks
              Reporter:
              ragupta2 Rahul Gupta
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: