Section 3.3.2.3.2 says "• 1 (0x01) Byte Indicates that the Payload is UTF-8 Encoded Character Data. The UTF-8 data in the Payload does not include a length prefix, nor is it subject to the restrictions described in section 1.5.4."
There are only two mandatory restrictions in 1.5.4
- "The character data in a UTF-8 Encoded String MUST be well-formed UTF-8 as defined by the Unicode specification [Unicode] and restated in RFC 3629 [RFC3629]. "
- "A UTF-8 Encoded String MUST NOT include an encoding of the null character U+0000."
I could see you might want to relax the second requirement, but it the string is not required to conform to the first one then you could put any sequence of bytes in the payload.
At the end of the section it says "The receiver MAY validate that the Payload is of the format indicated, and if it is not send a PUBACK, PUBREC, or DISCONNECT with Reason Code of 0x99 (Payload format invalid) as described in section 4.13."
However if any sequence of bytes is permitted, how can it ever reject a payload?
Is the idea that receiver can choose what kind of validation it performs? For example receiver A could do no validation at all, receiver B could validate it's well-formed UTF8, receiver C could require it to be well-formed and not include U+0000 ?