[MQTT-2] UTF-8 for will messages - OASIS Technical Committees Issue Tracker

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.1
Fix Version/s: 3.1.1
Component/s: core
Labels:
None

Proposal:

Hide

463 (Description of Connect Message) and 614 (Payload). Change "The payload contains one or more UTF-8 encoded strings" to "The payload contains one ore more encoded fields"
615 Change "These strings, if present.." to "These fields, if present.."
Each of 620 / 637 / 649 / 661 (Client ID, Will Topic, User Name, Password) change "UTF-encoded string" to "field", and add a further sentence saying "It is a UTF-8 encoded string".

641..643 Change the paragraph
"If the Will Flag=1, this is the next UTF-8 encoded string. The Will Message defines the content of the message that is published to the Will Topic if the client is unexpectedly disconnected. The Will Message can contain zero characters."
to say

"If the Will Flag=1, this is the next encoded field. The Will Message defines the payload content of the message that is published to the Will Topic if the client is unexpectedly disconnected. This field, if present, must consist of a 2-byte length (MSB followed by LSB) followed by the payload for the Will Message expressed as a sequence of zero or more bytes. The length gives the number of bytes in the payload that follows and does not include the 2 bytes taken up by the length itself."

645...647 Change the paragraph
"Although the Will Message is UTF-8 encoded in the CONNECT message, when it is published to the Will Topic only the bytes of the message are sent, not the first two length bytes. The message must therefore only consist of 7-bit ASCII characters."
to say

"When the Will Message is published to the Will Topic its payload consists only of the payload portion of this field, not the first two length bytes"

Show
463 (Description of Connect Message) and 614 (Payload). Change "The payload contains one or more UTF-8 encoded strings" to "The payload contains one ore more encoded fields" 615 Change "These strings, if present.." to "These fields, if present.." Each of 620 / 637 / 649 / 661 (Client ID, Will Topic, User Name, Password) change "UTF-encoded string" to "field", and add a further sentence saying "It is a UTF-8 encoded string". 641..643 Change the paragraph "If the Will Flag=1, this is the next UTF-8 encoded string. The Will Message defines the content of the message that is published to the Will Topic if the client is unexpectedly disconnected. The Will Message can contain zero characters." to say "If the Will Flag=1, this is the next encoded field. The Will Message defines the payload content of the message that is published to the Will Topic if the client is unexpectedly disconnected. This field, if present, must consist of a 2-byte length (MSB followed by LSB) followed by the payload for the Will Message expressed as a sequence of zero or more bytes. The length gives the number of bytes in the payload that follows and does not include the 2 bytes taken up by the length itself." 645...647 Change the paragraph "Although the Will Message is UTF-8 encoded in the CONNECT message, when it is published to the Will Topic only the bytes of the message are sent, not the first two length bytes. The message must therefore only consist of 7-bit ASCII characters." to say "When the Will Message is published to the Will Topic its payload consists only of the payload portion of this field, not the first two length bytes"

Description

The current 3.1 specification states that the will message is encoded in UTF-8 in the CONNECT message but will be published in ASCII encoding by a MQTT broker. This is a major inconsistency in the specification since this is the only case where ASCII encoding is used.

Here's the relevant citation from the specification:
"Although the Will Message is UTF-8 encoded in the CONNECT message, when it is published to the Will Topic only the bytes of the message are sent, not the first two length bytes. The message must therefore only consist of 7-bit ASCII characters."

A payload for a PUBLISH can of course be any raw bytes, in case of the will message we should think of removing the inconsistency from the spec. I see two possibilities:

1. The will message in the CONNECT message is not UTF-8 encoded but ASCII encoded.
2. The will message in the will PUBLISH is UTF-8. This would collide with the current spec because empty payloads are possible regarding to the 3.1 spec (in case of UTF-8 IIRC two length bytes have to be sent even with an empty message).

I would vote for option two because this would remove this inconsistency in the spec and the will message is encoded in the CONNECT message in UTF-8 anyway. I don't think the overhead of the two length bytes in case of an empty message are a serious problem. We could discuss if it would be reasonable that in case of an empty payload (= empty UTF-8 String) the length bytes should be removed automatically by broker implementations to reduce the overhead in PUBLISH messages.

Attachments

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Nick O'Leary (Inactive) added a comment - 02/May/13 4:29 AM

This has been discussed before on the mqtt.org mailing list, and is documented on the wiki: http://mqtt.org/wiki/doku.php/will_message_utf8_support

In summary, the spec is misleading to use terms like 'UTF-8' and ASCII when talking about the encoding of the will message. The will message payload is just like any other message payload - it is just a blob of bytes. The payload bytes might contain a string in some encoding, but that is only of interest to the application sending/receiving the message.

The spec uses the term 'UTF-8 encoded' as lazy shorthand for "encoded with two bytes to represent the length of the payload, followed by the payload itself".

The sentence you cite, although well intentioned to prevent misunderstand about what bytes get sent in the payload, in fact adds nothing but confusion and ought to be removed.

Show

Nick O'Leary (Inactive) added a comment - 02/May/13 4:29 AM This has been discussed before on the mqtt.org mailing list, and is documented on the wiki: http://mqtt.org/wiki/doku.php/will_message_utf8_support In summary, the spec is misleading to use terms like 'UTF-8' and ASCII when talking about the encoding of the will message. The will message payload is just like any other message payload - it is just a blob of bytes. The payload bytes might contain a string in some encoding, but that is only of interest to the application sending/receiving the message. The spec uses the term 'UTF-8 encoded' as lazy shorthand for "encoded with two bytes to represent the length of the payload, followed by the payload itself". The sentence you cite, although well intentioned to prevent misunderstand about what bytes get sent in the payload, in fact adds nothing but confusion and ought to be removed.

Hide

Permalink

Peter Niblett (Inactive) added a comment - 23/May/13 10:11 AM

Firstly a comment on Dominik's remark about zero length messages. The UTF-8 RFC (3629) doesn't mention length prefixes (that's something that the Java writeUTF method puts on the front of its string encodings). So there should be nothing to stop you coding a zero-length WIll Message in the Connect message, and you will then get zero bytes in the payload of the message itself.

This is because of the part of the sentence that says "when it is published to the Will Topic only the bytes of the message are sent, not the first two length bytes"

This is useful as it makes it clear that the length bytes are part of the variable header structure, not the message itself. This part should be kept in the specification.

However there are two other questions here
1. What is the rationale for saying "The message must therefore only consist of 7-bit ASCII characters."? As far as I can see, there is none and this sentence should be deleted.
2. Should the spec actually define this field to be UTF-8 at all?

Nick has given an explanation of how this came about. We can't go through the spec now and change all references to UTF-8 into "2-byte length-prefixed" binary - that would be a significant change to the current protocol. However there's a reasonable case here to say that we could change this field to be binary. That would make the Will Message consistent with other MQTT messages.

Show

Peter Niblett (Inactive) added a comment - 23/May/13 10:11 AM Firstly a comment on Dominik's remark about zero length messages. The UTF-8 RFC (3629) doesn't mention length prefixes (that's something that the Java writeUTF method puts on the front of its string encodings). So there should be nothing to stop you coding a zero-length WIll Message in the Connect message, and you will then get zero bytes in the payload of the message itself. This is because of the part of the sentence that says "when it is published to the Will Topic only the bytes of the message are sent, not the first two length bytes" This is useful as it makes it clear that the length bytes are part of the variable header structure, not the message itself. This part should be kept in the specification. However there are two other questions here 1. What is the rationale for saying "The message must therefore only consist of 7-bit ASCII characters."? As far as I can see, there is none and this sentence should be deleted. 2. Should the spec actually define this field to be UTF-8 at all? Nick has given an explanation of how this came about. We can't go through the spec now and change all references to UTF-8 into "2-byte length-prefixed" binary - that would be a significant change to the current protocol. However there's a reasonable case here to say that we could change this field to be binary. That would make the Will Message consistent with other MQTT messages.

Hide

Permalink

Raphael Cohen (Inactive) added a comment - 23/May/13 10:21 AM

I think the consistency is very important. There's a slight chance this breaks existing implementations that do strict UTF-8 validation on this field, but it should be trivial for clients and servers to support the new definition.

There's also a good argument for a zero-length will message - but does this have any impact on the way some brokers do deletes for retained messages? Not sure it does, but raising the question for others to think about.

So, if we do make this change, then we do need to change the MQTT version / identifier.

Show

Raphael Cohen (Inactive) added a comment - 23/May/13 10:21 AM I think the consistency is very important. There's a slight chance this breaks existing implementations that do strict UTF-8 validation on this field, but it should be trivial for clients and servers to support the new definition. There's also a good argument for a zero-length will message - but does this have any impact on the way some brokers do deletes for retained messages? Not sure it does, but raising the question for others to think about. So, if we do make this change, then we do need to change the MQTT version / identifier.

Hide

Permalink

Peter Niblett (Inactive) added a comment - 05/Jun/13 12:02 PM

As Raph says, making the field binary will mean that any existing server implementation that validates the characters in this field will have to change if it is to be compliant. We would also encourage clients to expose this field in their apis in a way that allow binary data to be supplied. However I think it is the right thing to do.

Zero length messages are already allowed by the Input Spec, so we shouldn't forbid their use now. I assume that you can use the Will Mechanism to delete a retained message should you really want to do that.

In order to progress this issue, I will make the following proposal (line numbers from wd 03 draft)
--------------------------------------------------------------------------------------

463 (Description of Connect Message) and 614 (Payload). Change "The payload contains one or more UTF-8 encoded strings" to "The payload contains one ore more encoded fields"
615 Change "These strings, if present.." to "These fields, if present.."
Each of 620 / 637 / 649 / 661 (Client ID, Will Topic, User Name, Password) change "UTF-encoded string" to "field", and add a further sentence saying "It is a UTF-8 encoded string".

641..643 Change the paragraph
"If the Will Flag=1, this is the next UTF-8 encoded string. The Will Message defines the content of the message that is published to the Will Topic if the client is unexpectedly disconnected. The Will Message can contain zero characters."
to say

"If the Will Flag=1, this is the next encoded field. The Will Message defines the payload content of the message that is published to the Will Topic if the client is unexpectedly disconnected. This field, if present, must consist of a 2-byte length (MSB followed by LSB) followed by the payload for the Will Message expressed as a sequence of zero or more bytes. The length gives the number of bytes in the payload that follows and does not include the 2 bytes taken up by the length itself."

645...647 Change the paragraph
"Although the Will Message is UTF-8 encoded in the CONNECT message, when it is published to the Will Topic only the bytes of the message are sent, not the first two length bytes. The message must therefore only consist of 7-bit ASCII characters."
to say

"When the Will Message is published to the Will Topic its payload consists only of the payload portion of this field, not the first two length bytes"

---------------

Show

Peter Niblett (Inactive) added a comment - 05/Jun/13 12:02 PM As Raph says, making the field binary will mean that any existing server implementation that validates the characters in this field will have to change if it is to be compliant. We would also encourage clients to expose this field in their apis in a way that allow binary data to be supplied. However I think it is the right thing to do. Zero length messages are already allowed by the Input Spec, so we shouldn't forbid their use now. I assume that you can use the Will Mechanism to delete a retained message should you really want to do that. In order to progress this issue, I will make the following proposal (line numbers from wd 03 draft) -------------------------------------------------------------------------------------- 463 (Description of Connect Message) and 614 (Payload). Change "The payload contains one or more UTF-8 encoded strings" to "The payload contains one ore more encoded fields" 615 Change "These strings, if present.." to "These fields, if present.." Each of 620 / 637 / 649 / 661 (Client ID, Will Topic, User Name, Password) change "UTF-encoded string" to "field", and add a further sentence saying "It is a UTF-8 encoded string". 641..643 Change the paragraph "If the Will Flag=1, this is the next UTF-8 encoded string. The Will Message defines the content of the message that is published to the Will Topic if the client is unexpectedly disconnected. The Will Message can contain zero characters." to say "If the Will Flag=1, this is the next encoded field. The Will Message defines the payload content of the message that is published to the Will Topic if the client is unexpectedly disconnected. This field, if present, must consist of a 2-byte length (MSB followed by LSB) followed by the payload for the Will Message expressed as a sequence of zero or more bytes. The length gives the number of bytes in the payload that follows and does not include the 2 bytes taken up by the length itself." 645...647 Change the paragraph "Although the Will Message is UTF-8 encoded in the CONNECT message, when it is published to the Will Topic only the bytes of the message are sent, not the first two length bytes. The message must therefore only consist of 7-bit ASCII characters." to say "When the Will Message is published to the Will Topic its payload consists only of the payload portion of this field, not the first two length bytes" ---------------

Hide

Permalink

Richard Coppen (Inactive) added a comment - 06/Jun/13 12:13 PM

discussed on TC call 06.06.2013
Peter's proposal agreed

Show

Richard Coppen (Inactive) added a comment - 06/Jun/13 12:13 PM discussed on TC call 06.06.2013 Peter's proposal agreed

Hide

Permalink

Rahul Gupta (Inactive) added a comment - 19/Jun/13 3:12 AM

Changes done in WD05 -
---------------------------------

line 601 -> The payload contains one or more encoded fields
line 775 -> The payload of the CONNECT control packet contains one or more encoded fields, based on the flags in the variable header.
line 780 -> The Client Identifier is always present and is the first field in the payload. The Client identifier is a UTF-8 encoded string,
line 797 -> If Will Flag=1, this is the next field in payload. Will Topic is a UTF-8 encoded string. The
line 808 -> If the Will Flag=1, this is the next field in payload. Will Message is a UTF-8 encoded string. The Will Message defines the payload content of the message that is published to the Will Topic if the client is unexpectedly disconnected. This field, if present, must consist of a 2-byte length (MSB followed by LSB) followed by the payload for the Will Message expressed as a sequence of zero or more bytes. The length gives the number of bytes in the payload that follows and does not include the 2 bytes taken up by the length itself.
line 815 -> When the Will Message is published to the Will Topic its payload consists only of the payload portion of this field, not the first two length bytes
line 818 -> If the User Name Flag=1, this is the next field in payload. User Name is a UTF-8 encoded string.
line 825 -> If the Password flag is set (1), then next field is the password corresponding to the user name which is connecting, and can be used by the server for authentication of the client. Password is a UTF-8 encoded string.

Show

Rahul Gupta (Inactive) added a comment - 19/Jun/13 3:12 AM Changes done in WD05 - --------------------------------- line 601 -> The payload contains one or more encoded fields line 775 -> The payload of the CONNECT control packet contains one or more encoded fields, based on the flags in the variable header. line 780 -> The Client Identifier is always present and is the first field in the payload. The Client identifier is a UTF-8 encoded string, line 797 -> If Will Flag=1, this is the next field in payload. Will Topic is a UTF-8 encoded string. The line 808 -> If the Will Flag=1, this is the next field in payload. Will Message is a UTF-8 encoded string. The Will Message defines the payload content of the message that is published to the Will Topic if the client is unexpectedly disconnected. This field, if present, must consist of a 2-byte length (MSB followed by LSB) followed by the payload for the Will Message expressed as a sequence of zero or more bytes. The length gives the number of bytes in the payload that follows and does not include the 2 bytes taken up by the length itself. line 815 -> When the Will Message is published to the Will Topic its payload consists only of the payload portion of this field, not the first two length bytes line 818 -> If the User Name Flag=1, this is the next field in payload. User Name is a UTF-8 encoded string. line 825 -> If the Password flag is set (1), then next field is the password corresponding to the user name which is connecting, and can be used by the server for authentication of the client. Password is a UTF-8 encoded string.

Hide

Permalink

Richard Coppen (Inactive) added a comment - 24/Jun/13 9:42 AM

Resolved in WD05

Show

Richard Coppen (Inactive) added a comment - 24/Jun/13 9:42 AM Resolved in WD05

Hide

Permalink

Richard Coppen (Inactive) added a comment - 28/Jun/13 9:39 AM

Changes in WD05 (line numbers slightly out due to edits)

Show

Richard Coppen (Inactive) added a comment - 28/Jun/13 9:39 AM Changes in WD05 (line numbers slightly out due to edits)

People

Assignee:

Rahul Gupta (Inactive)

Reporter:

Dominik Obermaier (Inactive)

Watchers:

1 Start watching this issue

Dates

Created:

02/May/13 3:58 AM

Updated:

28/Jun/13 9:39 AM

Resolved:

24/Jun/13 9:42 AM