[MQTT-320] Expectations of timing accuracy in MQTT implementations - OASIS Technical Committees Issue Tracker

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: No Action
Affects Version/s: 5
Fix Version/s: None
Component/s: core
Labels:
None

Proposal:

Hide

Non-normative.

In this specification time intervals in seconds are used to indicate when some future event should occur, such as the Keep Alive time.
These intervals specify the minimum time before the event will occur, with no assurance of the exact time interval. It is likely that
for relatively short time intervals, the margin of error will be greater than that for relatively large values.

Show
Non-normative. In this specification time intervals in seconds are used to indicate when some future event should occur, such as the Keep Alive time. These intervals specify the minimum time before the event will occur, with no assurance of the exact time interval. It is likely that for relatively short time intervals, the margin of error will be greater than that for relatively large values.

Description

There are a number of aspects of MQTT which involve timings, starting with the keepalive interval, and some new features introduced in MQTT 5 such as message expiration.

These timers are denoted in seconds (I don't think we have any exceptions to that), but the implementation of those timers in both servers and clients may not actually be at the resolution of small numbers of seconds. I suggest that we include some wording to limit the expectation of accuracy when a small number of seconds is used on such a timer, where "small" is to be defined, but could be less than 10 for instance.

Attachments

Activity

Ascending order - Click to sort in descending order

4 older comments

Hide

Permalink

Ken Borgendale (Inactive) added a comment - 09/Nov/16 10:34 PM

For batch timeouts you would generally not want to expire something too soon, but for clock skew or administrative action this can certainly happen. Clock skew is normally pretty small except at very small time intervals. On the other hand I could see an argument for only checking session expiration every minute which would give a very large accuracy issue for an interval of 1 second.

Administrative configuration or actions is another issue somewhat separate from timer accuracy. The client might ask us to keep session state for 10 years, but we might only authorize that user to 10 days. For keepalive we added a return value in CONNACK for this. For message and session expiration this does not really make sense. We do have non normative text pointing out that an infinite expiration interval does not really mean we will keep the data forever due to hardware or administrative issues. We need to have similar language for other timeouts.

Show

Ken Borgendale (Inactive) added a comment - 09/Nov/16 10:34 PM For batch timeouts you would generally not want to expire something too soon, but for clock skew or administrative action this can certainly happen. Clock skew is normally pretty small except at very small time intervals. On the other hand I could see an argument for only checking session expiration every minute which would give a very large accuracy issue for an interval of 1 second. Administrative configuration or actions is another issue somewhat separate from timer accuracy. The client might ask us to keep session state for 10 years, but we might only authorize that user to 10 days. For keepalive we added a return value in CONNACK for this. For message and session expiration this does not really make sense. We do have non normative text pointing out that an infinite expiration interval does not really mean we will keep the data forever due to hardware or administrative issues. We need to have similar language for other timeouts.

Hide

Permalink

Andrew Banks (Inactive) added a comment - 10/Nov/16 11:38 AM

Below are paraphrased versions of what we currently have by way of timers
in the specification and some notes on each one.

There is no need for any mention
of absolute time of day or the accuracy with which the time of day is set
because all of the timers use time intervals.

The use of "minimum" intervals and "after" mean that implementations are
permitted to make allowances for the measurement accuracy of these intervals.
For example if there is doubt that a time interval has passed, then the
implementation is allowed to wait longer, until it is certain.

Keep Alive
----------

The MINIMUM time interval in seconds before the client is presumed dead if no
packets are received by the server.

The server MUST disconnect the client AFTER 1.5 times the keep alive interval.

PINGRESP should be sent PROMPTLY by the server, its arrival at the client is
advisory and its up to the client to decide what to do if it does not arrive.

Note: The "1.5 times" came about to make it clear which end should add a fudge factor
and how much. We wanted to avoid situations where both ends had made
different assumptions.

Note: This is a case where the client should adopt early processing to send the PINGREQ
rather than late processing.

Note: Clock drift can be important here because both ends are measuring the time
interval. The "1.5 times" means this is unlikely to be an issue in practice.

Session Expiry
--------------

The MINIMUM time interval in seconds after network disconnection before the
client and server can delete the session state.

Note: Clock drift and clock accuracy could be important here. The specification
advises the client to use the Session Present flag, not the time calculation to
determine if the server still has session state.

Will Delay
----------

The MINIMUM time interval in seconds which the server must wait before publishing
the will message.

Publication Expiry Interval
---------------------------

The MINIMUM time interval in seconds during which the Server should try to
deliver a message. AFTER this time the Server will destroy the message.

The publish packet sent to the client contains the remaining expiry interval
as calculated by the server.

Note: An implementation might decide to store the time of day when delivery
was first attempted. So long as it uses the same clock to calculate the
remaining time clock skew is unimportant.

Note: The message received by the client might contain an over estimate of the
the message lifetime but never an under estimate.

Show

Andrew Banks (Inactive) added a comment - 10/Nov/16 11:38 AM Below are paraphrased versions of what we currently have by way of timers in the specification and some notes on each one. There is no need for any mention of absolute time of day or the accuracy with which the time of day is set because all of the timers use time intervals. The use of "minimum" intervals and "after" mean that implementations are permitted to make allowances for the measurement accuracy of these intervals. For example if there is doubt that a time interval has passed, then the implementation is allowed to wait longer, until it is certain. Keep Alive ---------- The MINIMUM time interval in seconds before the client is presumed dead if no packets are received by the server. The server MUST disconnect the client AFTER 1.5 times the keep alive interval. PINGRESP should be sent PROMPTLY by the server, its arrival at the client is advisory and its up to the client to decide what to do if it does not arrive. Note: The "1.5 times" came about to make it clear which end should add a fudge factor and how much. We wanted to avoid situations where both ends had made different assumptions. Note: This is a case where the client should adopt early processing to send the PINGREQ rather than late processing. Note: Clock drift can be important here because both ends are measuring the time interval. The "1.5 times" means this is unlikely to be an issue in practice. Session Expiry -------------- The MINIMUM time interval in seconds after network disconnection before the client and server can delete the session state. Note: Clock drift and clock accuracy could be important here. The specification advises the client to use the Session Present flag, not the time calculation to determine if the server still has session state. Will Delay ---------- The MINIMUM time interval in seconds which the server must wait before publishing the will message. Publication Expiry Interval --------------------------- The MINIMUM time interval in seconds during which the Server should try to deliver a message. AFTER this time the Server will destroy the message. The publish packet sent to the client contains the remaining expiry interval as calculated by the server. Note: An implementation might decide to store the time of day when delivery was first attempted. So long as it uses the same clock to calculate the remaining time clock skew is unimportant. Note: The message received by the client might contain an over estimate of the the message lifetime but never an under estimate.

Hide

Permalink

Ed Briggs [X] (Inactive) added a comment - 13/Nov/16 8:27 PM

I agree with Andrews general approach. I would like to point out some counter-examples to a few of the assumptions discussed so far. These are things I encounter in my work on time synchronization.

1. Interval timers are preferable for timeout purposes, but not all systems have interval timers available. For those that do, the resolution of the timer is an important parameter. A timer with 1 second resolution, will have an accuracy of +/- 1 second. Some implementations make the following mistake, if timer of interest (say the MQTT Keep Alive timer) is (say) 60 seconds, they will implement a tick timer that wakes up every 30 or 60 seconds. The result is 50% probability the time interval will be too short,
and by a substantial amount.

2. Embedded system manufacturers (automotive and consumer markets) are switching to cheaper, less accurate ceramic resonators in place of quartz crystals for clocks, and these typically have a manufacturing frequency tolerance of +/- 0.5%, which means that over a period of 24 hours, the time interval may be off by +/- 432 seconds (6min, 12 sec), and any two system may diverge by twice that (14 minutes, 24 sec). This size of the error increases with the size of the interval. There is an additional 0.1% error due to thermal conditions over an operating temperature range of -40 - 120 F. Automotive devices are subject to wide temperature variations.

3. Not all systems have an interval timer. Some will use time-of-day measurements for this purpose, and this presents some massive problems:
a.) Some systems, like the ubiquitous Raspberry PI have no CMOS or battery clock, so the time is set to 1-Jan-1970 when the system starts. If the time is later set to an external reference, there is a jump of 46 years between two adjacent timestamps. And when the system is rebooted, there is a step backward of 46 years. Any time interval measurements would be, well, unreliable.

4. Those systems that have RTC chips with CMOS/Battery backup may be set to an incorrect value, causing the same sort of leap as in #3 if the clock is ever set or synchronized (say with NTP.) Some RTC chips may not be set to UTC, and so there will be seasonal time-zone adjustments that invalidate the time intervals.

5. There are leap seconds in UTC which typically cause a step backward of 1 second (different systems behave differently, but this is a common case)

6. Some embedded systems designers are prohibited from synchronizing the TOD clock to an external source (e.g. GPS, NIST or ntp.org, etc) because of security concerns.
Both are easily spoofed.

7. If an external time reference is used (e.g. GPS in a automobile, or NTP in a residential device), the reference source may become inaccessible (driving in a tunnel, or out of wireless range). This can lead to an arbitrarily large time offset error, and a very step forward or backward when reachability is restored.

In conclusion, I would suggest we avoid any assumptions about both the magnitude and the direction of the phase error.

Show

Ed Briggs [X] (Inactive) added a comment - 13/Nov/16 8:27 PM I agree with Andrews general approach. I would like to point out some counter-examples to a few of the assumptions discussed so far. These are things I encounter in my work on time synchronization. 1. Interval timers are preferable for timeout purposes, but not all systems have interval timers available. For those that do, the resolution of the timer is an important parameter. A timer with 1 second resolution, will have an accuracy of +/- 1 second. Some implementations make the following mistake, if timer of interest (say the MQTT Keep Alive timer) is (say) 60 seconds, they will implement a tick timer that wakes up every 30 or 60 seconds. The result is 50% probability the time interval will be too short, and by a substantial amount. 2. Embedded system manufacturers (automotive and consumer markets) are switching to cheaper, less accurate ceramic resonators in place of quartz crystals for clocks, and these typically have a manufacturing frequency tolerance of +/- 0.5%, which means that over a period of 24 hours, the time interval may be off by +/- 432 seconds (6min, 12 sec), and any two system may diverge by twice that (14 minutes, 24 sec). This size of the error increases with the size of the interval. There is an additional 0.1% error due to thermal conditions over an operating temperature range of -40 - 120 F. Automotive devices are subject to wide temperature variations. 3. Not all systems have an interval timer. Some will use time-of-day measurements for this purpose, and this presents some massive problems: a.) Some systems, like the ubiquitous Raspberry PI have no CMOS or battery clock, so the time is set to 1-Jan-1970 when the system starts. If the time is later set to an external reference, there is a jump of 46 years between two adjacent timestamps. And when the system is rebooted, there is a step backward of 46 years. Any time interval measurements would be, well, unreliable. 4. Those systems that have RTC chips with CMOS/Battery backup may be set to an incorrect value, causing the same sort of leap as in #3 if the clock is ever set or synchronized (say with NTP.) Some RTC chips may not be set to UTC, and so there will be seasonal time-zone adjustments that invalidate the time intervals. 5. There are leap seconds in UTC which typically cause a step backward of 1 second (different systems behave differently, but this is a common case) 6. Some embedded systems designers are prohibited from synchronizing the TOD clock to an external source (e.g. GPS, NIST or ntp.org, etc) because of security concerns. Both are easily spoofed. 7. If an external time reference is used (e.g. GPS in a automobile, or NTP in a residential device), the reference source may become inaccessible (driving in a tunnel, or out of wireless range). This can lead to an arbitrarily large time offset error, and a very step forward or backward when reachability is restored. In conclusion, I would suggest we avoid any assumptions about both the magnitude and the direction of the phase error.

Hide

Permalink

Ian Craggs (Inactive) added a comment - 14/Nov/16 2:21 PM - edited

In response to Ken's first comment, I don't want to add any specific numbers particularly, the only point I am keen to highlight is that the shorter the interval chosen, the more likely any timing discrepancy is likely to be noticeable. No particular accuracy can be assumed, unless specified by a particular server implementation.

Show

Ian Craggs (Inactive) added a comment - 14/Nov/16 2:21 PM - edited In response to Ken's first comment, I don't want to add any specific numbers particularly, the only point I am keen to highlight is that the shorter the interval chosen, the more likely any timing discrepancy is likely to be noticeable. No particular accuracy can be assumed, unless specified by a particular server implementation.

Hide

Permalink

Andrew Banks (Inactive) added a comment - 17/Nov/16 6:45 AM

Ed, the worst consequences of inaccurate clocks that you mention seem to be:
1) A client mail fail to send PINREQ in time.
The client could mitigate this by asking for a longer keep alive interval or by accepting that it will have to reconnect. Another possibility would be for the Client to not use Keep alive and rely on a server initiate ping to detect liveness.

2) A server might expire state sooner than it should.

This would happen if the server clock drifts or is corrected towards the future. The server might know this could happen, for example because it is using a of day clock which is reporting
a time before some hard coded value which is clearly wrong , like 15 November 2016 (its 17 November at the time of writing) . In this case it could be very lenient with its expiry and perhaps never expire the data.

Show

Andrew Banks (Inactive) added a comment - 17/Nov/16 6:45 AM Ed, the worst consequences of inaccurate clocks that you mention seem to be: 1) A client mail fail to send PINREQ in time. The client could mitigate this by asking for a longer keep alive interval or by accepting that it will have to reconnect. Another possibility would be for the Client to not use Keep alive and rely on a server initiate ping to detect liveness. 2) A server might expire state sooner than it should. This would happen if the server clock drifts or is corrected towards the future. The server might know this could happen, for example because it is using a of day clock which is reporting a time before some hard coded value which is clearly wrong , like 15 November 2016 (its 17 November at the time of writing) . In this case it could be very lenient with its expiry and perhaps never expire the data.

People

Assignee:

Ian Craggs (Inactive)

Reporter:

Ian Craggs (Inactive)

Watchers:

4 Start watching this issue

Dates

Due:

08/Dec/16

Created:

20/Oct/16 2:58 PM

Updated:

20/Oct/17 2:27 PM

Resolved:

20/Oct/17 2:18 PM