[chrony-users] Possible bug in PPS support

Post by Rob Janssen
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 0 13h -279ns[ -401ns] +/- 79ns
^- xxxxxx.xxxx.xxx 1 10 377 250 +3462us[+3462us] +/- 10ms
As can be seen, it has been lost for 13 hours but it still has the * sign in the 2nd column.
We are remotely monitoring these systems using chronyc tracking and it still indicated stratum 1 referenced to PPS.
I would have expected it to drop back to using those network time servers after some time of not getting pulses
(i.e. once "Reach" is 0) and the stratum to increase to 2. When it would operate that way, we would have
received an alert.
Furthermore, the clock had drifted by 3.5ms by the time the above status was noticed, while when synchronized
to network time it usually is within 1 to 1.5ms. So it really is not considering those network time sources anymore.

It would have switched eventually when the estimated error of the
refclock was larger than the error of the NTP source (10
milliseconds).

Have you saved the tracking or sourcestats output? From the skew we
could estimate how long it would take.

Post by Rob Janssen
Is it to be considered a bug, or is this just a design feature?

It's a feature, but there is apparently a bug which may make the
switch take much longer than it should.

Post by Rob Janssen
How could we work around that in this case?

Decreasing the maximum number of samples of the NTP source with the
maxsamples option should reduce the maximum span (as reported in
sourcestats) and also the time it will switch from unreachable
sources.

Increasing the maxclockerror would do that too if it was included in
the source selection. Even with the default value it would take only few
hours to switch in your case.

I thought it was included when I responded couple days ago to a
similar question on this list. I just checked and it's not included.
I'll look into that.

--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.

Rob Janssen

2017-10-23 16:06:17 UTC

It would have switched eventually when the estimated error of the
refclock was larger than the error of the NTP source (10
milliseconds).

That does not seem reasonable... should it not refer to the estimated error of the source itself rather
than to the network source?

Post by Miroslav Lichvar
Have you saved the tracking or sourcestats output? From the skew we
could estimate how long it would take.

Ok here is the tracking.log, the last few lines before it failed:

2017-10-21 22:18:30 PPS 1 -12.275 0.048 -6.697e-07 N 1 4.525e-07 1.504e-07
2017-10-21 22:18:46 PPS 1 -12.279 0.030 -1.661e-07 N 1 3.788e-07 3.638e-11
2017-10-21 22:19:02 PPS 1 -12.284 0.029 -7.386e-07 N 1 4.446e-07 1.177e-07
2017-10-21 22:19:18 PPS 1 -12.286 0.020 -6.956e-08 N 1 3.629e-07 4.908e-11
2017-10-21 22:19:34 PPS 1 -12.290 0.022 -7.190e-07 N 1 4.091e-07 6.094e-08
2017-10-21 22:19:50 PPS 1 -12.292 0.018 -1.540e-07 N 1 3.709e-07 4.822e-11
2017-10-21 22:20:06 PPS 1 -12.295 0.017 -4.841e-07 N 1 4.030e-07 1.114e-07
2017-10-21 22:20:22 PPS 1 -12.297 0.014 -1.363e-07 N 1 3.626e-07 8.935e-09

After this, nothing was logged until I restarted chronyd 13 hours later and it synced to the network sources.

Post by Rob Janssen
Is it to be considered a bug, or is this just a design feature?

It's a feature, but there is apparently a bug which may make the
switch take much longer than it should.

However, we use this form of time synchronization because we need the clock to be within about 20us
of real time. When the PPS sync is lost and only network sync is achieved, that is not really attainable.
So we need some indication whenever there is no PPS sync.
Would it not be reasonable to indicate loss of PPS sync when the Reach value becomes zero?
Ok, it could be that freewheeling keeps a more accurate time than syncing to another source, but
at least the error condition should be monitored.

Post by Rob Janssen
How could we work around that in this case?

Decreasing the maximum number of samples of the NTP source with the
maxsamples option should reduce the maximum span (as reported in
sourcestats) and also the time it will switch from unreachable
sources.
Increasing the maxclockerror would do that too if it was included in
the source selection. Even with the default value it would take only few
hours to switch in your case.

Ok but rather than "only a few hours" I would like to see "only a few minutes".
The Span indicated by sourcestats is 79 for the PPS source now, and 103m for
the network sources.
Would that mean it drops the PPS after 79 seconds? That would be fine.

Rob

Bill Unruh

2017-10-23 16:33:05 UTC

...

Post by Rob Janssen
Is it to be considered a bug, or is this just a design feature?

It's a feature, but there is apparently a bug which may make the
switch take much longer than it should.

However, we use this form of time synchronization because we need the clock
to be within about 20us
of real time. When the PPS sync is lost and only network sync is achieved,
that is not really attainable.
So we need some indication whenever there is no PPS sync.
Would it not be reasonable to indicate loss of PPS sync when the Reach value becomes zero?
Ok, it could be that freewheeling keeps a more accurate time than syncing to
another source, but
at least the error condition should be monitored.

Post by Rob Janssen
How could we work around that in this case?

Decreasing the maximum number of samples of the NTP source with the
maxsamples option should reduce the maximum span (as reported in
sourcestats) and also the time it will switch from unreachable
sources.
Increasing the maxclockerror would do that too if it was included in
the source selection. Even with the default value it would take only few
hours to switch in your case.

Ok but rather than "only a few hours" I would like to see "only a few minutes".

But that would be totally rediculous. The offset of the local clock from UTC
after even a few hours is still far far better than that from the network, and
far better even than 20us. Remember what you want to know is how far the local
clock is from UTC, not whether or not the local clock has not heard from PPS
in the past few minutes.

Post by Rob Janssen
The Span indicated by sourcestats is 79 for the PPS source now, and 103m for
the network sources.
Would that mean it drops the PPS after 79 seconds? That would be fine.

No. You really need to think through what you want and what the time on your
server machine delivers. After all if the computer clock in your local machine
was and exact track of UTC always to atto seconds, and you used the GPS only
to make determine the intial offset determination then it would be silly to
throw away that source just because the pps had not been heard from.

Post by Rob Janssen
Rob
--
"unsubscribe" in the subject.
subject.

Rob Janssen

2017-10-23 16:44:21 UTC

Post by Rob Janssen
Ok but rather than "only a few hours" I would like to see "only a few minutes".

You don't support my calculation that if the clock apparently wandered away 3400us
after 13 hours, it would take about 5 minutes to wander 20us?
I would think it is a best-case calculation as it assumes a linear drift in one direction.
I practice it will probably wobble, and take less than 5 minutes to wander 20us.

Please note we are talking MICROseconds here. Not MILLIseconds. I don't think
many standard systems will remain within 20us for several hours if left without sync.
(it would likely require some TCXO clock option)

Post by Rob Janssen
The Span indicated by sourcestats is 79 for the PPS source now, and 103m for
the network sources.
Would that mean it drops the PPS after 79 seconds? That would be fine.

We are not interested in "time that is likely a good estimation". We require accurate time and if
we do not have it, or do not have certainty about it, we need to shutdown our application.
So we require some monitoring. Of course I can add monitoring of "sources" or "sourcestats"
to the monitoring of "tracking" that we currently do, and alert when "Reach" of the PPS
clock is zero. That is probably our quickest solution. However, I would have expected this
error condition (missing PPS pulses) to be somehow reflected in the "tracking" output.

Rob

Bill Unruh

2017-10-23 17:24:54 UTC

Post by Rob Janssen
Ok but rather than "only a few hours" I would like to see "only a few minutes".

You don't support my calculation that if the clock apparently wandered away 3400us

Again, no evidence of that 3400 us.

From the evidence that chrony has, pps does NOT wander that badly in 13 hrs.
Remember chrony constantly measures both the standard deviation in the offset
AND in the rate. So it has a good estimate of how far the offset will wander
in that time. And it is NOT 3400us. So you need to tell us how you measure
that 3400us.

Post by Rob Janssen
after 13 hours, it would take about 5 minutes to wander 20us?

Not it would not. chrony has measured it, and it is not that much.

Post by Rob Janssen
I would think it is a best-case calculation as it assumes a linear drift in one direction.
I practice it will probably wobble, and take less than 5 minutes to wander 20us.
Please note we are talking MICROseconds here. Not MILLIseconds. I don't think
many standard systems will remain within 20us for several hours if left without sync.
(it would likely require some TCXO clock option)

Sure they could. If the temp is constant, as you claim, that is main cause of
changes in drift rate.

Post by Rob Janssen
The Span indicated by sourcestats is 79 for the PPS source now, and 103m for
the network sources.
Would that mean it drops the PPS after 79 seconds? That would be fine.

We are not interested in "time that is likely a good estimation". We require
accurate time and if

I am sorry, but nothing will give you "accurate time" Not even GPS. What it
can give you is an estimate of the time and the accuracy of that estimate.

Post by Rob Janssen
we do not have it, or do not have certainty about it, we need to shutdown our application.
So we require some monitoring. Of course I can add monitoring of "sources"
or "sourcestats"
to the monitoring of "tracking" that we currently do, and alert when "Reach" of the PPS
clock is zero. That is probably our quickest solution. However, I would
have expected this
error condition (missing PPS pulses) to be somehow reflected in the "tracking" output.

Why?

Post by Rob Janssen
Rob

Miroslav Lichvar

2017-10-23 17:44:06 UTC

Post by Rob Janssen
You don't support my calculation that if the clock apparently wandered away 3400us

Again, no evidence of that 3400 us.

If I understand it correctly, 3.4ms was the offset of the NTP source
13 hours after the PPS stopped working. The stddev of the NTP source
from sourcestats is ~50 microseconds, so if the offset was originally
better than 1.6ms (there are 3 different sources in the original
report with -0.1ms, 1.5ms, 1.5ms offsets), it drifted at least by
~1.8ms in that time.

If there was a significant change in the temperature, the error gained
in 13 hours could be much larger. On one of my servers with PPS I see
that the frequency offset can change by 0.5 ppm in just few seconds.

Rob Janssen

2017-10-23 17:56:26 UTC

Post by Rob Janssen
You don't support my calculation that if the clock apparently wandered away 3400us

Again, no evidence of that 3400 us.

Yes, I forgot that there was a systematic 1.4 ms offset at the time the PPS sync was
active so after it ran unsynced for 13h and had a 3.4 ms offset the drift was more like
2ms instead of 3.4ms.
However, that still is 2 orders of magnitude more than we can allow. So we certainly
need to alert on this condition, we cannot just freewheel for 13 hours and assume the
time is still accurate enough.

I am now testing with the root delay/dispersion. A couple of minutes after the PPS
has been removed, the root delay remains at 0.000000001 seconds but the root dispersion
now has increased to 0.000662125 seconds. That certainly is a value that is immediately
affected by the lack of sync, however I need to determine a threshold value for the
monitoring alert.
The tracking also shows "System time : 0.000000009 seconds fast of NTP time"
but I cannot believe the time is still that accurate.

I understand now that the 10ms value shown in "chronyc sources" is based on the 20ms
roundtriptime of the network towards the NTP source. This time is quite constant as
indicated by the low Std Dev but the fixed RTT apparently makes chrony believe the
network is dodgy (as Bill expresses it). The only thing dodgy about it is that for this
particular site there is a systematic offset in the propagation time from/towards the
site of 1.4 ms resulting in the 1.4ms offset observed when PPS is available, probably
caused by asymmetric routing. Other than that, it is quite stable. It is a network
designed for distribution of audio and video to transmitter sites, well dimensioned with
guaranteed bandwidth and not overloaded at any time.

Rob

Bill Unruh

2017-10-23 18:59:48 UTC

If you really need 20usec, then relying on one gps is certainly a bad
decision. You should have two or three machines all with independent gps
sources so you could catch one of them going rogue, or quitting.

You seem to be saying that having no time source whatsoever is better than
having one which may be off by 20us? I think you need to set out the real
conditions that you need in detail ("We need accuracy to 20us" could be
because it was a number that some administrator with absolutely no idea of
time came up with, or it could be a legal requirement, or it could be "we
should be able to do that" kind of requirement) They impliment a system which
can with some confidence deliver that. There are of course no guarentees. A
nuke over the building would severely degrade the accuracy of the clocks in a
way that was totally unpredictable beforehand. Or a power failure. etc.

William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology |____ ***@physics.ubc.ca
Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/

Post by Rob Janssen
You don't support my calculation that if the clock apparently wandered away 3400us

Again, no evidence of that 3400 us.

Yes, I forgot that there was a systematic 1.4 ms offset at the time the PPS sync was
active so after it ran unsynced for 13h and had a 3.4 ms offset the drift was more like
2ms instead of 3.4ms.
However, that still is 2 orders of magnitude more than we can allow. So we certainly
need to alert on this condition, we cannot just freewheel for 13 hours and assume the
time is still accurate enough.
I am now testing with the root delay/dispersion. A couple of minutes after the PPS
has been removed, the root delay remains at 0.000000001 seconds but the root dispersion
now has increased to 0.000662125 seconds. That certainly is a value that is immediately
affected by the lack of sync, however I need to determine a threshold value for the
monitoring alert.
The tracking also shows "System time : 0.000000009 seconds fast of NTP time"
but I cannot believe the time is still that accurate.
I understand now that the 10ms value shown in "chronyc sources" is based on the 20ms
roundtriptime of the network towards the NTP source. This time is quite constant as
indicated by the low Std Dev but the fixed RTT apparently makes chrony believe the
network is dodgy (as Bill expresses it). The only thing dodgy about it is that for this
particular site there is a systematic offset in the propagation time from/towards the
site of 1.4 ms resulting in the 1.4ms offset observed when PPS is available, probably
caused by asymmetric routing. Other than that, it is quite stable. It is a network
designed for distribution of audio and video to transmitter sites, well dimensioned with
guaranteed bandwidth and not overloaded at any time.
Rob
--
"unsubscribe" in the subject.
subject.

Rob Janssen

2017-10-23 19:14:05 UTC

Post by Bill Unruh
If you really need 20usec, then relying on one gps is certainly a bad
decision. You should have two or three machines all with independent gps
sources so you could catch one of them going rogue, or quitting.

The GPSDOs we are using are 2-3 orders of magnitude better than that.
These are not your typical $50 modules, but professional GPSDO with OCXO
or better oscillator.
Monitoring of their accuracy is done by their owners, we only get the signal
via distribution amplifiers. That is why we would prefer to have some additional
validation, like the PPS signal completely missing.
(which could also be caused by a mistakenly unplugged or cut cable, which
would never be detected by the GPSDO monitoring)

Post by Bill Unruh
You seem to be saying that having no time source whatsoever is better than
having one which may be off by 20us? I think you need to set out the real
conditions that you need in detail ("We need accuracy to 20us" could be
because it was a number that some administrator with absolutely no idea of
time came up with, or it could be a legal requirement, or it could be "we
should be able to do that" kind of requirement)

The time is used for a single-channel simulcast transmitter system. That is,
the same signal is transmitted from multiple locations on the same frequency
at the same time. When this is not done within 20us at the same time, it will cause
severe distortion of the signal. When we don't know we are within 20us, we
prefer to not transmit at all, so disable that particular transmitter.

I think I know better what is involved and what the limitations are than you do.
Also, I prefer to discuss with Miroslav, who concentrates on the problem under
discussion rather than casting doubt on everything. Thank you for you input
until now.

Rob

Bill Unruh

2017-10-23 20:21:01 UTC

The GPSDOs we are using are 2-3 orders of magnitude better than that.
These are not your typical $50 modules, but professional GPSDO with OCXO
or better oscillator.

It is not the accuracy of the individual gps but the the fallback in case one
of them goes mad (as happened to you). You do not want them on the same
machine unless they have hardware timestamping, since the interrupt latency is
far larger than 1us for servicing each interrupt.

Post by Rob Janssen
Monitoring of their accuracy is done by their owners, we only get the signal
via distribution amplifiers. That is why we would prefer to have some additional
validation, like the PPS signal completely missing.
(which could also be caused by a mistakenly unplugged or cut cable, which
would never be detected by the GPSDO monitoring)

As I said, you could do that with a cron job every 5 min cheching.

OK then you should have redundancy on each transmitter, and monitoring eg via
that cron job.

Post by Rob Janssen
I think I know better what is involved and what the limitations are than you do.

Of course. But that is not what is at issue here.

Post by Rob Janssen
Also, I prefer to discuss with Miroslav, who concentrates on the problem under
discussion rather than casting doubt on everything. Thank you for you input
until now.

???
You are making claims. I ask for what your evidence is for those claims, and
you have never given the evidence. Operating on false evidence is a sure way
of making bad decision.
I am not casting doubt on everything. I am trying to explain how chrony works
and why it does what it does.

Post by Rob Janssen
Rob
--
"unsubscribe" in the subject.
subject.

Rob Janssen

2017-10-23 20:34:12 UTC

The GPSDOs we are using are 2-3 orders of magnitude better than that.
These are not your typical $50 modules, but professional GPSDO with OCXO
or better oscillator.

Again you are wandering away from the topic Bill!
The discussion is about detection of a possible problem, not about availability.
I did not specify availability of the system, it may well be down when there is a component
failure, but we only want to know about it.

As I said, you could do that with a cron job every 5 min cheching.

We already have a comprehensive monitoring system based on Nagios, that in case
of this service uses "chronyc -h host tracking" to regularly retrieve the status of chrony
and alerts responsible people when something is wrong.

The issue is that it monitors "stratum" and "last offset" and it failed to trigger when the
PPS signal went away, even after 13 hours. It would have triggered when stratum
went above 1 or last offset above 20us, but it didn't. Both of these values remain frozen
when there is no PPS.

That is the issue I want to rectify, but that won't happen when I discuss with you.
Fortunately there is Miroslav who gave me useful hints.

Rob

Bill Unruh

2017-10-24 03:34:41 UTC

If that is all you want, then you could look at the "refclock" log and see
when the last successful input came in. If it is more than say 15 min ago then
the reach would be down to 0 and the refclock would have stopped.

Or you could run chronyc in a cron, and use the sources and look at the reach and if it
was 0 hit an error flag.

William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology |____ ***@physics.ubc.ca
Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/

The GPSDOs we are using are 2-3 orders of magnitude better than that.
These are not your typical $50 modules, but professional GPSDO with OCXO
or better oscillator.

As I said, you could do that with a cron job every 5 min cheching.

We already have a comprehensive monitoring system based on Nagios, that in case
of this service uses "chronyc -h host tracking" to regularly retrieve the status of chrony
and alerts responsible people when something is wrong.
The issue is that it monitors "stratum" and "last offset" and it failed to trigger when the
PPS signal went away, even after 13 hours. It would have triggered when stratum
went above 1 or last offset above 20us, but it didn't. Both of these values remain frozen
when there is no PPS.
That is the issue I want to rectify, but that won't happen when I discuss with you.
Fortunately there is Miroslav who gave me useful hints.
Rob
--
"unsubscribe" in the subject.
subject.

Bill Unruh

2017-10-24 07:27:31 UTC

Post by Bill Unruh
If that is all you want, then you could look at the "refclock" log and see

Sorry. That's refclocks.log

Post by Bill Unruh
when the last successful input came in. If it is more than say 15 min ago then
the reach would be down to 0 and the refclock would have stopped.
Or you could run chronyc in a cron, and use the sources and look at the reach and if it
was 0 hit an error flag.
William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/

The GPSDOs we are using are 2-3 orders of magnitude better than that.
These are not your typical $50 modules, but professional GPSDO with OCXO
or better oscillator.

Again you are wandering away from the topic Bill!
The discussion is about detection of a possible problem, not about availability.
I did not specify availability of the system, it may well be down when
there is a component
failure, but we only want to know about it.

As I said, you could do that with a cron job every 5 min cheching.

We already have a comprehensive monitoring system based on Nagios, that in case
of this service uses "chronyc -h host tracking" to regularly retrieve the
status of chrony
and alerts responsible people when something is wrong.
The issue is that it monitors "stratum" and "last offset" and it failed to
trigger when the
PPS signal went away, even after 13 hours. It would have triggered when stratum
went above 1 or last offset above 20us, but it didn't. Both of these
values remain frozen
when there is no PPS.
That is the issue I want to rectify, but that won't happen when I discuss with you.
Fortunately there is Miroslav who gave me useful hints.
Rob
--
"unsubscribe" in the subject.
subject.

Bill Unruh

2017-10-23 18:53:47 UTC

Post by Rob Janssen
You don't support my calculation that if the clock apparently wandered away 3400us

Again, no evidence of that 3400 us.

If I understand it correctly, 3.4ms was the offset of the NTP source

My question is how he determined that the offset was 3.4 ms after 13 hours.
Simply looking at the offset from the one of the ntp servers does not cut it.
That is only 2 std dev from the mean.

Post by Miroslav Lichvar
13 hours after the PPS stopped working. The stddev of the NTP source
from sourcestats is ~50 microseconds, so if the offset was originally
better than 1.6ms (there are 3 different sources in the original
report with -0.1ms, 1.5ms, 1.5ms offsets), it drifted at least by
~1.8ms in that time.
If there was a significant change in the temperature, the error gained
in 13 hours could be much larger. On one of my servers with PPS I see
that the frequency offset can change by 0.5 ppm in just few seconds.

I agree. and chrony PPS does a bad job of measuring that. Perhaps chrony
should keep track of the drift over a much longer period than the measurement
period (max 64 samples are 16 sec per sample is only about 15 min. so, keeping
a list of the drift rate over say a day would give a much better feeling for
the drift wander due to temp differences, etc. It is certainly true that the
drift fluctuations are not guassian so an estimate derived from 15 min really
gives a very poor estimate of the fluctuations on the time scale of hours or
days.

So that part of his concern is certainly valid. On the other hand, being
worried about the loss of connectivity on the 15 min time scale probably is
not, unless he has evidence.
But the evidence is all there in the measurement logs when the pps is running.
He could use that to estimate what the skew is over a variety of time periods.

O

Post by Miroslav Lichvar
--
Miroslav Lichvar
--
with "unsubscribe" in the subject.
with "help" in the subject.

Rob Janssen

2017-10-23 19:00:42 UTC

Post by Rob Janssen
You don't support my calculation that if the clock apparently wandered away 3400us

Again, no evidence of that 3400 us.

If I understand it correctly, 3.4ms was the offset of the NTP source

My question is how he determined that the offset was 3.4 ms after 13 hours. Simply looking at the offset from the one of the ntp servers does not cut it.
That is only 2 std dev from the mean.

I pasted the output for a single server but the other 2 were within very short offset of that:
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 0 13h -279ns[ -401ns] +/- 79ns
^- xxxxxx.xxxx.xxx 1 10 377 17m +3476us[+3476us] +/- 9930us
^- xxxxxx.xxxx.xxx 1 10 377 250 +3462us[+3462us] +/- 10ms
^- xxx.xxxxxx.xxxx.xxx 1 10 377 299 +3459us[+3459us] +/- 10ms

I am confident that those offsets were correct, but as I mentioned I forgot to subtract the offset that
was already there when the PPS sync was present (due to network delay asymmetry).

So that part of his concern is certainly valid. On the other hand, being
worried about the loss of connectivity on the 15 min time scale probably is
not, unless he has evidence.
But the evidence is all there in the measurement logs when the pps is running.
He could use that to estimate what the skew is over a variety of time periods.

Again, I am not interested in the performance when the clock is free-running as I do not believe
that it is good enough for our application anyway.
I am interested in monitoring/detecting that the clock is not synced to (recent) PPS input.

Rob

Miroslav Lichvar

2017-10-23 16:55:29 UTC

Post by Rob Janssen
Furthermore, the clock had drifted by 3.5ms by the time the above status was noticed, while when synchronized
to network time it usually is within 1 to 1.5ms. So it really is not considering those network time sources anymore.

It would have switched eventually when the estimated error of the
refclock was larger than the error of the NTP source (10
milliseconds).

That does not seem reasonable... should it not refer to the estimated error of the source itself rather
than to the network source?

I'm not sure what you mean here.

Post by Miroslav Lichvar
Have you saved the tracking or sourcestats output? From the skew we
could estimate how long it would take.

2017-10-21 22:18:30 PPS 1 -12.275 0.048 -6.697e-07 N 1 4.525e-07 1.504e-07
2017-10-21 22:18:46 PPS 1 -12.279 0.030 -1.661e-07 N 1 3.788e-07 3.638e-11
2017-10-21 22:19:02 PPS 1 -12.284 0.029 -7.386e-07 N 1 4.446e-07 1.177e-07
2017-10-21 22:19:18 PPS 1 -12.286 0.020 -6.956e-08 N 1 3.629e-07 4.908e-11
2017-10-21 22:19:34 PPS 1 -12.290 0.022 -7.190e-07 N 1 4.091e-07 6.094e-08
2017-10-21 22:19:50 PPS 1 -12.292 0.018 -1.540e-07 N 1 3.709e-07 4.822e-11
2017-10-21 22:20:06 PPS 1 -12.295 0.017 -4.841e-07 N 1 4.030e-07 1.114e-07
2017-10-21 22:20:22 PPS 1 -12.297 0.014 -1.363e-07 N 1 3.626e-07 8.935e-09
After this, nothing was logged until I restarted chronyd 13 hours later and it synced to the network sources.

The last skew was 14 ppb, so it would take about 8 days to accumulate
10 milliseconds worth of dispersion. The other check comparing the age
of samples between sources would kick in sooner (64 * 1024 seconds =
~18 hours).

Post by Rob Janssen
Is it to be considered a bug, or is this just a design feature?

It's a feature, but there is apparently a bug which may make the
switch take much longer than it should.

It's not an error condition in chrony, as it was designed for
intermittent "connection". Refclocks are handled in the same way as
NTP sources.

I think the best approach for checking the accuracy of the clock is to
monitor the root delay+dispersion. That's the estimated maximum error
of the clock. If you really wanted to make sure an update of the clock
was made in the last X seconds, you can check the reference time.

Post by Rob Janssen
Ok but rather than "only a few hours" I would like to see "only a few minutes".
The Span indicated by sourcestats is 79 for the PPS source now, and 103m for
the network sources.
Would that mean it drops the PPS after 79 seconds? That would be fine.

No, that would be 103 minutes if the span didn't change in that time.

Rob Janssen

2017-10-23 17:01:30 UTC

Post by Miroslav Lichvar
The last skew was 14 ppb, so it would take about 8 days to accumulate
10 milliseconds worth of dispersion.

Can you explain where the 10ms comes from? I know it is displayed in the "sources" output,
but how is it calculated? It is way above the StdDev indicated in the "sourcestats".
And of course it is also way above our usual accuracy.

Rob

Miroslav Lichvar

2017-10-23 17:05:44 UTC

Post by Miroslav Lichvar
The last skew was 14 ppb, so it would take about 8 days to accumulate
10 milliseconds worth of dispersion.

It includes the root delay and distance. Check "chronyc ntpdata". Most
of that is probably the round-trip time to the server.

One point I forgot to make is that even if chronyd reselected
immediately after the reach value of the PPS refclock got to 0, like
ntpd does, checking stratum or selected source wouldn't be a reliable
way to monitor the accuracy, because the reselection wouldn't happen
if the NTP source was down too.

Rob Janssen

2017-10-23 17:10:28 UTC

One point I forgot to make is that even if chronyd reselected immediately after the reach value of the PPS refclock got to 0, like ntpd does, checking stratum or selected source wouldn't be a reliable way to monitor the accuracy, because the
reselection wouldn't happen if the NTP source was down too.

Ok I will experiment with watching the root delay and -dispersion and see how they behave when removing PPS on my test system.
At the moment (after being locked for 8 hours or so) it shows:

Root delay : 0.000000001 seconds
Root dispersion : 0.000010389 seconds

Rob

Rob Janssen

2017-10-24 21:14:21 UTC

Post by Miroslav Lichvar
I think the best approach for checking the accuracy of the clock is to
monitor the root delay+dispersion. That's the estimated maximum error
of the clock. If you really wanted to make sure an update of the clock
was made in the last X seconds, you can check the reference time.

I am now monitoring the Root dispersion and this appears to work OK, after some tweaking
of the threshold value. The reference time unfortunately is in a format that is not easy to
check for "being recent" in a simple script, it would be nice if there was a "seconds since epoch"
field as well (as there is in ntpd/ntpq). But well, it looks like the dispersion increases
rapidly when there is no PPS reference and this is much like what I require.
(after all, the same is happening to the uncertainty of the time for our application)

Thanks for the hint!

Rob

Miroslav Lichvar

2017-10-25 06:29:41 UTC

Post by Rob Janssen
I am now monitoring the Root dispersion and this appears to work OK, after some tweaking
of the threshold value. The reference time unfortunately is in a format that is not easy to
check for "being recent" in a simple script, it would be nice if there was a "seconds since epoch"
field as well (as there is in ntpd/ntpq).

With the -c option, which is available in newer chrony versions, the
reference timestamp is printed in "seconds since epoch".

$ chronyc -c tracking | awk -F , '{ print $4 }'
1508912704.491908798

Rob Janssen

2017-10-25 07:40:09 UTC

With the -c option, which is available in newer chrony versions, the
reference timestamp is printed in "seconds since epoch".
$ chronyc -c tracking | awk -F , '{ print $4 }'
1508912704.491908798

Thanks! I have updated to 3.2 but not re-read the manpage.
This format is much easier to parse in our monitoring plugin, I'll rework it to use this feature.

Rob

Bill Unruh

2017-10-23 16:12:06 UTC

Post by Rob Janssen
210 Number of sources = 4
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 377 24 +218ns[ +278ns] +/-
124ns
^- xxxxxx.xxxx.xxx 1 10 377 877 -147us[ -122us] +/-
11ms
^- xxxxxx.xxxx.xxx 1 10 377 14 +1480us[+1480us] +/-
10ms
^- xxx.xxxxxx.xxxx.xxx 1 10 377 345 +1446us[+1447us] +/-
10ms
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 0 13h -279ns[ -401ns] +/-
79ns
^- xxxxxx.xxxx.xxx 1 10 377 250 +3462us[+3462us] +/-
10ms
As can be seen, it has been lost for 13 hours but it still has the * sign in
the 2nd column.
We are remotely monitoring these systems using chronyc tracking and it still
indicated stratum 1 referenced to PPS.
I would have expected it to drop back to using those network time servers
after some time of not getting pulses
(i.e. once "Reach" is 0) and the stratum to increase to 2. When it would
operate that way, we would have
received an alert.
Furthermore, the clock had drifted by 3.5ms by the time the above status was
noticed, while when synchronized
to network time it usually is within 1 to 1.5ms. So it really is not
considering those network time sources anymore.

Not sure what the above paragraph means. How do you know it has drifted by
3.5ms or 1 ms? I do not believe those figures, unless you meant 3.5us and
1usec. If by remote monitoring you mean really really remote with dodgy
network between them.
Was this a test by the way where you unplugged the gps from the machine.
Otherwise figuring out why gps pps was lost for that period of time is
probably the first thing to do.
Miroslav is better placed to figure out what is happening within chrony when
it loses pps input. Given the uncertainty in the rate as estimated from the
PPS it, 13 hrs ago, is still probably a better estimate of the current time
than is the network time from the other systems.
Remember that they are at poll 10 which is 1000 seconds or so (about 15 min)
so the network time sources have not had that many "measurements" in that time
interval and those are pretty crappy (10ms std dev which is really huge).
The PPS std dev is inn the ns range-- about 10000 times better. So the PPS is
still, even 13 hrs later, a better estimate of the true time than are those
crappy network sources.

Post by Rob Janssen
The above situation occurred with chrony 2.1
However, I have reproduced it with an installation updated to version 3.2
although with an "outage" time of 15 minutes.
It had Reach 0 but still was indicating lock to PPS after 869 seconds.

The star means that the PPS is the best indicator of what the true time now
is.

Post by Rob Janssen
Is it to be considered a bug, or is this just a design feature?

It is neither a bug or a "design feature" (by which I assume you mean it is
not working properly but the designer does not care-- that is how it is often
taken to mean). Here it indicates that the PPS is still, 13 hrs later, the
best indication of the offset from UTC. Now, this assumption that it is the
best could be off itself. For example if the time span used by the PPS was
overnight when the machine was cool inside, and during the day the machine is
used a lot and heats up, then the estimate from the PPS rate could well be
off because those kinds of jump in the rate would not enter into the estimate
of the skew for the PPS. (if the PPS had accumulated 64 samples at 16 sec per
sample, that is only 15 min, so the time span over which the pps is measuring
the rate and the changes in the rate is quite short and would not capture
large rate deviations which occur with non-gaussian distribution-- like the
heating up every morning)

Post by Rob Janssen
How could we work around that in this case?

It is not clear what it is you want to work around? From all the data, the PPS
13 hrs ago is still the best estimate of the UTC. Why would you want chrony to
use a measureably much worse source just because the PPS has not been heard
from for 13 hrs? Eventually the PPS from the remote past is no longer as good
as the relatively really crappy time from the network, but that could take
days.

Post by Rob Janssen
Rob
--
"unsubscribe" in the subject.
subject.

Rob Janssen

2017-10-23 16:29:58 UTC

Post by Rob Janssen
210 Number of sources = 4
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 377 24 +218ns[ +278ns] +/- 124ns
^- xxxxxx.xxxx.xxx 1 10 377 877 -147us[ -122us] +/- 11ms
^- xxxxxx.xxxx.xxx 1 10 377 14 +1480us[+1480us] +/- 10ms
^- xxx.xxxxxx.xxxx.xxx 1 10 377 345 +1446us[+1447us] +/- 10ms
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 0 13h -279ns[ -401ns] +/- 79ns
^- xxxxxx.xxxx.xxx 1 10 377 250 +3462us[+3462us] +/- 10ms
As can be seen, it has been lost for 13 hours but it still has the * sign in the 2nd column.
We are remotely monitoring these systems using chronyc tracking and it still indicated stratum 1 referenced to PPS.
I would have expected it to drop back to using those network time servers after some time of not getting pulses
(i.e. once "Reach" is 0) and the stratum to increase to 2. When it would operate that way, we would have
received an alert.
Furthermore, the clock had drifted by 3.5ms by the time the above status was noticed, while when synchronized
to network time it usually is within 1 to 1.5ms. So it really is not considering those network time sources anymore.

Look in the above stats: it usually is at about 1.5ms (14xx us) from the network time sources,
and when the error condition occurred, it was at 3462us offset.
There is a network between the source and the system, but it isn't dodgy.

Post by Bill Unruh
Was this a test by the way where you unplugged the gps from the machine.
Otherwise figuring out why gps pps was lost for that period of time is
probably the first thing to do.

We know what happened: the GPSDO went defective so there were no PPS pulses anymore.
(and also no 10 MHz reference, which we need in another part of the system)

What I would like to see is handling of the error condition. Of course it is understandable that
there is no time syncing when there are no PPS pulses, but the condition should be visible.
(e.g. by the stratum increasing and/or the source changing)

Post by Bill Unruh
Miroslav is better placed to figure out what is happening within chrony when
it loses pps input. Given the uncertainty in the rate as estimated from the
PPS it, 13 hrs ago, is still probably a better estimate of the current time
than is the network time from the other systems.

It isn't! Network time from the other systems would be about 1500us out, time was now 3400us out.
However, that is not the main point.

Post by Bill Unruh
Remember that they are at poll 10 which is 1000 seconds or so (about 15 min)
so the network time sources have not had that many "measurements" in that time
interval and those are pretty crappy (10ms std dev which is really huge).
The PPS std dev is inn the ns range-- about 10000 times better.

I don't think the shown output in the last column of "chronyc sources" is the stddev.
Right now that column still indicates 10ms, but when I use "chronyc sourcestats" the last
column actually has a header Std Dev and the values are around 40-60us.

Post by Bill Unruh
So the PPS is
still, even 13 hrs later, a better estimate of the true time than are those
crappy network sources.

The network sources aren't crappy. There is a systematic offset but the variation is low.
I have no idea what the figure in the last column of sources means, it has no header.

Post by Rob Janssen
The above situation occurred with chrony 2.1
However, I have reproduced it with an installation updated to version 3.2 although with an "outage" time of 15 minutes.
It had Reach 0 but still was indicating lock to PPS after 869 seconds.

The star means that the PPS is the best indicator of what the true time now
is.

Even when it has not provided information for 13 hours?

Post by Rob Janssen
Is it to be considered a bug, or is this just a design feature?

It is neither a bug or a "design feature" (by which I assume you mean it is
not working properly but the designer does not care-- that is how it is often
taken to mean).

Of course it could be that the design has a different objective.
We need the time to be very accurate (preferably within 2us but certainly within 20us)
and it looks like chrony is normally able to achieve that, but a design feature could
be that it is freewheeling on loss of sync rather than indicating an error.
I don't mind that it is freewheeling but I need an indication of that - because I need to
turn off our application as I know it does not take long for time to wander out of the 20us
window. Assuming 3400us of wander in 13 hours we should not be without sync for
more than 5 minutes without knowing it.

Post by Bill Unruh
Here it indicates that the PPS is still, 13 hrs later, the
best indication of the offset from UTC. Now, this assumption that it is the
best could be off itself. For example if the time span used by the PPS was
overnight when the machine was cool inside, and during the day the machine is
used a lot and heats up, then the estimate from the PPS rate could well be
off because those kinds of jump in the rate would not enter into the estimate
of the skew for the PPS. (if the PPS had accumulated 64 samples at 16 sec per
sample, that is only 15 min, so the time span over which the pps is measuring
the rate and the changes in the rate is quite short and would not capture
large rate deviations which occur with non-gaussian distribution-- like the
heating up every morning)

Well, the systems are in a rack in airconditioned system rooms, it should not be a big
problem.

Post by Rob Janssen
How could we work around that in this case?

What I need mostly is the information that there is no sync. And, I would expect that
when chrony notices that the offset from the external sources is larger than it usually
is, that it starts tracking those sources instead of running free. But that does not
really matter because the time offset is way out of our tolerances by that time.

Rob

Bill Unruh

2017-10-23 17:13:48 UTC

Post by Rob Janssen
210 Number of sources = 4
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 377 24 +218ns[ +278ns] +/- 124ns
^- xxxxxx.xxxx.xxx 1 10 377 877 -147us[ -122us] +/- 11ms
^- xxxxxx.xxxx.xxx 1 10 377 14 +1480us[+1480us] +/- 10ms
^- xxx.xxxxxx.xxxx.xxx 1 10 377 345 +1446us[+1447us] +/- 10ms
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 0 13h -279ns[ -401ns] +/- 79ns
^- xxxxxx.xxxx.xxx 1 10 377 250 +3462us[+3462us] +/- 10ms
As can be seen, it has been lost for 13 hours but it still has the * sign
in the 2nd column.
We are remotely monitoring these systems using chronyc tracking and it
still indicated stratum 1 referenced to PPS.
I would have expected it to drop back to using those network time servers
after some time of not getting pulses
(i.e. once "Reach" is 0) and the stratum to increase to 2. When it would
operate that way, we would have
received an alert.
Furthermore, the clock had drifted by 3.5ms by the time the above status
was noticed, while when synchronized
to network time it usually is within 1 to 1.5ms. So it really is not
considering those network time sources anymore.

Look in the above stats: it usually is at about 1.5ms (14xx us) from the
network time sources,
and when the error condition occurred, it was at 3462us offset.
There is a network between the source and the system, but it isn't dodgy.

Yes, it is. Note that it is saying that the standard deviation is 10ms. That
one particular measurement was only off by 1.5ms does not tell one anything.
The standard deviation tells much more.

And if it is off by 1.5 ms, that is still 10000 times worse than the PPS.

Post by Bill Unruh
Was this a test by the way where you unplugged the gps from the machine.
Otherwise figuring out why gps pps was lost for that period of time is
probably the first thing to do.

We know what happened: the GPSDO went defective so there were no PPS pulses anymore.
(and also no 10 MHz reference, which we need in another part of the system)

That is of course a different issue. And seeing no 10MHz reference is surely
something you can test for elsewhere.

Post by Rob Janssen
What I would like to see is handling of the error condition. Of course it is

The purpose of chrony is to discipline the local clock Not to test GPS
receivers.
You could run a cron job which looks at the PPS reach every 5 min and if it
finds it has dropped to 0, it can do something like let you know your gps has
problems. But why should that be chrony's job? It is giving you the best
estimate of UTC it can given the data. I certainly would not want it giving me
worse estimates.

Post by Rob Janssen
understandable that
there is no time syncing when there are no PPS pulses, but the condition

Sure there is. You can still use the past info from PPS to sync the current
clock.

Post by Rob Janssen
should be visible.
(e.g. by the stratum increasing and/or the source changing)

It isn't! Network time from the other systems would be about 1500us out,
time was now 3400us out.

No idea what you mean. As I said I have seen no evidence about how you
determined those figures.

Post by Rob Janssen
However, that is not the main point.

Post by Bill Unruh
So the PPS is
still, even 13 hrs later, a better estimate of the true time than are those
crappy network sources.

The network sources aren't crappy. There is a systematic offset but the variation is low.

No, it is not.

Post by Rob Janssen
I have no idea what the figure in the last column of sources means, it has no header.

The star means that the PPS is the best indicator of what the true time now
is.

Even when it has not provided information for 13 hours?

Sure.

Post by Rob Janssen
Is it to be considered a bug, or is this just a design feature?

It is neither a bug or a "design feature" (by which I assume you mean it is
not working properly but the designer does not care-- that is how it is often
taken to mean).

Of course it could be that the design has a different objective.
We need the time to be very accurate (preferably within 2us but certainly within 20us)

chrony's job is to try to do the best job it can with the data available of
disciplining the local clock. That is its job. That you want that discipline
to have a certain accuracy is a separate job, which you could handle by having
a cron job look at the log files for example.

Post by Rob Janssen
and it looks like chrony is normally able to achieve that, but a design feature could
be that it is freewheeling on loss of sync rather than indicating an error.

But if that freewheeling is more accurate than the other clock sources, why
would you object to freewheeling?

Post by Rob Janssen
I don't mind that it is freewheeling but I need an indication of that - because I need to
turn off our application as I know it does not take long for time to wander out of the 20us

You know that how?

Post by Rob Janssen
window. Assuming 3400us of wander in 13 hours we should not be without sync

Again, you have not told us how you determined that 3400us.

Post by Rob Janssen
for
more than 5 minutes without knowing it.

Why? How are you arriving at these figures?

Well, the systems are in a rack in airconditioned system rooms, it should not be a big
problem.

It is internal temperatures, not room temperatures that are important.