Discussion:
[chrony-users] Tracking lost but server selected
Ariel Garcia
2018-03-17 11:48:02 UTC
Permalink
I'm wondering how my Chrony landed in this state:

# chronyc tracking
Reference ID : 00000000 ()
Stratum : 0
Ref time (UTC) : Thu Jan 01 00:00:00 1970
System time : 0.000280024 seconds slow of NTP time
Last offset : +0.000216718 seconds
RMS offset : 0.004225173 seconds
Frequency : 36.861 ppm fast
Residual freq : +0.000 ppm
Skew : 0.000 ppm
Root delay : 1.000000000 seconds
Root dispersion : 1.000000000 seconds
Update interval : 9332.6 seconds
Leap status : Not synchronised


being that a source is still selected:

chronyc> sources
210 Number of sources = 10
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^+ 40.xxx.xxx.xxx 2 10 377 229m -23ms[ -23ms] +/- 163ms
^+ 52.xxx.xxx.xxx 2 10 377 22m +41ms[ +41ms] +/- 90ms
^+ 52.xxx.xxx.xxx 2 10 377 109m +34ms[ +34ms] +/- 76ms
^+ 13.xxx.xxx.xxx 2 10 377 299m +7232us[+7232us] +/- 113ms
^- 223.xxx.xxx.xxx 7 10 377 126m +26ms[ +26ms] +/- 296ms
^+ ntp0.rrze.uni-erlangen.de 1 10 377 354 +40ms[ +40ms] +/- 45ms
^* rustime01.rus.uni-stuttg> 1 10 377 453m -4143us[-4982us] +/- 48ms
^+ ptbtime1.ptb.de 1 10 377 346 +32ms[ +32ms] +/- 54ms
^- time-c-g.nist.gov 1 10 4 206m -296ms[ -296ms] +/- 404ms
^+ 52.xxx.xxx.xxx 2 10 377 195m +21ms[ +21ms] +/- 111ms

(many sources since i am trying to compare private servers in the cloud to
"known-good" public servers)

My tracking.log ends in this way:
---------------------
==> /var/log/chrony/tracking.log <==
2018-03-16 17:20:48 192.53.103.108 2 37.569 0.186 -2.258e-03 N 7 8.752e-03 -1.581e-03 8.815e-02 3.405e-03 4.938e-02
2018-03-16 22:01:01 131.188.3.220 2 37.397 0.235 -1.169e-02 N 8 1.670e-02 -1.405e-07 9.432e-02 1.041e-03 6.827e-02
2018-03-16 23:02:56 131.188.3.220 2 37.231 0.201 -3.187e-03 N 8 1.301e-02 1.048e-02 9.414e-02 1.102e-03 6.461e-02
2018-03-16 23:20:01 129.69.1.153 2 37.082 0.188 -6.433e-03 N 9 2.570e-02 1.103e-02 8.334e-02 2.310e-03 6.309e-02
2018-03-16 23:20:05 129.69.1.153 2 37.016 0.161 -8.800e-04 N 9 2.534e-02 1.743e-02 8.334e-02 1.764e-03 6.145e-02
2018-03-16 23:54:39 129.69.1.153 2 36.840 0.269 1.549e-03 N 8 1.052e-02 -7.460e-06 8.334e-02 2.914e-03 6.417e-02
2018-03-17 00:11:59 129.69.1.153 2 36.759 0.191 -1.346e-03 N 8 1.062e-02 -1.480e-03 8.195e-02 1.045e-03 4.753e-02
2018-03-17 01:21:03 129.69.1.153 2 36.749 0.159 -6.800e-04 N 8 1.140e-02 -1.296e-04 8.195e-02 2.893e-03 4.713e-02
2018-03-17 03:56:36 129.69.1.153 2 36.732 0.121 2.167e-04 N 8 1.209e-02 5.104e-04 8.195e-02 6.179e-03 5.526e-02
2018-03-17 06:13:08 0.0.0.0 0 36.861 0.121 0.000e+00 ? 0 0.000e+00 2.895e-04 8.195e-02 1.539e-02 5.666e-02
---------------------

while i am still seeing measurments comming in:

==> /var/log/chrony/measurements.log <==
2018-03-17 11:42:02 40.xxx.xxx.xxx N 2 111 111 1101 10 10 1.00 2.474e-01 6.753e-01 3.317e-06 1.280e-02 4.428e-02 83BC03DC 4B D K
2018-03-17 11:42:17 52.xxx.xxx.xxx N 2 111 111 1101 10 10 1.00 3.201e-01 8.283e-01 2.489e-06 3.340e-02 4.530e-02 C035676C 4B D K
2018-03-17 11:42:27 129.69.1.153 N 1 111 111 1101 10 10 1.00 3.446e-01 8.649e-01 6.185e-06 0.000e+00 1.099e-03 505A4600 4B D K
2018-03-17 11:42:27 52.xxx.xxx.xxx N 2 111 111 1101 10 10 1.00 6.632e-02 3.054e-01 2.481e-06 1.364e-02 3.954e-02 814501AA 4B D K

==> /var/log/chrony/statistics.log <==
2018-03-17 11:07:23 192.53.103.108 1.299e-02 3.078e-02 4.285e-03 3.641e-07 2.461e-07 1.3e-01 31 0 12 -0.44
2018-03-17 11:07:46 52.xxx.xxx.xxx 1.303e-02 2.626e-02 4.708e-03 4.498e-07 3.968e-07 2.3e-01 24 0 9 0.00
2018-03-17 11:24:28 131.188.3.220 2.940e-03 4.116e-02 1.658e-03 1.343e-06 3.955e-07 1.4e-02 10 0 7 -0.50
2018-03-17 11:24:36 192.53.103.108 1.310e-02 3.085e-02 4.530e-03 3.386e-07 3.023e-07 1.0e-01 32 2 13 -0.44


I did not run a "reselect" command in case somebody wants additional
input from the daemon in its current state.

Thanks!
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Ariel Garcia
2018-03-17 22:17:05 UTC
Permalink
Strace'ing the daemon, excepting connections i run from the client, i only see
this activity for hours:

----------------------
select(6, [1 5], NULL, NULL, {2783, 75820}
) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {1521307188, 980265551}) = 0
clock_gettime(CLOCK_REALTIME, {1521307188, 980640329}) = 0
clock_gettime(CLOCK_REALTIME, {1521307188, 980950442}) = 0
adjtimex({modes=ADJ_FREQUENCY|ADJ_TICK, offset=0, freq=-2415664, maxerror=16000000, esterror=16000000, status=STA_UNSYNC, constant=2, precision=1, tolerance=32768000, time={1521307188, 981261}, tick=10000, ppsfreq=0, jitter=0, shift=0, stabil=0, jitcnt=0, calcnt=0, errcnt=0, stbcnt=0}) = 5 (TIME_ERROR)
clock_gettime(CLOCK_REALTIME, {1521307188, 981697915}) = 0
select(6, [1 5], NULL, NULL, {9999, 999252}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {1521317189, 82886363}) = 0
clock_gettime(CLOCK_REALTIME, {1521317189, 83261558}) = 0
clock_gettime(CLOCK_REALTIME, {1521317189, 83577629}) = 0
adjtimex({modes=ADJ_FREQUENCY|ADJ_TICK, offset=0, freq=-2415665, maxerror=16000000, esterror=16000000, status=STA_UNSYNC, constant=2, precision=1, tolerance=32768000, time={1521317189, 83888}, tick=10000, ppsfreq=0, jitter=0, shift=0, stabil=0, jitcnt=0, calcnt=0, errcnt=0, stbcnt=0}) = 5 (TIME_ERROR)
clock_gettime(CLOCK_REALTIME, {1521317189, 84323852}) = 0
select(6, [1 5], NULL, NULL, {9999, 999253}
----------------------

I can get much out of it, but it seems clear that the failing
adjtimex() call is either cause or consequence.
Post by Ariel Garcia
# chronyc tracking
Reference ID : 00000000 ()
Stratum : 0
Ref time (UTC) : Thu Jan 01 00:00:00 1970
System time : 0.000280024 seconds slow of NTP time
Last offset : +0.000216718 seconds
RMS offset : 0.004225173 seconds
Frequency : 36.861 ppm fast
Residual freq : +0.000 ppm
Skew : 0.000 ppm
Root delay : 1.000000000 seconds
Root dispersion : 1.000000000 seconds
Update interval : 9332.6 seconds
Leap status : Not synchronised
chronyc> sources
210 Number of sources = 10
MS Name/IP address Stratum Poll Reach LastRx Last sample
============================================================================
=== ^+ 40.xxx.xxx.xxx 2 10 377 229m -23ms[ -23ms]
+/- 163ms ^+ 52.xxx.xxx.xxx 2 10 377 22m +41ms[
+41ms] +/- 90ms ^+ 52.xxx.xxx.xxx 2 10 377 109m
+34ms[ +34ms] +/- 76ms ^+ 13.xxx.xxx.xxx 2 10 377
299m +7232us[+7232us] +/- 113ms ^- 223.xxx.xxx.xxx 7
10 377 126m +26ms[ +26ms] +/- 296ms ^+ ntp0.rrze.uni-erlangen.de
1 10 377 354 +40ms[ +40ms] +/- 45ms ^*
rustime01.rus.uni-stuttg> 1 10 377 453m -4143us[-4982us] +/-
48ms ^+ ptbtime1.ptb.de 1 10 377 346 +32ms[ +32ms]
+/- 54ms ^- time-c-g.nist.gov 1 10 4 206m -296ms[
-296ms] +/- 404ms ^+ 52.xxx.xxx.xxx 2 10 377 195m
+21ms[ +21ms] +/- 111ms
(many sources since i am trying to compare private servers in the cloud to
"known-good" public servers)
---------------------
==> /var/log/chrony/tracking.log <==
2018-03-16 17:20:48 192.53.103.108 2 37.569 0.186 -2.258e-03 N 7
8.752e-03 -1.581e-03 8.815e-02 3.405e-03 4.938e-02 2018-03-16 22:01:01
131.188.3.220 2 37.397 0.235 -1.169e-02 N 8 1.670e-02
-1.405e-07 9.432e-02 1.041e-03 6.827e-02 2018-03-16 23:02:56
131.188.3.220 2 37.231 0.201 -3.187e-03 N 8 1.301e-02
1.048e-02 9.414e-02 1.102e-03 6.461e-02 2018-03-16 23:20:01 129.69.1.153
2 37.082 0.188 -6.433e-03 N 9 2.570e-02 1.103e-02
8.334e-02 2.310e-03 6.309e-02 2018-03-16 23:20:05 129.69.1.153 2
37.016 0.161 -8.800e-04 N 9 2.534e-02 1.743e-02 8.334e-02
1.764e-03 6.145e-02 2018-03-16 23:54:39 129.69.1.153 2 36.840
0.269 1.549e-03 N 8 1.052e-02 -7.460e-06 8.334e-02 2.914e-03
6.417e-02 2018-03-17 00:11:59 129.69.1.153 2 36.759 0.191
-1.346e-03 N 8 1.062e-02 -1.480e-03 8.195e-02 1.045e-03 4.753e-02
2018-03-17 01:21:03 129.69.1.153 2 36.749 0.159 -6.800e-04 N
8 1.140e-02 -1.296e-04 8.195e-02 2.893e-03 4.713e-02 2018-03-17
03:56:36 129.69.1.153 2 36.732 0.121 2.167e-04 N 8
1.209e-02 5.104e-04 8.195e-02 6.179e-03 5.526e-02 2018-03-17 06:13:08
0.0.0.0 0 36.861 0.121 0.000e+00 ? 0 0.000e+00
2.895e-04 8.195e-02 1.539e-02 5.666e-02 ---------------------
==> /var/log/chrony/measurements.log <==
2018-03-17 11:42:02 40.xxx.xxx.xxx N 2 111 111 1101 10 10 1.00
2.474e-01 6.753e-01 3.317e-06 1.280e-02 4.428e-02 83BC03DC 4B D K
2018-03-17 11:42:17 52.xxx.xxx.xxx N 2 111 111 1101 10 10 1.00
3.201e-01 8.283e-01 2.489e-06 3.340e-02 4.530e-02 C035676C 4B D K
2018-03-17 11:42:27 129.69.1.153 N 1 111 111 1101 10 10 1.00
3.446e-01 8.649e-01 6.185e-06 0.000e+00 1.099e-03 505A4600 4B D K
2018-03-17 11:42:27 52.xxx.xxx.xxx N 2 111 111 1101 10 10 1.00
6.632e-02 3.054e-01 2.481e-06 1.364e-02 3.954e-02 814501AA 4B D K
==> /var/log/chrony/statistics.log <==
2018-03-17 11:07:23 192.53.103.108 1.299e-02 3.078e-02 4.285e-03
3.641e-07 2.461e-07 1.3e-01 31 0 12 -0.44 2018-03-17 11:07:46
52.xxx.xxx.xxx 1.303e-02 2.626e-02 4.708e-03 4.498e-07 3.968e-07
2.3e-01 24 0 9 0.00 2018-03-17 11:24:28 131.188.3.220 2.940e-03
4.116e-02 1.658e-03 1.343e-06 3.955e-07 1.4e-02 10 0 7 -0.50
2018-03-17 11:24:36 192.53.103.108 1.310e-02 3.085e-02 4.530e-03
3.386e-07 3.023e-07 1.0e-01 32 2 13 -0.44
I did not run a "reselect" command in case somebody wants additional
input from the daemon in its current state.
Thanks!
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Miroslav Lichvar
2018-03-19 09:12:21 UTC
Permalink
Post by Ariel Garcia
# chronyc tracking
Reference ID : 00000000 ()
Tracking reporting no reference probably means that the selection
failed at some point (e.g. no majority reached) and the newly selected
source (if there is one) didn't have an update yet.

Is there a message in the system log corresponding to the selection of
0.0.0.0?
Post by Ariel Garcia
==> /var/log/chrony/measurements.log <==
2018-03-17 11:42:02 40.xxx.xxx.xxx N 2 111 111 1101 10 10 1.00 2.474e-01 6.753e-01 3.317e-06 1.280e-02 4.428e-02 83BC03DC 4B D K
2018-03-17 11:42:17 52.xxx.xxx.xxx N 2 111 111 1101 10 10 1.00 3.201e-01 8.283e-01 2.489e-06 3.340e-02 4.530e-02 C035676C 4B D K
2018-03-17 11:42:27 129.69.1.153 N 1 111 111 1101 10 10 1.00 3.446e-01 8.649e-01 6.185e-06 0.000e+00 1.099e-03 505A4600 4B D K
2018-03-17 11:42:27 52.xxx.xxx.xxx N 2 111 111 1101 10 10 1.00 6.632e-02 3.054e-01 2.481e-06 1.364e-02 3.954e-02 814501AA 4B D K
Was there a network change which could cause the delay to be increased?

The sources seem to be failing in the test C, which suggests the
network delay is larger than it used to be before. chronyd is waiting
for the delay to get back to normal, but if it is a permanent change
(e.g. network routing has changed which added more delay), it may take
a while before the sample with the old minimum delay is flushed, or a
different source with better statistics is selected.

If you look further back in the measurements log, do you see a point
when all sources suddenly started to consistently fail the test C?
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Ariel Garcia
2018-03-19 11:59:39 UTC
Permalink
Hello Miroslav,

thanks a lot for your enlightening answer :)
In the meantime i've noticed, that the state is "normalizing" itself after
several hours: i've seen it remaining w/o reference for anything between
1 to 18 hours) but eventually happening again.

The device is connected via mobile network, with pretty varying and bad RTT's
to the NTP servers (i've seen anything between 0.1 to 2+ seconds, mean is
around 0.8s). Peer delay (chronyc ntpdata) is usually ~0.8s also.
Post by Miroslav Lichvar
Tracking reporting no reference probably means that the selection
failed at some point (e.g. no majority reached) and the newly selected
source (if there is one) didn't have an update yet.
Is there a message in the system log corresponding to the selection of
0.0.0.0?
yes, right, it always prints
Can't synchronise: no selectable sources
whenever it turns to "Ref time 1970" state
Post by Miroslav Lichvar
Was there a network change which could cause the delay to be increased?
well, no from the device but as said above the mobile nw is quite varying,
delays could be better and are for sure nor regular.
Post by Miroslav Lichvar
The sources seem to be failing in the test C, which suggests the
network delay is larger than it used to be before. chronyd is waiting
for the delay to get back to normal, but if it is a permanent change
(e.g. network routing has changed which added more delay), it may take
a while before the sample with the old minimum delay is flushed, or a
different source with better statistics is selected.
great! that is a very helpful information.

Is there any parameter i could adjust, to make it less "sensitive" to the
(bad) network behaviour, like increasing maxdelay ? (currently i have the
default value, 3s)
As said above, i've seen RTT's of 2 seconds, but looking sporadically, they
may get even worse...

Would having less sources (4 intended) help in reaching majority?
Post by Miroslav Lichvar
If you look further back in the measurements log, do you see a point
when all sources suddenly started to consistently fail the test C?
it is rather the other way around: most of the time i see test C failing:
111 111 1101
and every once in a while you find a server where all bits are set (logs
attached in case helpful, i can't really see a pattern)

So i guess when the rate of "all good" points falls even more than the normal
value, i loose the reference?
Any of the parameters printed in the logs (root delay, dispersion etc) helps
to know when the synchronization is going to be lost?

Thanks a lot for your help and for the great tool :)
Ariel
Miroslav Lichvar
2018-03-19 13:32:21 UTC
Permalink
Post by Ariel Garcia
Post by Miroslav Lichvar
Is there a message in the system log corresponding to the selection of
0.0.0.0?
yes, right, it always prints
Can't synchronise: no selectable sources
whenever it turns to "Ref time 1970" state
Ok, so it's not a problem with sources not agreeing with one another,
but rather they all become unfit for synchronization, e.g. having a
distance larger than maxdistance.

Default maxdistance is 3 seconds and from the logs it doesn't look
to me like it could be reached in just few hours.

Which chrony version are you running?
Post by Ariel Garcia
Is there any parameter i could adjust, to make it less "sensitive" to the
(bad) network behaviour, like increasing maxdelay ? (currently i have the
default value, 3s)
As said above, i've seen RTT's of 2 seconds, but looking sporadically, they
may get even worse...
You could increase maxdelaydevratio to make the test C less sensitive.
A very large value like 1e6 will effectively disable it.

But that will most likely only make the clock less stable. If most of
the measurements have delay in tenths of a second, it's probably
better to stick to the rare measurements that have a delay below 0.1s.
Post by Ariel Garcia
Would having less sources (4 intended) help in reaching majority?
From the log message it doesn't seem to be an issue with reaching
majority, so changing the number of sources should not make much of a
difference.
Post by Ariel Garcia
So i guess when the rate of "all good" points falls even more than the normal
value, i loose the reference?
Any of the parameters printed in the logs (root delay, dispersion etc) helps
to know when the synchronization is going to be lost?
It's not very clear to me what is going on here. If the issue was with
sources being "too distant", skew in sourcestats would be the most
important value. A source which has a large skew will fail the
maxdistance check faster when its measurements are not accumulated
(e.g. failing the test C).

The source 129.69.1.153 was ok at 03:56:36, but at 06:13:08 it
couldn't remain selected? That's weird.

Could you please post your config file?
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Ariel Garcia
2018-03-19 14:07:24 UTC
Permalink
Post by Miroslav Lichvar
Default maxdistance is 3 seconds and from the logs it doesn't look
to me like it could be reached in just few hours.
you mean the measurements are too bad to enable selecting a reference
source faster than in several hours?
("maxdistance 3s" roughly means that measurements with RTT > 3s are discarded
right?)
Post by Miroslav Lichvar
Which chrony version are you running?
3.2 downloaded from Github
(taken from https://github.com/mlichvar/chrony/archive/3.2.tar.gz and then
compiled in a TPXdist/ARM crosscompiling environment)
Post by Miroslav Lichvar
You could increase maxdelaydevratio to make the test C less sensitive.
A very large value like 1e6 will effectively disable it.
But that will most likely only make the clock less stable. If most of
the measurements have delay in tenths of a second, it's probably
better to stick to the rare measurements that have a delay below 0.1s.
ok. Well, i do not necessarily target a very precise clock: 100ms (in)accuracy
would still be fine.
However, i would like to make sure chrony always considers itself
"synchronized" unless the sources are unreachable for several days. The device
will be connecting for short periods every 12h, and if
chrony waitsync 1 1 500
fails (= not sync'ed) i'll force a longer connection until synchronization
happens...
Post by Miroslav Lichvar
It's not very clear to me what is going on here. If the issue was with
sources being "too distant", skew in sourcestats would be the most
important value. A source which has a large skew will fail the
maxdistance check faster when its measurements are not accumulated
(e.g. failing the test C).
The source 129.69.1.153 was ok at 03:56:36, but at 06:13:08 it
couldn't remain selected? That's weird.
Could you please post your config file?
sure, attached.

Thanks!
Miroslav Lichvar
2018-03-19 15:40:19 UTC
Permalink
Post by Ariel Garcia
Post by Miroslav Lichvar
Default maxdistance is 3 seconds and from the logs it doesn't look
to me like it could be reached in just few hours.
you mean the measurements are too bad to enable selecting a reference
source faster than in several hours?
They seem too good to cause the sources to have a large skew and
quickly reach maxdistance when they are missing a larger number of new
measurements.
Post by Ariel Garcia
("maxdistance 3s" roughly means that measurements with RTT > 3s are discarded
right?)
maxdistance is only related to the source selection. The maxdelay*
options are related only to the individual measurements. If we ignore
the skew and other sources of dispersion, a source which doesn't have
a measurement with better delay than 6 seconds would always fail the
maxdistance check. (distance = delay / 2 + dispersion)

Do you set any of the sources to offline (with chronyc)?

Offline sources are treated as unreachable and have some additional
requirements for selection.

It would probably help if we could see debug output from the time when
the sources cannot be selected.
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Ariel Garcia
2018-03-20 10:32:50 UTC
Permalink
Post by Miroslav Lichvar
maxdistance is only related to the source selection. The maxdelay*
options are related only to the individual measurements. If we ignore
the skew and other sources of dispersion, a source which doesn't have
a measurement with better delay than 6 seconds would always fail the
maxdistance check. (distance = delay / 2 + dispersion)
great, thanks for the explanations!
Post by Miroslav Lichvar
Do you set any of the sources to offline (with chronyc)?
no, didn't
Post by Miroslav Lichvar
Offline sources are treated as unreachable and have some additional
requirements for selection.
It would probably help if we could see debug output from the time when
the sources cannot be selected.
i've already recompiled with --enable-debug and restarted the daemon, but
didn't happen up to now. I will post again as soon as i get the data, perhaps
1-2 days.

Thanks!
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Ariel Garcia
2018-03-23 01:05:58 UTC
Permalink
Hello,
Post by Miroslav Lichvar
Post by Ariel Garcia
you mean the measurements are too bad to enable selecting a reference
source faster than in several hours?
They seem too good to cause the sources to have a large skew and
quickly reach maxdistance when they are missing a larger number of new
measurements.
Post by Ariel Garcia
("maxdistance 3s" roughly means that measurements with RTT > 3s are
discarded right?)
maxdistance is only related to the source selection. The maxdelay*
options are related only to the individual measurements. If we ignore
the skew and other sources of dispersion, a source which doesn't have
a measurement with better delay than 6 seconds would always fail the
maxdistance check. (distance = delay / 2 + dispersion)
Do you set any of the sources to offline (with chronyc)?
Offline sources are treated as unreachable and have some additional
requirements for selection.
It would probably help if we could see debug output from the time when
the sources cannot be selected.
Ok, got it :)

There seems to be a 1 to 1 correlation with the "Fallback drift 13"
activation, which is the value i selected in the config file:
fallbackdrift 13 19

Attached the log files and all the debugging output since a device reboot
on 21st March. Another app turned the network connection down around
6-9am on March 22nd, but then the problem appeared at 13:35 and then
again at 18:28 on March 22nd.

Thanks again!
Ariel Garcia
2018-03-26 11:03:15 UTC
Permalink
Hello,

Sorry i had sent this email on Friday but something went wrong.
I now have a lot more of data (debug logs, chrony logs) showing the
correlation between "reference lost" and "fallback drift activation" but i
think the data from Friday should be quite helpful already.

----
Post by Miroslav Lichvar
Post by Ariel Garcia
you mean the measurements are too bad to enable selecting a reference
source faster than in several hours?
They seem too good to cause the sources to have a large skew and
quickly reach maxdistance when they are missing a larger number of new
measurements.
Post by Ariel Garcia
("maxdistance 3s" roughly means that measurements with RTT > 3s are
discarded right?)
maxdistance is only related to the source selection. The maxdelay*
options are related only to the individual measurements. If we ignore
the skew and other sources of dispersion, a source which doesn't have
a measurement with better delay than 6 seconds would always fail the
maxdistance check. (distance = delay / 2 + dispersion)
Do you set any of the sources to offline (with chronyc)?
Offline sources are treated as unreachable and have some additional
requirements for selection.
It would probably help if we could see debug output from the time when
the sources cannot be selected.
Ok, got it :)

There seems to be a 1 to 1 correlation with the "Fallback drift 13"
activation, which is the value i selected in the config file:
fallbackdrift 13 19

Attached the log files and all the debugging output since a device reboot
on 21st March. Another app turned the network connection down around
6-9am on March 22nd, but then the problem appeared at 13:35 and then
again at 18:28 on March 22nd.

Thanks again!
Miroslav Lichvar
2018-03-26 16:16:21 UTC
Permalink
Post by Ariel Garcia
Post by Miroslav Lichvar
It would probably help if we could see debug output from the time when
the sources cannot be selected.
Ok, got it :)
There seems to be a 1 to 1 correlation with the "Fallback drift 13"
fallbackdrift 13 19
Ok, I think chronyd switching to a fallback drift when no update of
the clock was made in 2^13 seconds explains the unsynchronized status
(0.0.0.0 entries in the tracking log). That's actually how it was
designed to work. I think it could be improved to keep the current
reference and adjust the dispersion rate to allow it to operate as an
NTP server and make the tracking report useful.

The "Can't synchronise: no selectable sources" messages seem to
correspond to chronyc offline commands. Maybe you have a networking
script running automatically when an interface is brought up/down?
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Ariel Garcia
2018-03-26 17:47:56 UTC
Permalink
Post by Miroslav Lichvar
Post by Ariel Garcia
Post by Miroslav Lichvar
It would probably help if we could see debug output from the time when
the sources cannot be selected.
Ok, got it :)
There seems to be a 1 to 1 correlation with the "Fallback drift 13"
fallbackdrift 13 19
Ok, I think chronyd switching to a fallback drift when no update of
the clock was made in 2^13 seconds explains the unsynchronized status
(0.0.0.0 entries in the tracking log). That's actually how it was
designed to work. I think it could be improved to keep the current
reference and adjust the dispersion rate to allow it to operate as an
NTP server and make the tracking report useful.
Ok. I'm misinterpreting the fallbackdrift documentation then (or it is perhaps
a bit misleading), since it says: "They (drifts) are used when the clock is no
longer synchronised ...". From what i understand
"synchronized" == "a reference is set"
so the fallback drift causes the reference to be lost, and not the other way
around.

I would like to make sure that a reference source stays selected even if the
network is offline, at least until chrony estimates the clock may get way too
off (eg, greater than 1s): i poll chrony with
"chronyc waitsync 1 1 500"
every 5 minutes to make sure it is still sync'ed, and trigger a network
connection if not.

Shall i simply remove the fallbackdrift setting in that case?
Post by Miroslav Lichvar
The "Can't synchronise: no selectable sources" messages seem to
correspond to chronyc offline commands. Maybe you have a networking
script running automatically when an interface is brought up/down?
yes, exactly, although i tried to ensure the 3G modem stays connected all the
time. Probably now i can't recover a log of that anymore, but i could retest
if interesting for you.

Thanks a lot for your support!
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Miroslav Lichvar
2018-03-27 09:56:49 UTC
Permalink
Post by Ariel Garcia
Post by Miroslav Lichvar
Ok, I think chronyd switching to a fallback drift when no update of
the clock was made in 2^13 seconds explains the unsynchronized status
(0.0.0.0 entries in the tracking log). That's actually how it was
designed to work. I think it could be improved to keep the current
reference and adjust the dispersion rate to allow it to operate as an
NTP server and make the tracking report useful.
Ok. I'm misinterpreting the fallbackdrift documentation then (or it is perhaps
a bit misleading), since it says: "They (drifts) are used when the clock is no
longer synchronised ...". From what i understand
"synchronized" == "a reference is set"
so the fallback drift causes the reference to be lost, and not the other way
around.
Hm, yes, it's confusing. It changed when chrony was updated to NTPv4
(chrony-2.0). Before that, the fallback drift was activated only if
there was no reference, but that's a rare case in NTPv4, so it always
activates. I'll try to fix the description.
Post by Ariel Garcia
I would like to make sure that a reference source stays selected even if the
network is offline, at least until chrony estimates the clock may get way too
off (eg, greater than 1s): i poll chrony with
"chronyc waitsync 1 1 500"
every 5 minutes to make sure it is still sync'ed, and trigger a network
connection if not.
Shall i simply remove the fallbackdrift setting in that case?
Yes, or increase the fallbackdrift minimum interval to a value where
you would consider the clock to no longer be synchronized.
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Loading...