Discussion:
[chrony-users] need help with large delays reported in ntpdata
Charbonneau, André
2021-04-03 19:33:13 UTC
Permalink
Hi,
I currently have a Linux server getting its time from a 1pps source (atomic clock) and 2 stratum 1 servers (for time of day). My Linux server and stratum 1 servers are connected to the same switch. The 1pps is from a Rb synchronized to UTC.

I'm seeing some weird peer delay and peer dispersion values in the output of chronyc ntpdata command:

Remote address : ***.***.52.242 (******F2)
Remote port : 123
Local address : ***.***.52.175 (******AF)
Leap status : Normal
Version : 3
Mode : Server
Stratum : 1
Poll interval : 10 (1024 seconds)
Precision : -5 (0.031250000 seconds)
Root delay : 0.000000 seconds
Root dispersion : 0.000000 seconds
Reference ID : 474D5200 (GMR)
Reference time : Sat Apr 03 18:55:15 2021
Offset : +0.000004316 seconds
Peer delay : 0.031250019 seconds
Peer dispersion : 0.031250019 seconds
Response time : 0.000404820 seconds
Jitter asymmetry: +0.00
NTP tests : 111 111 1101
Interleaved : No
Authenticated : No
TX timestamping : Kernel
RX timestamping : Kernel
Total TX : 108
Total RX : 108
Total valid RX : 108

These peer delay and peer dispersion values are quite large (being connected to the same switch) and constant over time (they don't change). I was looking at the chrony code and this looks like this could be due because of large estimated error?

Here is the output of chronyc sources:

$ chronyc sources
210 Number of sources = 3
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 377 11 +120ns[ +189ns] +/- 66ns
^- ***.***.52.204 1 5 377 38m +2040ns[ +347ns] +/- 62us (this seems very large... I've seen this jump to 47ms)
^- ***.***.52.242 1 5 377 39m +1913ns[ +232ns] +/- 62us

The LastRX values for the 2 stratum 1 servers is very large, even though the maxpoll is set to 5.

So this seems to indicate that there is a NTP test that fails constantly for these 2 entries.

I'm not sure where to look next to try to figure out what is the root cause of this problem. Any information about what might explain these constant large delays reported by the ntpdata command and large LastRX values would be much appreciated.

Thanks,
Andre


CentOS Linux release 7.9.2009 (Core)

chronyd (chrony) version 3.4 (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6
+DEBUG)




--
Andre Charbonneau

Frequency & Time
Metrology Research Centre
National Research Council Canada / Government of Canada
***@nrc-cnrc.gc.ca<mailto:***@nrc-cnrc.gc.ca> / 613-993-3129

Fréquence et temps
Centre de recherche en métrologie
Conseil national de recherches Canada / Gouvernement du Canada
***@nrc-cnrc.gc.ca<mailto:***@nrc-cnrc.gc.ca> / 613-993-3129
Bill Unruh
2021-04-03 21:25:47 UTC
Permalink
William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology |____ ***@physics.ubc.ca
Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/
[CAUTION: Non-UBC Email]
Hi,
I currently have a Linux server getting its time from a 1pps source (atomic clock) and 2 stratum
1 servers (for time of day).  My Linux server and stratum 1 servers are connected to the same
switch.  The 1pps is from a Rb synchronized to UTC.
 
I’m seeing some weird peer delay and peer dispersion values in the output of chronyc ntpdata
.......
NTP tests       : 111 111 1101
This says that the source i failing the "delay dev ration"
The explanation would seem to be

maxdelaydevratio ratio
If a measurement has a ratio of the increase in the round-trip delay from the minimum delay amongst the previous measurements to the standard deviation of the previous measurements that is greater than the specified ratio, it will be rejected. The default is 10.0.

You could try increasing this max to see if that helps.

It is not entirely clear to me what this means however. And you sources says
that the system has a reach of 377, which means taht the the last 8 contacts
with the source have been successful.
Of course this does not say why they have been successfull if they have been
rejected.
Interleaved     : No
Authenticated   : No
TX timestamping : Kernel
RX timestamping : Kernel
Total TX        : 108
Total RX        : 108
Total valid RX  : 108
 
These peer delay and peer dispersion values are quite large (being connected to the same switch)
and constant over time (they don’t change).  I was looking at the chrony code and this looks like
this could be due because of large estimated error?
 
 
$ chronyc sources
210 Number of sources = 3
MS Name/IP address         Stratum Poll Reach LastRx Last sample              
===============================================================================
#* PPS                           0   4   377    11   +120ns[ +189ns] +/-   66ns
^- ***.***.52.204                1   5   377   38m  +2040ns[ +347ns] +/-   62us (this seems very
large
  I’ve seen this jump to 47ms)
^- ***.***.52.242                1   5   377   39m  +1913ns[ +232ns] +/-   62us
 
The LastRX values for the 2 stratum 1 servers is very large, even though the maxpoll is set to 5.
 
So this seems to indicate that there is a NTP test that fails constantly for these 2 entries.
 
I’m not sure where to look next to try to figure out what is the root cause of this problem.  Any
information about what might explain these constant large delays reported by the ntpdata command
and large LastRX values would be much appreciated.
 
Thanks,
  Andre
 
 
CentOS Linux release 7.9.2009 (Core)
 
chronyd (chrony) version 3.4 (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS
+SECHASH +IPV6
+DEBUG)
 
 
 
 
--
Andre Charbonneau
 
Frequency & Time
Metrology Research Centre
National Research Council Canada / Government of Canada
 
Fréquence et temps
Centre de recherche en métrologie
Conseil national de recherches Canada / Gouvernement du Canada
 
Dan Gearty
2021-04-04 01:23:43 UTC
Permalink
I get the same value for "NTP tests" for an NTP server on the same subnet.
But the peer delay is plausible, and the time performance is good.

Not sure what could cause network delays like you are seeing outside of one
of the servers being CPU bound or the network being saturated.



# chronyc ntpdata | head -27

Remote address : 192.168.50.230 (C0A832E6)

Remote port : 123

Local address : 192.168.50.1 (C0A83201)

Leap status : Normal

Version : 4

Mode : Server

Stratum : 1

Poll interval : 6 (64 seconds)

Precision : -25 (0.000000030 seconds)

Root delay : 0.000000 seconds

Root dispersion : 0.000000 seconds

Reference ID : 47505300 (GPS)

Reference time : Sun Apr 04 01:11:14 2021

Offset : -0.000024434 seconds

Peer delay : 0.000593282 seconds

Peer dispersion : 0.000000070 seconds

Response time : 0.000004822 seconds

Jitter asymmetry: +0.00

NTP tests : 111 111 1101

Interleaved : No

Authenticated : No

TX timestamping : Daemon

RX timestamping : Kernel

Total TX : 12224

Total RX : 12224

Total valid RX : 12224



# chronyc sources | head -3

MS Name/IP address Stratum Poll Reach LastRx Last sample

============================================================================
===

^* LeoNTP 1 3 377 46 +8327ns[ +11us] +/-
289us



# chronyc sourcestats | head -3

Name/IP Address NP NR Span Frequency Freq Skew Offset Std
Dev

============================================================================
==

LeoNTP 20 12 630 +0.001 0.066 +58ns
16us



# chronyc -version

chronyc (chrony) version 4.0 (-READLINE -SECHASH +IPV6 -DEBUG)
Miroslav Lichvar
2021-04-06 08:15:21 UTC
Permalink
Post by Charbonneau, André
Precision : -5 (0.031250000 seconds)
Root delay : 0.000000 seconds
Root dispersion : 0.000000 seconds
Reference ID : 474D5200 (GMR)
Reference time : Sat Apr 03 18:55:15 2021
Offset : +0.000004316 seconds
Peer delay : 0.031250019 seconds
Peer dispersion : 0.031250019 seconds
These peer delay and peer dispersion values are quite large (being connected to the same switch) and constant over time (they don't change). I was looking at the chrony code and this looks like this could be due because of large estimated error?
chrony as a client rounds up the peer delay and dispersion to the
precision reported by the server. -5 is not great. What kind of server
is that?
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Charbonneau, André
2021-04-06 21:10:24 UTC
Permalink
Thanks Miroslav.

The server is a Masterclock GMR1000. Funny enough, as soon as I disconnected the 1pps input pulse to that GMR1000 unit, and it falls back to GPS for reference, then the precision is improved by a lot. For example:

Stratum 1 with 1pps input (as reported in the output of chronyc ntpdata command):

Remote address : ***.***.52.242 (***634F2)
Remote port : 123
Local address : ***.***.52.101 (***63465)
Leap status : Normal
Version : 3
Mode : Server
Stratum : 1
Poll interval : 10 (1024 seconds)
Precision : -5 (0.031250000 seconds)
Root delay : 0.000000 seconds
Root dispersion : 0.000000 seconds
Reference ID : 474D5200 (GMR)
Reference time : Tue Apr 06 18:13:51 2021
Offset : -0.000000193 seconds
Peer delay : 0.031250019 seconds
Peer dispersion : 0.031250019 seconds
Response time : 0.000412496 seconds
Jitter asymmetry: +0.00
NTP tests : 111 111 1101
Interleaved : No
Authenticated : No
TX timestamping : Kernel
RX timestamping : Kernel
Total TX : 8095
Total RX : 8095
Total valid RX : 8095


Same stratum 1 with 1pps disconnected, shortly after, using GPS as fallback:

Remote address : ***.***.52.242 (***634F2)
Remote port : 123
Local address : ***.***.52.101 (***63465)
Leap status : Normal
Version : 3
Mode : Server
Stratum : 1
Poll interval : 10 (1024 seconds)
Precision : -18 (0.000003815 seconds)
Root delay : 0.000000 seconds
Root dispersion : 0.000000 seconds
Reference ID : 474D5200 (GMR)
Reference time : Tue Apr 06 18:19:15 2021
Offset : -0.000027722 seconds
Peer delay : 0.000116265 seconds
Peer dispersion : 0.000003832 seconds
Response time : 0.000400542 seconds
Jitter asymmetry: +0.00
NTP tests : 111 111 1111
Interleaved : No
Authenticated : No
TX timestamping : Kernel
RX timestamping : Kernel
Total TX : 8105
Total RX : 8105
Total valid RX : 8105

I wonder what might be causing this discrepancy between the 2 references.

Thanks,
Andre
--
Andre Charbonneau

Frequency & Time
Metrology Research Centre
National Research Council Canada / Government of Canada
***@nrc-cnrc.gc.ca / 613-993-3129

Fréquence et temps
Centre de recherche en métrologie
Conseil national de recherches Canada / Gouvernement du Canada
***@nrc-cnrc.gc.ca / 613-993-3129


-----Original Message-----
From: Miroslav Lichvar [mailto:***@redhat.com]
Sent: April 6, 2021 4:15 AM
To: chrony-***@chrony.tuxfamily.org
Subject: Re: [chrony-users] need help with large delays reported in ntpdata

***ATTENTION*** This email originated from outside of the NRC. ***ATTENTION*** Ce courriel provient de l'extérieur du CNRC
Post by Charbonneau, André
Precision : -5 (0.031250000 seconds)
Root delay : 0.000000 seconds
Root dispersion : 0.000000 seconds
Reference ID : 474D5200 (GMR)
Reference time : Sat Apr 03 18:55:15 2021
Offset : +0.000004316 seconds
Peer delay : 0.031250019 seconds
Peer dispersion : 0.031250019 seconds
These peer delay and peer dispersion values are quite large (being connected to the same switch) and constant over time (they don't change). I was looking at the chrony code and this looks like this could be due because of large estimated error?
chrony as a client rounds up the peer delay and dispersion to the precision reported by the server. -5 is not great. What kind of server is that?

--
Miroslav Lichvar


--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Charbonneau, André
2021-04-07 13:53:05 UTC
Permalink
Hi,

UPDATE:
We've just heard back from that unit's manufacturer and it looks like we've discovered a bug in their firmware related to the precision field calculation. They are working on a fix.
Thanks everyone for the feedback on this; much appreciated. It helped us narrow down on the issue and identify the root cause of the problem.

Best regards,
Andre

--
Andre Charbonneau

Frequency & Time
Metrology Research Centre
National Research Council Canada / Government of Canada
***@nrc-cnrc.gc.ca / 613-993-3129

Fréquence et temps
Centre de recherche en métrologie
Conseil national de recherches Canada / Gouvernement du Canada
***@nrc-cnrc.gc.ca / 613-993-3129

-----Original Message-----
From: Charbonneau, André [mailto:***@nrc-cnrc.gc.ca]
Sent: April 6, 2021 5:10 PM
To: chrony-***@chrony.tuxfamily.org
Subject: RE: [chrony-users] need help with large delays reported in ntpdata

Thanks Miroslav.

The server is a Masterclock GMR1000. Funny enough, as soon as I disconnected the 1pps input pulse to that GMR1000 unit, and it falls back to GPS for reference, then the precision is improved by a lot. For example:

Stratum 1 with 1pps input (as reported in the output of chronyc ntpdata command):

Remote address : ***.***.52.242 (***634F2)
Remote port : 123
Local address : ***.***.52.101 (***63465)
Leap status : Normal
Version : 3
Mode : Server
Stratum : 1
Poll interval : 10 (1024 seconds)
Precision : -5 (0.031250000 seconds)
Root delay : 0.000000 seconds
Root dispersion : 0.000000 seconds
Reference ID : 474D5200 (GMR)
Reference time : Tue Apr 06 18:13:51 2021
Offset : -0.000000193 seconds
Peer delay : 0.031250019 seconds
Peer dispersion : 0.031250019 seconds
Response time : 0.000412496 seconds
Jitter asymmetry: +0.00
NTP tests : 111 111 1101
Interleaved : No
Authenticated : No
TX timestamping : Kernel
RX timestamping : Kernel
Total TX : 8095
Total RX : 8095
Total valid RX : 8095


Same stratum 1 with 1pps disconnected, shortly after, using GPS as fallback:

Remote address : ***.***.52.242 (***634F2)
Remote port : 123
Local address : ***.***.52.101 (***63465)
Leap status : Normal
Version : 3
Mode : Server
Stratum : 1
Poll interval : 10 (1024 seconds)
Precision : -18 (0.000003815 seconds)
Root delay : 0.000000 seconds
Root dispersion : 0.000000 seconds
Reference ID : 474D5200 (GMR)
Reference time : Tue Apr 06 18:19:15 2021
Offset : -0.000027722 seconds
Peer delay : 0.000116265 seconds
Peer dispersion : 0.000003832 seconds
Response time : 0.000400542 seconds
Jitter asymmetry: +0.00
NTP tests : 111 111 1111
Interleaved : No
Authenticated : No
TX timestamping : Kernel
RX timestamping : Kernel
Total TX : 8105
Total RX : 8105
Total valid RX : 8105

I wonder what might be causing this discrepancy between the 2 references.

Thanks,
Andre
--
Andre Charbonneau

Frequency & Time
Metrology Research Centre
National Research Council Canada / Government of Canada ***@nrc-cnrc.gc.ca / 613-993-3129

Fréquence et temps
Centre de recherche en métrologie
Conseil national de recherches Canada / Gouvernement du Canada ***@nrc-cnrc.gc.ca / 613-993-3129


-----Original Message-----
From: Miroslav Lichvar [mailto:***@redhat.com]
Sent: April 6, 2021 4:15 AM
To: chrony-***@chrony.tuxfamily.org
Subject: Re: [chrony-users] need help with large delays reported in ntpdata

***ATTENTION*** This email originated from outside of the NRC. ***ATTENTION*** Ce courriel provient de l'extérieur du CNRC
Post by Charbonneau, André
Precision : -5 (0.031250000 seconds)
Root delay : 0.000000 seconds
Root dispersion : 0.000000 seconds
Reference ID : 474D5200 (GMR)
Reference time : Sat Apr 03 18:55:15 2021
Offset : +0.000004316 seconds
Peer delay : 0.031250019 seconds
Peer dispersion : 0.031250019 seconds
These peer delay and peer dispersion values are quite large (being connected to the same switch) and constant over time (they don't change). I was looking at the chrony code and this looks like this could be due because of large estimated error?
chrony as a client rounds up the peer delay and dispersion to the precision reported by the server. -5 is not great. What kind of server is that?

--
Miroslav Lichvar


--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.


--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Loading...