Discussion:
[chrony-users] HW Timestamping fails with specific source
Thibaut BEYLER
2018-01-29 16:17:48 UTC
Permalink
Hello,

We have chrony 3.2 clients on 5 sites, each synchronised with 3 sources :

- Source 1 : An high performance asic-based ntp server with server-side
hw timestamping
- Source 2 : A 'backup' software-based ntp server
- Source 3 : A (shitty) linux ntpd

Source 1 & 2 are from a stratum 1 time server deployed on each site.

All our client run on a recent kernel and have nics that have at least TX
hw timestamping features, timestamping should be at least hardware/kernel.
In this mode, peer delay is around 12us normally with source 1 (only 1
switch between the clients and the timeserver)

I recently investigated some server on a site that struggle to use HW
timestamp and spend most of their time in deamon/deamon mode instead of
hardware/kernel. The weird part is that it occurs only with source 1, the
client kernels are stable with hw/kernel on the less acurate sources.

Turns out I have this problem only with my high performance source, the
'slower' sources being fine.

Running chrony in debug mode i have messages such as

2018-01-26T16:46:26Z sys_linux.c:755:(get_precise_phc_sample)
ioctl(PTP_SYS_OFFSET_PRECISE) failed : Operation not supported
2018-01-26T16:46:26Z hwclock.c:171:(HCL_AccumulateSample) HW clock needs
more samples
2018-01-26T16:46:26Z ntp_io_linux.c:653:(NIO_Linux_ProcessMessage)
Received 90 (48) bytes from error queue for 10.214.16.11:123 fd=13 if=2
tss=0
2018-01-26T16:46:26Z ntp_io_linux.c:665:(NIO_Linux_ProcessMessage) Missing
TX timestamp
I tried to run chrony in debug mode in another site where I have no
problems to compare the logs, but when i run chronyd as root (debug mode or
not) with the exact same command that systemd launches it, then I have the
exact same problem (timestamp reported deamon/deamon with source 1 and fine
with the others)

Any leads ?
Miroslav Lichvar
2018-01-29 17:05:42 UTC
Permalink
Post by Thibaut BEYLER
I recently investigated some server on a site that struggle to use HW
timestamp and spend most of their time in deamon/deamon mode instead of
hardware/kernel. The weird part is that it occurs only with source 1, the
client kernels are stable with hw/kernel on the less acurate sources.
It's a race condition between receiving the TX timestamp of the client
request and receiving the response from the server. If the response is
so fast that it is received before the TX timestamp of the request,
the late timestamp will be ignored as there will not be a
corresponding request to which it could be applied.

What NIC do the clients have? I've seen this with an Intel card. It
happened only for a minority of requests and they were all dropped due
to failing the test C, so overall it worked well with HW timestamping.

I'm not sure if it should be treated as a driver/HW issue or if
applications should really be expected to get TX timestamps so late.
I asked about this on the Intel development list some time ago, but
didn't get a response. I think it could be addressed in chrony by
introducing a new timeout for timestamps, but I'd rather avoid the
extra complexity.

As a workaround you could try to add another switch between the server
and clients to increase the peer delay. You could also try to lower
the priority of the chronyd process to give it more time to get the
timestamp. Someone reported it happened only when chronyd was running
with a high priority.
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Thibaut BEYLER
2018-01-29 18:21:30 UTC
Permalink
Nics are Intel 10GB FC, 82599ES on all sites.

I haven't though of that race condition, site which has problem is the most
recent so more recent switches, cpu & mobos, so that would make sense,

However, in my case 98% of he requests are Deamon/Deamon, so most of them
pasts all the tests (generally those below 60us) which is bad for time
accurancy.

Looking at the logs i can also spots some crazy vales getting 2 secondes
peer delay (and thoses are Deamon/Kernel ) .. mixed timestamp ? Will
investigate more tommorow, thanks
Post by Miroslav Lichvar
Post by Thibaut BEYLER
I recently investigated some server on a site that struggle to use HW
timestamp and spend most of their time in deamon/deamon mode instead of
hardware/kernel. The weird part is that it occurs only with source 1, the
client kernels are stable with hw/kernel on the less acurate sources.
It's a race condition between receiving the TX timestamp of the client
request and receiving the response from the server. If the response is
so fast that it is received before the TX timestamp of the request,
the late timestamp will be ignored as there will not be a
corresponding request to which it could be applied.
What NIC do the clients have? I've seen this with an Intel card. It
happened only for a minority of requests and they were all dropped due
to failing the test C, so overall it worked well with HW timestamping.
I'm not sure if it should be treated as a driver/HW issue or if
applications should really be expected to get TX timestamps so late.
I asked about this on the Intel development list some time ago, but
didn't get a response. I think it could be addressed in chrony by
introducing a new timeout for timestamps, but I'd rather avoid the
extra complexity.
As a workaround you could try to add another switch between the server
and clients to increase the peer delay. You could also try to lower
the priority of the chronyd process to give it more time to get the
timestamp. Someone reported it happened only when chronyd was running
with a high priority.
--
Miroslav Lichvar
--
with "unsubscribe" in the subject.
with "help" in the subject.
Miroslav Lichvar
2018-02-02 10:48:59 UTC
Permalink
Post by Thibaut BEYLER
However, in my case 98% of he requests are Deamon/Deamon, so most of them
pasts all the tests (generally those below 60us) which is bad for time
accurancy.
Hm, with recent kernels and drivers that support SW timestamping the
timestamps should always be at least kernel/kernel. What kernel
version do the clients have?

I've tried a few different approaches to wait for the late HW
timestamp and pushed one that I'm quite happy with. It is specific to
the Linux timestamping code and should have a minimal impact in the
usual case when the timestamps are not late (mostly limited to a
configuration with one interface that has HW timestamping, but another
does not).

Could you please build from git and see if it fixes the issue in your
environment?
Post by Thibaut BEYLER
Looking at the logs i can also spots some crazy vales getting 2 secondes
peer delay (and thoses are Deamon/Kernel ) .. mixed timestamp ? Will
investigate more tommorow, thanks
Do you have a packet capture or chrony debug output showing this?
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Thibaut BEYLER
2018-02-12 17:16:02 UTC
Permalink
Post by Miroslav Lichvar
Hm, with recent kernels and drivers that support SW timestamping the
timestamps should always be at least kernel/kernel. What kernel
version do the clients have ?
Clients have a kernel v4.9.51 so yes it should work (and it works with
other sources)


I've tried a few different approaches to wait for the late HW
Post by Miroslav Lichvar
timestamp and pushed one that I'm quite happy with. It is specific to
the Linux timestamping code and should have a minimal impact in the
usual case when the timestamps are not late (mostly limited to a
configuration with one interface that has HW timestamping, but another
does not).
Could you please build from git and see if it fixes the issue in your
Post by Miroslav Lichvar
environment?
Thanks ! back from holiday and tested that. Good news is that TX
timestamping is now always HW, however RX timestamp is Deamon most of the
time, sometimes Kernel. However, the general performance are worse now :
the peer delay is jittery and goes very high. Here's the debug log with my
two sources (slow : 10.214.11.23 , fast : 10.214.16.11 ) the best one
having the worst perfs : https://pastebin.com/yi0nYNCm
Post by Miroslav Lichvar
Looking at the logs i can also spots some crazy vales getting 2 secondes
Post by Thibaut BEYLER
peer delay (and thoses are Deamon/Kernel ) .. mixed timestamp ? Will
investigate more tommorow, thanks
Do you have a packet capture or chrony debug output showing this?
I could not reproduce that issue today but here's a debug log from 2 weeks
ago that I saved : https://pastebin.com/LUcYsTNz
Post by Miroslav Lichvar
Post by Thibaut BEYLER
However, in my case 98% of he requests are Deamon/Deamon, so most of them
pasts all the tests (generally those below 60us) which is bad for time
accurancy.
Hm, with recent kernels and drivers that support SW timestamping the
timestamps should always be at least kernel/kernel. What kernel
version do the clients have?
I've tried a few different approaches to wait for the late HW
timestamp and pushed one that I'm quite happy with. It is specific to
the Linux timestamping code and should have a minimal impact in the
usual case when the timestamps are not late (mostly limited to a
configuration with one interface that has HW timestamping, but another
does not).
Could you please build from git and see if it fixes the issue in your
environment?
Post by Thibaut BEYLER
Looking at the logs i can also spots some crazy vales getting 2 secondes
peer delay (and thoses are Deamon/Kernel ) .. mixed timestamp ? Will
investigate more tommorow, thanks
Do you have a packet capture or chrony debug output showing this?
--
Miroslav Lichvar
--
with "unsubscribe" in the subject.
with "help" in the subject.
Miroslav Lichvar
2018-02-13 14:28:58 UTC
Permalink
Post by Thibaut BEYLER
Post by Miroslav Lichvar
Could you please build from git and see if it fixes the issue in your
environment?
Thanks ! back from holiday and tested that. Good news is that TX
timestamping is now always HW, however RX timestamp is Deamon most of the
the peer delay is jittery and goes very high. Here's the debug log with my
two sources (slow : 10.214.11.23 , fast : 10.214.16.11 ) the best one
having the worst perfs : https://pastebin.com/yi0nYNCm
Thanks. I can reproduce the issue now. It's a race condition in the
kernel and it's not related to HW timestamping.

When chronyd opens a socket with kernel RX timestamping and it's the
only socket on the system using timestamping, it takes some time
before the kernel actually starts timestamping received packets. If a
response is received before that, it will only have a daemon timestamp
and that one is bad due to waiting for the transmit timestamp.

I'm not sure yet how to fix it. We could keep a dummy socket open to
permanently enable timestamping in the kernel, or maybe reuse the
cmdmon sockets for this purpose.

As a workaround, you can enable the NTP server socket, which will keep
the timestamping enabled, e.g. by adding "allow 127.0.0.1" to
chrony.conf.
Post by Thibaut BEYLER
Post by Miroslav Lichvar
Looking at the logs i can also spots some crazy vales getting 2 secondes
Post by Thibaut BEYLER
peer delay (and thoses are Deamon/Kernel ) .. mixed timestamp ? Will
investigate more tommorow, thanks
Do you have a packet capture or chrony debug output showing this?
I could not reproduce that issue today but here's a debug log from 2 weeks
ago that I saved : https://pastebin.com/LUcYsTNz
Interesting. I don't see anything wrong here. It looks more like an
issue with the network or server, and not the timestamping itself, as
there is a >2s gap between the log messages corresponding to the
transmission and reception.
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Thibaut BEYLER
2018-02-14 10:26:13 UTC
Permalink
Post by Miroslav Lichvar
When chronyd opens a socket with kernel RX timestamping and it's the
only socket on the system using timestamping, it takes some time
before the kernel actually starts timestamping received packets. If a
response is received before that, it will only have a daemon timestamp
and that one is bad due to waiting for the transmit timestamp.
Glad you found something, so the problem occurs when the ntp response
packet is received before the kernel could initialize the RX timestamping ?

As a workaround, you can enable the NTP server socket, which will keep
Post by Miroslav Lichvar
the timestamping enabled, e.g. by adding "allow 127.0.0.1" to
chrony.conf.
Works great this way, thanks ! Hardware/Kernel all the time and low peer
delay with this workaround.
Post by Miroslav Lichvar
Post by Thibaut BEYLER
Post by Miroslav Lichvar
Could you please build from git and see if it fixes the issue in your
environment?
Thanks ! back from holiday and tested that. Good news is that TX
timestamping is now always HW, however RX timestamp is Deamon most of the
the peer delay is jittery and goes very high. Here's the debug log with
my
Post by Thibaut BEYLER
two sources (slow : 10.214.11.23 , fast : 10.214.16.11 ) the best one
having the worst perfs : https://pastebin.com/yi0nYNCm
Thanks. I can reproduce the issue now. It's a race condition in the
kernel and it's not related to HW timestamping.
When chronyd opens a socket with kernel RX timestamping and it's the
only socket on the system using timestamping, it takes some time
before the kernel actually starts timestamping received packets. If a
response is received before that, it will only have a daemon timestamp
and that one is bad due to waiting for the transmit timestamp.
I'm not sure yet how to fix it. We could keep a dummy socket open to
permanently enable timestamping in the kernel, or maybe reuse the
cmdmon sockets for this purpose.
As a workaround, you can enable the NTP server socket, which will keep
the timestamping enabled, e.g. by adding "allow 127.0.0.1" to
chrony.conf.
Post by Thibaut BEYLER
Post by Miroslav Lichvar
Looking at the logs i can also spots some crazy vales getting 2
secondes
Post by Thibaut BEYLER
Post by Miroslav Lichvar
Post by Thibaut BEYLER
peer delay (and thoses are Deamon/Kernel ) .. mixed timestamp ? Will
investigate more tommorow, thanks
Do you have a packet capture or chrony debug output showing this?
I could not reproduce that issue today but here's a debug log from 2
weeks
Post by Thibaut BEYLER
ago that I saved : https://pastebin.com/LUcYsTNz
Interesting. I don't see anything wrong here. It looks more like an
issue with the network or server, and not the timestamping itself, as
there is a >2s gap between the log messages corresponding to the
transmission and reception.
--
Miroslav Lichvar
--
with "unsubscribe" in the subject.
with "help" in the subject.
Miroslav Lichvar
2018-02-14 10:55:16 UTC
Permalink
Post by Thibaut BEYLER
Post by Miroslav Lichvar
When chronyd opens a socket with kernel RX timestamping and it's the
only socket on the system using timestamping, it takes some time
before the kernel actually starts timestamping received packets. If a
response is received before that, it will only have a daemon timestamp
and that one is bad due to waiting for the transmit timestamp.
Glad you found something, so the problem occurs when the ntp response
packet is received before the kernel could initialize the RX timestamping ?
Yes. When chronyd is polling a server, it opens a socket, enables
timestamping and few other options on the socket, and sends a request.
The kernel has a global counter for sockets using timestamping, but it
is not updated synchronously with that setting on the socket (for
performance reasons I assume). If chronyd is the only application
using timestamping and it doesn't have the server socket open, its
client socket will be the one which turns the kernel timestamping on
and off.
Post by Thibaut BEYLER
Works great this way, thanks ! Hardware/Kernel all the time and low peer
delay with this workaround.
Great.

I'll probably push a fix which will keep one socket open all the time
just for the timestamping.
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Denny Page
2018-01-29 17:39:42 UTC
Permalink
Is it a cut-through switch?
Post by Thibaut BEYLER
Hello,
• Source 1 : An high performance asic-based ntp server with server-side hw timestamping
• Source 2 : A 'backup' software-based ntp server
• Source 3 : A (shitty) linux ntpd
Source 1 & 2 are from a stratum 1 time server deployed on each site.
All our client run on a recent kernel and have nics that have at least TX hw timestamping features, timestamping should be at least hardware/kernel. In this mode, peer delay is around 12us normally with source 1 (only 1 switch between the clients and the timeserver)
I recently investigated some server on a site that struggle to use HW timestamp and spend most of their time in deamon/deamon mode instead of hardware/kernel. The weird part is that it occurs only with source 1, the client kernels are stable with hw/kernel on the less acurate sources.
Turns out I have this problem only with my high performance source, the 'slower' sources being fine.
Running chrony in debug mode i have messages such as
2018-01-26T16:46:26Z sys_linux.c:755:(get_precise_phc_sample) ioctl(PTP_SYS_OFFSET_PRECISE) failed : Operation not supported
2018-01-26T16:46:26Z hwclock.c:171:(HCL_AccumulateSample) HW clock needs more samples
2018-01-26T16:46:26Z ntp_io_linux.c:653:(NIO_Linux_ProcessMessage) Received 90 (48) bytes from error queue for 10.214.16.11:123 fd=13 if=2 tss=0
2018-01-26T16:46:26Z ntp_io_linux.c:665:(NIO_Linux_ProcessMessage) Missing TX timestamp
I tried to run chrony in debug mode in another site where I have no problems to compare the logs, but when i run chronyd as root (debug mode or not) with the exact same command that systemd launches it, then I have the exact same problem (timestamp reported deamon/deamon with source 1 and fine with the others)
Any leads ?
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Loading...