Discussion:
[chrony-users] Chrony vs. Linux RNG
Holger Hoffstätte
2018-04-22 17:15:12 UTC
Permalink
Hello!

I test stable/LTS kernels to help Greg KH and just updated to 4.16.4-rc1.
This contains a few patches that are supposed to help with CVEs around
randomness, and which cause an interesting catch-22 that affects chrony,
hence this mail.

The patches in question are in the stable queue and can be found under the
"random-*" prefix at:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-4.16

Not sure exactly which patch is "at fault" because I don't feel like bisecting
this mess and it's unlikely to be reverted anyway.

The initial symptom was that starting chronyd on boot seemed to "hang",
but eventually continued after ~30 secs or so, working fine as usual, so
I blamed Gremlins and continued.

Since the symptom reliably reproduced on two other machines I investigated
further and eventually found that it relates to access of the CRNG: as soon
as "random: crng init done" appeared in the kernel log, chrony would start
up without delay. Apparently accessing the CRNG now blocks in early phases
of the boot process, when not enough entropy has been collected - which is
typically the time when chrony is started as well. This can make e.g. a
headless server without concurrent background activity take a *really* long
time to boot: in one instance I measured a blocked boot process taking over
a minute instead of the usual 5 seconds. IMHO furiously pinging a booting
remote host is not really a solution, though it does seem to help. :)

Long story short, is there something chrony can do to avoid this?
Why does it need to access any random number generators in the first
place?

For now I just quick-fixed this issue for myself by starting chrony in the
background, allowing the system to boot and so creating more entropy faster -
but I realize of course the downside of adjusting time later etc.

Thanks!
Holger
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Bill Unruh
2018-04-22 23:25:40 UTC
Permalink
It def. should not do that. t sounds likesomething is using /dev/random
instead of /dev/urandom. Nothing should use /dev/random.
(chrony does not seem to). urandon should NOT block. I f they have altered the
kernel so it does, that is a bug.



William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology |____ ***@physics.ubc.ca
Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/
Post by Holger Hoffstätte
Hello!
I test stable/LTS kernels to help Greg KH and just updated to 4.16.4-rc1.
This contains a few patches that are supposed to help with CVEs around
randomness, and which cause an interesting catch-22 that affects chrony,
hence this mail.
The patches in question are in the stable queue and can be found under the
https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-4.16
Not sure exactly which patch is "at fault" because I don't feel like bisecting
this mess and it's unlikely to be reverted anyway.
The initial symptom was that starting chronyd on boot seemed to "hang",
but eventually continued after ~30 secs or so, working fine as usual, so
I blamed Gremlins and continued.
Since the symptom reliably reproduced on two other machines I investigated
further and eventually found that it relates to access of the CRNG: as soon
as "random: crng init done" appeared in the kernel log, chrony would start
up without delay. Apparently accessing the CRNG now blocks in early phases
of the boot process, when not enough entropy has been collected - which is
typically the time when chrony is started as well. This can make e.g. a
headless server without concurrent background activity take a *really* long
time to boot: in one instance I measured a blocked boot process taking over
a minute instead of the usual 5 seconds. IMHO furiously pinging a booting
remote host is not really a solution, though it does seem to help. :)
Long story short, is there something chrony can do to avoid this?
Why does it need to access any random number generators in the first
place?
For now I just quick-fixed this issue for myself by starting chrony in the
background, allowing the system to boot and so creating more entropy faster -
but I realize of course the downside of adjusting time later etc.
Thanks!
Holger
--
"unsubscribe" in the subject.
subject.
Miroslav Lichvar
2018-04-23 09:35:21 UTC
Permalink
Post by Bill Unruh
It def. should not do that. t sounds likesomething is using /dev/random
instead of /dev/urandom. Nothing should use /dev/random.
(chrony does not seem to). urandon should NOT block. I f they have altered the
kernel so it does, that is a bug.
chronyd uses the getrandom() system call if it is available, in a mode
that can block.

According to the random(4) man page, reading from /dev/urandom never
blocks and always returns some data, but the getrandom() system call
either blocks or fails if not allowed to block when the PRNG isn't
initialized yet.

So, chronyd could fall back to reading from /dev/urandom if the
getrandom() call failed (setting the flag to not block).
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Miroslav Lichvar
2018-04-23 09:04:37 UTC
Permalink
Post by Holger Hoffstätte
I test stable/LTS kernels to help Greg KH and just updated to 4.16.4-rc1.
This contains a few patches that are supposed to help with CVEs around
randomness, and which cause an interesting catch-22 that affects chrony,
hence this mail.
Thanks for the heads up.

I tried booting a VM with 4.17-rc2, which should include the patches
you are referring to, but didn't see any delays problems.

On what distro do you test it? Does it save and restore the random
seed on boot (e.g. the systemd-random-seed)?
Post by Holger Hoffstätte
The patches in question are in the stable queue and can be found under the
https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-4.16
Long story short, is there something chrony can do to avoid this?
I guess it could use a non-blocking read for the urandom device (or
getrandom() syscall) and fall back to random(), but I'm not sure if it
would be a good idea from the security point of view.
Post by Holger Hoffstätte
Why does it need to access any random number generators in the first
place?
There are few places in the chrony code that use random numbers. I
think the most important one is randomness added to transmit timestamp
in client requests in order to make it more difficult for attackers
(that don't see the request) to send you a valid response and mess
with your clock.
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Holger Hoffstätte
2018-04-23 09:52:00 UTC
Permalink
Post by Miroslav Lichvar
Post by Holger Hoffstätte
I test stable/LTS kernels to help Greg KH and just updated to 4.16.4-rc1.
This contains a few patches that are supposed to help with CVEs around
randomness, and which cause an interesting catch-22 that affects chrony,
hence this mail.
Thanks for the heads up.
I tried booting a VM with 4.17-rc2, which should include the patches
Yeah, I could have mentioned that..
Post by Miroslav Lichvar
you are referring to, but didn't see any delays problems.
On what distro do you test it? Does it save and restore the random
seed on boot (e.g. the systemd-random-seed)?
Gentoo using OpenRC, chronyd 3.3. It uses start-stop-daemon and it
was definitely chronyd hanging the boot sequence; for tests I disabled
chronyd from the default runlevel and all was back to smooth sailing.
Since s-s-d relies on chronyd going into the background, the temporary
fix was to add the --background flag to s-s-d so that OpenRC returns
immediately.

I just saw that it does indeed have a "urandom" service in the boot
runlevel, reading/writing from/to /var/lib/misc/random-seed.
But that happens way before chrony's default runlevel.

glibc is 2.26, so it should be using getrandom() and not use the
urandom fallbacks. Unfortunately it's really hard to trace/debug
this since the bug only manifests itself during the early stages, and
as soon as I do anything on a freshly booted system I create entropy,
initialising the crng and thus making everything work.
Post by Miroslav Lichvar
I guess it could use a non-blocking read for the urandom device (or
getrandom() syscall) and fall back to random(), but I'm not sure if it
would be a good idea from the security point of view.
I found in util.c that it *should* be using getrandom() already?
Maybe the HAVE_GETRANDOM detection didn't work, but even then the
urandom fallback should not be blocking either. I'll double-check the
package script's autoconf log.

thanks,
Holger
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Holger Hoffstätte
2018-04-23 10:05:55 UTC
Permalink
Post by Holger Hoffstätte
Post by Miroslav Lichvar
I guess it could use a non-blocking read for the urandom device (or
getrandom() syscall) and fall back to random(), but I'm not sure if it
would be a good idea from the security point of view.
I found in util.c that it *should* be using getrandom() already?
Maybe the HAVE_GETRANDOM detection didn't work, but even then the
urandom fallback should not be blocking either. I'll double-check the
package script's autoconf log.
As I suspected..everything looking good:

$ebuild chrony-3.3.ebuild configure
* chrony-3.3.tar.gz BLAKE2B SHA512 size ;-) ... [ ok ]
Post by Holger Hoffstätte
Post by Miroslav Lichvar
Unpacking source...
Unpacking chrony-3.3.tar.gz to /tmp/portage/net-misc/chrony-3.3/work
Source unpacked in /tmp/portage/net-misc/chrony-3.3/work
Preparing source in /tmp/portage/net-misc/chrony-3.3/work/chrony-3.3 ...
Source prepared.
Configuring source in /tmp/portage/net-misc/chrony-3.3/work/chrony-3.3 ...
* ./configure --enable-scfilter --disable-pps --without-editline --docdir=/usr/share/doc/chrony-3.3 --chronysockdir=/run/chrony --mandir=/usr/share/man --prefix=/usr --sysconfdir=/etc/chrony --disable-sechash --without-nss --without-tomcrypt
Configuring for Linux-x86_64
Checking for x86_64-pc-linux-gnu-gcc : Yes
Checking for 64-bit time_t : Yes
NTP time mapped to 1968-05-05T09:56:04Z/2104-06-11T16:24:20Z
Checking for math : No
Checking for math in -lm : Yes
Checking for <stdint.h> : Yes
Checking for <inttypes.h> : Yes
Checking for struct in_pktinfo : Yes
Checking for IPv6 support : Yes
Checking for struct in6_pktinfo : No
Checking for struct in6_pktinfo with _GNU_SOURCE : Yes
Checking for clock_gettime() : Yes
Checking for getaddrinfo() : Yes
Checking for pthread : Yes
Checking for arc4random_buf() : No
Checking for getrandom() : Yes
Checking for recvmmsg() : Yes
Checking for SW/HW timestamping : Yes
Checking for other timestamping options : Yes
Checking for libcap : Yes
Checking for seccomp : Yes
Checking for <linux/rtc.h> : Yes
Checking for <linux/ptp_clock.h> : Yes
Checking for sched_setscheduler() : Yes
Checking for mlockall() : Yes
Checking for readline : Yes
Features : +CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER -SIGND +ASYNCDNS +READLINE -SECHASH +IPV6 -DEBUG
Creating Makefile
Creating doc/Makefile
Creating test/unit/Makefile
Post by Holger Hoffstätte
Post by Miroslav Lichvar
Source configured.
So it's probably indeed blocking in too-early getrandom() (thanks for
pointing that out!)and falling back to urandom with GRND_NONBLOCK could
work. Let me know if I can try any patches.

thanks,
Holger
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Miroslav Lichvar
2018-04-23 10:40:23 UTC
Permalink
Post by Holger Hoffstätte
So it's probably indeed blocking in too-early getrandom() (thanks for
pointing that out!)and falling back to urandom with GRND_NONBLOCK could
work. Let me know if I can try any patches.
You can try the following patch. It should prevent getrandom() from
blocking and allow fall back to /dev/urandom.

--- a/util.c
+++ b/util.c
@@ -1224,7 +1224,7 @@ get_random_bytes_getrandom(char *buf, unsigned int len)
if (disabled)
break;

- if (getrandom(rand_buf, sizeof (rand_buf), 0) != sizeof (rand_buf)) {
+ if (getrandom(rand_buf, sizeof (rand_buf), GRND_NONBLOCK) != sizeof (rand_buf)) {
disabled = 1;
break;
}
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Holger Hoffstätte
2018-04-23 11:00:38 UTC
Permalink
Post by Miroslav Lichvar
Post by Holger Hoffstätte
So it's probably indeed blocking in too-early getrandom() (thanks for
pointing that out!)and falling back to urandom with GRND_NONBLOCK could
work. Let me know if I can try any patches.
You can try the following patch. It should prevent getrandom() from
blocking and allow fall back to /dev/urandom.
--- a/util.c
+++ b/util.c
@@ -1224,7 +1224,7 @@ get_random_bytes_getrandom(char *buf, unsigned int len)
if (disabled)
break;
- if (getrandom(rand_buf, sizeof (rand_buf), 0) != sizeof (rand_buf)) {
+ if (getrandom(rand_buf, sizeof (rand_buf), GRND_NONBLOCK) != sizeof (rand_buf)) {
disabled = 1;
break;
}
Works as expected. \o/

Tested-by: Holger Hoffstätte <***@applied-asynchrony.com>

Thanks!
Holger
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Miroslav Lichvar
2018-04-23 11:07:23 UTC
Permalink
Post by Holger Hoffstätte
Works as expected. \o/
Great. Thanks. I'll think a bit about possible implications before
pushing the change.
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Holger Hoffstätte
2018-04-23 11:21:42 UTC
Permalink
Post by Miroslav Lichvar
Great. Thanks. I'll think a bit about possible implications before
pushing the change.
Maybe make "available" and "disabled" non-static so that they are
not just evaluated once? On subsequent calls the CRNG will eventually
be initialized, so at some point it will start working with the
expected randomness. Just an idea.

cheers
Holger
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Miroslav Lichvar
2018-04-23 11:38:46 UTC
Permalink
Post by Holger Hoffstätte
Post by Miroslav Lichvar
Great. Thanks. I'll think a bit about possible implications before
pushing the change.
Maybe make "available" and "disabled" non-static so that they are
not just evaluated once?
They are static to avoid a performance loss when the system call is
not supported (e.g. on an old kernel).
Post by Holger Hoffstätte
On subsequent calls the CRNG will eventually
be initialized, so at some point it will start working with the
expected randomness. Just an idea.
I think that's possible, but it would need to check the error code to
distinguish between getrandom() not being fully initialized and
getrandom() missing.

One thing that I don't like much about the fallback is that it may
cause chronyd to randomly fail in environments where /dev/urandom is
not available. Before, it either always worked or failed. Now it may
fail if it's started too early and restarting it later will fix it.
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Miroslav Lichvar
2018-04-23 10:13:51 UTC
Permalink
Post by Holger Hoffstätte
Gentoo using OpenRC, chronyd 3.3. It uses start-stop-daemon and it
was definitely chronyd hanging the boot sequence; for tests I disabled
chronyd from the default runlevel and all was back to smooth sailing.
Since s-s-d relies on chronyd going into the background, the temporary
fix was to add the --background flag to s-s-d so that OpenRC returns
immediately.
Ok, if it is blocking before the foreground process exits, that
probably means it's not due to NTP, but something else is using random
numbers, e.g. a timer is added to the scheduler.
Post by Holger Hoffstätte
I found in util.c that it *should* be using getrandom() already?
It should and that's probably why it is blocking. If you disable
HAVE_GETRANDOM, it should stop.
--
Miroslav Lichvar
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Holger Hoffstätte
2018-04-23 10:53:59 UTC
Permalink
Post by Miroslav Lichvar
Post by Holger Hoffstätte
Gentoo using OpenRC, chronyd 3.3. It uses start-stop-daemon and it
was definitely chronyd hanging the boot sequence; for tests I disabled
chronyd from the default runlevel and all was back to smooth sailing.
Since s-s-d relies on chronyd going into the background, the temporary
fix was to add the --background flag to s-s-d so that OpenRC returns
immediately.
Ok, if it is blocking before the foreground process exits, that
probably means it's not due to NTP, but something else is using random
numbers, e.g. a timer is added to the scheduler.
Post by Holger Hoffstätte
I found in util.c that it *should* be using getrandom() already?
It should and that's probably why it is blocking. If you disable
HAVE_GETRANDOM, it should stop.
Indeed it does. I configured as usual but added a swift
"sed -i '/HAVE_GETRANDOM/d' config.h" post-configure, built,
removed the --background flag from s-s-d, rebooted and it
immediately starts just as before.

Now trying the other patch. :)

thanks!
Holger
--
To unsubscribe email chrony-users-***@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-***@chrony.tuxfamily.org
with "help" in the subject.
Trouble? Email ***@chrony.tuxfamily.org.
Bill Unruh
2018-04-24 03:31:11 UTC
Permalink
Post by Miroslav Lichvar
Post by Holger Hoffstätte
I test stable/LTS kernels to help Greg KH and just updated to 4.16.4-rc1.
This contains a few patches that are supposed to help with CVEs around
randomness, and which cause an interesting catch-22 that affects chrony,
hence this mail.
Thanks for the heads up.
I tried booting a VM with 4.17-rc2, which should include the patches
you are referring to, but didn't see any delays problems.
On what distro do you test it? Does it save and restore the random
seed on boot (e.g. the systemd-random-seed)?
Post by Holger Hoffstätte
The patches in question are in the stable queue and can be found under the
https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-4.16
Long story short, is there something chrony can do to avoid this?
I guess it could use a non-blocking read for the urandom device (or
getrandom() syscall) and fall back to random(), but I'm not sure if it
would be a good idea from the security point of view.
Do not use /dev/random It gives no extra security over /dev/urandom, and can
block. Using it is a terrible idea no matter what you use it for.

The random() glibc routine is terribly weak.(It has nothing to do with
/dev/random and uses a totally different generator, and is easily (now adays)
breakable.) Just read /dev/urandom. (or getrandom with the GRND_NOBLOCK flag
set)
Post by Miroslav Lichvar
Post by Holger Hoffstätte
Why does it need to access any random number generators in the first
place?
There are few places in the chrony code that use random numbers. I
think the most important one is randomness added to transmit timestamp
in client requests in order to make it more difficult for attackers
(that don't see the request) to send you a valid response and mess
with your clock.
The code does seem to use /dev/urandom in general, except occasionally it uses
getrandom with 0 flag. (which is the same as urandom but blocking on initial
use if not enough randomness estimate is in the pool.
Post by Miroslav Lichvar
--
Miroslav Lichvar
--
with "unsubscribe" in the subject.
with "help" in the subject.
Loading...