[Openvpn-devel] Switch assertion failure to returning false

Message ID 20200520183404.54822-1-code@jeremyevans.net
State Accepted
Headers show
Series [Openvpn-devel] Switch assertion failure to returning false | expand

Commit Message

Jeremy Evans May 20, 2020, 8:34 a.m. UTC
This assertion failure can be hit in production, which causes the
openvpn server process to stop and all clients to be disconnected.
Bug #1270 has been filed for this issue on Trac by another user
who has experienced the issue, and this patch attempts to address it.

Tracing callers, it appears that some callers check ks->authenticated
before calling, but others do not.  It may be possible to add the check
for the callers that do not check, but this seems to be a simpler
solution.

To give some background, we hit this assertion failure, with the
following log output:

```
Tue May 19 15:57:05 2020 username/73.135.141.11:1194 PUSH: Received control message: 'PUSH_REQUEST'
Tue May 19 15:57:05 2020 username/73.135.141.11:1194 SENT CONTROL [username]: 'PUSH_REPLY,redirect-gateway def1,comp-lzo,persist-key,persist-tun,route-gateway 10.28.47.1,topology subnet,ping 10,ping-restart 120,ifconfig 10.28.47.38 255.255.255.0,peer-id 89' (status=1)
Tue May 19 15:57:05 2020 username/73.135.141.11:1194 Assertion failed at /path/to/openvpn-2.4.7/src/openvpn/ssl.c:1944 (ks->authenticated)
Tue May 19 15:57:05 2020 username/73.135.141.11:1194 Exiting due to fatal error
Tue May 19 15:57:05 2020 username/73.135.141.11:1194 Closing TUN/TAP interface
```

using the following OpenVPN server configuration:

```
port 1194
proto udp
dev-type tun
ca ca.crt
cert server.crt
key server.key
dh dh.pem
topology subnet
push "redirect-gateway def1"
push "comp-lzo"
push "persist-key"
push "persist-tun"
keepalive 10 120
comp-lzo
user nobody
group nobody
persist-key
persist-tun
cd /home/openvpn/server
chroot /var/empty
daemon
verb 3
crl-verify crl.pem
tls-auth ta.key 0
cipher AES-256-CBC
tls-version-min 1.2
tls-cipher ECDHE-RSA-AES256-GCM-SHA384
ncp-disable
mute-replay-warnings
script-security 3
auth-user-pass-verify "ldap-auth/ldap-auth" via-env
auth-user-pass-optional
```

and the following command line options:

```
--config openvpn.conf --dev tun1 --local 206.131.72.52 \
--log-append openvpn.log --status openvpn-status.log \
--server 10.28.47.0 255.255.255.0
```

The failed assertion is inside the function
`tls_session_generate_data_channel_keys`, which is called 3 other places
in `ssl.c.`:

* `key_method_2_write`: checks for `ks->authenticated` before calling

* `key_method_2_read`: appears to run in client mode but not in server
  mode

* `tls_session_update_crypto_params`: runs in server mode and does not
  check before calling

That leads me to believe the problem caller is
`tls_session_update_crypto_params`. There.s three callers of
`tls_session_update_crypto_params`:.

* `incoming_push_message` (`push.c`): Probably this caller, since the
  server pushes configuration to clients, and the log shows the
  assertion failure right after the push reply.

* `multi_process_file_closed` (`multi.c`): Not this caller.  NCP is
  disabled in config, and async push was not enabled when compiling.

* `do_deferred_options` (`init.c`): Not this caller.  The server
  configuration doesn't pull.

Changing the assertion to returning false appears to be the simplest
fix.  Another approach would be changing callers to check
`ks->authenticated` before calling, either
`tls_session_update_crypto_params` or `incoming_push_message`.

Signed-off-by: Jeremy Evans <code@jeremyevans.net>
---
 src/openvpn/ssl.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Gert Doering May 20, 2020, 9:33 a.m. UTC | #1
Hi,

On Wed, May 20, 2020 at 11:34:04AM -0700, Jeremy Evans wrote:
> To give some background, we hit this assertion failure, with the
> following log output:

This should not happen, asserting out in "normal server use" is bad.  

(Neither should it ever reach that point without ks->authenticated being 
true)

> Tue May 19 15:57:05 2020 username/73.135.141.11:1194 Assertion failed at /path/to/openvpn-2.4.7/src/openvpn/ssl.c:1944 (ks->authenticated)

Is this with true 2.4.7, or is this something more recent?

> @@ -1930,7 +1930,10 @@ tls_session_generate_data_channel_keys(struct tls_session *session)
>                                            &ks->session_id_remote : &session->session_id;
>  
> -    ASSERT(ks->authenticated);
> +    if (!ks->authenticated) {
> +        msg(D_TLS_ERRORS, "TLS Error: key_state not authenticated");
> +        goto cleanup;
> +    }
>  
>      ks->crypto_options.flags = session->opt->crypto_flags;
>      if (!generate_key_expansion(&ks->crypto_options.key_ctx_bi,

I'm not sure if that code is correct, though - it will erase key
material (in cleanup) without actually having generated a session
key.  So "bad things might happen later".

But I leave that to Steffan or Arne who understands these intricacies
better than I do.

(Just confirmed, the ASSERT() is still like this in master)


Trying to understand the somewhat twisted maze how ks->authenticated
is set to true/false leads me to more questions...

 - are plugins or auth scripts involved on your side?
 - can you reproduce this, or was this a once-in-a-lifetime crash?

(reproduceable is better, of course :-) - as this could be a plugin
or script authentication interfering with changes brought by NCP...)

gert
Jeremy Evans May 20, 2020, 11:31 a.m. UTC | #2
On 05/20 09:33, Gert Doering wrote:
> Hi,
> 
> On Wed, May 20, 2020 at 11:34:04AM -0700, Jeremy Evans wrote:
> > To give some background, we hit this assertion failure, with the
> > following log output:
> 
> This should not happen, asserting out in "normal server use" is bad.  
> 
> (Neither should it ever reach that point without ks->authenticated being 
> true)
> 
> > Tue May 19 15:57:05 2020 username/73.135.141.11:1194 Assertion failed at /path/to/openvpn-2.4.7/src/openvpn/ssl.c:1944 (ks->authenticated)
> 
> Is this with true 2.4.7, or is this something more recent?

This failure was with 2.4.7, but the first time it happened we were
running 2.4.4 (see below)
 
> > @@ -1930,7 +1930,10 @@ tls_session_generate_data_channel_keys(struct tls_session *session)
> >                                            &ks->session_id_remote : &session->session_id;
> >  
> > -    ASSERT(ks->authenticated);
> > +    if (!ks->authenticated) {
> > +        msg(D_TLS_ERRORS, "TLS Error: key_state not authenticated");
> > +        goto cleanup;
> > +    }
> >  
> >      ks->crypto_options.flags = session->opt->crypto_flags;
> >      if (!generate_key_expansion(&ks->crypto_options.key_ctx_bi,
> 
> I'm not sure if that code is correct, though - it will erase key
> material (in cleanup) without actually having generated a session
> key.  So "bad things might happen later".
 
The same behavior would happen if generate_key_expansion fails a few
lines below, so my assumption was it was safe to do so.  However, that's
just an assumption and not even an educated guess.

> But I leave that to Steffan or Arne who understands these intricacies
> better than I do.
> 
> (Just confirmed, the ASSERT() is still like this in master)
> 
> 
> Trying to understand the somewhat twisted maze how ks->authenticated
> is set to true/false leads me to more questions...
> 
>  - are plugins or auth scripts involved on your side?
>  - can you reproduce this, or was this a once-in-a-lifetime crash?

There is a program to implement multifactor authentication as mentioned
in the config (`auth-user-pass-verify "ldap-auth/ldap-auth" via-env`)

This is a statically compiled C program that exits 0 on successful
multifactor authentication and 1 on unsuccessful authentication.

> (reproduceable is better, of course :-) - as this could be a plugin
> or script authentication interfering with changes brought by NCP...)

It's reproducible in the sense that it has happened to us multiple
times.  This is actually the fourth time it has happened.  Previous
times were 2018-04-16 (2.4.4), 2019-09-25 (2.4.7), and 2020-05-13
(2.4.7). Bug #1270 says it also affects 2.4.8.  I haven't tried 2.4.9
yet, but I plan to upgrade the server to use it next month.

It's not reproducible in the sense that I know how to trigger it,
though.

Thanks,
Jeremy
Steffan Karger May 27, 2020, 3:08 a.m. UTC | #3
Hi,

Thanks for the clear report, patch and follow-up.

On 20-05-2020 23:31, Jeremy Evans wrote:
> On 05/20 09:33, Gert Doering wrote:
>> On Wed, May 20, 2020 at 11:34:04AM -0700, Jeremy Evans wrote:
>>> To give some background, we hit this assertion failure, with the
>>> following log output:
>>
>> This should not happen, asserting out in "normal server use" is bad.  
>>
>> (Neither should it ever reach that point without ks->authenticated being 
>> true)

Agreed. I think it makes sense to first fix the urgent part (ie, not
kill the server), then figure out how de code ends up at this ASSERT.

Jeremy, can you determine from your logs whether this always
happenedwhen --auth-user-pass-verify returns zero or non-zero? Ie,
should the connections that trigger this succeed or fail?

>>> @@ -1930,7 +1930,10 @@ tls_session_generate_data_channel_keys(struct tls_session *session)
>>>                                            &ks->session_id_remote : &session->session_id;
>>>  
>>> -    ASSERT(ks->authenticated);
>>> +    if (!ks->authenticated) {
>>> +        msg(D_TLS_ERRORS, "TLS Error: key_state not authenticated");
>>> +        goto cleanup;
>>> +    }
>>>  
>>>      ks->crypto_options.flags = session->opt->crypto_flags;
>>>      if (!generate_key_expansion(&ks->crypto_options.key_ctx_bi,
>>
>> I'm not sure if that code is correct, though - it will erase key
>> material (in cleanup) without actually having generated a session
>> key.  So "bad things might happen later".
>  
> The same behavior would happen if generate_key_expansion fails a few
> lines below, so my assumption was it was safe to do so.  However, that's
> just an assumption and not even an educated guess.

This change correctly turns "kill the server" into "reject the
connection", which I agree is a good thing.

Acked-by: Steffan Karger <steffan.karger@foxcrypto.com>

-Steffan
Jeremy Evans May 27, 2020, 5:11 a.m. UTC | #4
On 05/27 03:08, Steffan Karger wrote:
> Hi,
> 
> Thanks for the clear report, patch and follow-up.
> 
> On 20-05-2020 23:31, Jeremy Evans wrote:
> > On 05/20 09:33, Gert Doering wrote:
> >> On Wed, May 20, 2020 at 11:34:04AM -0700, Jeremy Evans wrote:
> >>> To give some background, we hit this assertion failure, with the
> >>> following log output:
> >>
> >> This should not happen, asserting out in "normal server use" is bad.  
> >>
> >> (Neither should it ever reach that point without ks->authenticated being 
> >> true)
> 
> Agreed. I think it makes sense to first fix the urgent part (ie, not
> kill the server), then figure out how de code ends up at this ASSERT.
> 
> Jeremy, can you determine from your logs whether this always
> happenedwhen --auth-user-pass-verify returns zero or non-zero? Ie,
> should the connections that trigger this succeed or fail?

--auth-user-pass-verify returned non-zero in each crash.  In that sense,
the assert was good, because the alternative would be that an
unauthenticated user may be able to get VPN access (assuming there
wasn't some later check that caught the issue).

Here are the relevant log sections from the 4 crashes:

Tue May 19 15:57:03 2020 username/73.135.141.11:1194 WARNING: Failed running command (--auth-user-pass-verify): external program exited with error status: 1
Tue May 19 15:57:03 2020 username/73.135.141.11:1194 TLS Auth Error: Auth Username/Password verification failed for peer
Tue May 19 15:57:03 2020 username/73.135.141.11:1194 Control Channel: TLSv1.2, cipher TLSv1/SSLv3 ECDHE-RSA-AES256-GCM-SHA384, 2048 bit RSA
Tue May 19 15:57:04 2020 username1/73.112.148.11:1194 TLS: new session incoming connection from [AF_INET]73.112.148.11:1194
Tue May 19 15:57:04 2020 username2/73.190.61.11:1194 TLS: new session incoming connection from [AF_INET]73.190.61.11:1194
Tue May 19 15:57:04 2020 username3/73.190.134.11:1194 TLS: new session incoming connection from [AF_INET]73.190.134.11:1194
Tue May 19 15:57:04 2020 username4/73.120.76.11:1194 TLS: new session incoming connection from [AF_INET]73.120.76.11:1194
Tue May 19 15:57:05 2020 username5/73.16.123.11:1194 TLS: new session incoming connection from [AF_INET]73.16.123.11:1194
Tue May 19 15:57:05 2020 username/73.135.141.11:1194 PUSH: Received control message: 'PUSH_REQUEST'
Tue May 19 15:57:05 2020 username/73.135.141.11:1194 SENT CONTROL [username]: 'PUSH_REPLY,redirect-gateway def1,comp-lzo,persist-key,persist-tun,route-gateway 10.28.47.1,topology subnet,ping 10,ping-restart 120,ifconfig 10.28.47.38 255.255.255.0,peer-id 89' (status=1)
Tue May 19 15:57:05 2020 username/73.135.141.11:1194 Assertion failed at /path/to/openvpn-2.4.7/src/openvpn/ssl.c:1944 (ks->authenticated)

Wed May 13 14:14:54 2020 username/24.106.43.11:1194 WARNING: Failed running command (--auth-user-pass-verify): external program exited with error status: 1
Wed May 13 14:14:54 2020 username/24.106.43.11:1194 TLS Auth Error: Auth Username/Password verification failed for peer
Wed May 13 14:14:54 2020 username/24.106.43.11:1194 TLS: move_session: dest=TM_ACTIVE src=TM_UNTRUSTED reinit_src=1
Wed May 13 14:14:54 2020 username/24.106.43.11:1194 TLS: tls_multi_process: untrusted session promoted to semi-trusted
Wed May 13 14:14:54 2020 username/24.106.43.11:1194 Control Channel: TLSv1.2, cipher TLSv1/SSLv3 ECDHE-RSA-AES256-GCM-SHA384, 2048 bit RSA
Wed May 13 14:14:57 2020 username/24.106.43.11:1194 PUSH: Received control message: 'PUSH_REQUEST'
Wed May 13 14:14:57 2020 username/24.106.43.11:1194 SENT CONTROL [username]: 'PUSH_REPLY,redirect-gateway def1,comp-lzo,persist-key,persist-tun,route-gateway 10.28.47.1,topology subnet,ping 10,ping-restart 120,ifconfig 10.28.47.38 255.255.255.0,peer-id 89' (status=1)
Wed May 13 14:14:57 2020 username/24.106.43.11:1194 Assertion failed at /path/to/openvpn-2.4.7/src/openvpn/ssl.c:1944 (ks->authenticated)

Wed Sep 25 12:14:56 2019 username/174.94.150.11:30407 WARNING: Failed running command (--auth-user-pass-verify): external program exited with error status: 1
Wed Sep 25 12:14:56 2019 username/174.94.150.11:30407 TLS Auth Error: Auth Username/Password verification failed for peer
Wed Sep 25 12:14:56 2019 username/174.94.150.11:30407 TLS: move_session: dest=TM_ACTIVE src=TM_UNTRUSTED reinit_src=1
Wed Sep 25 12:14:56 2019 username/174.94.150.11:30407 TLS: tls_multi_process: untrusted session promoted to semi-trusted
Wed Sep 25 12:14:56 2019 username/174.94.150.11:30407 Control Channel: TLSv1.2, cipher TLSv1/SSLv3 ECDHE-RSA-AES256-GCM-SHA384, 2048 bit RSA
Wed Sep 25 12:14:57 2019 username/174.94.150.11:30407 PUSH: Received control message: 'PUSH_REQUEST'
Wed Sep 25 12:14:57 2019 username/174.94.150.11:30407 SENT CONTROL [username]: 'PUSH_REPLY,redirect-gateway def1,comp-lzo,persist-key,persist-tun,route-gateway 10.28.47.1,topology subnet,ping 10,ping-restart 120,ifconfig 10.28.47.38 255.255.255.0,peer-id 89' (status=1)
Wed Sep 25 12:14:57 2019 username/174.94.150.11:30407 Assertion failed at /path/to/openvpn-2.4.7/src/openvpn/ssl.c:1944 (ks->authenticated)

Mon Apr 16 16:20:27 2018 username/73.112.174.11:2939 WARNING: Failed running command (--auth-user-pass-verify): external program exited with error status: 1
Mon Apr 16 16:20:27 2018 username/73.112.174.11:2939 TLS Auth Error: Auth Username/Password verification failed for peer
Mon Apr 16 16:20:27 2018 username/73.112.174.11:2939 Control Channel: TLSv1.2, cipher TLSv1/SSLv3 ECDHE-RSA-AES256-GCM-SHA384, 2048 bit RSA
Mon Apr 16 16:20:28 2018 username/73.112.174.11:2939 PUSH: Received control message: 'PUSH_REQUEST'
Mon Apr 16 16:20:28 2018 username/73.112.174.11:2939 SENT CONTROL [username]: 'PUSH_REPLY,redirect-gateway def1,comp-lzo,persist-key,persist-tun,route-gateway 10.28.47.1,topology subnet,ping 10,ping-restart 120,ifconfig 10.28.47.38 255.255.255.0,peer-id 89' (status=1)
Mon Apr 16 16:20:28 2018 username/73.112.174.11:2939 Assertion failed at /path/to/openvpn-2.4.4/src/openvpn/ssl.c:1929 (ks->authenticated)

Thanks,
Jeremy
Gert Doering May 30, 2020, 8:28 a.m. UTC | #5
Your patch has been applied to the master and release/2.4 branch (bugfix).

Thanks.  I do hope Steffan finds time to track down the root cause of 
this - from your log, it smells like a timing issue if multiple clients 
connect in quick succession and multiple instances of the plugin are
called.  Making "something" not finish initializing before the first 
client sends his PUSH_REQUEST.  Or so.

commit 984bd1e1601e4b9562dbc88b02a8db60b884286f (master)
commit 098edbb1f5a2e1360fd6a4ae0642b63bec12e992 (release/2.4)
Author: Jeremy Evans
Date:   Wed May 20 11:34:04 2020 -0700

     Switch assertion failure to returning false

     Signed-off-by: Jeremy Evans <code@jeremyevans.net>
     Acked-by: Steffan Karger <steffan.karger@foxcrypto.com>
     Message-Id: <20200520183404.54822-1-code@jeremyevans.net>
     URL: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg19914.html
     Signed-off-by: Gert Doering <gert@greenie.muc.de>


--
kind regards,

Gert Doering

Patch

diff --git a/src/openvpn/ssl.c b/src/openvpn/ssl.c
index c2e9a4f3..c8b3a959 100644
--- a/src/openvpn/ssl.c
+++ b/src/openvpn/ssl.c
@@ -1930,7 +1930,10 @@  tls_session_generate_data_channel_keys(struct tls_session *session)
     const struct session_id *server_sid = !session->opt->server ?
                                           &ks->session_id_remote : &session->session_id;
 
-    ASSERT(ks->authenticated);
+    if (!ks->authenticated) {
+        msg(D_TLS_ERRORS, "TLS Error: key_state not authenticated");
+        goto cleanup;
+    }
 
     ks->crypto_options.flags = session->opt->crypto_flags;
     if (!generate_key_expansion(&ks->crypto_options.key_ctx_bi,