[Openvpn-devel,v3] bandaid fix for TCP multipoint server crash with Linux-DCO

Message ID 20221227202614.2114971-1-gert@greenie.muc.de
State Accepted
Headers show
Series [Openvpn-devel,v3] bandaid fix for TCP multipoint server crash with Linux-DCO | expand

Commit Message

Gert Doering Dec. 27, 2022, 8:26 p.m. UTC
TCP multipoint servers with Linux-DCO can crash under yet-unknown
circumstances where a TCP socket gets handed to the kernel (= userland
shall not acceess it again) but the socket still lands in the event
polling mechanism, and is passed to link_socket_read() with
sock->fd being "-1" (SOCKET_UNDEFINED).

This is a bug, but it happens very unfrequently so not fixed yet.

When this happens, the server gets stuck in an endless loop of
"trying recvfrom(-1, ..), getting an error, looging that error,
continue" until the server's disk is full.

The situation is being made a bit more complex by the dco-win
approach of treating "all kernel sockets as UDP", so the Linux
implementation tries to access the -1 socket as UDP, confusing
the picture more.

As a bandaid to avoid the crash, this patch changes

 - socket.h: only do the "if dco_installed, treat as UDP" for WIN32
   (link_socket_read())

 - socket.c: add ASSERT(sock->fd >= 0); checks to all UDP socket paths
   (we should never even hit those as this is a TCP specific problem,
   but in the "sock->fd = -1" case, doing a clean server abort is
   preferred to "the disk is full with non-helpful logfiles, and then
   the server crashes anyway")

 - socket.c: in the TCP read function, link_socket_read_tcp(),
   check for sock->fd < 0 and trigger "sock->stream_reset = true"
   (+ write to the log what happened).

This change will kill this particular TCP client instance (SIGTERM),
but leave the rest of the server running fine - and given that
in our tests this issue seems to be triggered by inbound TCP RST
in just the wrong moment, it seems to be "a properly-sized bandaid".

v2: rebase on top of "move dco_installed back to link_socket"
v3: move sock->fd check inside !residual_fully_formed clause (so
    we can still handle already-read packets)

Github: OpenVPN/openvpn#190

Signed-off-by: Gert Doering <gert@greenie.muc.de>
---
 src/openvpn/socket.c | 17 +++++++++++++++++
 src/openvpn/socket.h |  4 ++++
 2 files changed, 21 insertions(+)

Comments

Arne Schwabe Dec. 27, 2022, 8:39 p.m. UTC | #1
Am 27.12.22 um 21:26 schrieb Gert Doering:
> TCP multipoint servers with Linux-DCO can crash under yet-unknown
> circumstances where a TCP socket gets handed to the kernel (= userland
> shall not acceess it again) but the socket still lands in the event
> polling mechanism, and is passed to link_socket_read() with
> sock->fd being "-1" (SOCKET_UNDEFINED).
> 
> This is a bug, but it happens very unfrequently so not fixed yet.
> 
> When this happens, the server gets stuck in an endless loop of
> "trying recvfrom(-1, ..), getting an error, looging that error,
> continue" until the server's disk is full.
> 
> The situation is being made a bit more complex by the dco-win
> approach of treating "all kernel sockets as UDP", so the Linux
> implementation tries to access the -1 socket as UDP, confusing
> the picture more.
> 
> As a bandaid to avoid the crash, this patch changes
> 
>   - socket.h: only do the "if dco_installed, treat as UDP" for WIN32
>     (link_socket_read())
> 
>   - socket.c: add ASSERT(sock->fd >= 0); checks to all UDP socket paths
>     (we should never even hit those as this is a TCP specific problem,
>     but in the "sock->fd = -1" case, doing a clean server abort is
>     preferred to "the disk is full with non-helpful logfiles, and then
>     the server crashes anyway")
> 
>   - socket.c: in the TCP read function, link_socket_read_tcp(),
>     check for sock->fd < 0 and trigger "sock->stream_reset = true"
>     (+ write to the log what happened).
> 
> This change will kill this particular TCP client instance (SIGTERM),
> but leave the rest of the server running fine - and given that
> in our tests this issue seems to be triggered by inbound TCP RST
> in just the wrong moment, it seems to be "a properly-sized bandaid".
> 
> v2: rebase on top of "move dco_installed back to link_socket"
> v3: move sock->fd check inside !residual_fully_formed clause (so
>      we can still handle already-read packets)

Acked-By: Arne Schwabe <arne@rfc2549.org>

The move of going the tcp read path and allowing the 
residual_fully_formed path to work makes the problem non-reproducible 
for me so far
Gert Doering Dec. 27, 2022, 10:07 p.m. UTC | #2
Patch has been applied to the master and release/2.6 branch.

commit c7416160fb2e5a66d5801e4b789751a7480e6384 (master)
commit b17cbbdab0f81dd7d54db8786adfffa70b87e1d6 (HEAD -> release/2.6)
Author: Gert Doering
Date:   Tue Dec 27 21:26:14 2022 +0100

     bandaid fix for TCP multipoint server crash with Linux-DCO

     Signed-off-by: Gert Doering <gert@greenie.muc.de>
     Acked-by: Arne Schwabe <arne@rfc2549.org>
     Message-Id: <20221227202614.2114971-1-gert@greenie.muc.de>
     URL: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg25844.html
     Signed-off-by: Gert Doering <gert@greenie.muc.de>


--
kind regards,

Gert Doering

Patch

diff --git a/src/openvpn/socket.c b/src/openvpn/socket.c
index c7ec0e06..2230052f 100644
--- a/src/openvpn/socket.c
+++ b/src/openvpn/socket.c
@@ -3228,6 +3228,18 @@  link_socket_read_tcp(struct link_socket *sock,
 
     if (!sock->stream_buf.residual_fully_formed)
     {
+        /* with Linux-DCO, we sometimes try to access a socket that is
+         * already installed in the kernel and has no valid file descriptor
+         * anymore.  This is a bug.
+         * Handle by resetting client instance instead of crashing.
+         */
+        if (sock->sd == SOCKET_UNDEFINED)
+        {
+            msg(M_INFO, "BUG: link_socket_read_tcp(): sock->sd==-1, reset client instance" );
+            sock->stream_reset = true;              /* reset client instance */
+            return buf->len = 0;                    /* nothing to read */
+        }
+
 #ifdef _WIN32
         sockethandle_t sh = { .s = sock->sd };
         len = sockethandle_finalize(sh, &sock->reads, buf, NULL);
@@ -3285,6 +3297,8 @@  link_socket_read_udp_posix_recvmsg(struct link_socket *sock,
     struct msghdr mesg;
     socklen_t fromlen = sizeof(from->dest.addr);
 
+    ASSERT(sock->sd >= 0);                      /* can't happen */
+
     iov.iov_base = BPTR(buf);
     iov.iov_len = buf_forward_capacity_total(buf);
     mesg.msg_iov = &iov;
@@ -3351,6 +3365,9 @@  link_socket_read_udp_posix(struct link_socket *sock,
     socklen_t fromlen = sizeof(from->dest.addr);
     socklen_t expectedlen = af_addr_size(sock->info.af);
     addr_zero_host(&from->dest);
+
+    ASSERT(sock->sd >= 0);                      /* can't happen */
+
 #if ENABLE_IP_PKTINFO
     /* Both PROTO_UDPv4 and PROTO_UDPv6 */
     if (sock->info.proto == PROTO_UDP && sock->sockflags & SF_USE_IP_PKTINFO)
diff --git a/src/openvpn/socket.h b/src/openvpn/socket.h
index 05c31b10..3b9c1ba3 100644
--- a/src/openvpn/socket.h
+++ b/src/openvpn/socket.h
@@ -1058,7 +1058,11 @@  link_socket_read(struct link_socket *sock,
                  struct buffer *buf,
                  struct link_socket_actual *from)
 {
+#ifdef _WIN32
     if (proto_is_udp(sock->info.proto) || sock->dco_installed)
+#else
+    if (proto_is_udp(sock->info.proto))
+#endif
     /* unified UDPv4 and UDPv6, for DCO the kernel
      * will strip the length header */
     {