Message ID | 20221227202614.2114971-1-gert@greenie.muc.de |
---|---|
State | Accepted |
Headers | show |
Series | [Openvpn-devel,v3] bandaid fix for TCP multipoint server crash with Linux-DCO | expand |
Am 27.12.22 um 21:26 schrieb Gert Doering: > TCP multipoint servers with Linux-DCO can crash under yet-unknown > circumstances where a TCP socket gets handed to the kernel (= userland > shall not acceess it again) but the socket still lands in the event > polling mechanism, and is passed to link_socket_read() with > sock->fd being "-1" (SOCKET_UNDEFINED). > > This is a bug, but it happens very unfrequently so not fixed yet. > > When this happens, the server gets stuck in an endless loop of > "trying recvfrom(-1, ..), getting an error, looging that error, > continue" until the server's disk is full. > > The situation is being made a bit more complex by the dco-win > approach of treating "all kernel sockets as UDP", so the Linux > implementation tries to access the -1 socket as UDP, confusing > the picture more. > > As a bandaid to avoid the crash, this patch changes > > - socket.h: only do the "if dco_installed, treat as UDP" for WIN32 > (link_socket_read()) > > - socket.c: add ASSERT(sock->fd >= 0); checks to all UDP socket paths > (we should never even hit those as this is a TCP specific problem, > but in the "sock->fd = -1" case, doing a clean server abort is > preferred to "the disk is full with non-helpful logfiles, and then > the server crashes anyway") > > - socket.c: in the TCP read function, link_socket_read_tcp(), > check for sock->fd < 0 and trigger "sock->stream_reset = true" > (+ write to the log what happened). > > This change will kill this particular TCP client instance (SIGTERM), > but leave the rest of the server running fine - and given that > in our tests this issue seems to be triggered by inbound TCP RST > in just the wrong moment, it seems to be "a properly-sized bandaid". > > v2: rebase on top of "move dco_installed back to link_socket" > v3: move sock->fd check inside !residual_fully_formed clause (so > we can still handle already-read packets) Acked-By: Arne Schwabe <arne@rfc2549.org> The move of going the tcp read path and allowing the residual_fully_formed path to work makes the problem non-reproducible for me so far
Patch has been applied to the master and release/2.6 branch. commit c7416160fb2e5a66d5801e4b789751a7480e6384 (master) commit b17cbbdab0f81dd7d54db8786adfffa70b87e1d6 (HEAD -> release/2.6) Author: Gert Doering Date: Tue Dec 27 21:26:14 2022 +0100 bandaid fix for TCP multipoint server crash with Linux-DCO Signed-off-by: Gert Doering <gert@greenie.muc.de> Acked-by: Arne Schwabe <arne@rfc2549.org> Message-Id: <20221227202614.2114971-1-gert@greenie.muc.de> URL: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg25844.html Signed-off-by: Gert Doering <gert@greenie.muc.de> -- kind regards, Gert Doering
diff --git a/src/openvpn/socket.c b/src/openvpn/socket.c index c7ec0e06..2230052f 100644 --- a/src/openvpn/socket.c +++ b/src/openvpn/socket.c @@ -3228,6 +3228,18 @@ link_socket_read_tcp(struct link_socket *sock, if (!sock->stream_buf.residual_fully_formed) { + /* with Linux-DCO, we sometimes try to access a socket that is + * already installed in the kernel and has no valid file descriptor + * anymore. This is a bug. + * Handle by resetting client instance instead of crashing. + */ + if (sock->sd == SOCKET_UNDEFINED) + { + msg(M_INFO, "BUG: link_socket_read_tcp(): sock->sd==-1, reset client instance" ); + sock->stream_reset = true; /* reset client instance */ + return buf->len = 0; /* nothing to read */ + } + #ifdef _WIN32 sockethandle_t sh = { .s = sock->sd }; len = sockethandle_finalize(sh, &sock->reads, buf, NULL); @@ -3285,6 +3297,8 @@ link_socket_read_udp_posix_recvmsg(struct link_socket *sock, struct msghdr mesg; socklen_t fromlen = sizeof(from->dest.addr); + ASSERT(sock->sd >= 0); /* can't happen */ + iov.iov_base = BPTR(buf); iov.iov_len = buf_forward_capacity_total(buf); mesg.msg_iov = &iov; @@ -3351,6 +3365,9 @@ link_socket_read_udp_posix(struct link_socket *sock, socklen_t fromlen = sizeof(from->dest.addr); socklen_t expectedlen = af_addr_size(sock->info.af); addr_zero_host(&from->dest); + + ASSERT(sock->sd >= 0); /* can't happen */ + #if ENABLE_IP_PKTINFO /* Both PROTO_UDPv4 and PROTO_UDPv6 */ if (sock->info.proto == PROTO_UDP && sock->sockflags & SF_USE_IP_PKTINFO) diff --git a/src/openvpn/socket.h b/src/openvpn/socket.h index 05c31b10..3b9c1ba3 100644 --- a/src/openvpn/socket.h +++ b/src/openvpn/socket.h @@ -1058,7 +1058,11 @@ link_socket_read(struct link_socket *sock, struct buffer *buf, struct link_socket_actual *from) { +#ifdef _WIN32 if (proto_is_udp(sock->info.proto) || sock->dco_installed) +#else + if (proto_is_udp(sock->info.proto)) +#endif /* unified UDPv4 and UDPv6, for DCO the kernel * will strip the length header */ {
TCP multipoint servers with Linux-DCO can crash under yet-unknown circumstances where a TCP socket gets handed to the kernel (= userland shall not acceess it again) but the socket still lands in the event polling mechanism, and is passed to link_socket_read() with sock->fd being "-1" (SOCKET_UNDEFINED). This is a bug, but it happens very unfrequently so not fixed yet. When this happens, the server gets stuck in an endless loop of "trying recvfrom(-1, ..), getting an error, looging that error, continue" until the server's disk is full. The situation is being made a bit more complex by the dco-win approach of treating "all kernel sockets as UDP", so the Linux implementation tries to access the -1 socket as UDP, confusing the picture more. As a bandaid to avoid the crash, this patch changes - socket.h: only do the "if dco_installed, treat as UDP" for WIN32 (link_socket_read()) - socket.c: add ASSERT(sock->fd >= 0); checks to all UDP socket paths (we should never even hit those as this is a TCP specific problem, but in the "sock->fd = -1" case, doing a clean server abort is preferred to "the disk is full with non-helpful logfiles, and then the server crashes anyway") - socket.c: in the TCP read function, link_socket_read_tcp(), check for sock->fd < 0 and trigger "sock->stream_reset = true" (+ write to the log what happened). This change will kill this particular TCP client instance (SIGTERM), but leave the rest of the server running fine - and given that in our tests this issue seems to be triggered by inbound TCP RST in just the wrong moment, it seems to be "a properly-sized bandaid". v2: rebase on top of "move dco_installed back to link_socket" v3: move sock->fd check inside !residual_fully_formed clause (so we can still handle already-read packets) Github: OpenVPN/openvpn#190 Signed-off-by: Gert Doering <gert@greenie.muc.de> --- src/openvpn/socket.c | 17 +++++++++++++++++ src/openvpn/socket.h | 4 ++++ 2 files changed, 21 insertions(+)