[Openvpn-devel,ovpn-net-next,v2] ovpn: ensure sk is still valid during cleanup

Message ID 20250505225549.19492-1-a@unstable.cc
State Accepted
Delegated to: Antonio Quartulli
Headers show
Series [Openvpn-devel,ovpn-net-next,v2] ovpn: ensure sk is still valid during cleanup | expand

Commit Message

Antonio Quartulli May 5, 2025, 10:55 p.m. UTC
From: Antonio Quartulli <antonio@openvpn.net>

In case of UDP peer timeout, an openvpn client (userspace)
performs the following actions:
1. receives the peer deletion notification (reason=timeout)
2. closes the socket

Upon 1. we have the following:
- ovpn_peer_keepalive_work()
 - ovpn_socket_release()
  - synchronize_rcu()
At this point, 2. gets a chance to complete and ovpn_sock->sock->sk
becomes NULL. ovpn_socket_release() will then attempt dereferencing it,
resulting in the following crash log:

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [#1] SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf]
CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G           O        6.15.0-rc2-00635-g521139ac3840 #272 PREEMPT(full)
Tainted: [O]=OOT_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014
Workqueue: events ovpn_peer_keepalive_work [ovpn]
RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn]
Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30
RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217
RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb
RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be
RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840
R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000
R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c
FS:  0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
 <TASK>
 unlock_ovpn+0x8b/0xe0 [ovpn]
 ovpn_peer_keepalive_work+0xe3/0x540 [ovpn]
 ? ovpn_peers_free+0x780/0x780 [ovpn]
 ? lock_acquire+0x56/0x70
 ? process_one_work+0x888/0x1740
 process_one_work+0x933/0x1740
 ? pwq_dec_nr_in_flight+0x10b0/0x10b0
 ? move_linked_works+0x12d/0x2c0
 ? assign_work+0x163/0x270
 worker_thread+0x4d6/0xd90
 ? preempt_count_sub+0x4c/0x70
 ? process_one_work+0x1740/0x1740
 kthread+0x36c/0x710
 ? trace_preempt_on+0x8c/0x1e0
 ? kthread_is_per_cpu+0xc0/0xc0
 ? preempt_count_sub+0x4c/0x70
 ? _raw_spin_unlock_irq+0x36/0x60
 ? calculate_sigpending+0x7b/0xa0
 ? kthread_is_per_cpu+0xc0/0xc0
 ret_from_fork+0x3a/0x80
 ? kthread_is_per_cpu+0xc0/0xc0
 ret_from_fork_asm+0x11/0x20
 </TASK>
Modules linked in: ovpn(O)

Reason for accessing sk is ithat we need to retrieve its
protocol and continue the cleanup routine accordingly.

Fix the crash by grabbing a reference to sk before proceeding
with the cleanup. If the refcounter has reached zero, we
know that the socket is being destroyed and thus we skip
the cleanup in ovpn_socket_release().

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
---
 drivers/net/ovpn/socket.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

Comments

Antonio Quartulli May 6, 2025, 8:50 a.m. UTC | #1
Hi,

On 06/05/2025 00:55, Antonio Quartulli wrote:
> From: Antonio Quartulli <antonio@openvpn.net>
> 

[...]

> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>

This is v2 for "[PATCH ovpn-net-next 1/2] ovpn: don't access sk after 
release".

> ---
>   drivers/net/ovpn/socket.c | 21 ++++++++++++---------
>   1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ovpn/socket.c b/drivers/net/ovpn/socket.c
> index a83cbab72591..66a2ecbc483b 100644
> --- a/drivers/net/ovpn/socket.c
> +++ b/drivers/net/ovpn/socket.c
> @@ -66,6 +66,7 @@ static bool ovpn_socket_put(struct ovpn_peer *peer, struct ovpn_socket *sock)
>   void ovpn_socket_release(struct ovpn_peer *peer)
>   {
>   	struct ovpn_socket *sock;
> +	struct sock *sk;
>   	bool released;
>   
>   	might_sleep();
> @@ -75,13 +76,14 @@ void ovpn_socket_release(struct ovpn_peer *peer)
>   	if (!sock)
>   		return;
>   
> -	/* sanity check: we should not end up here if the socket
> -	 * was already closed
> +	/* sock->sk may be released concurrently, therefore we
> +	 * first attempt grabbing a reference.
> +	 * if sock->sk is NULL it means it is already being
> +	 * destroyed and we don't need any further cleanup
>   	 */
> -	if (!sock->sock->sk) {
> -		DEBUG_NET_WARN_ON_ONCE(1);
> +	sk = sock->sock->sk;
> +	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
>   		return;
> -	}

Rather than accessing sk without any guarantee of not having been 
nullified in the meantime, I think it is safer to check and possibly 
grab a reference.

At this point we know for sure that sk must be valid and wasn't released 
yet.

I hope it makes sense.


Cheers,
Gert Doering May 7, 2025, 11:30 a.m. UTC | #2
Hi,

On Tue, May 06, 2025 at 12:55:49AM +0200, Antonio Quartulli wrote:
> From: Antonio Quartulli <antonio@openvpn.net>
> 
> In case of UDP peer timeout, an openvpn client (userspace)
> performs the following actions:
> 1. receives the peer deletion notification (reason=timeout)
> 2. closes the socket
[..]
> 
> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
> ---
>  drivers/net/ovpn/socket.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ovpn/socket.c b/drivers/net/ovpn/socket.c
> index a83cbab72591..66a2ecbc483b 100644

On my ubuntu 20.04 ("backport of ovpn") running a few OpenVPN instances
for test purposes, opening and closing DCO interface frequently, I could
fairly reliably crash the system in ~1 hour.

With this patch, the machine survived the full test parcours.

I can not comment on the technical merits on the patch, just state "it
fixes my problem", so

Tested-By: Gert Doering <gert@greenie.muc.de>

gert
Antonio Quartulli May 8, 2025, 11:28 a.m. UTC | #3
Hi Sabrina,

On 07/05/2025 13:30, Gert Doering wrote:
> Tested-By: Gert Doering <gert@greenie.muc.de>
> 

Gert has stress tested this change quite a bit and he couldn't reproduce 
any side issue.

To me it looks like this is a reasonable solution to avoid clashes due 
to racing cleanups (ovpn_socket_release vs destroy/close).

Please let me know if you have any comment, otherwise I'd like to send 
this to netdev for inclusion, since it fixes a frequent kernel panic.

Regards,
Antonio Quartulli May 9, 2025, 2:15 p.m. UTC | #4
On 06/05/2025 00:55, Antonio Quartulli wrote:

> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>

Merged to main, commit id 476148af4e28

So far we didn't spot any regression, so we'll move on with this fix for 
the time being as we know it is preventing the crash for good.

We can improve it later if we feel this is not the best we can do.

Cheers!

Patch

diff --git a/drivers/net/ovpn/socket.c b/drivers/net/ovpn/socket.c
index a83cbab72591..66a2ecbc483b 100644
--- a/drivers/net/ovpn/socket.c
+++ b/drivers/net/ovpn/socket.c
@@ -66,6 +66,7 @@  static bool ovpn_socket_put(struct ovpn_peer *peer, struct ovpn_socket *sock)
 void ovpn_socket_release(struct ovpn_peer *peer)
 {
 	struct ovpn_socket *sock;
+	struct sock *sk;
 	bool released;
 
 	might_sleep();
@@ -75,13 +76,14 @@  void ovpn_socket_release(struct ovpn_peer *peer)
 	if (!sock)
 		return;
 
-	/* sanity check: we should not end up here if the socket
-	 * was already closed
+	/* sock->sk may be released concurrently, therefore we
+	 * first attempt grabbing a reference.
+	 * if sock->sk is NULL it means it is already being
+	 * destroyed and we don't need any further cleanup
 	 */
-	if (!sock->sock->sk) {
-		DEBUG_NET_WARN_ON_ONCE(1);
+	sk = sock->sock->sk;
+	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
 		return;
-	}
 
 	/* Drop the reference while holding the sock lock to avoid
 	 * concurrent ovpn_socket_new call to mess up with a partially
@@ -90,18 +92,18 @@  void ovpn_socket_release(struct ovpn_peer *peer)
 	 * Holding the lock ensures that a socket with refcnt 0 is fully
 	 * detached before it can be picked by a concurrent reader.
 	 */
-	lock_sock(sock->sock->sk);
+	lock_sock(sk);
 	released = ovpn_socket_put(peer, sock);
-	release_sock(sock->sock->sk);
+	release_sock(sk);
 
 	/* align all readers with sk_user_data being NULL */
 	synchronize_rcu();
 
 	/* following cleanup should happen with lock released */
 	if (released) {
-		if (sock->sock->sk->sk_protocol == IPPROTO_UDP) {
+		if (sk->sk_protocol == IPPROTO_UDP) {
 			netdev_put(sock->ovpn->dev, &sock->dev_tracker);
-		} else if (sock->sock->sk->sk_protocol == IPPROTO_TCP) {
+		} else if (sk->sk_protocol == IPPROTO_TCP) {
 			/* wait for TCP jobs to terminate */
 			ovpn_tcp_socket_wait_finish(sock);
 			ovpn_peer_put(sock->peer);
@@ -111,6 +113,7 @@  void ovpn_socket_release(struct ovpn_peer *peer)
 		 */
 		kfree(sock);
 	}
+	sock_put(sk);
 }
 
 static bool ovpn_socket_hold(struct ovpn_socket *sock)