[Openvpn-devel] multi.c: Allow floating to a new IP right after connection setup

Message ID 20250507074438.1326755-1-walter.openvpn@wjd.nu
State New
Headers show
Series [Openvpn-devel] multi.c: Allow floating to a new IP right after connection setup | expand

Commit Message

Walter Doekes May 7, 2025, 7:44 a.m. UTC
From: Walter Doekes <walter+github@wjd.nu>

When you're connected to a VPN which is used as the default gateway, a
connection to a second VPN will cause a tunnel-in-tunnel traffic. If the
administrator of the second VPN wants to avoid that, by pushing its IP
as net_gateway, this means that the client's source IP switches right
after connect:

  the client source IP switches, from
  - the first-VPN-exit-IP, to the
  - regular-ISP-exit-IP

In openvpn 2.5 and below, this worked fine. Since openvpn 2.6,
specifially b364711486, this triggers the "Disallow float to an address
taken by another client" code. Since that change, the traffic from the
second source IP creates a second connection, which now needs special
handling in the check-floating-IP code.

This change allows one to switch to the new IP, if it is still in an
unconnected state. That makes the use-case mentioned above work again.

Github: closes OpenVPN/openvpn#704
Signed-off-by: Walter Doekes <walter+github@wjd.nu>
---
 src/openvpn/multi.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Walter Doekes May 22, 2025, 7:09 a.m. UTC | #1
Dear mailing list and Arne,

I saw in the IRC meeting notes that 2.7 is about to be branched.

I don't mean to push, but I personally think this bug is important enough
to get looked at and fixed, preferably before 2.7.0 is made.

Is this new patch good enough? Should I put it in Gerrit? Are there any
questions about the functionality?

I realise that I did not explain _why_ b364711486 breaks things. I'm also
curious, but I think the author of that patch, or someone else more
intimate with the codebase, can more easily explain. Regardless there is a
bug in 2.6.x right now, and I'd would like it if a fix is upstreamed; be
it my fix, or someone elses.

Thanks!
Walter


> From: Walter Doekes <walter+github@wjd.nu>
>
> When you're connected to a VPN which is used as the default gateway, a
> connection to a second VPN will cause a tunnel-in-tunnel traffic. If the
> administrator of the second VPN wants to avoid that, by pushing its IP
> as net_gateway, this means that the client's source IP switches right
> after connect:
>
>   the client source IP switches, from
>   - the first-VPN-exit-IP, to the
>   - regular-ISP-exit-IP
>
> In openvpn 2.5 and below, this worked fine. Since openvpn 2.6,
> specifially b364711486, this triggers the "Disallow float to an address
> taken by another client" code. Since that change, the traffic from the
> second source IP creates a second connection, which now needs special
> handling in the check-floating-IP code.
>
> This change allows one to switch to the new IP, if it is still in an
> unconnected state. That makes the use-case mentioned above work again.
>
> Github: closes OpenVPN/openvpn#704
> Signed-off-by: Walter Doekes <walter+github@wjd.nu>
> ---
>  src/openvpn/multi.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/src/openvpn/multi.c b/src/openvpn/multi.c
> index a2d3fd10..51a00b71 100644
> --- a/src/openvpn/multi.c
> +++ b/src/openvpn/multi.c
> @@ -3236,8 +3236,21 @@ multi_process_float(struct multi_context *m, struct
> multi_instance *mi,
>          struct tls_multi *m1 = mi->context.c2.tls_multi;
>          struct tls_multi *m2 = ex_mi->context.c2.tls_multi;
>
> +        /* if the new connection is fresh and the old one is already
> connected, this
> +         * might be a legitimate move to a new IP by the original client;
> +         * for example when the server IP is pushed as net_gateway to
> escape from
> +         * a double VPN. */
> +        if (m1->multi_state == CAS_CONNECT_DONE
> +            && m2->multi_state == CAS_NOT_CONNECTED
> +            && m1->locked_cert_hash_set
> +            && !m2->locked_cert_hash_set
> +            && session_id_equal(&m1->session[TM_ACTIVE].session_id,
> +                          &m2->session[TM_ACTIVE].session_id))
> +        {
> +            /* allow this case */
> +        }
>          /* do not float if target address is taken by client with another
> cert */
> -        if (!cert_hash_compare(m1->locked_cert_hash_set,
> m2->locked_cert_hash_set))
> +        else if (!cert_hash_compare(m1->locked_cert_hash_set,
> m2->locked_cert_hash_set))
>          {
>              msg(D_MULTI_LOW, "Disallow float to an address taken by
> another client %s",
>                  multi_instance_string(ex_mi, false, &gc));
> --
> 2.34.1
>
>
Arne Schwabe May 22, 2025, 11:14 a.m. UTC | #2
Am 22.05.2025 um 09:09 schrieb Walter Doekes:
> Dear mailing list and Arne,
>
> I saw in the IRC meeting notes that 2.7 is about to be branched.
>
> I don't mean to push, but I personally think this bug is important enough
> to get looked at and fixed, preferably before 2.7.0 is made.
>
> Is this new patch good enough? Should I put it in Gerrit? Are there any
> questions about the functionality?
>
> I realise that I did not explain _why_ b364711486 breaks things. I'm also
> curious, but I think the author of that patch, or someone else more
> intimate with the codebase, can more easily explain. Regardless there is a
> bug in 2.6.x right now, and I'd would like it if a fix is upstreamed; be
> it my fix, or someone elses.

The thing is that I do not really understand your scenario and how it 
exactly breaks for you to the extend that I cannot reproduce the issue.

You are saying that the client switches the IP address after connect. But that is just a regular float from the perspective of the VPN server. I still do not understand where the other connection that is already on that IP/port is coming from. It is also not an older connection as it is not a fully established connection either.

In summary I am not able to either reproduce or understand what is happning in your scenario. And I do not want to apply a patch that I don't understand.

Arne
Walter Doekes May 22, 2025, 2:24 p.m. UTC | #3
> The thing is that I do not really understand your scenario and how it
> exactly breaks for you to the extend that I cannot reproduce the issue.

I thought I explained things sufficiently in:

https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31502.html

Apparently not. Please let me know what is unclear about that explanation.


> You are saying that the client switches the IP address after connect. But
> that is just a regular float from the perspective of the VPN server. I
> still do not understand where the other connection that is already on that
> IP/port is coming from. It is also not an older connection as it is not a
> fully established connection either.

Well, as far as I can tell, it _is_ just a regular float... that stopped
working after the mentioned commit.

It is indeed that recent connection. From what I gather from the earlier
workings, we should not end up in that piece of code (where I added the
fix), but for some reason we _do_ now.


> In summary I am not able to either reproduce or understand what is
> happning in your scenario. And I do not want to apply a patch that I don't
> understand.

Totally fair that you don't want to apply a patch that you don't
understand. I on the other hand do not see why you're unable to reproduce.

The scenario is not at all complicated:

- Two vpn servers;
- first vpn server pushes a default gateway;
- second vpn server pushes its external IP as net_gateway (*);
- second vpn server immediately sees the client float from one IP to another.

If you're unable to reproduce that, then:

- Either you're using a vastly different version and it has been fixed
since then (but not something that landed in debian/bookworm or
ubuntu/noble, and I _think_ I did try latest 2.6 as well);
- or you're using different settings (udp; auth/tls-auth; dev-tun;
subnet-topology);
- or there is some unknown factor involved that neither of us can think or
right now.

I will create a reproducer config so you can see the exact settings (apart
from the IP addresses).

In the mean time, can you confirm that you understand the scenario or ask
for additional clarification?

Thank you!

Walter


(*) Why? Because if it didn't, traffic from the client to VPN-two goes
through VPN-one as well. And that incurs overhead: additional latency and
cpu load, possible MTU issues.
Arne Schwabe May 23, 2025, 12:26 a.m. UTC | #4
> Totally fair that you don't want to apply a patch that you don't
> understand. I on the other hand do not see why you're unable to reproduce.
>
> The scenario is not at all complicated:
>
> - Two vpn servers;
> - first vpn server pushes a default gateway;
> - second vpn server pushes its external IP as net_gateway (*);
> - second vpn server immediately sees the client float from one IP to another.

What I understand so far:

- so you connect to vpn 1 first and that is a normal VPN with a default 
gateway and you get VPN1 IP

- Then via that VPN, you connect a 2nd VPN and you have as source the 
VPN IP, so the 2nd VPN server only see the VPN1 IP.

- after connection is established, you  do the host route directly to 
the server.

- 2nd VPN server sees a float from VPN1 IP to extern IP (EXTIP) of client

- Server refuses the float since there is already a not fully 
established connection on EXTIP

What I don't understand where the this not fully established connection 
should be coming from. That would mean that the server would have need 
to have received a valid connection attempt from EXTIP that was never 
established. And I do not understand from you explaination where that 
happens.

> If you're unable to reproduce that, then:
>
> - Either you're using a vastly different version and it has been fixed
> since then (but not something that landed in debian/bookworm or
> ubuntu/noble, and I _think_ I did try latest 2.6 as well);
> - or you're using different settings (udp; auth/tls-auth; dev-tun;
> subnet-topology);
> - or there is some unknown factor involved that neither of us can think or
> right now.
>
> I will create a reproducer config so you can see the exact settings (apart
> from the IP addresses).
>
> In the mean time, can you confirm that you understand the scenario or ask
> for additional clarification?

I wrote again down what you basically told me and there is still this 
mystery connection that blocks you. And there is no explaination why 
this connection exist in the first place. You are fixing the sympton of 
this ghost connection that blocks your float but from my perspective we 
have not really established why it exists in the first place.

Arne
Walter Doekes May 25, 2025, 8:27 p.m. UTC | #5
Good. Your understanding of the situation is the same.

I did not yet make a reproducer config -- mostly because I don't think 
we're doing anything non-standard. But I did double check that latest 
2.6 is affected, tested both client and server.

And, I added some tracing code, and I observed the following:

>     else if (verdict == VERDICT_VALID_CONTROL_V1 || verdict == VERDICT_VALID_ACK_V1)
>     {   
>         /* ACK_V1 contains the peer id (our id) while CONTROL_V1 can but does not
>          * need to contain the peer id */
>         struct gc_arena gc = gc_new();
> 
>         bool ret = check_session_id_hmac(state, from, hmac, handwindow);
> 
>         const char *peer = print_link_socket_actual(&m->top.c2.from, &gc);
>         if (!ret)
>         {   
>             msg(D_MULTI_MEDIUM, "Packet with invalid or missing SID from %s", peer);
>         }
>         else
>         {   
>             msg(D_MULTI_DEBUG, "Valid packet with HMAC challenge from peer (%s), "
>                 "accepting new connection.", peer);
>             //ret = false; // setting false here makes all setup fail
>         }
>         gc_free(&gc);
> 
>         msg(D_MULTI_MEDIUM, "do_pre_decrypt_check: verdict %d (%d)", verdict, ret);
>         return ret;
>     }

This is the code that has the changed behaviour.

In the before situation, we had this:

> --- a/src/openvpn/mudp.c
> +++ b/src/openvpn/mudp.c
> @@ -55,8 +55,10 @@ do_pre_decrypt_check(struct multi_context *m)
>  
>      if (verdict == VERDICT_INVALID || verdict == VERDICT_VALID_CONTROL_V1)
>      {
> +        msg(D_MULTI_MEDIUM, "do_pre_decrypt_check: verdict %d (false)", verdict);
>          return false;
>      }
> +    msg(D_MULTI_MEDIUM, "do_pre_decrypt_check: verdict %d (true)", verdict);
>      return true;
>  }
>  

<time> do_pre_decrypt_check: verdict 1 (false)
m->pending = (nil), float = 0 // in multi_process_incoming_link
<time> Float requested for peer 0 to EXTIP:PORT


In the after-situation, we have this:

<time> do_pre_decrypt_check: verdict 2 (1)
<time> MULTI: multi_create_instance called
...
m->pending = 0x5a36ce132f10, float = 0 // in multi_process_incoming_link
<time> Float requested for peer 0 to EXTIP:PORT


verdict is VERDICT_VALID_CONTROL_V1 in both cases (which changed from 1 
to 2 after b364711486dc6371ad2659a5aa190941136f4f04), and the 
do_pre_decrypt_check now returns TRUE.

That TRUE-result results in calls to multi_create_instance, and now we 
have that "mystery connection".


When I find more time, I'll debug some more. But this might shed some 
light on things for you.

As far as I understand, the packet with VERDICT_VALID_CONTROL_V1 verdict 
is now detected as a new valid connection setup. That creates a 
connection on the new IP. And then the float gets rejected.

Walter



On 23-05-2025 02:26, Arne Schwabe wrote:
>> Totally fair that you don't want to apply a patch that you don't
>> understand. I on the other hand do not see why you're unable to reproduce.
>>
>> The scenario is not at all complicated:
>>
>> - Two vpn servers;
>> - first vpn server pushes a default gateway;
>> - second vpn server pushes its external IP as net_gateway (*);
>> - second vpn server immediately sees the client float from one IP to another.
> 
> What I understand so far:
> 
> - so you connect to vpn 1 first and that is a normal VPN with a default 
> gateway and you get VPN1 IP
> 
> - Then via that VPN, you connect a 2nd VPN and you have as source the 
> VPN IP, so the 2nd VPN server only see the VPN1 IP.
> 
> - after connection is established, you  do the host route directly to 
> the server.
> 
> - 2nd VPN server sees a float from VPN1 IP to extern IP (EXTIP) of client
> 
> - Server refuses the float since there is already a not fully 
> established connection on EXTIP
> 
> What I don't understand where the this not fully established connection 
> should be coming from. That would mean that the server would have need 
> to have received a valid connection attempt from EXTIP that was never 
> established. And I do not understand from you explaination where that 
> happens.
> 
>> If you're unable to reproduce that, then:
>>
>> - Either you're using a vastly different version and it has been fixed
>> since then (but not something that landed in debian/bookworm or
>> ubuntu/noble, and I _think_ I did try latest 2.6 as well);
>> - or you're using different settings (udp; auth/tls-auth; dev-tun;
>> subnet-topology);
>> - or there is some unknown factor involved that neither of us can think or
>> right now.
>>
>> I will create a reproducer config so you can see the exact settings (apart
>> from the IP addresses).
>>
>> In the mean time, can you confirm that you understand the scenario or ask
>> for additional clarification?
> 
> I wrote again down what you basically told me and there is still this 
> mystery connection that blocks you. And there is no explaination why 
> this connection exist in the first place. You are fixing the sympton of 
> this ghost connection that blocks your float but from my perspective we 
> have not really established why it exists in the first place.
> 
> Arne
>
Arne Schwabe May 26, 2025, 7:31 a.m. UTC | #6
Am 25.05.25 um 22:27 schrieb Walter Doekes:
> Good. Your understanding of the situation is the same.
> 
> I did not yet make a reproducer config -- mostly because I don't think 
> we're doing anything non-standard. But I did double check that latest 
> 2.6 is affected, tested both client and server.

It is non-standard enough that you are the only ones affected and that 
we have trouble reproducing it.

It would be good to get a pcap of the servers view of the client 
connection. You can send that in private if you want. I might have some 
idea what is happening if I can have a look at that. But currently it is 
just too much guesswork without knowing anything concrete that I can go on.

Arne
Arne Schwabe June 25, 2025, 12:40 p.m. UTC | #7
Am 25.05.25 um 22:27 schrieb Walter Doekes:
> Good. Your understanding of the situation is the same.
> 
> I did not yet make a reproducer config -- mostly because I don't think 
> we're doing anything non-standard. But I did double check that latest 
> 2.6 is affected, tested both client and server.

With pcap dumps I think I got an understanding what is happening. Can 
you see if this patch on the server fixes the problem for you? It is 
marked as WIP since I want to have more unit tests but the code 
shouldn't change.

https://gerrit.openvpn.net/c/openvpn/+/1067

Arne
Walter Doekes June 25, 2025, 1:30 p.m. UTC | #8
Good. I backported the patch so it ran against the culprit version
(b3647114).

I got these mesages:


  SENT CONTROL [mycommonname]: 'PUSH_REPLY,route ... 255.255.255.255
net_gateway,route-gateway 10.x.x.1,topology subnet,ping 15,ping-restart
55,route 10.x.x.0 255.255.0.0 vpn_gateway,ifconfig 10.x.x.3
255.255.255.0,peer-id 4,cipher AES-256-GCM' (status=1)

  Packet with invalid or missing SID from [AF_INET]HOME_IP:33567

  Float requested for peer 4 to HOME_IP:33567

  peer 4 (mycommonname) floated from VPN_IP:33567 to [AF_INET]HOME_IP:33567



The "Packet with invalid or missing SID" is new to me. But other than
that, it works.

I also tried it against 2.6-latest (0169b4ad). Also works. There the
message is:

  Packet (P_ACK_V1) with invalid or missing SID from [AF_INET]HOME_IP:46088

I can't tell if this new message is problematic or not. It doesn't
negatively impact my connection setup. And I (now) know when to expect it.



As for your patch: there's a minor typo in your patch at ssl_pkt.h in the
signature:

"bool check_session_id_hmac" should be "bool pkt_is_ack"


Further, I would prefer if the commit message itself mentioned something
about "floating IPs and 60 second timeout after connect" instead of "rare
circumstances" which are not rare in 100% of my use cases. That might be
beneficial to the next person who runs into this.


Thanks for the fixes!

Walter


> Am 25.05.25 um 22:27 schrieb Walter Doekes:
>> Good. Your understanding of the situation is the same.
>>
>> I did not yet make a reproducer config -- mostly because I don't think
>> we're doing anything non-standard. But I did double check that latest
>> 2.6 is affected, tested both client and server.
>
> With pcap dumps I think I got an understanding what is happening. Can
> you see if this patch on the server fixes the problem for you? It is
> marked as WIP since I want to have more unit tests but the code
> shouldn't change.
>
> https://gerrit.openvpn.net/c/openvpn/+/1067
>
> Arne
>
Arne Schwabe June 25, 2025, 10:43 p.m. UTC | #9
Am 25.06.25 um 15:30 schrieb Walter Doekes:
> Good. I backported the patch so it ran against the culprit version
> (b3647114).
> 
> I got these mesages:
> 
> 
>    SENT CONTROL [mycommonname]: 'PUSH_REPLY,route ... 255.255.255.255
> net_gateway,route-gateway 10.x.x.1,topology subnet,ping 15,ping-restart
> 55,route 10.x.x.0 255.255.0.0 vpn_gateway,ifconfig 10.x.x.3
> 255.255.255.0,peer-id 4,cipher AES-256-GCM' (status=1)
> 
>    Packet with invalid or missing SID from [AF_INET]HOME_IP:33567
> 
>    Float requested for peer 4 to HOME_IP:33567
> 
>    peer 4 (mycommonname) floated from VPN_IP:33567 to [AF_INET]HOME_IP:33567
> 
> 
> 
> The "Packet with invalid or missing SID" is new to me. But other than
> that, it works.

Thanks for testing and confirming that it works.

> I also tried it against 2.6-latest (0169b4ad). Also works. There the
> message is:
> 
>    Packet (P_ACK_V1) with invalid or missing SID from [AF_INET]HOME_IP:46088
> 
> I can't tell if this new message is problematic or not. It doesn't
> negatively impact my connection setup. And I (now) know when to expect it.
> 
> 
> 
> As for your patch: there's a minor typo in your patch at ssl_pkt.h in the
> signature:
> 
> "bool check_session_id_hmac" should be "bool pkt_is_ack"

Thanks will fix in the  next revision.

> Further, I would prefer if the commit message itself mentioned something
> about "floating IPs and 60 second timeout after connect" instead of "rare
> circumstances" which are not rare in 100% of my use cases. That might be
> beneficial to the next person who runs into this.


You need a connection that

- starts on IP A
- successfully floats to IP B by data packet
- then has a control packet from IP A before any data packet can trigger 
the float back to IP A

In this scenario we would trigger a new connection coming before while 
know we detect that this should not trigger a new connection as it is 
not a new connection attempt. So instead of creating a new connection, 
you see (at verb 4 or higher) the message

    Packet (P_ACK_V1) with invalid or missing SID

instead with the patch.

I will update the commit message when the patch is no longer in draft 
mode. I wanted to confirm that this is actually the problem/scenario we 
are fixing before finishing the patch.

Arne

Patch

diff --git a/src/openvpn/multi.c b/src/openvpn/multi.c
index a2d3fd10..51a00b71 100644
--- a/src/openvpn/multi.c
+++ b/src/openvpn/multi.c
@@ -3236,8 +3236,21 @@  multi_process_float(struct multi_context *m, struct multi_instance *mi,
         struct tls_multi *m1 = mi->context.c2.tls_multi;
         struct tls_multi *m2 = ex_mi->context.c2.tls_multi;
 
+        /* if the new connection is fresh and the old one is already connected, this
+         * might be a legitimate move to a new IP by the original client;
+         * for example when the server IP is pushed as net_gateway to escape from
+         * a double VPN. */
+        if (m1->multi_state == CAS_CONNECT_DONE
+            && m2->multi_state == CAS_NOT_CONNECTED
+            && m1->locked_cert_hash_set
+            && !m2->locked_cert_hash_set
+            && session_id_equal(&m1->session[TM_ACTIVE].session_id,
+                          &m2->session[TM_ACTIVE].session_id))
+        {
+            /* allow this case */
+        }
         /* do not float if target address is taken by client with another cert */
-        if (!cert_hash_compare(m1->locked_cert_hash_set, m2->locked_cert_hash_set))
+        else if (!cert_hash_compare(m1->locked_cert_hash_set, m2->locked_cert_hash_set))
         {
             msg(D_MULTI_LOW, "Disallow float to an address taken by another client %s",
                 multi_instance_string(ex_mi, false, &gc));