Message ID | 20250507074438.1326755-1-walter.openvpn@wjd.nu |
---|---|
State | New |
Headers | show |
Series | [Openvpn-devel] multi.c: Allow floating to a new IP right after connection setup | expand |
Dear mailing list and Arne, I saw in the IRC meeting notes that 2.7 is about to be branched. I don't mean to push, but I personally think this bug is important enough to get looked at and fixed, preferably before 2.7.0 is made. Is this new patch good enough? Should I put it in Gerrit? Are there any questions about the functionality? I realise that I did not explain _why_ b364711486 breaks things. I'm also curious, but I think the author of that patch, or someone else more intimate with the codebase, can more easily explain. Regardless there is a bug in 2.6.x right now, and I'd would like it if a fix is upstreamed; be it my fix, or someone elses. Thanks! Walter > From: Walter Doekes <walter+github@wjd.nu> > > When you're connected to a VPN which is used as the default gateway, a > connection to a second VPN will cause a tunnel-in-tunnel traffic. If the > administrator of the second VPN wants to avoid that, by pushing its IP > as net_gateway, this means that the client's source IP switches right > after connect: > > the client source IP switches, from > - the first-VPN-exit-IP, to the > - regular-ISP-exit-IP > > In openvpn 2.5 and below, this worked fine. Since openvpn 2.6, > specifially b364711486, this triggers the "Disallow float to an address > taken by another client" code. Since that change, the traffic from the > second source IP creates a second connection, which now needs special > handling in the check-floating-IP code. > > This change allows one to switch to the new IP, if it is still in an > unconnected state. That makes the use-case mentioned above work again. > > Github: closes OpenVPN/openvpn#704 > Signed-off-by: Walter Doekes <walter+github@wjd.nu> > --- > src/openvpn/multi.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/src/openvpn/multi.c b/src/openvpn/multi.c > index a2d3fd10..51a00b71 100644 > --- a/src/openvpn/multi.c > +++ b/src/openvpn/multi.c > @@ -3236,8 +3236,21 @@ multi_process_float(struct multi_context *m, struct > multi_instance *mi, > struct tls_multi *m1 = mi->context.c2.tls_multi; > struct tls_multi *m2 = ex_mi->context.c2.tls_multi; > > + /* if the new connection is fresh and the old one is already > connected, this > + * might be a legitimate move to a new IP by the original client; > + * for example when the server IP is pushed as net_gateway to > escape from > + * a double VPN. */ > + if (m1->multi_state == CAS_CONNECT_DONE > + && m2->multi_state == CAS_NOT_CONNECTED > + && m1->locked_cert_hash_set > + && !m2->locked_cert_hash_set > + && session_id_equal(&m1->session[TM_ACTIVE].session_id, > + &m2->session[TM_ACTIVE].session_id)) > + { > + /* allow this case */ > + } > /* do not float if target address is taken by client with another > cert */ > - if (!cert_hash_compare(m1->locked_cert_hash_set, > m2->locked_cert_hash_set)) > + else if (!cert_hash_compare(m1->locked_cert_hash_set, > m2->locked_cert_hash_set)) > { > msg(D_MULTI_LOW, "Disallow float to an address taken by > another client %s", > multi_instance_string(ex_mi, false, &gc)); > -- > 2.34.1 > >
Am 22.05.2025 um 09:09 schrieb Walter Doekes: > Dear mailing list and Arne, > > I saw in the IRC meeting notes that 2.7 is about to be branched. > > I don't mean to push, but I personally think this bug is important enough > to get looked at and fixed, preferably before 2.7.0 is made. > > Is this new patch good enough? Should I put it in Gerrit? Are there any > questions about the functionality? > > I realise that I did not explain _why_ b364711486 breaks things. I'm also > curious, but I think the author of that patch, or someone else more > intimate with the codebase, can more easily explain. Regardless there is a > bug in 2.6.x right now, and I'd would like it if a fix is upstreamed; be > it my fix, or someone elses. The thing is that I do not really understand your scenario and how it exactly breaks for you to the extend that I cannot reproduce the issue. You are saying that the client switches the IP address after connect. But that is just a regular float from the perspective of the VPN server. I still do not understand where the other connection that is already on that IP/port is coming from. It is also not an older connection as it is not a fully established connection either. In summary I am not able to either reproduce or understand what is happning in your scenario. And I do not want to apply a patch that I don't understand. Arne
> The thing is that I do not really understand your scenario and how it > exactly breaks for you to the extend that I cannot reproduce the issue. I thought I explained things sufficiently in: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31502.html Apparently not. Please let me know what is unclear about that explanation. > You are saying that the client switches the IP address after connect. But > that is just a regular float from the perspective of the VPN server. I > still do not understand where the other connection that is already on that > IP/port is coming from. It is also not an older connection as it is not a > fully established connection either. Well, as far as I can tell, it _is_ just a regular float... that stopped working after the mentioned commit. It is indeed that recent connection. From what I gather from the earlier workings, we should not end up in that piece of code (where I added the fix), but for some reason we _do_ now. > In summary I am not able to either reproduce or understand what is > happning in your scenario. And I do not want to apply a patch that I don't > understand. Totally fair that you don't want to apply a patch that you don't understand. I on the other hand do not see why you're unable to reproduce. The scenario is not at all complicated: - Two vpn servers; - first vpn server pushes a default gateway; - second vpn server pushes its external IP as net_gateway (*); - second vpn server immediately sees the client float from one IP to another. If you're unable to reproduce that, then: - Either you're using a vastly different version and it has been fixed since then (but not something that landed in debian/bookworm or ubuntu/noble, and I _think_ I did try latest 2.6 as well); - or you're using different settings (udp; auth/tls-auth; dev-tun; subnet-topology); - or there is some unknown factor involved that neither of us can think or right now. I will create a reproducer config so you can see the exact settings (apart from the IP addresses). In the mean time, can you confirm that you understand the scenario or ask for additional clarification? Thank you! Walter (*) Why? Because if it didn't, traffic from the client to VPN-two goes through VPN-one as well. And that incurs overhead: additional latency and cpu load, possible MTU issues.
> Totally fair that you don't want to apply a patch that you don't > understand. I on the other hand do not see why you're unable to reproduce. > > The scenario is not at all complicated: > > - Two vpn servers; > - first vpn server pushes a default gateway; > - second vpn server pushes its external IP as net_gateway (*); > - second vpn server immediately sees the client float from one IP to another. What I understand so far: - so you connect to vpn 1 first and that is a normal VPN with a default gateway and you get VPN1 IP - Then via that VPN, you connect a 2nd VPN and you have as source the VPN IP, so the 2nd VPN server only see the VPN1 IP. - after connection is established, you do the host route directly to the server. - 2nd VPN server sees a float from VPN1 IP to extern IP (EXTIP) of client - Server refuses the float since there is already a not fully established connection on EXTIP What I don't understand where the this not fully established connection should be coming from. That would mean that the server would have need to have received a valid connection attempt from EXTIP that was never established. And I do not understand from you explaination where that happens. > If you're unable to reproduce that, then: > > - Either you're using a vastly different version and it has been fixed > since then (but not something that landed in debian/bookworm or > ubuntu/noble, and I _think_ I did try latest 2.6 as well); > - or you're using different settings (udp; auth/tls-auth; dev-tun; > subnet-topology); > - or there is some unknown factor involved that neither of us can think or > right now. > > I will create a reproducer config so you can see the exact settings (apart > from the IP addresses). > > In the mean time, can you confirm that you understand the scenario or ask > for additional clarification? I wrote again down what you basically told me and there is still this mystery connection that blocks you. And there is no explaination why this connection exist in the first place. You are fixing the sympton of this ghost connection that blocks your float but from my perspective we have not really established why it exists in the first place. Arne
Good. Your understanding of the situation is the same. I did not yet make a reproducer config -- mostly because I don't think we're doing anything non-standard. But I did double check that latest 2.6 is affected, tested both client and server. And, I added some tracing code, and I observed the following: > else if (verdict == VERDICT_VALID_CONTROL_V1 || verdict == VERDICT_VALID_ACK_V1) > { > /* ACK_V1 contains the peer id (our id) while CONTROL_V1 can but does not > * need to contain the peer id */ > struct gc_arena gc = gc_new(); > > bool ret = check_session_id_hmac(state, from, hmac, handwindow); > > const char *peer = print_link_socket_actual(&m->top.c2.from, &gc); > if (!ret) > { > msg(D_MULTI_MEDIUM, "Packet with invalid or missing SID from %s", peer); > } > else > { > msg(D_MULTI_DEBUG, "Valid packet with HMAC challenge from peer (%s), " > "accepting new connection.", peer); > //ret = false; // setting false here makes all setup fail > } > gc_free(&gc); > > msg(D_MULTI_MEDIUM, "do_pre_decrypt_check: verdict %d (%d)", verdict, ret); > return ret; > } This is the code that has the changed behaviour. In the before situation, we had this: > --- a/src/openvpn/mudp.c > +++ b/src/openvpn/mudp.c > @@ -55,8 +55,10 @@ do_pre_decrypt_check(struct multi_context *m) > > if (verdict == VERDICT_INVALID || verdict == VERDICT_VALID_CONTROL_V1) > { > + msg(D_MULTI_MEDIUM, "do_pre_decrypt_check: verdict %d (false)", verdict); > return false; > } > + msg(D_MULTI_MEDIUM, "do_pre_decrypt_check: verdict %d (true)", verdict); > return true; > } > <time> do_pre_decrypt_check: verdict 1 (false) m->pending = (nil), float = 0 // in multi_process_incoming_link <time> Float requested for peer 0 to EXTIP:PORT In the after-situation, we have this: <time> do_pre_decrypt_check: verdict 2 (1) <time> MULTI: multi_create_instance called ... m->pending = 0x5a36ce132f10, float = 0 // in multi_process_incoming_link <time> Float requested for peer 0 to EXTIP:PORT verdict is VERDICT_VALID_CONTROL_V1 in both cases (which changed from 1 to 2 after b364711486dc6371ad2659a5aa190941136f4f04), and the do_pre_decrypt_check now returns TRUE. That TRUE-result results in calls to multi_create_instance, and now we have that "mystery connection". When I find more time, I'll debug some more. But this might shed some light on things for you. As far as I understand, the packet with VERDICT_VALID_CONTROL_V1 verdict is now detected as a new valid connection setup. That creates a connection on the new IP. And then the float gets rejected. Walter On 23-05-2025 02:26, Arne Schwabe wrote: >> Totally fair that you don't want to apply a patch that you don't >> understand. I on the other hand do not see why you're unable to reproduce. >> >> The scenario is not at all complicated: >> >> - Two vpn servers; >> - first vpn server pushes a default gateway; >> - second vpn server pushes its external IP as net_gateway (*); >> - second vpn server immediately sees the client float from one IP to another. > > What I understand so far: > > - so you connect to vpn 1 first and that is a normal VPN with a default > gateway and you get VPN1 IP > > - Then via that VPN, you connect a 2nd VPN and you have as source the > VPN IP, so the 2nd VPN server only see the VPN1 IP. > > - after connection is established, you do the host route directly to > the server. > > - 2nd VPN server sees a float from VPN1 IP to extern IP (EXTIP) of client > > - Server refuses the float since there is already a not fully > established connection on EXTIP > > What I don't understand where the this not fully established connection > should be coming from. That would mean that the server would have need > to have received a valid connection attempt from EXTIP that was never > established. And I do not understand from you explaination where that > happens. > >> If you're unable to reproduce that, then: >> >> - Either you're using a vastly different version and it has been fixed >> since then (but not something that landed in debian/bookworm or >> ubuntu/noble, and I _think_ I did try latest 2.6 as well); >> - or you're using different settings (udp; auth/tls-auth; dev-tun; >> subnet-topology); >> - or there is some unknown factor involved that neither of us can think or >> right now. >> >> I will create a reproducer config so you can see the exact settings (apart >> from the IP addresses). >> >> In the mean time, can you confirm that you understand the scenario or ask >> for additional clarification? > > I wrote again down what you basically told me and there is still this > mystery connection that blocks you. And there is no explaination why > this connection exist in the first place. You are fixing the sympton of > this ghost connection that blocks your float but from my perspective we > have not really established why it exists in the first place. > > Arne >
Am 25.05.25 um 22:27 schrieb Walter Doekes: > Good. Your understanding of the situation is the same. > > I did not yet make a reproducer config -- mostly because I don't think > we're doing anything non-standard. But I did double check that latest > 2.6 is affected, tested both client and server. It is non-standard enough that you are the only ones affected and that we have trouble reproducing it. It would be good to get a pcap of the servers view of the client connection. You can send that in private if you want. I might have some idea what is happening if I can have a look at that. But currently it is just too much guesswork without knowing anything concrete that I can go on. Arne
Am 25.05.25 um 22:27 schrieb Walter Doekes: > Good. Your understanding of the situation is the same. > > I did not yet make a reproducer config -- mostly because I don't think > we're doing anything non-standard. But I did double check that latest > 2.6 is affected, tested both client and server. With pcap dumps I think I got an understanding what is happening. Can you see if this patch on the server fixes the problem for you? It is marked as WIP since I want to have more unit tests but the code shouldn't change. https://gerrit.openvpn.net/c/openvpn/+/1067 Arne
Good. I backported the patch so it ran against the culprit version (b3647114). I got these mesages: SENT CONTROL [mycommonname]: 'PUSH_REPLY,route ... 255.255.255.255 net_gateway,route-gateway 10.x.x.1,topology subnet,ping 15,ping-restart 55,route 10.x.x.0 255.255.0.0 vpn_gateway,ifconfig 10.x.x.3 255.255.255.0,peer-id 4,cipher AES-256-GCM' (status=1) Packet with invalid or missing SID from [AF_INET]HOME_IP:33567 Float requested for peer 4 to HOME_IP:33567 peer 4 (mycommonname) floated from VPN_IP:33567 to [AF_INET]HOME_IP:33567 The "Packet with invalid or missing SID" is new to me. But other than that, it works. I also tried it against 2.6-latest (0169b4ad). Also works. There the message is: Packet (P_ACK_V1) with invalid or missing SID from [AF_INET]HOME_IP:46088 I can't tell if this new message is problematic or not. It doesn't negatively impact my connection setup. And I (now) know when to expect it. As for your patch: there's a minor typo in your patch at ssl_pkt.h in the signature: "bool check_session_id_hmac" should be "bool pkt_is_ack" Further, I would prefer if the commit message itself mentioned something about "floating IPs and 60 second timeout after connect" instead of "rare circumstances" which are not rare in 100% of my use cases. That might be beneficial to the next person who runs into this. Thanks for the fixes! Walter > Am 25.05.25 um 22:27 schrieb Walter Doekes: >> Good. Your understanding of the situation is the same. >> >> I did not yet make a reproducer config -- mostly because I don't think >> we're doing anything non-standard. But I did double check that latest >> 2.6 is affected, tested both client and server. > > With pcap dumps I think I got an understanding what is happening. Can > you see if this patch on the server fixes the problem for you? It is > marked as WIP since I want to have more unit tests but the code > shouldn't change. > > https://gerrit.openvpn.net/c/openvpn/+/1067 > > Arne >
Am 25.06.25 um 15:30 schrieb Walter Doekes: > Good. I backported the patch so it ran against the culprit version > (b3647114). > > I got these mesages: > > > SENT CONTROL [mycommonname]: 'PUSH_REPLY,route ... 255.255.255.255 > net_gateway,route-gateway 10.x.x.1,topology subnet,ping 15,ping-restart > 55,route 10.x.x.0 255.255.0.0 vpn_gateway,ifconfig 10.x.x.3 > 255.255.255.0,peer-id 4,cipher AES-256-GCM' (status=1) > > Packet with invalid or missing SID from [AF_INET]HOME_IP:33567 > > Float requested for peer 4 to HOME_IP:33567 > > peer 4 (mycommonname) floated from VPN_IP:33567 to [AF_INET]HOME_IP:33567 > > > > The "Packet with invalid or missing SID" is new to me. But other than > that, it works. Thanks for testing and confirming that it works. > I also tried it against 2.6-latest (0169b4ad). Also works. There the > message is: > > Packet (P_ACK_V1) with invalid or missing SID from [AF_INET]HOME_IP:46088 > > I can't tell if this new message is problematic or not. It doesn't > negatively impact my connection setup. And I (now) know when to expect it. > > > > As for your patch: there's a minor typo in your patch at ssl_pkt.h in the > signature: > > "bool check_session_id_hmac" should be "bool pkt_is_ack" Thanks will fix in the next revision. > Further, I would prefer if the commit message itself mentioned something > about "floating IPs and 60 second timeout after connect" instead of "rare > circumstances" which are not rare in 100% of my use cases. That might be > beneficial to the next person who runs into this. You need a connection that - starts on IP A - successfully floats to IP B by data packet - then has a control packet from IP A before any data packet can trigger the float back to IP A In this scenario we would trigger a new connection coming before while know we detect that this should not trigger a new connection as it is not a new connection attempt. So instead of creating a new connection, you see (at verb 4 or higher) the message Packet (P_ACK_V1) with invalid or missing SID instead with the patch. I will update the commit message when the patch is no longer in draft mode. I wanted to confirm that this is actually the problem/scenario we are fixing before finishing the patch. Arne
diff --git a/src/openvpn/multi.c b/src/openvpn/multi.c index a2d3fd10..51a00b71 100644 --- a/src/openvpn/multi.c +++ b/src/openvpn/multi.c @@ -3236,8 +3236,21 @@ multi_process_float(struct multi_context *m, struct multi_instance *mi, struct tls_multi *m1 = mi->context.c2.tls_multi; struct tls_multi *m2 = ex_mi->context.c2.tls_multi; + /* if the new connection is fresh and the old one is already connected, this + * might be a legitimate move to a new IP by the original client; + * for example when the server IP is pushed as net_gateway to escape from + * a double VPN. */ + if (m1->multi_state == CAS_CONNECT_DONE + && m2->multi_state == CAS_NOT_CONNECTED + && m1->locked_cert_hash_set + && !m2->locked_cert_hash_set + && session_id_equal(&m1->session[TM_ACTIVE].session_id, + &m2->session[TM_ACTIVE].session_id)) + { + /* allow this case */ + } /* do not float if target address is taken by client with another cert */ - if (!cert_hash_compare(m1->locked_cert_hash_set, m2->locked_cert_hash_set)) + else if (!cert_hash_compare(m1->locked_cert_hash_set, m2->locked_cert_hash_set)) { msg(D_MULTI_LOW, "Disallow float to an address taken by another client %s", multi_instance_string(ex_mi, false, &gc));