Commit Graph

40 Commits

Author SHA1 Message Date
Wade Simmons 4eb1da0958
remove deadlock in GetOrHandshake (#1151)
We had a rare deadlock in GetOrHandshake because we kept the hostmap
lock when we do the call to StartHandshake. StartHandshake can block
while sending to the lighthouse query worker channel, and that worker
needs to be able to grab the hostmap lock to do its work. Other calls
for StartHandshake don't hold the hostmap lock so we should be able to
drop it here.

This lock was originally added with: https://github.com/slackhq/nebula/pull/954
2024-05-29 12:52:52 -04:00
Wade Simmons 7efa750aef
avoid deadlock in lighthouse queryWorker (#1112)
* avoid deadlock in lighthouse queryWorker

If the lighthouse queryWorker tries to grab to call StartHandshake on
a lighthouse vpnIp, we can deadlock on the handshake_manager lock. This
change drops the handshake_manager lock before we send on the lighthouse
queryChan (which could block), and also avoids sending to the channel if
this is a lighthouse IP itself.

* need to hold lock during cacheCb
2024-04-11 17:00:01 -04:00
Nate Brown a390125935
Support reloading preferred_ranges (#1043) 2024-04-03 22:14:51 -05:00
Nate Brown 072edd56b3
Fix re-entrant `GetOrHandshake` issues (#1044) 2023-12-19 11:58:31 -06:00
Nate Brown a44e1b8b05
Clean up a hostinfo to reduce memory usage (#955) 2023-11-02 16:53:59 -05:00
Nate Brown 50d6a1e8ca
QueryServer needs to be done outside of the lock (#996) 2023-10-17 15:43:51 -05:00
Nate Brown 076ebc6c6e
Simplify getting a hostinfo or starting a handshake with one (#954) 2023-08-21 18:51:45 -05:00
Nate Brown 7edcf620c0
We only need the certificate in ConnectionState (#953) 2023-08-21 14:11:06 -05:00
Nate Brown a10baeee92
Pull hostmap and pending hostmap apart, remove unused functions (#843) 2023-07-24 12:37:52 -05:00
Nate Brown 3bbf5f4e67
Use an interface for udp conns (#901) 2023-06-14 10:48:52 -05:00
Nate Brown 03e4a7f988
Rehandshaking (#838)
Co-authored-by: Brad Higgins <brad@defined.net>
Co-authored-by: Wade Simmons <wadey@slack-corp.com>
2023-05-04 15:16:37 -05:00
brad-defined 9b03053191
update EncReader and EncWriter interface function args to have concrete types (#844)
* Update LightHouseHandlerFunc to remove EncWriter param.
* Move EncWriter to interface
* EncReader, too
2023-04-07 14:28:37 -04:00
Nate Brown d3fe3efcb0
Fix handshake retry regression (#842) 2023-04-05 10:04:30 -05:00
Nate Brown ee8e1348e9
Use connection manager to drive NAT maintenance (#835)
Co-authored-by: brad-defined <77982333+brad-defined@users.noreply.github.com>
2023-03-31 15:45:05 -05:00
Nate Brown 1a6c657451
Normalize logs (#837) 2023-03-30 15:07:31 -05:00
brad-defined 2801fb2286
Fix relay (#827)
Co-authored-by: Nate Brown <nbrown.us@gmail.com>
2023-03-30 11:09:20 -05:00
Nate Brown f0ef80500d
Remove dead code and re-order transit from pending to main hostmap on stage 2 (#828) 2023-03-17 15:36:24 -05:00
Wade Simmons e1af37e46d
add calculated_remotes (#759)
* add calculated_remotes

This setting allows us to "guess" what the remote might be for a host
while we wait for the lighthouse response. For networks that hard
designed with in mind, it can help speed up handshake performance, as well as
improve resiliency in the case that all lighthouses are down.

Example:

    lighthouse:
      # ...

      calculated_remotes:
        # For any Nebula IPs in 10.0.10.0/24, this will apply the mask and add
        # the calculated IP as an initial remote (while we wait for the response
        # from the lighthouse). Both CIDRs must have the same mask size.
        # For example, Nebula IP 10.0.10.123 will have a calculated remote of
        # 192.168.1.123

        10.0.10.0/24:
          - mask: 192.168.1.0/24
            port: 4242

* figure out what is up with this test

* add test

* better logic for sending handshakes

Keep track of the last light of hosts we sent handshakes to. Only log
handshake sent messages if the list has changed.

Remove the test Test_NewHandshakeManagerTrigger because it is faulty and
makes no sense. It relys on the fact that no handshake packets actually
get sent, but with these changes we would send packets now (which it
should!)

* use atomic.Pointer

* cleanup to make it clearer

* fix typo in example
2023-03-13 15:09:08 -04:00
Nate Brown 92cc32f844
Remove handshake race avoidance (#820)
Co-authored-by: Wade Simmons <wadey@slack-corp.com>
2023-03-13 12:35:14 -05:00
Nate Brown 5278b6f926
Generic timerwheel (#804) 2023-01-18 10:56:42 -06:00
Caleb Jasik 12dbbd3dd3
Fix typos found by https://github.com/crate-ci/typos (#735) 2022-12-19 11:28:27 -06:00
brad-defined 1a7c575011
Relay (#678)
Co-authored-by: Wade Simmons <wsimmons@slack-corp.com>
2022-06-21 13:35:23 -05:00
Wade Simmons 304b12f63f
create ConnectionState before adding to HostMap (#535)
We have a few small race conditions with creating the HostInfo.ConnectionState
since we add the host info to the pendingHostMap before we set this
field. We can make everything a lot easier if we just add an "init"
function so that we can set this field in the hostinfo before we add it
to the hostmap.
2021-11-08 14:46:22 -05:00
Nate Brown bcabcfdaca
Rework some things into packages (#489) 2021-11-03 20:54:04 -05:00
brad-defined 6ae8ba26f7
Add a context object in nebula.Main to clean up on error (#550) 2021-11-02 13:14:26 -05:00
John Maguire 98c391396c
Remove log when no handshake message is sent (#452) 2021-04-30 18:19:40 -05:00
Wade Simmons 44cb697552
Add more metrics (#450)
* Add more metrics

This change adds the following counter metrics:

Metrics to track packets dropped at the firewall:

    firewall.dropped.local_ip
    firewall.dropped.remote_ip
    firewall.dropped.no_rule

Metrics to track handshakes attempts that have been initiated and ones
that have timed out (ones that have completed are tracked by the
existing "handshakes" histogram).

    handshake_manager.initiated
    handshake_manager.timed_out

Metrics to track when cached_packets are dropped because we run out of
buffer space, and how many are sent once the handshake completes.

    hostinfo.cached_packets.dropped
    hostinfo.cached_packets.sent

This change also notes how many cached packets we have when we log the
final "Handshake received" message for either stage1 for stage2.

* separate incoming/outgoing metrics

* remove "allowed" firewall metrics

We don't need this on the hotpath, they aren't worh it.

* don't need pointers here
2021-04-27 22:23:18 -04:00
Nathan Brown db23fdf9bc
Dont apply race avoidance to existing handshakes, use the handshake time to determine who wins (#451)
Co-authored-by: Wade Simmons <wadey@slack-corp.com>
2021-04-27 21:15:34 -05:00
Nathan Brown 710df6a876
Refactor remotes and handshaking to give every address a fair shot (#437) 2021-04-14 13:50:09 -05:00
Nathan Brown 3ea7e1b75f
Don't use a global logger (#423) 2021-03-26 09:46:30 -05:00
Wade Simmons 6c55d67f18
Refactor handshake_ix (#401)
There are some subtle race conditions with the previous handshake_ix implementation, mostly around collisions with localIndexId. This change refactors it so that we have a "commit" phase during the handshake where we grab the lock for the hostmap and ensure that we have a unique local index before storing it. We also now avoid using the pending hostmap at all for receiving stage1 packets, since we have everything we need to just store the completed handshake.

Co-authored-by: Nate Brown <nbrown.us@gmail.com>
Co-authored-by: Ryan Huber <rhuber@gmail.com>
Co-authored-by: forfuncsake <drussell@slack-corp.com>
2021-03-12 14:16:25 -05:00
Wade Simmons d604270966
Fix most known data races (#396)
This change fixes all of the known data races that `make smoke-docker-race` finds, except for one.

Most of these races are around the handshake phase for a hostinfo, so we add a RWLock to the hostinfo and Lock during each of the handshake stages.

Some of the other races are around consistently using `atomic` around the `messageCounter` field. To make this harder to mess up, I have renamed the field to `atomicMessageCounter` (I also removed the unnecessary extra pointer deference as we can just point directly to the struct field).

The last remaining data race is around reading `ConnectionInfo.ready`, which is a boolean that is only written to once when the handshake has finished. Due to it being in the hot path for packets and the rare case that this could actually be an issue, holding off on fixing that one for now.

here is the results of `make smoke-docker-race`:

before:

    lighthouse1: Found 2 data race(s)
    host2:       Found 36 data race(s)
    host3:       Found 17 data race(s)
    host4:       Found 31 data race(s)

after:

    host2: Found 1 data race(s)
    host4: Found 1 data race(s)

Fixes: #147
Fixes: #226
Fixes: #283
Fixes: #316
2021-03-05 21:18:33 -05:00
Wade Simmons 1bae5b2550
more validation in pending hostmap deletes (#344)
We are currently seeing some cases where we are not deleting entries
correctly from the pending hostmap. I believe this is a case of
an inbound timer tick firing and deleting the Hosts map entry for
a newer handshake attempt than intended, thus leaving the old Indexes
entry orphaned. This change adds some extra checking when deleteing from
the Indexes and Hosts maps to ensure we clean everything up correctly.
2021-03-01 12:40:46 -05:00
Wade Simmons ee7c27093c
add HostMap.RemoteIndexes (#329)
This change adds an index based on HostInfo.remoteIndexId. This allows
us to use HostMap.QueryReverseIndex without having to loop over all
entries in the map (this can be a bottleneck under high traffic
lighthouses).

Without this patch, a high traffic lighthouse server receiving recv_error
packets and lots of handshakes, cpu pprof trace can look like this:

      flat  flat%   sum%        cum   cum%
    2000ms 32.26% 32.26%     3040ms 49.03%  github.com/slackhq/nebula.(*HostMap).QueryReverseIndex
     870ms 14.03% 46.29%     1060ms 17.10%  runtime.mapiternext

Which shows 50% of total cpu time is being spent in QueryReverseIndex.
2020-11-23 14:51:16 -05:00
Wade Simmons a54f3fc681
fix fast handshake trigger for static hosts (#265)
We are currently triggering a fast handshake for static hosts right
inside HandshakeManager.AddVpnIP, but this can actually trigger before
we have generated the handshake packet to use. Instead, we should be
triggering right after we call ixHandshakeStage0 in getOrHandshake
(which generates the handshake packet)
2020-08-02 20:59:50 -04:00
Wade Simmons 4756c9613d
trigger handshakes when lighthouse reply arrives (#246)
Currently, we wait until the next timer tick to act on the lighthouse's
reply to our HostQuery. This means we can easily add hundreds of
milliseconds of unnecessary delay to the handshake. To fix this, we
can introduce a channel to trigger an outbound handshake without waiting
for the next timer tick.

A few samples of cold ping time between two hosts that require a
lighthouse lookup:

    before (v1.2.0):

    time=156 ms
    time=252 ms
    time=12.6 ms
    time=301 ms
    time=352 ms
    time=49.4 ms
    time=150 ms
    time=13.5 ms
    time=8.24 ms
    time=161 ms
    time=355 ms

    after:

    time=3.53 ms
    time=3.14 ms
    time=3.08 ms
    time=3.92 ms
    time=7.78 ms
    time=3.59 ms
    time=3.07 ms
    time=3.22 ms
    time=3.12 ms
    time=3.08 ms
    time=8.04 ms

I recommend reviewing this PR by looking at each commit individually, as
some refactoring was required that makes the diff a bit confusing when
combined together.
2020-07-22 10:35:10 -04:00
Wade Simmons b37a91cfbc
add meta packet statistics (#230)
This change add more metrics around "meta" (non "message" type packets).
For lighthouse packets, we also record statistics around the specific
lighthouse meta type.

We don't keep statistics for the "message" type so that we don't slow
down the fast path (and you can just look at metrics on the tun
interface to find that information).
2020-06-26 13:45:48 -04:00
Wade Simmons b4f2f7ce4e
log `certName` alongside `vpnIp` (#200)
This change adds a new helper, `(*HostInfo).logger()`, that starts a new
logrus.Entry with `vpnIp` and `certName`. We don't use the helper inside
of handshake_ix though since the certificate has not been attached to
the HostInfo yet.

Fixes: #84
2020-04-06 11:34:00 -07:00
Wade Simmons 179a369130
add configuration options for HandshakeManager (#179)
This change exposes the current constants we have defined for the handshake
manager as configuration options. This will allow us to test and tweak
with different intervals and wait rotations.

    # Handshake Manger Settings
    handshakes:
      # Total time to try a handshake = sequence of `try_interval * retries`
      # With 100ms interval and 20 retries it is 23.5 seconds
      try_interval: 100ms
      retries: 20

      # wait_rotation is the number of handshake attempts to do before starting to try non-local IP addresses
      wait_rotation: 5
2020-02-21 16:25:11 -05:00
Slack Security Team f22b4b584d Public Release 2019-11-19 17:00:20 +00:00