* Fixes a reocnfig freeze where the reconfig attempts to send to an unbuffered channel with no readers.
Only create stop channel when a DNS goroutine is created, and only send when the channel exists.
Buffer to size 1 so that the stop message can be immediately sent even if the goroutine is busy doing DNS lookups.
* add calculated_remotes
This setting allows us to "guess" what the remote might be for a host
while we wait for the lighthouse response. For networks that hard
designed with in mind, it can help speed up handshake performance, as well as
improve resiliency in the case that all lighthouses are down.
Example:
lighthouse:
# ...
calculated_remotes:
# For any Nebula IPs in 10.0.10.0/24, this will apply the mask and add
# the calculated IP as an initial remote (while we wait for the response
# from the lighthouse). Both CIDRs must have the same mask size.
# For example, Nebula IP 10.0.10.123 will have a calculated remote of
# 192.168.1.123
10.0.10.0/24:
- mask: 192.168.1.0/24
port: 4242
* figure out what is up with this test
* add test
* better logic for sending handshakes
Keep track of the last light of hosts we sent handshakes to. Only log
handshake sent messages if the list has changed.
Remove the test Test_NewHandshakeManagerTrigger because it is faulty and
makes no sense. It relys on the fact that no handshake packets actually
get sent, but with these changes we would send packets now (which it
should!)
* use atomic.Pointer
* cleanup to make it clearer
* fix typo in example
These new helpers make the code a lot cleaner. I confirmed that the
simple helpers like `atomic.Int64` don't add any extra overhead as they
get inlined by the compiler. `atomic.Pointer` adds an extra method call
as it no longer gets inlined, but we aren't using these on the hot path
so it is probably okay.
This allows you to configure remote allow lists specific to different
subnets of the inside CIDR. Example:
remote_allow_ranges:
10.42.42.0/24:
192.168.0.0/16: true
This would only allow hosts with a VPN IP in the 10.42.42.0/24 range to
have private IPs (and thus don't connect over public IPs).
The PR also refactors AllowList into RemoteAllowList and LocalAllowList to make it clearer which methods are allowed on which allow list.
Currently, if you use the remote allow list config, as soon as you attempt to create a tunnel to a node that has a blocked IP address, a mutex is locked and never unlocked. This happens even if the node has an allowed remote IP address in addition to the blocked remote IP address.
This pull request ensures that the lighthouse mutex is unlocked whenever we attempt to add a remote IP.
The change introduced by #320 incorrectly re-uses the output buffer for
sending punchBack packets. Since we are currently spawning a new
goroutine for each send here, we need to allocate a new buffer each
time. We can come back and optimize this in the future, but for now we
should fix the regression.
We noticed that the number of memory allocations LightHouse.HandleRequest creates for each call can seriously impact performance for high traffic lighthouses. This PR introduces a benchmark in the first commit and then optimizes memory usage by creating a LightHouseHandler struct. This struct allows us to re-use memory between each lighthouse request (one instance per UDP listener go-routine).
Currently, we wait until the next timer tick to act on the lighthouse's
reply to our HostQuery. This means we can easily add hundreds of
milliseconds of unnecessary delay to the handshake. To fix this, we
can introduce a channel to trigger an outbound handshake without waiting
for the next timer tick.
A few samples of cold ping time between two hosts that require a
lighthouse lookup:
before (v1.2.0):
time=156 ms
time=252 ms
time=12.6 ms
time=301 ms
time=352 ms
time=49.4 ms
time=150 ms
time=13.5 ms
time=8.24 ms
time=161 ms
time=355 ms
after:
time=3.53 ms
time=3.14 ms
time=3.08 ms
time=3.92 ms
time=7.78 ms
time=3.59 ms
time=3.07 ms
time=3.22 ms
time=3.12 ms
time=3.08 ms
time=8.04 ms
I recommend reviewing this PR by looking at each commit individually, as
some refactoring was required that makes the diff a bit confusing when
combined together.
This change add more metrics around "meta" (non "message" type packets).
For lighthouse packets, we also record statistics around the specific
lighthouse meta type.
We don't keep statistics for the "message" type so that we don't slow
down the fast path (and you can just look at metrics on the tun
interface to find that information).
These settings make it possible to blacklist / whitelist IP addresses
that are used for remote connections.
`lighthouse.remoteAllowList` filters which remote IPs are allow when
fetching from the lighthouse (or, if you are the lighthouse, which IPs
you store and forward to querying hosts). By default, any remote IPs are
allowed. You can provide CIDRs here with `true` to allow and `false` to
deny. The most specific CIDR rule applies to each remote. If all rules
are "allow", the default will be "deny", and vice-versa. If both "allow"
and "deny" rules are present, then you MUST set a rule for "0.0.0.0/0"
as the default.
lighthouse:
remoteAllowList:
# Example to block IPs from this subnet from being used for remote IPs.
"172.16.0.0/12": false
# A more complicated example, allow public IPs but only private IPs from a specific subnet
"0.0.0.0/0": true
"10.0.0.0/8": false
"10.42.42.0/24": true
`lighthouse.localAllowList` has the same logic as above, but it applies
to the local addresses we advertise to the lighthouse. Additionally, you
can specify an `interfaces` map of regular expressions to match against
interface names. The regexp must match the entire name. All interface
rules must be either true or false (and the default rule will be the
inverse). CIDR rules are matched after interface name rules.
Default is all local IP addresses.
lighthouse:
localAllowList:
# Example to blacklist docker interfaces.
interfaces:
'docker.*': false
# Example to only advertise IPs in this subnet to the lighthouse.
"10.0.0.0/8": true
* add configurable punching delay because of race-condition-y conntracks
* add changelog
* fix tests
* only do one punch per query
* Coalesce punchy config
* It is not is not set
* Add tests
Co-authored-by: Nate Brown <nbrown.us@gmail.com>