If we send out an event which refers to `prev_events` which other servers in
the federation are missing, then (after a round or two of backfill attempts),
they will end up asking us for `/state_ids` at a particular point in the DAG.
As per https://github.com/matrix-org/synapse/issues/7893, this is quite
expensive, and we tend to see lots of very similar requests around the same
time.
We can therefore handle this much more efficiently by using a cache, which (a)
ensures that if we see the same request from multiple servers (or even the same
server, multiple times), then they share the result, and (b) any other servers
that miss the initial excitement can also benefit from the work.
[It's interesting to note that `/state` has a cache for exactly this
reason. `/state` is now essentially unused and replaced with `/state_ids`, but
evidently when we replaced it we forgot to add a cache to the new endpoint.]
For inbound federation requests, if a given remote server makes too many
requests at once, we start stacking them up rather than processing them
immediatedly.
However, that means that there is a fair chance that the requesting server will
disconnect before we start processing the request. In that case, if it was a
read-only request (ie, a GET request), there is absolutely no point in
building a response (and some requests are quite expensive to handle).
Even in the case of a POST request, one of two things will happen:
* Most likely, the requesting server will retry the request and we'll get the
information anyway.
* Even if it doesn't, the requesting server has to assume that we didn't get
the memo, and act accordingly.
In short, we're better off aborting the request at this point rather than
ploughing on with what might be a quite expensive request.
It was correct at the time of our friend Jorik writing it (checking
git blame), but the world has moved now and it is no longer a
generator.
Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>
Add changelog
Save retrieved keys to the db
lint
Fix and de-brittle remote result dict processing
Use query_user_devices instead, assume only master, self_signing key types
Make changelog more useful
Remove very specific exception handling
Wrap get_verify_key_from_cross_signing_key in a try/except
Note that _get_e2e_cross_signing_verify_key can raise a SynapseError
lint
Add comment explaining why this is useful
Only fetch master and self_signing key types
Fix log statements, docstrings
Remove extraneous items from remote query try/except
lint
Factor key retrieval out into a separate function
Send device updates, modeled after SigningKeyEduUpdater._handle_signing_key_updates
Update method docstring
This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
* Pull Sentinel out of LoggingContext
... and drop a few unnecessary references to it
* Factor out LoggingContext.current_context
move `current_context` and `set_context` out to top-level functions.
Mostly this means that I can more easily trace what's actually referring to
LoggingContext, but I think it's generally neater.
* move copy-to-parent into `stop`
this really just makes `start` and `stop` more symetric. It also means that it
behaves correctly if you manually `set_log_context` rather than using the
context manager.
* Replace `LoggingContext.alive` with `finished`
Turn `alive` into `finished` and make it a bit better defined.
A lot of the things we log at INFO are now a bit superfluous, so lets
make them DEBUG logs to reduce the amount we log by default.
Co-Authored-By: Brendan Abolivier <babolivier@matrix.org>
Co-authored-by: Brendan Abolivier <github@brendanabolivier.com>
I messed this up a bit in #6805, but fortunately we weren't actually doing
anything with the room_version so it didn't matter that it was a str not a RoomVersion.
We were sending device updates down both the federation stream and
device streams. This mean there was a race if the federation sender
worker processed the federation stream first, as when the sender checked
if there were new device updates the slaved ID generator hadn't been
updated with the new stream IDs and so returned nothing.
This situation is correctly handled by events/receipts/etc by not
sending updates down the federation stream and instead having the
federation sender worker listen on the other streams and poke the
transaction queues as appropriate.
* Port synapse.replication.tcp to async/await
* Newsfile
* Correctly document type of on_<FOO> functions as async
* Don't be overenthusiastic with the asyncing....
* Fix /federation/v1/state for recent room versions
Turns out this endpoint was completely broken for v3 rooms. Hopefully this
re-signing code is irrelevant nowadays anyway.
Another small fixup noticed during work on a larger PR. The `origin` field of `add_display_name_to_third_party_invite` is not used and likely was just carried over from the `on_PUT` method of `FederationThirdPartyInviteExchangeServlet` which, like all other servlets, provides an `origin` argument.
Since it's not used anywhere in the handler function though, we should remove it from the function arguments.
Python will return a tuple whether there are parentheses around the returned values or not.
I'm just sick of my editor complaining about this all over the place :)
Propagate opentracing contexts across workers
Also includes some Convenience modifications to opentracing for servlets, notably:
- Add boolean to skip the whitelisting check on inject
extract methods. - useful when injecting into carriers
locally. Otherwise we'd always have to include our
own servername and whitelist our servername
- start_active_span_from_request instead of header
- Add boolean to decide whether to extract context
from a request to a servlet
Add authenticated_entity and servlet_names tags.
Functionally:
- Add a tag for authenticated_entity
- Add a tag for servlet_names
Stylistically:
Moved to importing methods directly from opentracing.
is cached and so does not always return a `Deferred`.
`await` does not silently pass-through non-Deferreds like `yield` used to.
Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>
--------
- Fix a regression introduced in v1.2.0rc1 which led to incorrect labels on some prometheus metrics. ([\#5734](https://github.com/matrix-org/synapse/issues/5734))
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEgQG31Z317NrSMt0QiISIDS7+X/QFAl04Ur0THGFuZHJld0Bh
bW9yZ2FuLnh5egAKCRCIhIgNLv5f9F4oD/0TY6S/SEd2uAmzor64ojmbX5BOwPzf
j/wzUTrfvuf40EvkNPDpnejNZSvy/ysbaGQaQusv0SQKlV3xrvdn4RuMvnOWVWck
kBsO+lvzOaUTR0KHDxN4y9F5eI2NdPbub4847PPVzyqSIHAd+kolxXS8kSBBhwpL
yfaICWV/AOy5L7xN+JZ9IQpnegVAvUj5DmgXzDHd6VdeiHDVJuARaBgrR5uCkwVS
ZoLRqZ95XV/qiguMAUvPOwyEqht2mwO64989MswP16YYm8oMkB5QA6I5nYnACsTP
qk9YcN/oNvEfQXUhttku6MxK1/4yUMPUhEoDBDH7ebc0440QDtWN+IHTdA6oPVZB
IuStL9YGY16m7Ltx37ZUA4URfNMiSeLHo3zKc/mCAcwxN4HyOjJewtxbG5zKQAOZ
SMs8UcDwGR4zL1hnt8ZDNYtWwfzJBQIdGjoHvjXJEY7/1csTv2lmAwewFTXiqSAr
30GW5ews94kotqBK53zZT6V0F5gHNqgGHniOz1ZpqLLxYLqO3LSAGe97CrqlWUdX
GkhA9tZyweknociD9fyyBmKdcFJ4mL4a+oGI5CMnSMph8UvCY8Y5XMb1T+iYEABI
tA9G3mBvgkLPj+5V+8QggNkBafSigW2Q4FX7enGsDmiiskZOtfeKrAcVkapD4ooi
3I7IW5aetZr2IQ==
=+JBn
-----END PGP SIGNATURE-----
Merge tag 'v1.2.0rc2' into develop
Bugfixes
--------
- Fix a regression introduced in v1.2.0rc1 which led to incorrect labels on some prometheus metrics. ([\#5734](https://github.com/matrix-org/synapse/issues/5734))
* Fix servlet metric names
Co-Authored-By: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
* Remove redundant check
* Cover all return paths
* Convert BaseFederationServlet._wrap to async
Empirically, this fixes some lost stacktraces. It should be safe because the
wrapped function is called from JsonResource._async_render, which is already
async.
* Convert the rest of synapse.federation.transport.server to async
We may as well do the whole file while we're here.
* changelog
* flake8
* Configure and initialise tracer
Includes config options for the tracer and sets up JaegerClient.
* Scope manager using LogContexts
We piggy-back our tracer scopes by using log context.
The current log context gives us the current scope. If new scope is
created we create a stack of scopes in the context.
* jaeger is a dependency now
* Carrier inject and extraction for Twisted Headers
* Trace federation requests on the way in and out.
The span is created in _started_processing and closed in
_finished_processing because we need a meaningful log context.
* Create logcontext for new scope.
Instead of having a stack of scopes in a logcontext we create a new
context for a new scope if the current logcontext already has a scope.
* Remove scope from logcontext if logcontext is top level
* Disable tracer if not configured
* typo
* Remove dependence on jaeger internals
* bools
* Set service name
* :Explicitely state that the tracer is disabled
* Black is the new black
* Newsfile
* Code style
* Use the new config setup.
* Generate config.
* Copyright
* Rename config to opentracing
* Remove user whitelisting
* Empty whitelist by default
* User ConfigError instead of RuntimeError
* Use isinstance
* Use tag constants for opentracing.
* Remove debug comment and no need to explicitely record error
* Two errors a "s(c)entry"
* Docstrings!
* Remove debugging brainslip
* Homeserver Whitlisting
* Better opentracing config comment
* linting
* Inclue worker name in service_name
* Make opentracing an optional dependency
* Neater config retreival
* Clean up dummy tags
* Instantiate tracing as object instead of global class
* Inlcude opentracing as a homeserver member.
* Thread opentracing to the request level
* Reference opetnracing through hs
* Instantiate dummy opentracin g for tests.
* About to revert, just keeping the unfinished changes just in case
* Revert back to global state, commit number:
9ce4a3d906
* Use class level methods in tracerutils
* Start and stop requests spans in a place where we
have access to the authenticated entity
* Seen it, isort it
* Make sure to close the active span.
* I'm getting black and blue from this.
* Logger formatting
Co-Authored-By: Erik Johnston <erik@matrix.org>
* Outdated comment
* Import opentracing at the top
* Return a contextmanager
* Start tracing client requests from the servlet
* Return noop context manager if not tracing
* Explicitely say that these are federation requests
* Include servlet name in client requests
* Use context manager
* Move opentracing to logging/
* Seen it, isort it again!
* Ignore twisted return exceptions on context exit
* Escape the scope
* Scopes should be entered to make them useful.
* Nicer decorator names
* Just one init, init?
* Don't need to close something that isn't open
* Docs make you smarter
Adds new config option `cleanup_extremities_with_dummy_events` which
periodically sends dummy events to rooms with more than 10 extremities.
THIS IS REALLY EXPERIMENTAL.
Also:
* rename VerifyKeyRequest->VerifyJsonRequest
* calculate key_ids on VerifyJsonRequest construction
* refactor things to pass around VerifyJsonRequests instead of 4-tuples
FederationClient.get_pdu is called in a loop to fetch a batch of PDUs. A
failure to fetch one should not result in a failure of the whole batch. Add the
missing `continue`.
We have too many things called get_event, and it's hard to figure out what we
mean. Also remove some unused params from the signature, and add some logging.
When handling incoming federation requests, make sure that we have an
up-to-date copy of the signing key.
We do not yet enforce the validity period for event signatures.
If we remove support for a particular room version, we should behave more
gracefully. This should make client requests fail with a 400 rather than a 500,
and will ignore individiual PDUs in a federation transaction, rather than the
whole transaction.
This commit adds two config options:
* `restrict_public_rooms_to_local_users`
Requires auth to fetch the public rooms directory through the CS API and disables fetching it through the federation API.
* `require_auth_for_profile_requests`
When set to `true`, requires that requests to `/profile` over the CS API are authenticated, and only returns the user's profile if the requester shares a room with the profile's owner, as per MSC1301.
MSC1301 also specifies a behaviour for federation (only returning the profile if the server asking for it shares a room with the profile's owner), but that's currently really non-trivial to do in a not too expensive way. Next step is writing down a MSC that allows a HS to specify which user sent the profile query. In this implementation, Synapse won't send a profile query over federation if it doesn't believe it already shares a room with the profile's owner, though.
Groups have been intentionally omitted from this commit.
Primarily this fixes a bug in the handling of remote users joining a
room where the server sent out the presence for all local users in the
room to all servers in the room.
We also change to using the state delta stream, rather than the
distributor, as it will make it easier to split processing out of the
master process (as well as being more flexible).
Finally, when sending presence states to newly joined servers we filter
out old presence states to reduce the number sent. Initially we filter
out states that are offline and have a last active more than a week ago,
though this can be changed down the line.
Fixes#3962
As per #3622, we remove trailing slashes from outbound federation requests. However, to ensure that we remain backwards compatible with previous versions of Synapse, if we receive a HTTP 400 with `M_UNRECOGNIZED`, then we are likely talking to an older version of Synapse in which case we retry with a trailing slash appended to the request path.
In worker mode, on the federation sender, when we receive an edu for sending
over the replication socket, it is parsed into an Edu object. There is no point
extracting the contents of it so that we can then immediately build another Edu.
* make 'event_id' a required parameter in federated state requests
As per the spec: https://matrix.org/docs/spec/server_server/r0.1.1.html#id40
Signed-off-by: Joseph Weston <joseph@weston.cloud>
* add changelog entry for bugfix
Signed-off-by: Joseph Weston <joseph@weston.cloud>
* Update server.py
We only process events sent to us from a server if the event ID matches
the server, to help guard against federation storms. We replace this
with a check against the event origin.
The transaction queue only sends out events that we generate. This was
done by checking domain of event ID, but that can no longer be used.
Instead, we may as well use the sender field.
We currently pass FrozenEvent instead of `dict` to
`compute_event_signature`, which works by accident due to `dict(event)`
producing the correct result.
This fixes PR #4493 commit 855a151
Currently they're stored as non-outliers even though the server isn't in
the room, which can be problematic in places where the code assumes it
has the state for all non outlier events.
In particular, there is an edge case where persisting the leave event
triggers a state resolution, which requires looking up the room version
from state. Since the server doesn't have the state, this causes an
exception to be thrown.
This allows the OpenID userinfo endpoint to be active even if the
federation resource is not active. The OpenID userinfo endpoint
is called by integration managers to verify user actions using the
client API OpenID access token. Without this verification, the
integration manager cannot know that the access token is valid.
The OpenID userinfo endpoint will be loaded in the case that either
"federation" or "openid" resource is defined. The new "openid"
resource is defaulted to active in default configuration.
Signed-off-by: Jason Robinson <jasonr@matrix.org>
* Correctly retry and back off if we get a HTTPerror response
* Refactor request sending to have better excpetions
MatrixFederationHttpClient blindly reraised exceptions to the caller
without differentiating "expected" failures (e.g. connection timeouts
etc) versus more severe problems (e.g. programming errors).
This commit adds a RequestSendFailed exception that is raised when
"expected" failures happen, allowing the TransactionQueue to log them as
warnings while allowing us to log other exceptions as actual exceptions.
When we receive events over federation we will need to know the room
version to be able to correctly handle them, e.g. once we start changing
event formats. Currently, we attempt to handle events in unknown rooms.
Broadly three things here:
* disable W504 which seems a bit whacko
* remove a bunch of `as e` expressions from exception handlers that don't use
them
* use `r""` for strings which include backslashes
Also, we don't use pep8 any more, so we can get rid of the duplicate config
there.
It's quite important that get_missing_events returns the *latest* events in the
room; however we were pulling event ids out of the database until we got *at
least* 10, and then taking the *earliest* of the results.
We also shouldn't really be relying on depth, and should be checking the
room_id.
- Improve logging: log things in the right order, include destination and txids
in all log lines, don't log successful responses twice
- Fix the docstring on TransportLayerClient.send_transaction
- Don't use treq.request, which is overcomplicated for our purposes: just use a
twisted.web.client.Agent.
- simplify the logic for setting up the bodyProducer
- fix bytes/str confusions
when processing incoming transactions, it can be hard to see what's going on,
because we process a bunch of stuff in parallel, and because we may end up
recursively working our way through a chain of three or four events.
This commit creates a way to use logcontexts to add the relevant event ids to
the log lines.
There's really no point in checking for destinations called "localhost" because
there is nothing stopping people creating other DNS entries which point to
127.0.0.1. The right fix for this is
https://github.com/matrix-org/synapse/issues/3953.
Blocking localhost, on the other hand, means that you get a surprise when
trying to connect a test server on localhost to an existing server (with a
'normal' server_name).
ExpiringCache required that `start()` be called before it would actually
start expiring entries. A number of places didn't do that.
This PR removes `start` from ExpiringCache, and automatically starts
backround reaping process on creation instead.
Add some informative comments about what's going on here.
Also, `sent_to_us_directly` and `get_missing` were doing the same thing (apart
from in `_handle_queued_pdus`, which looks like a bug), so let's get rid of
`get_missing` and use `sent_to_us_directly` consistently.
If we receive an event that doesn't pass their content hash check (e.g.
due to already being redacted) then we hit a bug which causes an
exception to be raised, which then promplty stops the event (and
request) from being processed.
This effects all sorts of federation APIs, including joining rooms with
a redacted state event.
We should check that both the sender's server, and the server which created the
event_id (which may be different from whatever the remote server has told us
the origin is), have signed the event.
When we get a federation request which refers to an event id, make sure that
said event is in the room the caller claims it is in.
(patch supplied by @turt2live)
This commit replaces SynapseError.from_http_response_exception with
HttpResponseException.to_synapse_error.
The new method actually returns a ProxiedRequestError, which allows us to pass
through additional metadata from the API call.
We really shouldn't be sending all CodeMessageExceptions back over the C-S API;
it will include things like 401s which we shouldn't proxy.
That means that we need to explicitly turn a few HttpResponseExceptions into
SynapseErrors in the federation layer.
The effect of the latter is that the matrix errcode will get passed through
correctly to calling clients, which might help with some of the random
M_UNKNOWN errors when trying to join rooms.
This fixes#3518, and ensures that we get useful logs and metrics for lots of
things that happen in the background.
(There are certainly more things that happen in the background; these are just
the common ones I've found running a single-process synapse locally).
This introduces a mechanism for tracking resource usage by background
processes, along with an example of how it will be used.
This will help address #3518, but more importantly will give us better insights
into things which are happening but not being shown up by the request metrics.
We *could* do this with Measure blocks, but:
- I think having them pulled out as a completely separate metric class will
make it easier to distinguish top-level processes from those which are
nested.
- I want to be able to report on in-flight background processes, and I don't
think we want to do this for *all* Measure blocks.
We need to do a bit more validation when we get a server name, but don't want
to be re-doing it all over the shop, so factor out a separate
parse_and_validate_server_name, and do the extra validation.
Also, use it to verify the server name in the config file.
Make sure that server_names used in auth headers are sane, and reject them with
a sensible error code, before they disappear off into the depths of the system.
While I was going through uses of preserve_fn for other PRs, I converted places
which only use the wrapped function once to use run_in_background, to avoid
creating the function object.
There were a bunch of places where we fire off a process to happen in the
background, but don't have any exception handling on it - instead relying on
the unhandled error being logged when the relevent deferred gets
garbage-collected.
This is unsatisfactory for a number of reasons:
- logging on garbage collection is best-effort and may happen some time after
the error, if at all
- it can be hard to figure out where the error actually happened.
- it is logged as a scary CRITICAL error which (a) I always forget to grep for
and (b) it's not really CRITICAL if a background process we don't care about
fails.
So this is an attempt to add exception handling to everything we fire off into
the background.
It turns out that most of the time we were calling have_events, we were only
using half of the result. Replace have_events with have_seen_events and
get_rejection_reasons, so that we can see what's going on a bit more clearly.
This reverts commit 9fbe70a7dc.
It turns out that sortedcontainers.SortedDict is not an exact match for
blist.sorteddict; in particular, `popitem()` removes things from the opposite
end of the dict.
This is trivial to fix, but I want to add some unit tests, and potentially some
more thought about it, before we do so.
Adds a `.wrap` method to ResponseCache which wraps up the boilerplate of a
(get, set) pair, and then use it throughout the codebase.
This will be largely non-functional, but does include the following functional
changes:
* federation_server.on_context_state_request: drops use of _server_linearizer
which looked redundant and could cause incorrect cache misses by yielding
between the get and the set.
* RoomListHandler.get_remote_public_room_list(): fixes logcontext leaks
* the wrap function includes some logging. I'm hoping this won't be too noisy
on production.