* More precise type for LoggingTransaction.execute
* Add an annotation for stream_ordering_month_ago
This would have spotted the error that was fixed in "Add comma missing from #15382. (#15429)"
Previously, we would spin in a tight loop until
`update_state_for_partial_state_event` stopped raising
`FederationPullAttemptBackoffError`s. Replace the spinloop with a wait
until the backoff period has expired.
Signed-off-by: Sean Quah <seanq@matrix.org>
It's important that collections returned from `@cached` methods are not
modified, otherwise future retrievals from the cache will return the
modified collection.
This applies to the return values from `@cached` methods and the values
inside the dictionaries returned by `@cachedList` methods. It's not
necessary for the dictionaries returned by `@cachedList` methods
themselves to be read-only.
Signed-off-by: Sean Quah <seanq@matrix.org>
Co-authored-by: David Robertson <davidr@element.io>
To perform an emulated upsert into a table safely, we must either:
* lock the table,
* be the only writer upserting into the table
* or rely on another unique index being present.
When the 2nd or 3rd cases were applicable, we previously avoided locking
the table as an optimization. However, as seen in #14406, it is easy to
slip up when adding new schema deltas and corrupt the database.
The only time we lock when performing emulated upserts is while waiting
for background updates on postgres. On sqlite, we do no locking at all.
Let's remove the option to skip locking tables, so that we don't shoot
ourselves in the foot again.
Signed-off-by: Sean Quah <seanq@matrix.org>
While https://github.com/matrix-org/synapse/pull/13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place.
Related to
- https://github.com/matrix-org/synapse/issues/13622
- https://github.com/matrix-org/synapse/pull/13635
- https://github.com/matrix-org/synapse/issues/13676
Part of https://github.com/matrix-org/synapse/issues/13356
Follow-up to https://github.com/matrix-org/synapse/pull/13815 which tracks event signature failures.
With this PR, we avoid the call to the costly `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before and we filter events based on the backoff before calling `_get_state_ids_after_missing_prev_event` now.
For example, this will save us 156s out of the 185s total that this `matrix.org` `/messages` request. If you want to see the full Jaeger trace of this, you can drag and drop this `trace.json` into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761
To explain this exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid.
- `backfill`
- `outgoing-federation-request` `/backfill`
- `_check_sigs_and_hash_and_fetch`
- `_check_sigs_and_hash_and_fetch_one` for each event received over backfill
- ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
- ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
- `_process_pulled_events`
- `_process_pulled_event` for each validated event
- ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it
- `_get_state_ids_after_missing_prev_event`
- `outgoing-federation-request` `/state_ids`
- ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again
- ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check
There is no need to grab thousands of backfill points when we only need 5 to make the `/backfill` request with. We need to grab a few extra in case the first few aren't visible in the history.
Previously, we grabbed thousands of backfill points from the database, then sorted and filtered them in the app. Fetching the 4.6k backfill points for `#matrix:matrix.org` from the database takes ~50ms - ~570ms so it's not like this saves a lot of time 🤷. But it might save us more time now that `get_backfill_points_in_room`/`get_insertion_event_backward_extremities_in_room` are more complicated after https://github.com/matrix-org/synapse/pull/13635
This PR moves the filtering and limiting to the SQL query so we just have less data to work with in the first place.
Part of https://github.com/matrix-org/synapse/issues/13356
Try to avoid an OOM by checking fewer extremities.
Generally this is a big rewrite of _maybe_backfill, to try and fix some of the TODOs and other problems in it. It's best reviewed commit-by-commit.
When we are processing a `/backfill` request from a remote server, exclude any
outliers from consideration early on. We can't return outliers anyway (since we
don't know the state at the outlier), and filtering them out earlier means that
we won't attempt to calulate the state for them.
* Make `get_auth_chain_ids` return a Set
It has a set internally, and a set is often useful where it gets used, so let's
avoid converting to an intermediate list.
* Minor refactors in `on_send_join_request`
A little bit of non-functional groundwork
* Implement MSC3706: partial state in /send_join response
==============================
Bugfixes
--------
- Fix a bug introduced in Synapse 1.40.0 that caused Synapse to fail to process incoming federation traffic after handling a large amount of events in a v1 room. ([\#11806](https://github.com/matrix-org/synapse/issues/11806))
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEgQG31Z317NrSMt0QiISIDS7+X/QFAmHum5QTHGFuZHJld0Bh
bW9yZ2FuLnh5egAKCRCIhIgNLv5f9G6vD/9Dw6V0687vzahnU6aYWaecRSO1sbww
EtcCiXOh0r8HXPAwEIXJYomSTTbPl0eAwP8T1the1WZVQArvRW2VvMzDoBheO8bt
dCm4CTKFCGUsI/GrlFDPtjAEd6kruAETmpHZn7bAFSqtIpissD6FjUwg/ND/NJfM
zVM/bMgM0+js6eD9J5/k16E1ZWj7Lbp+/fKN+qTeQrXzIeT13WQZ4Nz1o/cqe/21
o/coI3FmYq8CzKpfA6qZMDd/OtYpYqwwr7otSmW+6qHeZI/yqeoxlpzfgSZGpRUe
mtXqbQllZQeqrbm8oK96GNmKcThy6awTwJPoD46b1AOkmFdf/lGSqO0lnTQRRPqR
hyPPdrx6Lt2t7DDbVVyUElzkqLPhVJtrItPDC8669sWvmSgsPQdUqIsRmhF+8aJe
ffjRvKbGRICnNZkT+qGf1HwuMHajZyAIHAS/kyFDKUZCvau3VQ1wlyOqVQ1hWr7+
3k3CobjSx4y1bYKXncAK6hWH6lE8M319jaTnVfYXocDLWRonyFf7o286Q+c8WBGF
tY0FzvUPb5S2kktC4WLwcqtcTWK1cu5MI6GfD69EqL7iifJRVyKDeUoD7tiUvgzN
O2wBl2soJIsAU+8y16WQu7p+k2nZIomCCAySnW3C8mlxvv7CKQu0Url8J7IH8uZE
ZVN2MMe7H+ZHJQ==
=Fpsa
-----END PGP SIGNATURE-----
Merge tag 'v1.51.0rc2' into develop
Synapse 1.51.0rc2 (2022-01-24)
==============================
Bugfixes
--------
- Fix a bug introduced in Synapse 1.40.0 that caused Synapse to fail to process incoming federation traffic after handling a large amount of events in a v1 room. ([\#11806](https://github.com/matrix-org/synapse/issues/11806))
* Make historical messages available to federated servers
Part of MSC2716: https://github.com/matrix-org/matrix-doc/pull/2716
Follow-up to https://github.com/matrix-org/synapse/pull/9247
* Debug message not available on federation
* Add base starting insertion point when no chunk ID is provided
* Fix messages from multiple senders in historical chunk
Follow-up to https://github.com/matrix-org/synapse/pull/9247
Part of MSC2716: https://github.com/matrix-org/matrix-doc/pull/2716
---
Previously, Synapse would throw a 403,
`Cannot force another user to join.`,
because we were trying to use `?user_id` from a single virtual user
which did not match with messages from other users in the chunk.
* Remove debug lines
* Messing with selecting insertion event extremeties
* Move db schema change to new version
* Add more better comments
* Make a fake requester with just what we need
See https://github.com/matrix-org/synapse/pull/10276#discussion_r660999080
* Store insertion events in table
* Make base insertion event float off on its own
See https://github.com/matrix-org/synapse/pull/10250#issuecomment-875711889
Conflicts:
synapse/rest/client/v1/room.py
* Validate that the app service can actually control the given user
See https://github.com/matrix-org/synapse/pull/10276#issuecomment-876316455
Conflicts:
synapse/rest/client/v1/room.py
* Add some better comments on what we're trying to check for
* Continue debugging
* Share validation logic
* Add inserted historical messages to /backfill response
* Remove debug sql queries
* Some marker event implemntation trials
* Clean up PR
* Rename insertion_event_id to just event_id
* Add some better sql comments
* More accurate description
* Add changelog
* Make it clear what MSC the change is part of
* Add more detail on which insertion event came through
* Address review and improve sql queries
* Only use event_id as unique constraint
* Fix test case where insertion event is already in the normal DAG
* Remove debug changes
* Add support for MSC2716 marker events
* Process markers when we receive it over federation
* WIP: make hs2 backfill historical messages after marker event
* hs2 to better ask for insertion event extremity
But running into the `sqlite3.IntegrityError: NOT NULL constraint failed: event_to_state_groups.state_group`
error
* Add insertion_event_extremities table
* Switch to chunk events so we can auth via power_levels
Previously, we were using `content.chunk_id` to connect one
chunk to another. But these events can be from any `sender`
and we can't tell who should be able to send historical events.
We know we only want the application service to do it but these
events have the sender of a real historical message, not the
application service user ID as the sender. Other federated homeservers
also have no indicator which senders are an application service on
the originating homeserver.
So we want to auth all of the MSC2716 events via power_levels
and have them be sent by the application service with proper
PL levels in the room.
* Switch to chunk events for federation
* Add unstable room version to support new historical PL
* Messy: Fix undefined state_group for federated historical events
```
2021-07-13 02:27:57,810 - synapse.handlers.federation - 1248 - ERROR - GET-4 - Failed to backfill from hs1 because NOT NULL constraint failed: event_to_state_groups.state_group
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 1216, in try_backfill
await self.backfill(
File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 1035, in backfill
await self._auth_and_persist_event(dest, event, context, backfilled=True)
File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 2222, in _auth_and_persist_event
await self._run_push_actions_and_persist_event(event, context, backfilled)
File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 2244, in _run_push_actions_and_persist_event
await self.persist_events_and_notify(
File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 3290, in persist_events_and_notify
events, max_stream_token = await self.storage.persistence.persist_events(
File "/usr/local/lib/python3.8/site-packages/synapse/logging/opentracing.py", line 774, in _trace_inner
return await func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/synapse/storage/persist_events.py", line 320, in persist_events
ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
File "/usr/local/lib/python3.8/site-packages/synapse/storage/persist_events.py", line 237, in handle_queue_loop
ret = await self._per_item_callback(
File "/usr/local/lib/python3.8/site-packages/synapse/storage/persist_events.py", line 577, in _persist_event_batch
await self.persist_events_store._persist_events_and_state_updates(
File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/events.py", line 176, in _persist_events_and_state_updates
await self.db_pool.runInteraction(
File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 681, in runInteraction
result = await self.runWithConnection(
File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 770, in runWithConnection
return await make_deferred_yieldable(
File "/usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py", line 238, in inContext
result = inContext.theWork() # type: ignore[attr-defined]
File "/usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py", line 254, in <lambda>
inContext.theWork = lambda: context.call( # type: ignore[attr-defined]
File "/usr/local/lib/python3.8/site-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/usr/local/lib/python3.8/site-packages/twisted/python/context.py", line 83, in callWithContext
return func(*args, **kw)
File "/usr/local/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 293, in _runWithConnection
compat.reraise(excValue, excTraceback)
File "/usr/local/lib/python3.8/site-packages/twisted/python/deprecate.py", line 298, in deprecatedFunction
return function(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/twisted/python/compat.py", line 403, in reraise
raise exception.with_traceback(traceback)
File "/usr/local/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 284, in _runWithConnection
result = func(conn, *args, **kw)
File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 765, in inner_func
return func(db_conn, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 549, in new_transaction
r = func(cursor, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/synapse/logging/utils.py", line 69, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/events.py", line 385, in _persist_events_txn
self._store_event_state_mappings_txn(txn, events_and_contexts)
File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/events.py", line 2065, in _store_event_state_mappings_txn
self.db_pool.simple_insert_many_txn(
File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 923, in simple_insert_many_txn
txn.execute_batch(sql, vals)
File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 280, in execute_batch
self.executemany(sql, args)
File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 300, in executemany
self._do_execute(self.txn.executemany, sql, *args)
File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 330, in _do_execute
return func(sql, *args)
sqlite3.IntegrityError: NOT NULL constraint failed: event_to_state_groups.state_group
```
* Revert "Messy: Fix undefined state_group for federated historical events"
This reverts commit 187ab28611.
* Fix federated events being rejected for no state_groups
Add fix from https://github.com/matrix-org/synapse/pull/10439
until it merges.
* Adapting to experimental room version
* Some log cleanup
* Add better comments around extremity fetching code and why
* Rename to be more accurate to what the function returns
* Add changelog
* Ignore rejected events
* Use simplified upsert
* Add Erik's explanation of extra event checks
See https://github.com/matrix-org/synapse/pull/10498#discussion_r680880332
* Clarify that the depth is not directly correlated to the backwards extremity that we return
See https://github.com/matrix-org/synapse/pull/10498#discussion_r681725404
* lock only matters for sqlite
See https://github.com/matrix-org/synapse/pull/10498#discussion_r681728061
* Move new SQL changes to its own delta file
* Clean up upsert docstring
* Bump database schema version (62)
* Make historical messages available to federated servers
Part of MSC2716: https://github.com/matrix-org/matrix-doc/pull/2716
Follow-up to https://github.com/matrix-org/synapse/pull/9247
* Debug message not available on federation
* Add base starting insertion point when no chunk ID is provided
* Fix messages from multiple senders in historical chunk
Follow-up to https://github.com/matrix-org/synapse/pull/9247
Part of MSC2716: https://github.com/matrix-org/matrix-doc/pull/2716
---
Previously, Synapse would throw a 403,
`Cannot force another user to join.`,
because we were trying to use `?user_id` from a single virtual user
which did not match with messages from other users in the chunk.
* Remove debug lines
* Messing with selecting insertion event extremeties
* Move db schema change to new version
* Add more better comments
* Make a fake requester with just what we need
See https://github.com/matrix-org/synapse/pull/10276#discussion_r660999080
* Store insertion events in table
* Make base insertion event float off on its own
See https://github.com/matrix-org/synapse/pull/10250#issuecomment-875711889
Conflicts:
synapse/rest/client/v1/room.py
* Validate that the app service can actually control the given user
See https://github.com/matrix-org/synapse/pull/10276#issuecomment-876316455
Conflicts:
synapse/rest/client/v1/room.py
* Add some better comments on what we're trying to check for
* Continue debugging
* Share validation logic
* Add inserted historical messages to /backfill response
* Remove debug sql queries
* Some marker event implemntation trials
* Clean up PR
* Rename insertion_event_id to just event_id
* Add some better sql comments
* More accurate description
* Add changelog
* Make it clear what MSC the change is part of
* Add more detail on which insertion event came through
* Address review and improve sql queries
* Only use event_id as unique constraint
* Fix test case where insertion event is already in the normal DAG
* Remove debug changes
* Switch to chunk events so we can auth via power_levels
Previously, we were using `content.chunk_id` to connect one
chunk to another. But these events can be from any `sender`
and we can't tell who should be able to send historical events.
We know we only want the application service to do it but these
events have the sender of a real historical message, not the
application service user ID as the sender. Other federated homeservers
also have no indicator which senders are an application service on
the originating homeserver.
So we want to auth all of the MSC2716 events via power_levels
and have them be sent by the application service with proper
PL levels in the room.
* Switch to chunk events for federation
* Add unstable room version to support new historical PL
* Fix federated events being rejected for no state_groups
Add fix from https://github.com/matrix-org/synapse/pull/10439
until it merges.
* Only connect base insertion event to prev_event_ids
Per discussion with @erikjohnston,
https://matrix.to/#/!UytJQHLQYfvYWsGrGY:jki.re/$12bTUiObDFdHLAYtT7E-BvYRp3k_xv8w0dUQHibasJk?via=jki.re&via=matrix.org
* Make it possible to get the room_version with txn
* Allow but ignore historical events in unsupported room version
See https://github.com/matrix-org/synapse/pull/10245#discussion_r675592489
We can't reject historical events on unsupported room versions because homeservers without knowledge of MSC2716 or the new room version don't reject historical events either.
Since we can't rely on the auth check here to stop historical events on unsupported room versions, I've added some additional checks in the processing/persisting code (`synapse/storage/databases/main/events.py` -> `_handle_insertion_event` and `_handle_chunk_event`). I've had to do some refactoring so there is method to fetch the room version by `txn`.
* Move to unique index syntax
See https://github.com/matrix-org/synapse/pull/10245#discussion_r675638509
* High-level document how the insertion->chunk lookup works
* Remove create_event fallback for room_versions
See https://github.com/matrix-org/synapse/pull/10245/files#r677641879
* Use updated method name
Fixes#9490
This will break a couple of SyTest that are expecting failures to be added to the response of a federation /send, which obviously doesn't happen now that things are asynchronous.
Two drawbacks:
Currently there is no logic to handle any events left in the staging area after restart, and so they'll only be handled on the next incoming event in that room. That can be fixed separately.
We now only process one event per room at a time. This can be fixed up further down the line.